Skip to content

Commit 24de29b

Browse files
authored
Update module versions, X11 info, OpenMPI BTL
https://www.open-mpi.org/faq/?category=openfabrics#ib-btl NOTE: Prior versions of Open MPI used an sm BTL for shared memory. sm was effectively replaced with vader starting in Open MPI v3.0.0. Switching to "vader" BTL from "sm" prevents runtime error: # -------------------------------------------------------------------------- # As of version 3.0.0, the "sm" BTL is no longer available in Open MPI. # Efficient, high-speed same-node shared memory communication support in # Open MPI is available in the "vader" BTL. To use the vader BTL, you # can re-run your job with: # mpirun --mca btl vader,self,... your_mpi_application # --------------------------------------------------------------------------
1 parent 0b76c9a commit 24de29b

File tree

1 file changed

+28
-18
lines changed

1 file changed

+28
-18
lines changed

docs/PrincetonUTutorial.md

Lines changed: 28 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,13 @@
11
## Tutorials
2+
*Last updated 2019-10-16*
23

34
### Login to TigerGPU
45

56
First, login to TigerGPU cluster headnode via ssh:
67
```
78
ssh -XC <yourusername>@tigergpu.princeton.edu
89
```
10+
Note, `-XC` is optional; it is only necessary if you are planning on performing remote visualization, e.g. the output `.png` files from the below [section](#Learning-curves-and-ROC-per-epoch). Trusted X11 forwarding can be used with `-Y` instead of `-X` and may prevent timeouts, but it disables X11 SECURITY extension controls. Compression `-C` reduces the bandwidth usage and may be useful on slow connections.
911

1012
### Sample usage on TigerGPU
1113

@@ -15,21 +17,29 @@ git clone https://github.com/PPPLDeepLearning/plasma-python
1517
cd plasma-python
1618
```
1719

18-
After that, create an isolated Anaconda environment and load CUDA drivers:
20+
After that, create an isolated Anaconda environment and load CUDA drivers, an MPI compiler, and the HDF5 library:
1921
```
2022
#cd plasma-python
21-
module load anaconda3/4.4.0
23+
module load anaconda3
2224
conda create --name my_env --file requirements-travis.txt
2325
source activate my_env
2426
25-
export OMPI_MCA_btl="tcp,self,sm"
26-
module load cudatoolkit/8.0
27-
module load cudnn/cuda-8.0/6.0
28-
module load openmpi/cuda-8.0/intel-17.0/2.1.0/64
29-
module load intel/17.0/64/17.0.5.239
27+
export OMPI_MCA_btl="tcp,self,vader"
28+
# replace "vader" with "sm" for OpenMPI versions prior to 3.0.0
29+
module load cudatoolkit cudann
30+
module load openmpi/cuda-8.0/intel-17.0/3.0.0/64
31+
module load intel
32+
module load hdf5/intel-17.0/intel-mpi/1.10.0
33+
```
34+
As of the latest update of this document, the above modules correspond to the following versions on the TigerGPU system, given by `module list`:
35+
```
36+
Currently Loaded Modulefiles:
37+
1) anaconda3/2019.3 4) openmpi/cuda-8.0/intel-17.0/3.0.0/64 7) hdf5/intel-17.0/intel-mpi/1.10.0
38+
2) cudatoolkit/10.1 5) intel-mkl/2019.3/3/64
39+
3) cudnn/cuda-9.2/7.6.3 6) intel/19.0/64/19.0.3.199
3040
```
3141

32-
and install the `plasma-python` package:
42+
Next, install the `plasma-python` package:
3343

3444
```bash
3545
#source activate my_env
@@ -44,7 +54,7 @@ Common issue is Intel compiler mismatch in the `PATH` and what you use in the mo
4454
you should see something like this:
4555
```
4656
$ which mpicc
47-
/usr/local/openmpi/cuda-8.0/2.1.0/intel170/x86_64/bin/mpicc
57+
/usr/local/openmpi/cuda-8.0/3.0.0/intel170/x86_64/bin/mpicc
4858
```
4959

5060
If you source activate the Anaconda environment after loading the openmpi, you would pick the MPI from Anaconda, which is not good and could lead to errors.
@@ -93,20 +103,20 @@ For batch analysis, make sure to allocate 1 MPI process per GPU. Save the follow
93103
#SBATCH -c 4
94104
#SBATCH --mem-per-cpu=0
95105
96-
module load anaconda3/4.4.0
106+
module load anaconda3
97107
source activate my_env
98-
export OMPI_MCA_btl="tcp,self,sm"
99-
module load cudatoolkit/8.0
100-
module load cudnn/cuda-8.0/6.0
101-
module load openmpi/cuda-8.0/intel-17.0/2.1.0/64
102-
module load intel/17.0/64/17.0.4.196
108+
export OMPI_MCA_btl="tcp,self,vader"
109+
module load cudatoolkit cudann
110+
module load openmpi/cuda-8.0/intel-17.0/3.0.0/64
111+
module load intel
112+
module load hdf5/intel-17.0/intel-mpi/1.10.0
103113
104114
srun python mpi_learn.py
105115
106116
```
107117
where `X` is the number of nodes for distibuted training.
108118

109-
Submit the job with:
119+
Submit the job with (assuming you are still in the `examples/` subdirectory):
110120
```bash
111121
#cd examples
112122
sbatch slurm.cmd
@@ -131,7 +141,7 @@ where the number of GPUs is X * 4.
131141
Then launch the application from the command line:
132142

133143
```bash
134-
mpirun -npernode 4 python examples/mpi_learn.py
144+
mpirun -npernode 4 python mpi_learn.py
135145
```
136146

137147
### Understanding the data
@@ -205,7 +215,7 @@ python -m tensorflow.tensorboard --logdir /mnt/<destination folder name on your
205215
```
206216
You should see something like:
207217

208-
![alt text](https://github.com/PPPLDeepLearning/plasma-python/blob/master/docs/tb.png)
218+
![tensorboard example](https://github.com/PPPLDeepLearning/plasma-python/blob/master/docs/tb.png)
209219

210220
#### Learning curves and ROC per epoch
211221

0 commit comments

Comments
 (0)