You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
https://www.open-mpi.org/faq/?category=openfabrics#ib-btl
NOTE: Prior versions of Open MPI used an sm BTL for shared memory. sm was effectively replaced with vader starting in Open MPI v3.0.0.
Switching to "vader" BTL from "sm" prevents runtime error:
# --------------------------------------------------------------------------
# As of version 3.0.0, the "sm" BTL is no longer available in Open MPI.
# Efficient, high-speed same-node shared memory communication support in
# Open MPI is available in the "vader" BTL. To use the vader BTL, you
# can re-run your job with:
# mpirun --mca btl vader,self,... your_mpi_application
# --------------------------------------------------------------------------
Copy file name to clipboardExpand all lines: docs/PrincetonUTutorial.md
+28-18Lines changed: 28 additions & 18 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,13 @@
1
1
## Tutorials
2
+
*Last updated 2019-10-16*
2
3
3
4
### Login to TigerGPU
4
5
5
6
First, login to TigerGPU cluster headnode via ssh:
6
7
```
7
8
ssh -XC <yourusername>@tigergpu.princeton.edu
8
9
```
10
+
Note, `-XC` is optional; it is only necessary if you are planning on performing remote visualization, e.g. the output `.png` files from the below [section](#Learning-curves-and-ROC-per-epoch). Trusted X11 forwarding can be used with `-Y` instead of `-X` and may prevent timeouts, but it disables X11 SECURITY extension controls. Compression `-C` reduces the bandwidth usage and may be useful on slow connections.
If you source activate the Anaconda environment after loading the openmpi, you would pick the MPI from Anaconda, which is not good and could lead to errors.
@@ -93,20 +103,20 @@ For batch analysis, make sure to allocate 1 MPI process per GPU. Save the follow
93
103
#SBATCH -c 4
94
104
#SBATCH --mem-per-cpu=0
95
105
96
-
module load anaconda3/4.4.0
106
+
module load anaconda3
97
107
source activate my_env
98
-
export OMPI_MCA_btl="tcp,self,sm"
99
-
module load cudatoolkit/8.0
100
-
module load cudnn/cuda-8.0/6.0
101
-
module load openmpi/cuda-8.0/intel-17.0/2.1.0/64
102
-
module load intel/17.0/64/17.0.4.196
108
+
export OMPI_MCA_btl="tcp,self,vader"
109
+
module load cudatoolkit cudann
110
+
module load openmpi/cuda-8.0/intel-17.0/3.0.0/64
111
+
module load intel
112
+
module load hdf5/intel-17.0/intel-mpi/1.10.0
103
113
104
114
srun python mpi_learn.py
105
115
106
116
```
107
117
where `X` is the number of nodes for distibuted training.
108
118
109
-
Submit the job with:
119
+
Submit the job with (assuming you are still in the `examples/` subdirectory):
110
120
```bash
111
121
#cd examples
112
122
sbatch slurm.cmd
@@ -131,7 +141,7 @@ where the number of GPUs is X * 4.
131
141
Then launch the application from the command line:
132
142
133
143
```bash
134
-
mpirun -npernode 4 python examples/mpi_learn.py
144
+
mpirun -npernode 4 python mpi_learn.py
135
145
```
136
146
137
147
### Understanding the data
@@ -205,7 +215,7 @@ python -m tensorflow.tensorboard --logdir /mnt/<destination folder name on your
0 commit comments