You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you source activate the Anaconda environment after loading the openmpi, you would pick the MPI from Anaconda, which is not good and could lead to errors.
60
+
If you `source activate` the Anaconda environment **after** loading the OpenMPI library, your application would be built with the MPI library from Anaconda, which has worse performance on this cluster and could lead to errors. See [On Computing Well: Installing and Running ‘mpi4py’ on the Cluster](https://oncomputingwell.princeton.edu/2018/11/installing-and-running-mpi4py-on-the-cluster/) for a related discussion.
61
61
62
62
#### Location of the data on Tigress
63
63
64
-
The JET and D3D datasets containing multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions are located on `/tigress/FRNN` filesystem on Princeton U clusters.
65
-
Fo convenience, create following symbolic links:
64
+
The JET and D3D datasets contain multi-modal time series of sensory measurements leading up to deleterious events called plasma disruptions. The datasets are located in the `/tigress/FRNN` project directory of the [GPFS](https://www.ibm.com/support/knowledgecenter/en/SSPT3X_3.0.0/com.ibm.swg.im.infosphere.biginsights.product.doc/doc/bi_gpfs_overview.html) filesystem on Princeton University clusters.
This will preprocess the data and save it in `/tigress/<netid>/processed_shots`, `/tigress/<netid>/processed_shotlists` and `/tigress/<netid>/normalization`
79
+
This will preprocess the data and save rescaled copies of the signals in `/tigress/<netid>/processed_shots`, `/tigress/<netid>/processed_shotlists` and `/tigress/<netid>/normalization`
80
80
81
81
You would only have to run preprocessing once for each dataset. The dataset is specified in the config file `examples/conf.yaml`:
82
82
```yaml
83
83
paths:
84
84
data: jet_data_0D
85
85
```
86
-
It take takes about 20 minutes to preprocess in parallel and can normally be done on the cluster headnode.
86
+
Preprocessing this dataset takes about 20 minutes to preprocess in parallel and can normally be done on the cluster headnode.
87
87
88
88
#### Training and inference
89
89
90
90
Use Slurm scheduler to perform batch or interactive analysis on TigerGPU cluster.
91
91
92
92
##### Batch analysis
93
93
94
-
For batch analysis, make sure to allocate 1 MPI process per GPU. Save the following to slurm.cmd file (or make changes to the existing `examples/slurm.cmd`):
94
+
For batch analysis, make sure to allocate 1 MPI process per GPU. Save the following to `slurm.cmd` file (or make changes to the existing `examples/slurm.cmd`):
where `X` is the number of nodes for distibuted training.
117
+
where `X` is the number of nodes for distibuted training and the total number of GPUs is `X * 4`. This configuration guarantees 1 MPI process per GPU, regardless of the value of `X`.
118
+
119
+
Update the `num_gpus` value in `conf.yaml` to correspond to the total number of GPUs specified for your Slurm allocation.
118
120
119
121
Submit the job with (assuming you are still in the `examples/` subdirectory):
120
122
```bash
@@ -126,7 +128,11 @@ And monitor it's completion via:
126
128
```bash
127
129
squeue -u <netid>
128
130
```
129
-
Optionally, add an email notification option in the Slurm about the job completion.
131
+
Optionally, add an email notification option in the Slurm configuration about the job completion:
132
+
```
133
+
#SBATCH --mail-user=<netid>@princeton.edu
134
+
#SBATCH --mail-type=ALL
135
+
```
130
136
131
137
##### Interactive analysis
132
138
@@ -136,13 +142,18 @@ The workflow is to request an interactive session:
Then launch the application from the command line:
145
+
Then, launch the application from the command line:
142
146
143
147
```bash
144
-
mpirun -npernode 4 python mpi_learn.py
148
+
mpirun -N 4 python mpi_learn.py
145
149
```
150
+
where `-N` is a synonym for `-npernode` in OpenMPI. Do **not** use `srun` to launch the job inside an interactive session.
151
+
[//]: # (This option appears to be redundant given the salloc options; "mpirun python mpi_learn.py" appears to work just the same. HOWEVER, "srun python mpi_learn.py", "srun --ntasks-per-node python mpi_learn.py", etc. NEVER works--- it just hangs without any output. Why?)
152
+
153
+
[//]: # (Consistent with https://www.open-mpi.org/faq/?category=slurm ?)
154
+
155
+
[//]: # (certain output seems to be repeated by ntasks-per-node, e.g. echoing the conf.yaml. Expected?)
0 commit comments