📝 update intro to slurm

Justin900429 · Justin900429 · commit f73ef15cd909 · 2025-11-02T13:15:18.000+08:00
diff --git a/blog/2025-11-01-slurm.md b/blog/2025-11-01-slurm.md
@@ -6,7 +6,7 @@ tags: [slurm, cloud-computing]
 ---
 
 :::info
-If you are using the BASIC server, you probabily don't have to care about slurm. However, if you would like to use the resources from the NCHC, you should definetely know how to use it.
+If you're only working on the BASIC Lab server, Slurm might not be necessary yet. However, if you plan to use NCHC resources, then learning Slurm is a must. All NCHC clusters are managed through Slurm.
 :::
 
 ## Introduction
@@ -453,7 +453,7 @@ Job finished at: Sat Nov 01 10:31:00 2025
 
 ### Understanding the Job Script Options
 
-Let's break down what those #SBATCH lines actually mean:
+Let's break down what those `#SBATCH` lines actually mean:
 
 * `--job-name`: Give your job a memorable name (shows up in `squeue`)
 * `--output`:  Where to save standard output (`%x` = job name, `%j` = job ID)
@@ -556,6 +556,259 @@ We've all been there. You submit a job, feeling confident, and then... it fails
 Don't panic if you see messages in the `*.err` file! Despite the name, not everything in the error file is actually an error. Many programs print normal informational messages, warnings, and progress updates to stderr, which ends up in your `*.err` file. Meanwhile, your `*.out` file might be empty or only contain your explicit `echo` statements. Therefore, always check BOTH files - `*.out` AND `*.err` - to get the full picture of what your job is doing. The `*.err` file often contains the most useful information, even when everything is working perfectly fine.
 :::
 
+## Wrapper Script for Dynamic Job Submission
+
+Here's a common frustration: Slurm job scripts don't take command-line arguments in the way you might expect. You can't just do `sbatch my_job.sh --learning-rate 0.001` and have it work.
+
+**The problem**: You want to run the same experiment with different hyperparameters, seeds, or datasets, but you don't want to manually edit your job script 20 times or create 20 different files.
+**The solution**: Create a bash wrapper script that takes arguments and generates + submits the Slurm job for you!
+
+### Example
+
+Let's say you want to run training jobs with different learning rates and batch sizes. Here's a wrapper script:
+
+<details>
+
+<summary>`submit_training.sh`</summary>
+
+```bash
+#!/bin/bash
+
+# Check if correct number of arguments provided
+if [ "$#" -ne 3 ]; then
+    echo "Usage: $0 <learning_rate> <batch_size> <experiment_name>"
+    echo "Example: $0 0.001 32 exp1"
+    exit 1
+fi
+
+# Parse arguments
+LR=$1
+BATCH_SIZE=$2
+EXP_NAME=$3
+
+# Create a temporary job script
+JOB_SCRIPT=$(mktemp /tmp/slurm_job_XXXXXX.sh)
+
+# Write the Slurm job script dynamically
+cat > $JOB_SCRIPT << EOF
+#!/bin/bash
+#SBATCH --job-name=train_${EXP_NAME}
+#SBATCH --output=logs/train_${EXP_NAME}_%j.out
+#SBATCH --error=logs/train_${EXP_NAME}_%j.err
+#SBATCH --cpus-per-task=4
+#SBATCH --mem=16G
+#SBATCH --gres=gpu:1
+#SBATCH --time=04:00:00
+#SBATCH --partition=gpu
+
+echo "=========================================="
+echo "Experiment: ${EXP_NAME}"
+echo "Learning Rate: ${LR}"
+echo "Batch Size: ${BATCH_SIZE}"
+echo "Job ID: \$SLURM_JOB_ID"
+echo "=========================================="
+
+module load python/3.10 cuda/11.8
+source ~/venv/bin/activate
+
+# Run training with specified parameters
+python train.py \\
+    --learning-rate ${LR} \\
+    --batch-size ${BATCH_SIZE} \\
+    --experiment-name ${EXP_NAME} \\
+    --output-dir results/${EXP_NAME}
+
+echo "Training completed at \$(date)"
+EOF
+
+# Make sure logs directory exists
+mkdir -p logs
+mkdir -p results/${EXP_NAME}
+
+# Submit the job
+echo "Submitting job for experiment: ${EXP_NAME}"
+echo "  Learning rate: ${LR}"
+echo "  Batch size: ${BATCH_SIZE}"
+sbatch $JOB_SCRIPT
+
+# Clean up the temporary script
+rm $JOB_SCRIPT
+
+echo "Job submitted successfully!"
+```
+
+</details>
+
+Now you can easily submit jobs with different parameters:
+
+```bash
+> ./submit_training.sh 0.001 32 exp_lr001_bs32
+> ./submit_training.sh 0.01 64 exp_lr01_bs64
+> ./submit_training.sh 0.0001 128 exp_lr0001_bs128
+```
+
+Or even loop through multiple configurations:
+
+```bash
+for lr in 0.001 0.01 0.0001; do
+    for bs in 32 64 128; do
+        ./submit_training.sh $lr $bs exp_lr${lr}_bs${bs}
+    done
+done
+```
+
+### With Slurm parameters as arguments
+
+Here's a more sophisticated version that includes optional parameters:
+
+<details>
+
+<summary>`submit_training_with_slurm_params.sh`</summary>
+
+```bash
+#!/bin/bash
+
+# Default values
+CPUS=4
+MEM="16G"
+GPUS=1
+TIME="04:00:00"
+PARTITION="gpu"
+
+# Parse command line arguments
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --lr)
+            LR="$2"
+            shift 2
+            ;;
+        --batch-size)
+            BATCH_SIZE="$2"
+            shift 2
+            ;;
+        --exp-name)
+            EXP_NAME="$2"
+            shift 2
+            ;;
+        --cpus)
+            CPUS="$2"
+            shift 2
+            ;;
+        --mem)
+            MEM="$2"
+            shift 2
+            ;;
+        --gpus)
+            GPUS="$2"
+            shift 2
+            ;;
+        --time)
+            TIME="$2"
+            shift 2
+            ;;
+        --partition)
+            PARTITION="$2"
+            shift 2
+            ;;
+        *)
+            echo "Unknown option: $1"
+            echo "Usage: $0 --lr <value> --batch-size <value> --exp-name <name> [options]"
+            echo "Options:"
+            echo "  --cpus <num>        Number of CPUs (default: 4)"
+            echo "  --mem <size>        Memory (default: 16G)"
+            echo "  --gpus <num>        Number of GPUs (default: 1)"
+            echo "  --time <time>       Time limit (default: 04:00:00)"
+            echo "  --partition <name>  Partition name (default: gpu)"
+            exit 1
+            ;;
+    esac
+done
+
+# Check required arguments
+if [ -z "$LR" ] || [ -z "$BATCH_SIZE" ] || [ -z "$EXP_NAME" ]; then
+    echo "Error: --lr, --batch-size, and --exp-name are required"
+    exit 1
+fi
+
+# Create temporary job script
+JOB_SCRIPT=$(mktemp /tmp/slurm_job_XXXXXX.sh)
+
+cat > $JOB_SCRIPT << EOF
+#!/bin/bash
+#SBATCH --job-name=train_${EXP_NAME}
+#SBATCH --output=logs/train_${EXP_NAME}_%j.out
+#SBATCH --error=logs/train_${EXP_NAME}_%j.err
+#SBATCH --cpus-per-task=${CPUS}
+#SBATCH --mem=${MEM}
+#SBATCH --gres=gpu:${GPUS}
+#SBATCH --time=${TIME}
+#SBATCH --partition=${PARTITION}
+
+echo "=========================================="
+echo "Experiment: ${EXP_NAME}"
+echo "Learning Rate: ${LR}"
+echo "Batch Size: ${BATCH_SIZE}"
+echo "CPUs: ${CPUS}, Memory: ${MEM}, GPUs: ${GPUS}"
+echo "Job ID: \$SLURM_JOB_ID"
+echo "=========================================="
+
+module load python/3.10 cuda/11.8
+source ~/venv/bin/activate
+
+python train.py \\
+    --learning-rate ${LR} \\
+    --batch-size ${BATCH_SIZE} \\
+    --experiment-name ${EXP_NAME} \\
+    --output-dir results/${EXP_NAME}
+
+echo "Training completed at \$(date)"
+EOF
+
+mkdir -p logs results/${EXP_NAME}
+
+echo "Submitting job: ${EXP_NAME}"
+echo "  LR: ${LR}, Batch Size: ${BATCH_SIZE}"
+echo "  Resources: ${CPUS} CPUs, ${MEM} RAM, ${GPUS} GPUs"
+echo "  Time limit: ${TIME}, Partition: ${PARTITION}"
+
+sbatch $JOB_SCRIPT
+rm $JOB_SCRIPT
+
+echo "Job submitted!"
+```
+
+</details>
+
+To use this script, you can run:
+
+```shell
+# Basic usage
+> ./submit_training_advanced.sh --lr 0.001 --batch-size 32 --exp-name my_exp
+
+# With custom resources
+> ./submit_training_advanced.sh \
+    --lr 0.001 \
+    --batch-size 32 \
+    --exp-name big_exp \
+    --cpus 8 \
+    --mem 32G \
+    --gpus 2 \
+    --time 12:00:00
+```
+
+:::note
+The wrapper creates temporary job scripts that are deleted after submission. If you want to keep them for debugging, you can save them to a directory instead:
+
+```bash
+# Instead of mktemp and rm, use:
+JOB_SCRIPT="job_scripts/train_${EXP_NAME}_$(date +%Y%m%d_%H%M%S).sh"
+mkdir -p job_scripts
+# ... write to $JOB_SCRIPT ...
+# Don't delete it
+```
+
+:::
+
 ## Wrapping Up
 
 So there you have it - Slurm in a nutshell (okay, maybe a large nutshell).
@@ -568,7 +821,7 @@ Once you're comfortable with the basics, here are some powerful features worth e
 
 * Array jobs (`--array`) - Run the same script hundreds of times with different parameters in one submission
 * Job dependencies (`--dependency`) - Chain jobs together so they run sequentially without manual intervention
-* Job history (sacct) - Analyze past jobs to optimize your resource requests
+* Job history (`sacct`) - Analyze past jobs to optimize your resource requests
 * Job templates - Create reusable script templates for common workflows
 
 These features can seriously streamline your workflow once you're ready for them!