Skip to content

Commit f73ef15

Browse files
committed
📝 update intro to slurm
1 parent 38473c2 commit f73ef15

File tree

1 file changed

+256
-3
lines changed

1 file changed

+256
-3
lines changed

blog/2025-11-01-slurm.md

Lines changed: 256 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ tags: [slurm, cloud-computing]
66
---
77

88
:::info
9-
If you are using the BASIC server, you probabily don't have to care about slurm. However, if you would like to use the resources from the NCHC, you should definetely know how to use it.
9+
If you're only working on the BASIC Lab server, Slurm might not be necessary yet. However, if you plan to use NCHC resources, then learning Slurm is a must. All NCHC clusters are managed through Slurm.
1010
:::
1111

1212
## Introduction
@@ -453,7 +453,7 @@ Job finished at: Sat Nov 01 10:31:00 2025
453453

454454
### Understanding the Job Script Options
455455

456-
Let's break down what those #SBATCH lines actually mean:
456+
Let's break down what those `#SBATCH` lines actually mean:
457457

458458
* `--job-name`: Give your job a memorable name (shows up in `squeue`)
459459
* `--output`: Where to save standard output (`%x` = job name, `%j` = job ID)
@@ -556,6 +556,259 @@ We've all been there. You submit a job, feeling confident, and then... it fails
556556
Don't panic if you see messages in the `*.err` file! Despite the name, not everything in the error file is actually an error. Many programs print normal informational messages, warnings, and progress updates to stderr, which ends up in your `*.err` file. Meanwhile, your `*.out` file might be empty or only contain your explicit `echo` statements. Therefore, always check BOTH files - `*.out` AND `*.err` - to get the full picture of what your job is doing. The `*.err` file often contains the most useful information, even when everything is working perfectly fine.
557557
:::
558558
559+
## Wrapper Script for Dynamic Job Submission
560+
561+
Here's a common frustration: Slurm job scripts don't take command-line arguments in the way you might expect. You can't just do `sbatch my_job.sh --learning-rate 0.001` and have it work.
562+
563+
**The problem**: You want to run the same experiment with different hyperparameters, seeds, or datasets, but you don't want to manually edit your job script 20 times or create 20 different files.
564+
**The solution**: Create a bash wrapper script that takes arguments and generates + submits the Slurm job for you!
565+
566+
### Example
567+
568+
Let's say you want to run training jobs with different learning rates and batch sizes. Here's a wrapper script:
569+
570+
<details>
571+
572+
<summary>`submit_training.sh`</summary>
573+
574+
```bash
575+
#!/bin/bash
576+
577+
# Check if correct number of arguments provided
578+
if [ "$#" -ne 3 ]; then
579+
echo "Usage: $0 <learning_rate> <batch_size> <experiment_name>"
580+
echo "Example: $0 0.001 32 exp1"
581+
exit 1
582+
fi
583+
584+
# Parse arguments
585+
LR=$1
586+
BATCH_SIZE=$2
587+
EXP_NAME=$3
588+
589+
# Create a temporary job script
590+
JOB_SCRIPT=$(mktemp /tmp/slurm_job_XXXXXX.sh)
591+
592+
# Write the Slurm job script dynamically
593+
cat > $JOB_SCRIPT << EOF
594+
#!/bin/bash
595+
#SBATCH --job-name=train_${EXP_NAME}
596+
#SBATCH --output=logs/train_${EXP_NAME}_%j.out
597+
#SBATCH --error=logs/train_${EXP_NAME}_%j.err
598+
#SBATCH --cpus-per-task=4
599+
#SBATCH --mem=16G
600+
#SBATCH --gres=gpu:1
601+
#SBATCH --time=04:00:00
602+
#SBATCH --partition=gpu
603+
604+
echo "=========================================="
605+
echo "Experiment: ${EXP_NAME}"
606+
echo "Learning Rate: ${LR}"
607+
echo "Batch Size: ${BATCH_SIZE}"
608+
echo "Job ID: \$SLURM_JOB_ID"
609+
echo "=========================================="
610+
611+
module load python/3.10 cuda/11.8
612+
source ~/venv/bin/activate
613+
614+
# Run training with specified parameters
615+
python train.py \\
616+
--learning-rate ${LR} \\
617+
--batch-size ${BATCH_SIZE} \\
618+
--experiment-name ${EXP_NAME} \\
619+
--output-dir results/${EXP_NAME}
620+
621+
echo "Training completed at \$(date)"
622+
EOF
623+
624+
# Make sure logs directory exists
625+
mkdir -p logs
626+
mkdir -p results/${EXP_NAME}
627+
628+
# Submit the job
629+
echo "Submitting job for experiment: ${EXP_NAME}"
630+
echo " Learning rate: ${LR}"
631+
echo " Batch size: ${BATCH_SIZE}"
632+
sbatch $JOB_SCRIPT
633+
634+
# Clean up the temporary script
635+
rm $JOB_SCRIPT
636+
637+
echo "Job submitted successfully!"
638+
```
639+
640+
</details>
641+
642+
Now you can easily submit jobs with different parameters:
643+
644+
```bash
645+
> ./submit_training.sh 0.001 32 exp_lr001_bs32
646+
> ./submit_training.sh 0.01 64 exp_lr01_bs64
647+
> ./submit_training.sh 0.0001 128 exp_lr0001_bs128
648+
```
649+
650+
Or even loop through multiple configurations:
651+
652+
```bash
653+
for lr in 0.001 0.01 0.0001; do
654+
for bs in 32 64 128; do
655+
./submit_training.sh $lr $bs exp_lr${lr}_bs${bs}
656+
done
657+
done
658+
```
659+
660+
### With Slurm parameters as arguments
661+
662+
Here's a more sophisticated version that includes optional parameters:
663+
664+
<details>
665+
666+
<summary>`submit_training_with_slurm_params.sh`</summary>
667+
668+
```bash
669+
#!/bin/bash
670+
671+
# Default values
672+
CPUS=4
673+
MEM="16G"
674+
GPUS=1
675+
TIME="04:00:00"
676+
PARTITION="gpu"
677+
678+
# Parse command line arguments
679+
while [[ $# -gt 0 ]]; do
680+
case $1 in
681+
--lr)
682+
LR="$2"
683+
shift 2
684+
;;
685+
--batch-size)
686+
BATCH_SIZE="$2"
687+
shift 2
688+
;;
689+
--exp-name)
690+
EXP_NAME="$2"
691+
shift 2
692+
;;
693+
--cpus)
694+
CPUS="$2"
695+
shift 2
696+
;;
697+
--mem)
698+
MEM="$2"
699+
shift 2
700+
;;
701+
--gpus)
702+
GPUS="$2"
703+
shift 2
704+
;;
705+
--time)
706+
TIME="$2"
707+
shift 2
708+
;;
709+
--partition)
710+
PARTITION="$2"
711+
shift 2
712+
;;
713+
*)
714+
echo "Unknown option: $1"
715+
echo "Usage: $0 --lr <value> --batch-size <value> --exp-name <name> [options]"
716+
echo "Options:"
717+
echo " --cpus <num> Number of CPUs (default: 4)"
718+
echo " --mem <size> Memory (default: 16G)"
719+
echo " --gpus <num> Number of GPUs (default: 1)"
720+
echo " --time <time> Time limit (default: 04:00:00)"
721+
echo " --partition <name> Partition name (default: gpu)"
722+
exit 1
723+
;;
724+
esac
725+
done
726+
727+
# Check required arguments
728+
if [ -z "$LR" ] || [ -z "$BATCH_SIZE" ] || [ -z "$EXP_NAME" ]; then
729+
echo "Error: --lr, --batch-size, and --exp-name are required"
730+
exit 1
731+
fi
732+
733+
# Create temporary job script
734+
JOB_SCRIPT=$(mktemp /tmp/slurm_job_XXXXXX.sh)
735+
736+
cat > $JOB_SCRIPT << EOF
737+
#!/bin/bash
738+
#SBATCH --job-name=train_${EXP_NAME}
739+
#SBATCH --output=logs/train_${EXP_NAME}_%j.out
740+
#SBATCH --error=logs/train_${EXP_NAME}_%j.err
741+
#SBATCH --cpus-per-task=${CPUS}
742+
#SBATCH --mem=${MEM}
743+
#SBATCH --gres=gpu:${GPUS}
744+
#SBATCH --time=${TIME}
745+
#SBATCH --partition=${PARTITION}
746+
747+
echo "=========================================="
748+
echo "Experiment: ${EXP_NAME}"
749+
echo "Learning Rate: ${LR}"
750+
echo "Batch Size: ${BATCH_SIZE}"
751+
echo "CPUs: ${CPUS}, Memory: ${MEM}, GPUs: ${GPUS}"
752+
echo "Job ID: \$SLURM_JOB_ID"
753+
echo "=========================================="
754+
755+
module load python/3.10 cuda/11.8
756+
source ~/venv/bin/activate
757+
758+
python train.py \\
759+
--learning-rate ${LR} \\
760+
--batch-size ${BATCH_SIZE} \\
761+
--experiment-name ${EXP_NAME} \\
762+
--output-dir results/${EXP_NAME}
763+
764+
echo "Training completed at \$(date)"
765+
EOF
766+
767+
mkdir -p logs results/${EXP_NAME}
768+
769+
echo "Submitting job: ${EXP_NAME}"
770+
echo " LR: ${LR}, Batch Size: ${BATCH_SIZE}"
771+
echo " Resources: ${CPUS} CPUs, ${MEM} RAM, ${GPUS} GPUs"
772+
echo " Time limit: ${TIME}, Partition: ${PARTITION}"
773+
774+
sbatch $JOB_SCRIPT
775+
rm $JOB_SCRIPT
776+
777+
echo "Job submitted!"
778+
```
779+
780+
</details>
781+
782+
To use this script, you can run:
783+
784+
```shell
785+
# Basic usage
786+
> ./submit_training_advanced.sh --lr 0.001 --batch-size 32 --exp-name my_exp
787+
788+
# With custom resources
789+
> ./submit_training_advanced.sh \
790+
--lr 0.001 \
791+
--batch-size 32 \
792+
--exp-name big_exp \
793+
--cpus 8 \
794+
--mem 32G \
795+
--gpus 2 \
796+
--time 12:00:00
797+
```
798+
799+
:::note
800+
The wrapper creates temporary job scripts that are deleted after submission. If you want to keep them for debugging, you can save them to a directory instead:
801+
802+
```bash
803+
# Instead of mktemp and rm, use:
804+
JOB_SCRIPT="job_scripts/train_${EXP_NAME}_$(date +%Y%m%d_%H%M%S).sh"
805+
mkdir -p job_scripts
806+
# ... write to $JOB_SCRIPT ...
807+
# Don't delete it
808+
```
809+
810+
:::
811+
559812
## Wrapping Up
560813

561814
So there you have it - Slurm in a nutshell (okay, maybe a large nutshell).
@@ -568,7 +821,7 @@ Once you're comfortable with the basics, here are some powerful features worth e
568821

569822
* Array jobs (`--array`) - Run the same script hundreds of times with different parameters in one submission
570823
* Job dependencies (`--dependency`) - Chain jobs together so they run sequentially without manual intervention
571-
* Job history (sacct) - Analyze past jobs to optimize your resource requests
824+
* Job history (`sacct`) - Analyze past jobs to optimize your resource requests
572825
* Job templates - Create reusable script templates for common workflows
573826

574827
These features can seriously streamline your workflow once you're ready for them!

0 commit comments

Comments
 (0)