4.3.1. Basic Slurm jobs#

This section describes the basics of queuing jobs using on the Esrum cluster using the Slurm Workload Manager. This includes queuing tasks with the sbatch command , monitoring jobs with squeue and saccact, cancelling jobs with scancel, and reserving resources for jobs that need more CPUs or more RAM.

Users of the PBS (qsub) queuing system on e.g. porus or computerome can use this PBS to Slurm translation-sheet to migrate qsub scripts/commands to sbatch.

4.3.1.1. A basic job script#

In order to run a job using the Slurm workload manager, you must first write a shell script containing the commands that you want to execute. In the following example we just run a single command, echo "Hello, slurm!", but scripts can contain any number of commands.

#!/bin/bash

echo "Hello, slurm!"

The script can be named anything you like and does not need to be executable (via chmod +x), but the first line must contain a shebang (the line starting with #!) to indicate how slurm should execute it.

We use #!/bin/bash for the examples in this section, to indicate that they are bash scripts, but it is also possible to use other scripting languages by using the appropriate shebang (highlighted):

#!/usr/bin/env python3

print("Hello, slurm!")

Slurm scripts function like regular scripts for most part, meaning that the current directory corresponds to the directory in which you executed the script, that you can access environment variables set outside of the script, and that it is possible to pass command-line arguments to your scripts (see below).

4.3.1.2. Queuing a job#

In the following examples we will use the igzip command to compress a file. The igzip command is similar to gzip except that it is only available via a module, that it sacrifices compression ratio for speed, and that it supports multiple threads. This allows us to test those features with Slurm.

We start with a simple script, with which we will compress the FASTA file chr1.fasta. This script is saved as my_script.sh:

#!/bin/bash

module load igzip/2.30.0
igzip --keep "chr1.fasta"

The module command is used load the required software from the KU-IT provided library of scientific and other software. The Environment modules page gives an introduction to using modules on Esrum, but for now all you need to know is that the above command makes the igzip tool available to us. We could also have loaded the module on the command-line before queuing the command, as Slurm will remember what modules we have loaded, but it is recommended to load all required software in your job scripts to ensure that they are reproducible.

The --keep option for igzip is used to prevent igzip from deleting our input file when it is done.

To queue this script, run the sbatch command with the filename of the script as an argument:

$ ls
chr1.fasta  my_script.sh
$ sbatch my_script.sh
Submitted batch job 8503

Notice that we do not need to set the current working directory in our script (unlike PBS). As noted above, this defaults to the directory in which you queued the script. The number reported by sbatch is the job ID of your job (JOBID), which you will need should you want to cancel, pause, or otherwise manipulate your job (see below).

Once the job has started running (or has completed running), you will also find a file named slurm-${JOBID}.out in the current folder, where ${JOBID} is the ID reported by sbatch (8503 in this example):

$ ls
chr1.fasta  chr1.fasta.gz  my_script.sh  slurm-8503.out

The slurm-8503.out file contains any console output produced by your script/commands. This includes both STDOUT and STDERR by default, but this can be changed (see Common options). So if we had misspelled the filename in our command then the resulting error message would be found in the out file:

$ cat slurm-8503.out
igzip: chr1.fast does not exist

4.3.1.2.1. Passing arguments to sbatch scripts#

Arguments specified after the name of the sbatch script are passed to that script, just as if you were running it normally. This allows us to update our script above to take a filename on the command line instead of hard-coding that filename:

#!/bin/bash

module load igzip/2.30.0
igzip --keep "${1}"

We can then invoke the script using sbatch as above, specifying the name of the file we wanted to compress on the command-line:

$ sbatch my_script.sh "chr1.fasta"

This is equivalent to the original script, except that we can now easily submit a job for any file that we want to process, without having to update our script every time.

For further information, see this this tutorial for a brief overview of ways to use command-line arguments in a bash script.

4.3.1.3. Monitoring your jobs#

You can check the status of your queued and running jobs using the squeue --me command. The --me option ensures that only your jobs are shown, rather than everyone's jobs:

$ squeue --me
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 8503 standardq my_scrip   abc123  R       0:02      1 esrumcmpn01fl

The ST column indicating the status of the job (R for running, PD for pending, and so on).

Completed jobs are removed from the squeue list and can instead be listed using sacct:

$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
8503         my_script+ standardq+                     1  COMPLETED      0:0
8503.batch        batch                                1  COMPLETED      0:0

4.3.1.4. Cancelling jobs#

Already running jobs can be cancelled using the scancel command and the ID of the job you want to cancel:

$ squeue --me
JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
 8503 standardq my_scrip   abc123  R       0:02      1 esrumcmpn01fl
$ scancel 8503

Should you wish to cancel all your jobs, use the -u option:

$ scancel -u ${USER}

When running batch jobs you can either cancel the entire job (array, see below) or individual sub-tasks. See the Monitoring your jobs section.

4.3.1.5. Setting options#

The sbatch command offers two methods for setting options, such as resource requirements, notifications, etc (see e.g. Common options). The first option is simply to specify the options on the command line (e.g. sbatch --my-option my_script.sh).

Note that options for sbatch must be placed before the filename for your script. Options placed after the filename for your script (e.g. sbatch my_script.sh --my-option) will instead be passed directly to that script. This makes it simple to generalize scripts using standard scripting techniques.

The second option, which we recommend for resource requirements and the like, is to use #SBATCH comments.

For example, instead of queuing our job with the command

$ sbatch --my-option my_script.sh

We could instead modify my_script.sh by adding a line containing #SBATCH --my-option near the top of the file:

#!/bin/bash
#SBATCH --my-option

module load igzip/2.30.0
igzip --keep "chr1.fasta"

If we do so, then running sbatch my_script.sh becomes the equivalent of running sbatch --my-option my_script.sh. This had the advantage that our options are recorded along with the commands, and that we do not have to remember to specify those options every time we run sbatch my_script.sh.

This documentation will make use of #SBATCH comments, but remember that you can also specify them directly on the command-line. If you specify options on the command-line, then they take precedence above options specified using #SBATCH comments.

Note

The #SBATCH lines must be at the top of the file, before any other commands or the like. Moreover, there must be no spaces before or after the # in the #SBATCH comments. Other comments (lines starting with #) are allowed before and after the #SBATCH comments.

#SBATCH comments can also be used with other scripting languages, provided that you follow the rules described above, but note that source-code formatters like black may add spaces after the # and thereby break the #SBATCH comments.

4.3.1.6. Reserving resources#

By default a sbatch will request 1 CPU and just under 15 GB of ram per reserved CPU. Jobs will not be executed before the requested resources are available on a node and your jobs cannot exceed the amount of resources you've requested.

Should your job require more CPUs, then you can request them using the -c or --cpus-per-task option. The following script runs a job with 8 CPUs, and is therefore automatically assigned 8 * 15 ~= 120 gigabytes of RAM:

#!/bin/bash
#SBATCH --cpus-per-task 8

module load igzip/2.30.0
igzip --keep --threads 8 "chr1.fasta"

Notice that we need to not only reserve the CPUs, but we in almost all cases also need tell to our programs to actually use those CPUs. With igzip this is accomplished by using the --threads option as shown above. If this is not done then the reserved CPUs will have no effect on how long it takes for your program to run!

To avoid having to write the same number of threads multiple times, we can instead use hte ${SLURM_CPUS_PER_TASK} variable, which is automatically set to the number of CPUs we've requested:

#!/bin/bash
#SBATCH --cpus-per-task 8

module load igzip/2.30.0
igzip --keep --threads ${SLURM_CPUS_PER_TASK} "chr1.fasta"

The amount of RAM allocated by default should be sufficient for most tasks, but when needed you can request additional RAM using either the --mem-per-cpu or the --mem options. The --mem-per-cpu option allow you to request an amount of memory that depends on the number of CPUs you request (defaulting to just under 15 GB per CPU), while the --mem option allows you to request a specific amount of memory regardless of how many (or how few) CPUs you reserve.

The following script a task with 8 CPUs and 512 gigabytes of RAM:

#!/bin/bash
#SBATCH --cpus-per-task 8
#SBATCH --mem 512G

module load igzip/2.30.0
igzip --keep --threads ${SLURM_CPUS_PER_TASK} "chr1.fasta"

The same total could have been requested by using #SBATCH --mem-per-cpu 64G instead of #SBATCH --mem 512G.

As described in the Overview, each node has 128 CPUs available and 2 TB of RAM, of which 1993 GB can be reserved by Slurm. The GPU node has 4 TB of RAM available, of which 3920 GB can be reserved by Slurm, and may be used for jobs that have very high memory requirements. However, since we only have one GPU node we ask that you use the regular nodes unless your jobs actually require that much RAM. See the Using the GPU/hi-MEM node section for how to use the GPU node with or without reserving a GPU.

Warning

The --nodes option and the --ntasks option will start identical tasks on one or more nodes, so you should not be using these options unless your tools are specifically designed for this! Otherwise each instance will try to write to the same output file(s) and will produce results that are very likely corrupt.

If you need to run the same command on a set of files/samples, then see the Monitoring your jobs section.

4.3.1.6.1. Best practice for reserving resources#

Determining how many CPUs and how much memory you need to reserve for your jobs can be difficult:

Few programs benefit from using a lot of threads (CPUs) due to overhead and due to limits to how much of a given process can be parallelized (see Amdahl's law). Maximum throughput is often limited by how fast the software can read/write data.

We therefore recommended that you

Always refer to the documentation and recommendations for the specific software you are using!
Test the effect of the number of threads you are using before starting a lot of jobs.
Start with fewer CPUs and increase it only when there is a benefit to doing so. You can for example start with 2, 4, or 8 CPUs per task, and only increasing the number after it has been determined that the software benefits from the additional CPUs.

4.3.1.6.2. Monitoring resources used by jobs#

Once you have actually started running a job, you have several options for monitoring resource usage:

The /usr/bin/time -f "CPU = %P, MEM = %MKB" command can be used to estimate the efficiency from using multiple threads and to show how much memory a program used:

$ /usr/bin/time -f "CPU = %P, MEM = %M" my-command --threads 1 ...
CPU = 99%, MEM = 840563KB
$ /usr/bin/time -f "CPU = %P, MEM = %M" my-command --threads 4 ...
CPU = 345%, MEM = 892341KB
$ /usr/bin/time -f "CPU = %P, MEM = %M" my-command --threads 8 ...
CPU = 605%, MEM = 936324KB

In this example increasing the number of threads/CPUs to 4 did not result in a 4x increase in CPU usage, but only an 3.5x increase with 4 CPUs and only a 6x increase with 8 CPUs. Here it would be more efficient to run to tasks with 4 CPUs rather than one task with 8 CPUs.

The sacct command may be used to review the average CPU usage, the peak memory usage, disk I/O, and more for completed jobs. This makes it easier to verify that you are not needlessly reserving resources. A helper script is provided that summarizes some of this information in an easily readable form:

$ source /projects/cbmr_shared/apps/modules/activate.sh
$ module load sacct-usage
$ sacct-usage
      Age  User    Job   State         Elapsed  CPUs  CPUsWasted  ExtraMem  ExtraMemWasted  CPUHoursWasted
13:32:04s  abc123  1     FAILED     252:04:52s     8         6.9     131.4           131.4         4012.14
10:54:32s  abc123  2[1]  COMPLETED   02:49:25s    32        15.7       0.0             0.0           44.38
01:48:43s  abc123  3     COMPLETED   01:00:53s    24         2.4       0.0             0.0            2.43

The important information is found in the CPUsWasted column and the ExtraMemWasted column, which show the number CPUs that went unused on average the memory that was requested that went unused. Note that ExtraMem only counts memory requested in addition to the default allocation of ~16GB of RAM per CPU. That is because any additional reserved memory results in CPUs going unused unless a user explicitly asks for less RAM than the default ~16GB per CPU.

The final column indicates that number of CPU hours your job wasted, calculated as the length of time your job ran multiplied by the number of reserved CPUs wasted and the number of CPUs that would have been able to get the default 16GB of RAM had ExtraMemWasted been zero. Aim for your jobs to resemble the third job, not the second job and especially not the first job in the example!

When reserving jobs with additional resources it can also be useful to monitor CPU/memory usage in real time. This can help diagnose poor resource usage much faster than waiting for the program to finish running. See the Monitoring processes in jobs section for information about how to do so.

Because of this it is often more efficient to split your job into multiple sub-jobs (for example one job per chromosome) rather than increasing the number of threads used for the individual jobs. See the Advanced Slurm jobs page for more information about batching jobs.

4.3.1.6.3. Common options#

The following provides a brief overview of common options for sbatch not mentioned above. All of these options may be specified using #SBATCH comments.

The --job-name option allows you to give a name to your job. This shows up when using squeue, sacct and more. If not specified, the name of your script is used instead.
The --output and --error options allow you to specify where Slurm writes your scripts STDOUT and STDERR. The filenames should always include the text %j, which is replaced with the job ID. See the manual page for usage. Note also that the destination folder must exist or no output will be saved!
--time can be used to limit the maximum running time of your script. We do not require that --time is set, but it may be useful to automatically stop jobs that unexpectedly take too long to run. See the sbatch manual page for how to specify time limits.

--test-only can be used to test your batch scripts. Combine it with --verbose to verify that your options are correctly set before queuing your job:

$ sbatch --test-only --verbose my_script.sh
sbatch: defined options
sbatch: -------------------- --------------------
sbatch: cpus-per-task       : 8
sbatch: test-only           : set
sbatch: time                : 01:00:00
sbatch: verbose             : 1
sbatch: -------------------- --------------------
sbatch: end of defined options
[...]
sbatch: Job 8568 to start at 2023-06-28T12:15:32 using 8 processors on nodes esrumcmpn02fl in partition standardqueue

The --wait option can be used to make the sbatch block until the queued tasks have completed. This can be useful if you want to run sbatch from another script.

4.3.1.7. Interactive sessions#

If you need to run an interactive process, for example if you need to use an interactive R shell to process a large dataset, or if you just need to experiment with running an computationally heavy process, then you can start a shell on one of the compute nodes as follows:

[abc123@esrumhead01fl ~] $ srun --pty -- /bin/bash
[abc123@esrumcmpn07fl ~] $

Note how the hostname displayed changes from esrumhead01fl to esrumcmpn07fl, where esrumcmpn07fl may be any one of the Esrum compute nodes.

You can now run interactive programs, for example an R shell, or test computationally expensive tools or scripts. However, note that you cannot start jobs using Slurm in an interactive shell; jobs can only be started from the head node.

srun takes most of the same arguments as sbatch, including those used for reserving additional resources if you need more than the default 1 CPU and 15 GB of RAM:

$ srun --cpus-per-task 4 --mem 128G --pty -- /bin/bash

It is also possible to start an interactive session on the GPU/High-MEM nodes. See the Using the GPU/hi-MEM node page for more information. See the Advanced Slurm jobs page for more information about the srun command.

Once you are done, be sure to exit the interactive shell by using the exit command or pressing Ctrl+D, so that the resources reserved for your shell are made available to other users!

4.3.1.7.1. Running graphical programs#

Should you need to run a graphical program in an interactive session, then you must 1) enable X11 forwarding in the program you use to connect to the cluster (e.g. using -X option with ssh), and 2) specify the --x11 option when starting your interactive session:

$ ssh -X esrumhead01fl
$ srun --pty --x11 -- /bin/bash
$ xclock

If X11 forwarding is correctly enabled in your client, then you should see a small clock application on your desktop:

4.3.1.8. `sbatch` template script#

The following is a simple template for use with the sbatch command. This script can also be downloaded here.

#!/bin/bash

# The following are commonly used options for running jobs. Remove one
# "#" from the "##SBATCH" lines (changing them to "#SBATCH") to enable
# a given option.

# The number of CPUs (cores) used by your task. Defaults to 1.
##SBATCH --cpus-per-task=1
# The amount of RAM used by your task. Tasks are automatically assigned 15G
# per CPU (set above) if this option is not set.
##SBATCH --mem=15G
# Set a maximum runtime in hours:minutes:seconds. No default limit.
##SBATCH --time=1:00:00
# Request a GPU on the GPU code. Use `--gres=gpu:a100:2` to request both GPUs.
##SBATCH --partition=gpuqueue --gres=gpu:a100:1
# Send notifications when job ends. Remember to update the email address!
##SBATCH --mail-user=abc123@ku.dk --mail-type=END,FAIL

########################
# Your commands go here:

echo "Hello world!"

See also the Writing robust bash scripts page for tips on how to write more robust bash scripts. A template using those recommendations is available for download here.

4.3.1.9. What's next#

The next section of the documentation covers advanced usage of Slurm, including how to run jobs on the High-MEM/GPU node. However, if you have not already done so then it is recommended that you read the Environment modules page for an introduction on how to use the module system on Esrum to load the software you need for your work.

4.3.1.10. Troubleshooting#

4.3.1.10.1. Error: Requested node configuration is not available#

If you request too many CPUs (more than 128), or too much RAM (more than 1993 GB for compute nodes and more than 3920 GB for the GPU node), then Slurm will report that the request cannot be satisfied:

# More than 128 CPUs requested
$ sbatch --cpus-per-task 200 my_script.sh
sbatch: error: CPU count per node can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

# More than 1993 GB RAM requested on compute node
$ sbatch --mem 2000G my_script.sh
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available

To solve this, simply reduce the number of CPUs and/or the amount of RAM requested to fit within the limits described above. If your task does require more than 1993 GB of RAM, then you also need to add the --partition=gpuqueue, so that your task gets scheduled on the GPU/High-MEM node.

Additionally, you may receive this message if you request GPUs without specifying the correct queue or if you request too many GPUs:

# --partition=gpuqueue not specified
$ srun --gres=gpu:a100:2 -- echo "Hello world!"
srun: error: Unable to allocate resources: Requested node configuration is not available

# More than 2 GPUs requested
$ srun --partition=gpuqueue --gres=gpu:a100:3 -- echo "Hello world!"
srun: error: Unable to allocate resources: Requested node configuration is not available

To solve this error, simply avoid requesting more than 2 GPUs, and remember to include the --partition=gpuqueue option.

See also the Using the GPU/hi-MEM node section.