Slurm

Resource sharing and allocations on the cluster are handled by a combination of a resource manager (tracking which computational resources are available on which nodes) and a job scheduler (determining when and to which available resources to submit a particular job, and then monitoring it). To accomplish both tasks, we use Slurm, one of the most popular solutions in high performance and supercomputing.

There are two primary reasons to use Slurm. First, other than for basic, short testing, no “real” work should be performed on the login node, which has several responsibilities such as managing users, handling logins, monitoring the other nodes, etc. For that reason, nearly all work should be performed on the compute nodes, and Slurm acts as the “gateway” into those systems. Second, because Slurm keeps track of which resources are available on the compute nodes, it is able to allocate the most efficient set of them for your tasks, as quickly as possible.

Slurm is a powerful and flexible program, and as such it is beyond the scope of this document to provide an exhaustive tutorial. Rather, the examples provided here should be sufficient to get started, and a wide array of online resources for further guidance. (This page itself is modeled after the excellent CÉCI Slurm tutorial.)

Before working with any of the examples below, it is assumed you are logged into the cluster.

Gathering Information

Slurm offers a variety of commands to query the nodes. Most of them will not be strictly necessary for you to run, but they can provide a snapshot of the overall computational ecosystem, list jobs in process or that are queued up, and more.

sinfo

The sinfo command lists available partitions and some basic information about each. A partition is a logical grouping of physical compute nodes. Typical examples of partitions might be those dedicated to general purpose or GPU computing, debugging, post processing, or visualization. Running sinfo produces output similar to this:

$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all          up   infinite      4    mix himem01,node[01,04,06]
all          up   infinite      3  alloc node[01-06]
node*        up   infinite      3    mix node[01,04,06]
himem        up   infinite      1    mix himem01
gpu          up   infinite      1   idle gpu01

The output shows that we have three distinct partitions: nodes (with nodes 01-06), himem (with one node) and gpu (with one node). In addition, there is a fourth partition, all, which as the name suggests, comprises all the nodes and provides a means for directing workloads to all compute resources, leaving Slurm to determine where they’ll be run.

Note the asterisk identifies the partition to which jobs are submitted by default if no partition is specified.

squeue

The command squeue displays a list of jobs that are currently running (denoted with R) or that are pending (denoted with PD). Here is example output:

$ squeue
JOBID PARTITION NAME USER ST  TIME  NODES NODELIST(REASON)
12345     debug job1 dave  R   0:21     4 node[9-12]
12346     debug job2 dave PD   0:00     8 (Resources)
12348     debug job3 ed   PD   0:00     4 (Priority)

In this example, job 12345 is running on nodes 9-12 within the debug partition, job 12346 is pending because requested resources are unavailable, and job 12348 is pending because it is a lower priority than currently-running jobs. The other columns are largely self-explanatory, though TIME is the time up until now that a given job has been running. The squeue help page describes many other options available to control what information is displayed and its formatting.

Job Priority Scheduling

As is commonly the case with HPC clusters, there are often insufficient resources to run all jobs immediately when they are submitted; as such, submitted jobs are regularly placed into the job queue. Each job’s position in the queue is determined through the Fair Tree FairShare algorithm, which depends on a number of factors, including how many completed jobs a user has had in the previous week, how long the job has been in the queue, the size of job, the time requirement, and more. Slurm updates the priority queue every 5 seconds, so a job’s priority may change over time, moving up or down.

Slurm also uses backfill scheduling to “fill in” slots when, e.g., a job completes earlier than estimated, so it is possible, especially for shorter jobs, that a job may be run prior to when it was estimated to do so. For this reason, it is critical to estimate the time required for your job as accurately as possible. While you should not underestimate, excessive overestimation can make it appear that subsequent jobs won’t start for a long time. A good rule of thumb, when possible, is to request about 10-15% more time than you think is required.

Prior to submitting a job, you can check when it is estimated to be run:
sbatch --test-only myscript.sh

For a job that has already been submitted, you can check its status:
squeue --start -j <jobid>

For a list of your jobids:
squeue -u <username>

sshare

You can check your usage information, as well as your “FairShare score” with sshare; note that RawUsage corresponds to CPU seconds, and that this number will begin to decay after 7 days (meaning, it is essentially a 7-day running tally of your CPU usage):
sshare -u <username>

You can also see a formatted list of all queued jobs, sorted by current priority (which, as noted, constantly updates):
squeue -o '%.7i %.9Q %.9P %.8j %.8u %.8T %.10M %.11l %.8D %.5C %R' -S '-p' --state=pending | less

Creating and Submitting Jobs

Slurm offers two primary ways to submit jobs to the compute nodes: interactive and batch. Interactive is the simpler method, but its usefulness is somewhat limited and is generally used to work with software interactively. Batch is more complex and requires greater planning, but it is by far the most common use of Slurm and provides a great deal of flexibility and power. We’ll cover both options.

Interactive

The simplest way to connect to a set of resources within the compute nodes is simply to request an interactive shell with resources allocated to it, which can be accomplished with the srun command. Here is a basic example:

[user@hpc ~]$ salloc -t 60 --cpus-per-task=1 --mem-per-cpu=32gb
[user@node01 ~]$

This example allocates an interactive shell session for 60 minutes (-t 60) and provides one CPU (--cpus-per-task=1) and 32gb of memory to the session (--mem-per-cpu=32gb). As the second line shows, the requested resources were allocated using node01 and the interactive session switched to that node, ready for commands. At the end of 60 minutes, the session will be terminated, demonstrating why it is important to request a suitable amount of time (if you leave off the -t flag and do not specify a time, your session will be allocated only 5 minutes).

Once your interactive session starts, you will be in your home directory and can begin performing work or using interactive software. But, if you wish to run software with a GUI, you must explicitly indicate that by adding the --x11 flag to your srun command:

[user@hpc ~]$ salloc -t 60 --cpus-per-task=1 --mem-per-cpu=32gb --x11

salloc is extremely powerful and there are a number of other options you can leverage. One of the most useful flags is --ntasks-per-node, which will allocate a specific number of computational cores to the session. This can be useful when running software that is optimized for parallel operations, such as Stata or Mathematica. For instance, the following example modifies the previous command to also request 8 cores:

[user@hpc ~]$ salloc -t 60 -N 1-1 --ntasks-per-node=8 --mem=32gb
[user@node01 ~]$

Two important notes:

The amount of a given resource (such as memory and cores) is limited to the maximum amount on the node to which you are connecting. See the section on Resource Limits below for more information.
As a reminder, when doing work on a compute node, it is highly advisable to work with your data within /scratch rather than directly within your home directory. That is because when connected to a compute node, your home directory is mounted over the network, whereas /scratch is a physical disk local to the specific node you are on. For this reason, you should copy data into /scratch, perform work on it, then copy any final output back into your home directory. /scratch cannot be considered a permanent storage location! A program runs daily to delete any files out of /scratch older than a few days, but please clean up after yourself.

When you are finished with your session, the exit command will terminate it and return you to the login node:

[user@node01 ~]$ exit
exit
[user@hpc ~]$

Batch

The most common way to work with Slurm is to submit batch jobs and allow the scheduler to manage which resources are used, and at which times. So, what then, exactly, is a job? A job has two separate parts:

a resource request
a list of one or more job steps

Job steps are basically the individual commands to be run sequentially to perform the actual tasks of the job.

The best way to manage these two parts is within a single submission script that Slurm uses to allocate resources and process your job steps. Here is an extremely basic sample submission script (we’ll name it sample.sh):

#!/bin/bash
#
#SBATCH --job-name=sample
#SBATCH --output=sample_output.txt
#
#SBATCH --ntasks=1
#SBATCH --time=3:00
#SBATCH --mem-per-cpu=100mb
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=[NetID]@lafayette.edu

srun hostname
srun sleep 60

The so-called “shebang” line (#!/bin/bash) must be the very first line, and simply is saying that this is a bash shell script. The second line (#) is a comment; lines that begin with a hash – other than the very top shebang line and #SBATCH lines – are treated as comments and ignored (in this case, they are merely used to provide visual space between defining the metadata associated with the job, such as the title and output file, and the actual resource allocation requests). Following the shebang line are any number of SBATCH directives, which handle the resource allocation and other data (e.g., job name, output file location, and potentially many other options) associated with your job. These all must appear at the top of the file, prior to any job steps. In this sample file, multiple #SBATCH directives define the job: we give the job a name (--job-name=sample), we provide a filename into which any output of commands, programs, scripts, etc. will be written (--output=sample_output.txt), and then we are requesting one CPU (--ntasks=1) for 3 minutes (--time=3:00) and 100mb of memory (--mem-per-cpu=100mb). Finally, we’ll ask Slurm to send an email to us (--mail-type=BEGIN,END,FAIL and --mail-user=[NetID]@lafayette.edu) when the job starts and ends, and also in case of failure.

Note that these are only a handful of the many #SBATCH directives you can provide; for a complete list, you can run man sbatch or visit the sbatch help page. A couple of notes:

You should always provide a --job-name, since it makes it much easier to track. You might consider including your NetID in the name, such as your_netid_gaussian_run or similar (do not use spaces!).
The --output directive will default to writing the file to whichever directory you submitted the job from, which should almost always be your home directory (you can be explicit about this by saying, e.g., --output=~/sample_output.txt, since the tilde ~ is a shortcut to your home directory). Also, note that this output file isn’t necessarily where output from your job steps will be, which generally will be defined within your specific code or script. Rather, this file will contain whatever output you would see had you been running the commands interactively, from a terminal (i.e., you are simply specifying to direct any output that would display on-screen to a file instead). This means that you could easily have one or more output files from the code itself, in addition to this file.

Following whichever #SBATCH directives you choose to provide, you then list one or more srun commands with the specific tasks to be executed, in order. In this case, just for a basic demonstration, the command hostname is executed on whichever node the job was assigned to, and its result will be written out to the sample_output.txt file within your home directory (because it would have displayed on the screen had hostname been typed from the command prompt). Then, we issue a sleep 60 command (which does nothing and simply waits for a minute).

Once you have a properly-formatted submission script, you must submit the job to Slurm via the sbatch command:

[user@hpc ~]$ sbatch sample.sh
Submitted batch job 72

Assuming no errors, you will be given the jobid attributed to the job. At this point it will enter the queue in the pending state; once sufficient resources are available and it has the highest priority, it will switch to the running state and begin executing. If everything goes well, it will finish and be set to the completed state, or if something went wrong, it will be set to the failed state.

You can monitor the state of your job by running sstat -j <jobid> (replacing <jobid> with whatever was output when you submitted the job via sbatch). Note that by default, this will generate quite a bit of output, most of which probably isn’t that helpful. You can control what is shown with the --format parameter. For more information, run man sstat or view the sstat help page.

Upon completion of this sample script, there will be a file called sample_output.txt (which you can view with cat, less, vi, or any other text display or editing program) with the output of the hostname command, and in addition an email will be sent notifying you that the job ended:

[user@hpc ~]$ cat sample_output.txt
node01.cluster
[user@hpc ~]$

This simple example illustrates a serial job, which runs a job on a single CPU on a single node. It does not take advantage of the multi-core CPUs or span multiple compute nodes, both of which are scenarios that may be leveraged within a cluster. For some workflows, this is sufficient (or may even be necessary). The following section explains how to create parallel jobs.

Parallel Batch

While serial or interactive jobs are incredibly useful, one of the greatest advantages that working on a cluster provides is relatively simple management of parallel tasks. You can, with a comparatively simple script, manage thousands of runs of a program over several hours or days, or run the same program, using a different input file each time, automatically. Many types of parallel jobs are possible, such as parameter sweeps, MPI, and OpenMP. Please note that the examples below are, indeed, merely examples, and are meant to be as basic as possible to demonstrate what is possible. Actual scripts could be much more complex, depending on a particular use case.

You can use the --array flag in the submission script to generate a job array and to populate a special variable ($SLURM_ARRAY_TASK_ID) to pass an integer to your program, which might control various options:

#!/bin/bash
#
#SBATCH --job-name=param_sweep
#SBATCH --output=param_sweep_output.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100mb
#
#SBATCH --array=1-8

srun ./my_script $SLURM_ARRAY_TASK_ID

This would run eight distinct jobs: my_script 1, my_script 2, etc., but note that they could execute in any order (e.g., 2,3,1,4,5,6,8,7).

This approach can also be used to process multiple data files:

#!/bin/bash
#
#SBATCH --job-name=param_sweep
#SBATCH --output=param_sweep_output.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100mb
#
#SBATCH --array=0-7

FILES=(/path/to/data/*)

srun ./my_script ${FILES[$SLURM_ARRAY_TASK_ID]}

The FILES= command creates an array of all files in a particular directory, which you can then pass to my_script by iterating through them using the current integer value of $SLURM_ARRAY_TASK_ID.

A similar approach allows you to pass non-integer values to your program:

#!/bin/bash
#
#SBATCH --job-name=param_sweep
#SBATCH --output=param_sweep_output.txt
#
#SBATCH --ntasks=1
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=100mb
#
#SBATCH --array=0-2

ARGS=(0.05 0.76 1.28)
srun ./my_script ${ARGS[$SLURM_ARRAY_TASK_ID]}

Non-numeric arguments can also be passed simply by populating them within the ARGS array: ARGS=("red" "blue" "green"). In addition, it is possible to use non-sequential integers: --array=0,3,4,9-22 or --array=0-12:4 (equivalent to --array=0,4,8,12).

If the running time of an individual job is about 10 minutes or less, however, using a job array may introduce unnecessary overhead; instead, you can loop through files manually:

#! /bin/bash
#
#SBATCH --ntasks=8
for file in /path/to/data/*
do
   srun -n1 --exclusive ./my_script $file &
done
wait

This will loop through all files within a given directory, processing up to eight at a time. A variant allows you to send in, e.g., integers to your program, again 8 at a time:

#! /bin/bash
#
#SBATCH --ntasks=8
for i in {1..1000}
do
   srun -N1 -n1 -c1 --exclusive ./my_script $i &
done
wait

Slurm Job Submission Script Template

The following sample batch submission script is a template that can be easily modified for most jobs. Please email the Help Desk at help@lafayette.edu and ask to be connected to the High Performance Computing team for additional assistance.

#!/bin/bash
#
#SBATCH --partition=node           # Partition (job queue)
#SBATCH --requeue                  # Return job to the queue if preempted
#SBATCH --job-name=SAMPLE          # Assign a short name to your job
#SBATCH --nodes=1                  # Number of nodes you require
#SBATCH --cpus-per-task=1          # Cores per task (>1 if multithread tasks)
#SBATCH --mem-per-cpu=16gb         # Real memory per cpu
#SBATCH --time=00-01:00:00         # Total run time limit (DD-HH:MM:SS)
#SBATCH --output=slurm.%N.%j.out   # STDOUT file for SLURM output
#SBATCH --mail-type=BEGIN,END,FAIL # Email on job start, end, and in case of failure
#SBATCH --mail-user=NetID@lafayette.edu

## Move files to /scratch so we can run the job there
mkdir -p /scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID
cp -r \
data_file_1 \
my_script \
/scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID

## Run the job
cd /scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID
srun my_script data_file_1

## Move outputs to /home and clean-up
cp -pru /scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID/* $SLURM_SUBMIT_DIR
cd
rm -rf /scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID

sleep 10

Resource Limits

When submitting a batch job or initiating an interactive session via Slurm, it is critical to request sufficient – though not excessive – resources, which typically are memory and cores. Each compute node has hard limits:

node[01-03]
- Memory: 192GB per node
- Cores: Two processors with 20 cores each, for a total of 40 cores per node
node[04-06]
- Memory: 192GB per node
- Cores: Two processors with 26 cores each, for a total of 52 cores per node
himem01
- Memory: 768GB
- Cores: Two processors with 18 cores each, for a total of 36 cores
gpu01
- Memory: 192GB
- Cores: Two processors with 16 cores each, for a total of 32 cores
- GPUs: Two NVIDIA Quadro RTX8000 single-precision, each with 48GB memory and;
  - 4,608 CUDA Parallel-Processing cores
  - 576 NVIDIA Tensor cores
  - 72 NVIDIA RT cores

Practically speaking, you should profile your code and have some idea of the amount of resources required. Slurm will immediately kill your job or interactive session if you consume resources beyond those requested. One way to estimate how much memory will be required is to run a single job, and then query Slurm once the job completes to see how much memory was actually consumed:

[root@hpc ~]# sacct -j 128 --units=G --format=JobID,JobName,MaxRSS,Elapsed
       JobID    JobName     MaxRSS    Elapsed
------------ ---------- ---------- ----------
128                DIME              01:30:19
128.batch         batch      0.01G   01:30:19
128.extern       extern      0.00G   01:30:31
128.0           Rscript      0.00G   00:00:10
128.1           Rscript     94.11G   01:23:49

This job required just over 94GB, so if you would ultimately want to run two tasks in parallel, you might request, e.g., 256GB of memory in total (and so you would need to specify to use the high-memory node rather than the regular compute nodes, as they do not have sufficient memory to execute the combined tasks).

Processor cores are somewhat easier to manage. You should not request more cores than a single node contains (unless you are running, e.g., MPI or other code designed to span multiple nodes). Beyond that, however, provided you set the --cpus-per-task=1 flag, Slurm will allocate as many cores as it needs (and are available). For standard serial jobs, you won’t even generally use more than one single core, unless you are using a program designed to take advantage of multiple cores (such as Stata). You can specify certain other flags, such as --ntasks-per-node or --ntasks, to specify a core count, but in many cases, such as some of the batch processing examples above, Slurm will automatically allocate cores. It all depends on the type of job(s) you want to run.

If you have any questions, please contact the Help Desk at help@lafayette.edu and ask to be connected to the High Performance Computing team.

Tagged in: hpc slurm