Lafayette CollegeTechnology Help
Resource sharing and allocations on the cluster are handled by a combination of a resource manager (tracking which computational resources are available on which nodes) and a job scheduler (determining when and to which available resources to submit a particular job, and then monitoring it). To accomplish both tasks, we use Slurm, one of the most popular solutions in high performance and supercomputing.
There are two primary reasons to use Slurm. First, other than for basic, short testing, no “real” work should be performed on the login node, which has several responsibilities such as managing users, handling logins, monitoring the other nodes, etc. For that reason, nearly all work should be performed on the compute nodes, and Slurm acts as the “gateway” into those systems. Second, because Slurm keeps track of which resources are available on the compute nodes, it is able to allocate the most efficient set of them for your tasks, as quickly as possible.
Slurm is a powerful and flexible program, and as such it is beyond the scope of this document to provide an exhaustive tutorial. Rather, the examples provided here should be sufficient to get started, and a wide array of online resources for further guidance. (This page itself is modeled after the excellent CÉCI Slurm tutorial.)
Before working with any of the examples below, it is assumed you are logged into the cluster.
Slurm offers a variety of commands to query the nodes. Most of them will not be strictly necessary for you to run, but they can provide a snapshot of the overall computational ecosystem, list jobs in process or that are queued up, and more.
The sinfo
command lists available partitions and some basic information about each. A partition is a logical grouping of physical compute nodes. Typical examples of partitions might be those dedicated to general purpose or GPU computing, debugging, post processing, or visualization. Running sinfo
produces output similar to this:
$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST all up infinite 4 mix himem01,node[01,04,06] all up infinite 3 alloc node[01-06] node* up infinite 3 mix node[01,04,06] himem up infinite 1 mix himem01 gpu up infinite 1 idle gpu01
The output shows that we have three distinct partitions: nodes (with nodes 01-06), himem (with one node) and gpu (with one node). In addition, there is a fourth partition, all, which as the name suggests, comprises all the nodes and provides a means for directing workloads to all compute resources, leaving Slurm to determine where they’ll be run.
Note the asterisk identifies the partition to which jobs are submitted by default if no partition is specified.
The command squeue
displays a list of jobs that are currently running (denoted with R) or that are pending (denoted with PD). Here is example output:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 12345 debug job1 dave R 0:21 4 node[9-12] 12346 debug job2 dave PD 0:00 8 (Resources) 12348 debug job3 ed PD 0:00 4 (Priority)
In this example, job 12345 is running on nodes 9-12 within the debug
partition, job 12346 is pending because requested resources are unavailable, and job 12348 is pending because it is a lower priority than currently-running jobs. The other columns are largely self-explanatory, though TIME
is the time up until now that a given job has been running. The squeue
help page describes many other options available to control what information is displayed and its formatting.
As is commonly the case with HPC clusters, there are often insufficient resources to run all jobs immediately when they are submitted; as such, submitted jobs are regularly placed into the job queue. Each job’s position in the queue is determined through the Fair Tree FairShare algorithm, which depends on a number of factors, including how many completed jobs a user has had in the previous week, how long the job has been in the queue, the size of job, the time requirement, and more. Slurm updates the priority queue every 5 seconds, so a job’s priority may change over time, moving up or down.
Slurm also uses backfill scheduling to “fill in” slots when, e.g., a job completes earlier than estimated, so it is possible, especially for shorter jobs, that a job may be run prior to when it was estimated to do so. For this reason, it is critical to estimate the time required for your job as accurately as possible. While you should not underestimate, excessive overestimation can make it appear that subsequent jobs won’t start for a long time. A good rule of thumb, when possible, is to request about 10-15% more time than you think is required.
Prior to submitting a job, you can check when it is estimated to be run:
sbatch --test-only myscript.sh
For a job that has already been submitted, you can check its status:
squeue --start -j <jobid>
For a list of your jobids:
squeue -u <username>
You can check your usage information, as well as your “FairShare score” with sshare
; note that RawUsage corresponds to CPU seconds, and that this number will begin to decay after 7 days (meaning, it is essentially a 7-day running tally of your CPU usage):
sshare -u <username>
You can also see a formatted list of all queued jobs, sorted by current priority (which, as noted, constantly updates):
squeue -o '%.7i %.9Q %.9P %.8j %.8u %.8T %.10M %.11l %.8D %.5C %R' -S '-p' --state=pending | less
Slurm offers two primary ways to submit jobs to the compute nodes: interactive and batch. Interactive is the simpler method, but its usefulness is somewhat limited and is generally used to work with software interactively. Batch is more complex and requires greater planning, but it is by far the most common use of Slurm and provides a great deal of flexibility and power. We’ll cover both options.
The simplest way to connect to a set of resources within the compute nodes is simply to request an interactive shell with resources allocated to it, which can be accomplished with the srun
command. Here is a basic example:
[user@hpc ~]$ salloc -t 60 --cpus-per-task=1 --mem-per-cpu=32gb [user@node01 ~]$
This example allocates an interactive shell session for 60 minutes (-t 60
) and provides one CPU (--cpus-per-task=1
) and 32gb of memory to the session (--mem-per-cpu=32gb
). As the second line shows, the requested resources were allocated using node01
and the interactive session switched to that node, ready for commands. At the end of 60 minutes, the session will be terminated, demonstrating why it is important to request a suitable amount of time (if you leave off the -t
flag and do not specify a time, your session will be allocated only 5 minutes).
Once your interactive session starts, you will be in your home directory and can begin performing work or using interactive software. But, if you wish to run software with a GUI, you must explicitly indicate that by adding the --x11
flag to your srun
command:
[user@hpc ~]$ salloc -t 60 --cpus-per-task=1 --mem-per-cpu=32gb --x11
salloc
is extremely powerful and there are a number of other options you can leverage. One of the most useful flags is --ntasks-per-node
, which will allocate a specific number of computational cores to the session. This can be useful when running software that is optimized for parallel operations, such as Stata or Mathematica. For instance, the following example modifies the previous command to also request 8 cores:
[user@hpc ~]$ salloc -t 60 -N 1-1 --ntasks-per-node=8 --mem=32gb [user@node01 ~]$
Two important notes:
When you are finished with your session, the exit
command will terminate it and return you to the login node:
[user@node01 ~]$ exit exit [user@hpc ~]$
The most common way to work with Slurm is to submit batch jobs and allow the scheduler to manage which resources are used, and at which times. So, what then, exactly, is a job? A job has two separate parts:
Job steps are basically the individual commands to be run sequentially to perform the actual tasks of the job.
The best way to manage these two parts is within a single submission script that Slurm uses to allocate resources and process your job steps. Here is an extremely basic sample submission script (we’ll name it sample.sh
):
#!/bin/bash # #SBATCH --job-name=sample #SBATCH --output=sample_output.txt # #SBATCH --ntasks=1 #SBATCH --time=3:00 #SBATCH --mem-per-cpu=100mb #SBATCH --mail-type=BEGIN,END,FAIL #SBATCH --mail-user=[NetID]@lafayette.edu srun hostname srun sleep 60
The so-called “shebang” line (#!/bin/bash
) must be the very first line, and simply is saying that this is a bash shell script. The second line (#
) is a comment; lines that begin with a hash – other than the very top shebang line and #SBATCH
lines – are treated as comments and ignored (in this case, they are merely used to provide visual space between defining the metadata associated with the job, such as the title and output file, and the actual resource allocation requests). Following the shebang line are any number of SBATCH
directives, which handle the resource allocation and other data (e.g., job name, output file location, and potentially many other options) associated with your job. These all must appear at the top of the file, prior to any job steps. In this sample file, multiple #SBATCH directives define the job: we give the job a name (--job-name=sample
), we provide a filename into which any output of commands, programs, scripts, etc. will be written (--output=sample_output.txt
), and then we are requesting one CPU (--ntasks=1
) for 3 minutes (--time=3:00
) and 100mb of memory (--mem-per-cpu=100mb
). Finally, we’ll ask Slurm to send an email to us (--mail-type=BEGIN,END,FAIL
and --mail-user=[NetID]@lafayette.edu
) when the job starts and ends, and also in case of failure.
Note that these are only a handful of the many #SBATCH
directives you can provide; for a complete list, you can run man sbatch
or visit the sbatch
help page. A couple of notes:
--job-name
, since it makes it much easier to track. You might consider including your NetID in the name, such as your_netid_gaussian_run
or similar (do not use spaces!).--output
directive will default to writing the file to whichever directory you submitted the job from, which should almost always be your home directory (you can be explicit about this by saying, e.g., --output=~/sample_output.txt
, since the tilde ~ is a shortcut to your home directory). Also, note that this output file isn’t necessarily where output from your job steps will be, which generally will be defined within your specific code or script. Rather, this file will contain whatever output you would see had you been running the commands interactively, from a terminal (i.e., you are simply specifying to direct any output that would display on-screen to a file instead). This means that you could easily have one or more output files from the code itself, in addition to this file.Following whichever #SBATCH
directives you choose to provide, you then list one or more srun
commands with the specific tasks to be executed, in order. In this case, just for a basic demonstration, the command hostname
is executed on whichever node the job was assigned to, and its result will be written out to the sample_output.txt
file within your home directory (because it would have displayed on the screen had hostname
been typed from the command prompt). Then, we issue a sleep 60
command (which does nothing and simply waits for a minute).
Once you have a properly-formatted submission script, you must submit the job to Slurm via the sbatch
command:
[user@hpc ~]$ sbatch sample.sh Submitted batch job 72
Assuming no errors, you will be given the jobid attributed to the job. At this point it will enter the queue in the pending state; once sufficient resources are available and it has the highest priority, it will switch to the running state and begin executing. If everything goes well, it will finish and be set to the completed state, or if something went wrong, it will be set to the failed state.
You can monitor the state of your job by running sstat -j <jobid>
(replacing <jobid>
with whatever was output when you submitted the job via sbatch
). Note that by default, this will generate quite a bit of output, most of which probably isn’t that helpful. You can control what is shown with the --format
parameter. For more information, run man sstat
or view the sstat
help page.
Upon completion of this sample script, there will be a file called sample_output.txt
(which you can view with cat
, less
, vi
, or any other text display or editing program) with the output of the hostname
command, and in addition an email will be sent notifying you that the job ended:
[user@hpc ~]$ cat sample_output.txt node01.cluster [user@hpc ~]$
This simple example illustrates a serial job, which runs a job on a single CPU on a single node. It does not take advantage of the multi-core CPUs or span multiple compute nodes, both of which are scenarios that may be leveraged within a cluster. For some workflows, this is sufficient (or may even be necessary). The following section explains how to create parallel jobs.
While serial or interactive jobs are incredibly useful, one of the greatest advantages that working on a cluster provides is relatively simple management of parallel tasks. You can, with a comparatively simple script, manage thousands of runs of a program over several hours or days, or run the same program, using a different input file each time, automatically. Many types of parallel jobs are possible, such as parameter sweeps, MPI, and OpenMP. Please note that the examples below are, indeed, merely examples, and are meant to be as basic as possible to demonstrate what is possible. Actual scripts could be much more complex, depending on a particular use case.
You can use the --array
flag in the submission script to generate a job array and to populate a special variable ($SLURM_ARRAY_TASK_ID
) to pass an integer to your program, which might control various options:
#!/bin/bash # #SBATCH --job-name=param_sweep #SBATCH --output=param_sweep_output.txt # #SBATCH --ntasks=1 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100mb # #SBATCH --array=1-8 srun ./my_script $SLURM_ARRAY_TASK_ID
This would run eight distinct jobs: my_script 1
, my_script 2
, etc., but note that they could execute in any order (e.g., 2,3,1,4,5,6,8,7).
This approach can also be used to process multiple data files:
#!/bin/bash # #SBATCH --job-name=param_sweep #SBATCH --output=param_sweep_output.txt # #SBATCH --ntasks=1 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100mb # #SBATCH --array=0-7 FILES=(/path/to/data/*) srun ./my_script ${FILES[$SLURM_ARRAY_TASK_ID]}
The FILES=
command creates an array of all files in a particular directory, which you can then pass to my_script
by iterating through them using the current integer value of $SLURM_ARRAY_TASK_ID
.
A similar approach allows you to pass non-integer values to your program:
#!/bin/bash # #SBATCH --job-name=param_sweep #SBATCH --output=param_sweep_output.txt # #SBATCH --ntasks=1 #SBATCH --time=10:00 #SBATCH --mem-per-cpu=100mb # #SBATCH --array=0-2 ARGS=(0.05 0.76 1.28) srun ./my_script ${ARGS[$SLURM_ARRAY_TASK_ID]}
Non-numeric arguments can also be passed simply by populating them within the ARGS
array: ARGS=("red" "blue" "green")
. In addition, it is possible to use non-sequential integers: --array=0,3,4,9-22
or --array=0-12:4
(equivalent to --array=0,4,8,12
).
If the running time of an individual job is about 10 minutes or less, however, using a job array may introduce unnecessary overhead; instead, you can loop through files manually:
#! /bin/bash # #SBATCH --ntasks=8 for file in /path/to/data/* do srun -n1 --exclusive ./my_script $file & done wait
This will loop through all files within a given directory, processing up to eight at a time. A variant allows you to send in, e.g., integers to your program, again 8 at a time:
#! /bin/bash # #SBATCH --ntasks=8 for i in {1..1000} do srun -N1 -n1 -c1 --exclusive ./my_script $i & done wait
The following sample batch submission script is a template that can be easily modified for most jobs. Please email the Help Desk at help@lafayette.edu and ask to be connected to the High Performance Computing team for additional assistance.
#!/bin/bash # #SBATCH --partition=node # Partition (job queue) #SBATCH --requeue # Return job to the queue if preempted #SBATCH --job-name=SAMPLE # Assign a short name to your job #SBATCH --nodes=1 # Number of nodes you require #SBATCH --cpus-per-task=1 # Cores per task (>1 if multithread tasks) #SBATCH --mem-per-cpu=16gb # Real memory per cpu #SBATCH --time=00-01:00:00 # Total run time limit (DD-HH:MM:SS) #SBATCH --output=slurm.%N.%j.out # STDOUT file for SLURM output #SBATCH --mail-type=BEGIN,END,FAIL # Email on job start, end, and in case of failure #SBATCH --mail-user=NetID@lafayette.edu ## Move files to /scratch so we can run the job there mkdir -p /scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID cp -r \ data_file_1 \ my_script \ /scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID ## Run the job cd /scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID srun my_script data_file_1 ## Move outputs to /home and clean-up cp -pru /scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID/* $SLURM_SUBMIT_DIR cd rm -rf /scratch/$USER/$SLURM_JOB_NAME-$SLURM_JOB_ID sleep 10
When submitting a batch job or initiating an interactive session via Slurm, it is critical to request sufficient – though not excessive – resources, which typically are memory and cores. Each compute node has hard limits:
Practically speaking, you should profile your code and have some idea of the amount of resources required. Slurm will immediately kill your job or interactive session if you consume resources beyond those requested. One way to estimate how much memory will be required is to run a single job, and then query Slurm once the job completes to see how much memory was actually consumed:
[root@hpc ~]# sacct -j 128 --units=G --format=JobID,JobName,MaxRSS,Elapsed JobID JobName MaxRSS Elapsed ------------ ---------- ---------- ---------- 128 DIME 01:30:19 128.batch batch 0.01G 01:30:19 128.extern extern 0.00G 01:30:31 128.0 Rscript 0.00G 00:00:10 128.1 Rscript 94.11G 01:23:49
This job required just over 94GB, so if you would ultimately want to run two tasks in parallel, you might request, e.g., 256GB of memory in total (and so you would need to specify to use the high-memory node rather than the regular compute nodes, as they do not have sufficient memory to execute the combined tasks).
Processor cores are somewhat easier to manage. You should not request more cores than a single node contains (unless you are running, e.g., MPI or other code designed to span multiple nodes). Beyond that, however, provided you set the --cpus-per-task=1
flag, Slurm will allocate as many cores as it needs (and are available). For standard serial jobs, you won’t even generally use more than one single core, unless you are using a program designed to take advantage of multiple cores (such as Stata). You can specify certain other flags, such as --ntasks-per-node
or --ntasks
, to specify a core count, but in many cases, such as some of the batch processing examples above, Slurm will automatically allocate cores. It all depends on the type of job(s) you want to run.
If you have any questions, please contact the Help Desk at help@lafayette.edu and ask to be connected to the High Performance Computing team.