Creating Job with Slurm: how-to and automation examples
Among software inside an HPC cluster, the computational resource manager plays a major role. In fact, it is the one who deals with harmonizing the execution of the jobs of all users, ensuring that everyone has access to the required resources. This task is rather difficult, given the great variety of requests that can be made and the complexity of the management rules imposed by the administrators. Not to mention that each HPC cluster is in fact a small work of art created by the combination of hardware and software solutions masterfully (or magically) orchestrated together to ensure maximum performance and reliability. All these complexities are reflected in the resource managers, making them one of the software most obnoxious and difficult to use… but indispensable for a profitable performance of each user’s daily routine.
In this article, I will do my best to reconcile the relationship between users of HPC clusters and Slurm, one of the most popular OpenSource resource managers at worldwide level. Specifically, I will suggest some guidelines for the creation and the management of Jobs.
How to create a non-interactive Job
For Slurm, as well as for many other software of this type, the Jobs can be divided into two macro-groups: the interactive ones and the non-interactive ones. The difference is quite obvious: the former require user input to proceed, while the latter are totally automated. Usually, administrators recommend using the latter because they allow a more efficient use of computing resources. However, there are seldom templates available for the creation of fully automated “recipes” and therefore they must be hand made. This task can be greatly simplified using an interactive Job and a text editor. How?
First of all, I suggest you start one of these Jobs, using the following command:
srun --partition=<CODA> --nodes=<NUM SERVER> --ntasks=<NUM CORE> --mem=<MB RAM> --time=<HH:MM:SS> --pty /bin/bash
Please note that the underlined parts should be replaced with the specific values of the cluster you are using.
Optionally, you can add –gres = gpu: <TYPE>: <NUM> , if your Job needs GPUs.
Once a prompt is displayed, the Job will be active and the first commands can be executed as in a classic Linux terminal. Before proceeding, however, I advise you to create a new file with your trusted editor to start writing your automated “recipe”. This file should start with the following lines:
#SBATCH --nodes=<NUM SERVER>
The first of these lines contains the interpreter that Linux will use to execute the instructions below. Then you have the list of all the options you used to submit the interactive Job (with the exception of –pty) preceded by #SBATCH. This syntax allows Slurm to reconfigure its default values, avoiding the burden of rewriting them during the submission of the non-interactive Job.
Once the preamble of the “recipe” has been completed, you can proceed with the execution of the commands within the interactive Job, as reported in the documentation of your application, taking care to bring them back into the text file. Once the transcription is complete, the non-interactive Job is ready to be executed using the sbatch command:
You may encounter instructions that cannot be automated with the previous method. One of them is editing the input files. This first obstacle can be overcome with the following script:
mv <FILE> <FILE>.bk
head -n <NUM LINEE> <FILE>.bk >> <FILE>
cat >> <FILE> << EOF
tail -n <NUM LINEE> <FILE>.bk >> <FILE>
In the first line, the file to be modified is renamed to <FILE> .bk, then the head command copies the first lines (ie <NUM LINEE>) from the original file into a new file (named <FILE>). Within the cat command you have to insert a contiguous block of text that contains all the lines you have already changed. Finally, the tail command copies the number of lines <NUM LINEE> from <FILE> .bk to <FILE> starting from the bottom to go up.
Another case, in which you may find some difficulties when automating your Job, concerns the commands that execute interactive menus. To overcome these requests, you have to just enter all the responses (one per line) within a file (<INPUT FILE> in this example) and indicate the latter as the input of the original command, as shown below.
<CMD> < <INPUT_FILE>
Before closing the text editor, I recommend that you insert the variables provided by Slurm into your recipe, especially when referring to the number of servers, cores, GPUs and RAM. In this way, your Job will be able to adapt to all the combinations of resources you will use in the future. These variables are listed within the online help of the sbatch command in the “OUTPUT ENVIRONMENT VARIABLES” section. 
The production of a scientifically relevant data usually goes through many Jobs, more commonly described as a workflow. Slurm offers many strategies for the automation of these protocols, allowing the user to free the production and refining a data from human intervention. Below I list which are the most popular techniques, in my opinion:
- Jobs that submit other Jobs
In the first part of this article, I mentioned that Slurm runs Jobs within a standard Linux terminal. Among its commands you will certainly also find sbatch which allows you to submit a new Job, while executing another.
This technique is often used when your workflow includes pre- and post-processing of the data. Usually these processes require a much smaller number of resources than the main Job, so it is advisable to release the unused cores so that they can run other Jobs.
Before executing the sbatch command, it is good practice to insert checks (for example with if constructs) that verify the quality of the data produced. An example of a check is to search for a successful string in the output files, using the grep command. Another technique consists in verifying an intrinsic characteristic of the files produced, for example that their size is greater than a threshold value. A final note concerns the execution of the sleep 5s command after sbatch. This allows Slurm to complete the new Job before executing the subsequent instructions.
- The heterogeneous jobs 
This feature allows you to divide the computing resources of a Job into multiple sub-sets. The most classic use-case concerns the simultaneous execution of multiple post-processing operations, on the same data through different computational resources.
To create a heterogeneous Job, it is sufficient to separate the resources’ sub-sets of the heterogeneous Job with the #SBATCH hetjob line, as follows:
#SBATCH --ntasks=8 --mem-per-cpu=8G
#SBATCH --ntasks=1 --mem=16G --gres=gpu:v100:1
During the execution of the Job, the sub-sets can be combined or dedicated exclusively to a single command, using the –het-group option of srun.
srun --het-group=0,1 step-0.sh
srun --het-group=0 step-1a.sh : --het-group=1 step-1b.sh
srun --het-group=1 step-2.sh
In this example, step-0.sh uses both sub-sets. step-1a.sh and step-1b.sh, on the other hand, are performed simultaneously, one per sub-set. Finally, step-2.sh uses only the second sub-set (i.e. the one with the GPU, according to the previous scheme).
The advantage of a heterogeneous job compared to separate jobs is the immediate availability of resources for all scripts submitted with srun. In fact, all the resources required for a heterogeneous Job are allocated at the beginning of its execution and will not be released until its completion, thus preventing other Jobs from occupying them.
- Dependent Jobs
In Slurm it is possible to constrain the execution of a Job, based on the outcome of another. This functionality has been implemented to manage highly automated workflows that have multiple operational sequences to be undertaken, based on the results of the previous steps. Often, however, it is used by users to limit the number of jobs performed simultaneously, thus preserving a part of the computing resources for more urgent tasks.
To take advantage of this technique, it is necessary to use the –dependency option of sbatch. This option supports the following conditions:
- after: <JOBID>: <JOBID>:… in which the current Job will be executed only when all those in the list are already started or have been canceled;
- afterany: <JOBID>: <JOBID>:… where the current Job will be executed only after all the Jobs in the list have already finished;
- afterok: <JOBID>: <JOBID>:… i.e. the current Job will be executed only if all the Jobs in the list have already completed successfully;
- afternotok: <JOBID>: <JOBID>:… in which the current Job will be executed only if all the Jobs in the list have already terminated with a failure;
- singleton which is used to execute a single Job with the same name and belonging to the same user.
These conditions can be chained together using “?” when it is sufficient to satisfy only one or “,” when they must have a cumulative character. The value of this option can be changed during construction, as shown in the following examples:
scontrol update jobid=<JOBID> dependency=afterok:<NUOVO JOBID>
scontrol update jobid=<JOBID> dependency=””
In the first case the dependency value is overwritten with “afterok: <NUOVO JOBID>”, while in the second case it is completely removed.
During the implementation of this feature in your workflow, it is necessary to remember that the dependence between the Jobs is based solely on their status and not on quality of the data produced. Furthermore, there is no guarantee that the Jobs that have satisfied their dependencies are executed immediately, since they have anyway to compete with the other Jobs in the cluster, for the computational resources.
- The allocations
This last feature consists in submitting a Job to “reserve” computational resources for a certain amount of time, in order to avoid competition with the other Jobs in the cluster. Allocations are usually used to make short tests aimed at debugging (of a script for a non-interactive Job or of an application) or to check the validity of an input file before running a longer simulation.
To create an allocation you need to use the salloc command as shown below:
salloc --partition=<CODA> --nodes=<NUM SERVER> --ntasks=<NUM CORE> --mem=<MB RAM> --time=<HH:MM:SS> --begin=<HH:MM:SS> --no-shell &
As you can see, many of the options in this command are similar to those of srun and sbatch. To these must be added –no-shell , which avoids occupying the entire Job with an interactive terminal. Another option that I consider very useful in this area is –begin which indicates the moment after which the Job can be executed. This allows you to adjust the execution of the Job to your availability on the calendar. In the event that your available time is particularly limited, you can use the –deadline option to indicate a time limit beyond which it is no longer necessary to make the allocation. Slurm will automatically delete it from the queued Jobs.
Once the Job is running, the user can request the “reserved” resources using the –jobid option of srun:
srun --nodes=<NUM SERVER> --ntasks=<NUM CORE> --mem=<MB RAM> --jobid=<JOBID_ALLOCAZIONE> test.sh
Within an allocation, more than one step srun can be performed at a time, provided that the total number of resources requested does not exceed those “booked”.
It is good practice to indicate a maximum time limit for the use of an allocation, in order not to leave computational resources unnecessarily “reserved” for long time. It should also be considered that an allocation must still compete with the other Jobs in the cluster to obtain the required resources, therefore it is not guaranteed that it will start at the time indicated in the –begin option.
Among all the software in HPC clusters, Slurm and the other resource managers almost certainly have the primacy for being the least user-friendly. However, I hope that the information I have provided you in this article can help you relieve the daily stress caused by this forced coexistence… and why not, increase the efficiency of your Jobs by making the most of all the automatisms I have illustrated to you. Having said that, I just have to wish you a more peaceful and profitable day!