Basic concepts about SLURM

The Slurm scheduler provides three key functions:

  • it allocates access to resources (compute nodes) to users for some duration of time so they can perform work.

  • it provides a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes.

  • it arbitrates contention for resources by managing a queue of pending jobs.

A job consists in two parts: resource requests and job steps.

  • Resource requests describe the amount of computing resource (CPUs, memory, expected run time, etc.) that the job will need to successfully run.

  • Job steps describe individual tasks that must be executed into a job. Most often a single job needs to execute several individual computations to be completed. Each partial execution in a job is called job step. You can execute a job step with the SLURM command: srun. A job consists in one or more steps, each consisting in one or more tasks each using one or more CPU.

Jobs are typically created with the sbatch command, steps are created with the srun command, tasks are requested, at the job level with –ntasks or –ntasks-per-node, or at the step level with –ntasks. CPUs are requested per task with –cpus-per-task. Note that jobs submitted with sbatch have one implicit step; the Bash script itself.

The typical way of creating a job is to write a job submission script. A submission script is a shell script (e.g. a Bash script) whose first comments, if they are prefixed with #SBATCH, are interpreted by Slurm as parameters describing resource requests and submissions options.

Figure 1 is an example of job submission script. For this example we are requesting a total of 6 CPUs and 12GB of RAM for each job step. We could define infinite number of job steps but each job step is assigned a maximum number of resources to use (usually the input of one is the output of another).

For the example the job step 1 is parallelized into 3 tasks and each task uses 2 CPUs (to do this you need that your code was programmed with MPI or a similar programming paradigm). The job step 2 executes a serial code that only uses 1 core. It’s useless to use more than 1 task and 1 CPU to do this step because the code is not paralellized (the input data of this job step is the output results of job step 1). Finally the job step 3 executes an only task that can be executed in 6 cores in parallel. To do this you need that your code is programmed with OpenMP or similar programming paradigm.

../_images/slurm_job_step.jpg

Figure 1. Example of a job with 3 job steps and 3 tasks per job step.

Memory and CPU requests

A large number of users request far more memory and CPUs than their jobs use.

While it is important to request more memory than will be used (10-20% is usually sufficient), requesting 100x, or even 10,000x, more memory only reduces the number of jobs that a user can run as well as overall throughput on the cluster. Many users will be able to run far more jobs if they request more reasonable amounts of memory.

If your job is not able to parallelize then use only 1 CPU. Read first the user guide of your application in order to find if your code can be paralelized and with what parameters then submit it specifying the correct number of CPUs.

When a job finishes and doesn’t meet the requirements requested we send to the user an email showing how much memory and CPU the job actually used and can be used to adjust memory requests for future jobs. The SLURM directives for memory requests are the –mem or –mem-per-cpu. It is in the user’s best interest to adjust the memory request to a more realistic value.

Requesting more memory than needed will not speed up analyses. Based on their experience of finding their personal computers run faster when adding more memory, users often believe that requesting more memory will make their analyses run faster. This is not the case. An application running on the cluster will have access to all of the memory it requests, and we never swap RAM to disk. If an application can use more memory, it will get more memory. Only when the job crosses the limit based on the memory request does SLURM kill the job.

Slurm commands

Below we briefly explains some of more used commands of SLURM but you can review the complete information in the SLURM site

Monitoring jobs: squeue

The squeue command is a tool we use to pull up information about the jobs in queue. You can use the extended comand squeue_ to retrieve statistics about the efficiency of your job. By default, the squeue command will print out the job ID, QoS, username, job status, number of nodes, and name of nodes for all jobs queued or running within Slurm. Usually you wouldn’t need information for all jobs that were queued in the system, so we can specify jobs that only you are running with the –user flag:

$ squeue --user=USERNAME

We can output non-abbreviated information with the –long flag. This flag will print out the non-abbreviated default information with the addition of a time limit field:

$ squeue --user=USERNAME --long

The squeue command also provides users with a means to calculate a job’s estimated start time by adding the –start flag to our command. This will append Slurm’s estimated start time for each job in our output information.

Note: The start time provided by this command can be inaccurate. This is because the time calculated is based on jobs queued or running in the system. If a job with a higher priority is queued after the command is run, your job may be delayed.

$ squeue --user=USERNAME --start

When checking the status of a job, you may want to repeatedly call the squeue command to check for updates. We can accomplish this by adding the –iterate flag to our squeue command. This will run squeue every n seconds, allowing for a frequent, continuous update of queue information without needing to repeatedly call squeue:

$ squeue --user=USERNAME --start --iterate=n_seconds

Press ctrl-c to stop the command from looping and bring you back to the terminal.

more information about squeue

The squeue command details a variety of information on an active job’s status with state and reason codes. Job state codes describe a job’s current state in queue (e.g. pending, completed). Job reason codes describe the reason why the job is in its current state.

The following tables outline a variety of job state and reason codes you may encounter when using squeue to check on your jobs.

Job State Codes:

Job State

Codes

Explaination

COMPLETED

CD

The job has completed successfully.

COMPLETING

CG

The job is finishing but some processes are still active.

FAILED

F

The job terminated with a non-zero exit code and failed to execute.

PENDING

PD

The job is waiting for resource allocation. It will eventually run.

PREEMPTED

PR

The job was terminated because of preemption by another job.

RUNNING

R

The job currently is allocated to a node and is running.

SUSPENDED

S

A running job has been stopped with its cores released to other jobs.

STOPPED

ST

A running job has been stopped with its cores retained.

A full list of these Job State codes can be found in Slurm’s documentation.

Job Reason Codes:

Reason Codes

Explaination

Priority

One or more higher priority jobs is in queue for running. Your job will eventually run.

Dependency

This job is waiting for a dependent job to complete and will run afterwards.

Resources

The job is waiting for resources to become available and will eventually run.

InvalidAccount

The job’s account is invalid. Cancel the job and rerun with correct account.

InvaldQoS

The job’s QoS is invalid. Cancel the job and rerun with correct account.

QOSGrpCpuLimit

All CPUs assigned to your job’s specified QoS are in use; job will run eventually.

QOSGrpMaxJobsLimit

Maximum number of jobs for your job’s QoS have been met; job will run eventually.

QOSGrpNodeLimit

All nodes assigned to your job’s specified QoS are in use; job will run eventually.

PartitionCpuLimit

All CPUs assigned to your job’s specified partition are in use; job will run eventually.

PartitionMaxJobsLimit

Maximum number of jobs for your job’s partition have been met; job will run eventually.

PartitionNodeLimit

All nodes assigned to your job’s specified partition are in use; job will run eventually.

AssociationCpuLimit

All CPUs assigned to your job’s specified association are in use; job will run eventually.

AssociationMaxJobsLimit

Maximum number of jobs for your job’s association have been met; job will run eventually.

AssociationNodeLimit

All nodes assigned to your job’s specified association are in use; job will run eventually.

A full list of these Job Reason Codes can be found in Slurm’s documentation

Monitoring finished jobs: sacct

The sacct command allows users to pull up status information about past jobs. This command is used on jobs that have been previously run on the system instead of currently running jobs.

We can use a job’s id

$ sacct --user=USERNAME

Or your Garnatxa username…

$ sacct --user=USERNAME

to pull up accounting information on jobs run at an earlier time.

By default, sacct will only pull up jobs that were run on the current day. We can use the –starttime flag to tell the command to look beyond its short-term cache of jobs.

$ sacct --user=USERNAME --starttime=YYYY-MM-DD

To see a non-abbreviated version of sacct output, use the –long flag:

$ sacct --user=USERNAME --starttime=YYYY-MM-DD --long

The standard output of sacct may not provide the information we want. To remedy this, we can use the –format flag to choose what we want in our output. Similarly, the format flag is handled by a list of comma separated variables which specify output data:

$ sacct --user=USERNAME --format=var_1,var_2, ... ,var_N

A chart of some variables is provided below:

Variable

Description

account

Account the job ran under.

avecpu

Average CPU time of all tasks in job.

averss

Average resident set size of all tasks in the job.

cputime

Formatted (Elapsed time * CPU) count used by a job or step.

elapsed

Jobs elapsed time formated as DD-HH:MM:SS.

exitcode

The exit code returned by the job script or salloc.

jobid

The id of the Job.

jobname

The name of the Job.

maxdiskread

Maximum number of bytes read by all tasks in the job.

maxdiskwrite

Maximum number of bytes written by all tasks in the job.

maxrss

Maximum resident set size of all tasks in the job.

ncpus

Amount of allocated CPUs.

nnodes

The number of nodes used in a job.

ntasks

Number of tasks in a job.

priority

Slurm priority.

qos

Quality of service.

reqcpu

Required number of CPUs

reqmem

Required amount of memory for a job.

user

Username of the person who ran the job.

more information about ssact

Canceling jobs: scancel

Sometimes you may need to stop a job entirely while it’s running. The best way to accomplish this is with the scancel command. The scancel command allows you to cancel jobs you are running on Garnatxa using the job’s ID. The command looks like this:

$ scancel job-id

To cancel multiple jobs, you can use a comma-separated list of job IDs:

$ scancel job-id1, job-id2, jobiid3

To cancel all your jobs (running and pending):

$ scancel -u USERNAME

more information about scancel

Checking efficiency of running jobs: squeue_

The extended command squeue_ allows users to easily pull up status information about their currently running jobs. This includes information about the requested resources and the resources used so far. The command retrieves the efficiency achieved as a percentage. It’s very important you monitor your jobs checking the resources actually consumed. If squeue_ is showing a efficiency smaller than 80% you must accurate the requirements of your next job.

To check the efficiency of a running job:

$ squeue_ -j 10075
 __________________________________________________________________________________________________________________________________________________________________________
| ST | JOB      | NAME               | USER     | ACCOUNT  | QOS      | STARTIME   | TIME         | TIME_LEFT    | ND  |  CPU     E.CPU | PEAK_MEM    E.MEM | NODES        |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| R  | 10075    | EXAMPLE            | USER1    | ACC1     | short    | 2023-01-13 | 4:51:34      | 19:08:26     | 1   | 1/10     10%   | 23G/100G    23%   | cn00         |
|__________________________________________________________________________________________________________________________________________________________________________|
  • CPU: Is the average number of cpus used during the execution time of the task. In the example 1/10 means that the user requested 10 cpus but only 1 cpu is being used on average.

  • PEAK_MEM: Is the maximum size of memory that the job has been used during the execution. In the example 23G/100G means that the user requested a total of 100 GB of memory and at one point in execution the job has reached 23 GB.

What parameter to use to measure my efficiency?: You must check the column CPU and PEAK_MEM in order to determine if the efficiency of your job is less than 75% in cpu or memory. If that happens you must to modify the requested parameters in your submission script. These values ​​can give you an idea of ​​the resources consumed by your work so far. If the efficiency is so low and your job is running a short time then cancel the job and adjust the requirements or wait for the job to finish to be sure what consumed your job (next section).

To check the efficiency of all your jobs:

$ squeue_ -u USERNAME

Checking efficiency of running and completed jobs: sacct_

The resources consumed by a running or finished jobs can be checked executing: sacct_.

$ sacct_ -j 10516

[USERNAME@master test]$ sacct_ -j 10516
 ________________________________________________________________________________________________________________________________________________________________________________________________________________
| JOBID         | NAME          | START                | END                  | ELAPSED    | TOTAL_CPU  | USER          | ACCOUNT  | QOS      | CPU       E.CPU  | PEAK_MEM       E.MEM  | STATE     | EXIT_CODE |
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 10516         | seqJobTest    | 2023-01-11T11:09:32  | 2023-01-11T11:11:33  | 00:02:01   | 02:01.567  | xxx           | admin    | short    | 1/1       100%   | 6G/10G         60%    | COMPLETED | 0:0       |
| 10516.batch   | batch         | 2023-01-11T11:09:32  | 2023-01-11T11:11:33  | 00:02:01   | 00:00.373  |               | admin    |          | -/2       -      | -/10G          -      | COMPLETED | 0:0       |
| 10516.0       | stress        | 2023-01-11T11:09:33  | 2023-01-11T11:11:33  | 00:02:00   | 02:01.194  |               | admin    |          | -/2       -      | -/10G          -      | COMPLETED | 0:0       |
|________________________________________________________________________________________________________________________________________________________________________________________________________________|

In the example the job with job_id: 10516 consumed 6GB of memory but the user requested 10GB so the efficiency was only the 60% of the total. To show the last jobs finished by a user:

$ sacct_ -u USERNAME

And to get a brief output (only efficiencies and discarding job steps):

$ sacct_ -b -u USERNAME

Plotting the job efficiency over time: plotjob

The command plotjob can be used to display a plot of the consumed resources (cpu or memory) over the execution time. You can use this command with running or finished jobs. To use this command you will need to establish a X11 forwarding connection with Garnatxa (ssh -X). Use plotjob -h to get more information.

Attention

Keep in mind that the gray areas on the plot mean wasted resources for your jobs and for the rest of the users. Pay attention to the mean (spu) or peak (memory) data and adjust your sbatch script with the resources that job really needs.

Example of plotting the cpu efficiency:

ssh -X USERNAME@garnatxa
plotjob -j job_id -o cpu
../_images/cpu_efficiency.png

Example of plotting the memory efficiency:

ssh -X USERNAME@garnatxa
plotjob -j <job_id> -o mem
../_images/mem_efficiency.png

If you cannot get a graphical output you can try saving the graphic file in disk and sending it to an external location.

plotjob -j <job_id> -o mem -s
ls /tmp/mem_plot_1857479.png
scp /tmp/mem_plot_1857479.png user@external_host:/tmp

Controlling queued and running jobs: scontrol

The scontrol command provides users extended control of their jobs run through Slurm. This includes actions like suspending a job, holding a job from running, or pulling extensive status information on jobs.

To suspend a job that is currently running on the system, we can use scontrol with the suspend command. This will stop a running job on its current step that can be resumed at a later time. We can suspend a job by typing the command:

$ scontrol suspend job_id

To resume a paused job, we use scontrol with the resume command:

$ scontrol resume job_id

Slurm also provides a utility to hold jobs that are queued in the system. Holding a job will place the job in the lowest priority, effectively “holding” the job from being run. A job can only be held if it’s waiting on the system to be run. We use the hold command to place a job into a held state:

$ scontrol hold job_id

We can then release a held job using the release command:

$ scontrol release job_id

scontrol can also provide information on jobs using the show job command. The information provided from this command is quite extensive and detailed, so be sure to either clear your terminal window, grep certain information from the command, or pipe the output to a separate text file:

Output to console

$ scontrol show job job_id

Streaming output to a textfile

$ scontrol show job job_id > outputfile.txt

Piping output to Grep and find lines containing the word “Time”

$ scontrol show job job_id | grep Time

more information about scontrol

Submitting jobs to the cluster: sbatch

sbatch submits a batch script to Slurm. The batch script may be given to sbatch through a file name on the command line, or if no file name is specified, sbatch will read in a script from standard input. The batch script may contain options preceded with “#SBATCH” before any executable commands in the script. sbatch will stop processing further #SBATCH directives once the first non-comment non-whitespace line has been reached in the script. sbatch exits immediately after the script is successfully transferred to the Slurm controller and assigned a Slurm job ID. The batch script is not necessarily granted resources immediately, it may sit in the queue of pending jobs for some time before its required resources become available.

By default both standard output and standard error are directed to a file of the name “slurm-%j.out”, where the “%j” is replaced with the job allocation number. The file will be generated on the first node of the job allocation. Other than the batch script itself, Slurm does no movement of user files. When the job allocation is finally granted for the batch script, Slurm runs a single copy of the batch script on the first node in the set of allocated nodes.

A batch job is a shell script that is processed by a batch system. A typical batch job is shown below. It has four sections

  1. shebang (line 1)

  2. submit options (lines 3-5)

  3. initialization (lines 7-8)

  4. data handling and work (lines 10-12)

that are explained afterwards. The example shows the basic structure. Real batch jobs can become more complex.

 1#!/bin/bash
 2
 3# submit options
 4#SBATCH --ntasks=1
 5#SBATCH --time=00:05:00
 6
 7# initialization
 8module load package/version
 9
10# data handling and work
11cd /path/to/working/directory
12binary [arguments]
13
14exit

Explanations

1. shebang The first line of every shell script is the Shebang which specifies the command line interpreter to use.

2. submit options In a batch job the next lines contain submit options. Alternatively options could be given as arguments to the submit command. The syntax for specifying options is the same in both cases. In a job script submit options must be preceded by a special prefix which is #SBATCH for the SLURM batch system. Syntactically the first character of the prefix makes such a line a shell script comment. The submit command stops processing these lines once the first line containing a shell command has been reached.

3. initialization System specific initialization. It depends on the system whether system specific initialization is needed. For example, on our system the module function which is often needed in job specific initialization. Job specific initialization. There are two typical use cases: For application packages the corresponding module must be loaded. For self-compiled software it might be necessary to load (or switch to) exactly the same modules that were loaded at compile time. For MPI programs that are lauched with mpirun command the MPI module used at compile time must be loaded in any case.

4. data handling and work This part contains commands for handling data and the actual work to be performed. Selecting the working directory. The default working directory is the directory in which the submit command was issued.

When you write a sbatch file you can use a set of read variables in order to get the value of the requirements of your job:

Variable

Description

$SLURM_JOB_ID

The Job ID.

$SLURM_SUBMIT_DIR

The path of the job submission directory.

$SLURM_SUBMIT_HOST

The hostname of the node used for job submission.

$SLURM_JOB_NODELIST

Contains the definition (list) of the nodes that is assigned to the job.

$SLURM_CPUS_PER_TASK

Number of CPUs per task.

$SLURM_CPUS_ON_NODE

Number of CPUs on the allocated node.

$SLURM_JOB_CPUS_PER_NODE

Count of processors available to the job on this node.

$SLURM_CPUS_PER_GPU

Number of CPUs requested per allocated GPU.

$SLURM_MEM_PER_CPU

Memory per CPU. Same as –mem-per-cpu .

$SLURM_MEM_PER_GPU

Memory per GPU.

$SLURM_MEM_PER_NODE

Memory per node. Same as –mem .

$SLURM_GPUS

Number of GPUs requested.

$SLURM_NTASKS

The number of tasks.

$SLURM_NTASKS_PER_NODE

Number of tasks requested per node.

$SLURM_NTASKS_PER_SOCKET

Number of tasks requested per socket.

$SLURM_NTASKS_PER_CORE

Number of tasks requested per core.

$SLURM_NTASKS_PER_GPU

Number of tasks requested per GPU.

$SLURM_NNODES

Total number of nodes in the job’s resource allocation.

$SLURM_TASKS_PER_NODE

Number of tasks to be initiated on each node.

$SLURM_ARRAY_JOB_ID

Job array’s master job ID number.

$SLURM_ARRAY_TASK_ID

Job array ID (index) number.

$SLURM_ARRAY_TASK_COUNT

Total number of tasks in a job array.

$SLURM_ARRAY_TASK_MAX

Job array’s maximum ID (index) number.

$SLURM_ARRAY_TASK_MIN

Job array’s minimum ID (index) number.

$SLURM_RESTART_COUNT

The number of times your job was restarted due to node failures

Job scripts, the sbatch command, and the srun command support many different resource requests in the form of flags. These flags are available to all forms of jobs. To review all possible flags for these commands, please visit the Slurm page on sbatch . Below, we have listed some useful directives to consider when running your job script.

Type

Description

Flag

Allocation

Specify an allocation account

–account=allocation

Quality of service

Specify a QOS (see the section limits)

–qos=qos

Sending email

Receive email at beginning or end of job completion

–mail-type=type

Email address

Email address to receive the email

–mail-user=user

Number of nodes

The number of nodes needed to run the job

–nodes=nodes

Number of tasks

The total number of processes needed to run the job

–ntasks=processes

Tasks per node

The number of processes you wish to assign to each node

–ntasks-per-node=processes

Cpus per task

The number of cpus will be used per task

–cpus-per-task=number_cpus

Total memory

The total memory (per node requested) required for the job.

–mem=memory (units:K,M,G,T (default M))

Memory per cpu

The memory per cpu.

–mem-per-cpu=memory (units:K,M,G,T (default M))

Wall time

The max amount of time your job will run

–time=wall time

Job Name

Name your job so you can identify it in the queue

–job-name=jobname

Multithreading

Each task will only use 1 thread per core

–hint=nomultithread or –threads-per-core=1

Below is listed a set of scripts of example that you could use as template for building your SLURM submission scripts. Each of these scripts is used to submit a type of job (sequential, parallel, MPI, OpenMPI, etc).

If you want start testing the script before to submit your jobs in the cluster then copy the directory: /doc/test/ to your account.

[USERNAME@master ~]$ cp -pr /doc/test/ .

The scripts solves a typical problem in the bioinformatics environment: Index a reference sequence and then align multiple reads to the reference genome. The structure of directories is:

[USERNAME@master ~]$ cd test
[USERNAME@master test]$ ls -R
.:
ArrayJob.sh  data  executables  FileJob.sh  files  MPIJob.sh  OpenMPJob.sh  out  ref  SequentialJob.sh

./data:
reads_00.fq  reads_02.fq  reads_04.fq  reads_06.fq  reads_08.fq  reads_10.fq  reads_12.fq  reads_14.fq  reads_16.fq  reads_18.fq  reads_20.fq
reads_01.fq  reads_03.fq  reads_05.fq  reads_07.fq  reads_09.fq  reads_11.fq  reads_13.fq  reads_15.fq  reads_17.fq  reads_19.fq

./out:

./ref:
chr8.fa

If you choose to execute one of these sample scripts, please make sure you understand what each #SBATCH directive before before using the script to submit your jobs. Otherwise, you may not get the result you want and may waste valuable computing resources.

Basic, Single-Threaded Job

This script can serve as the template for many single-processor applications. The mem flag can be used to request the appropriate amount of memory for your job. Please make sure to test your application and set this value to a reasonable number based on actual memory use. The %j in the –output line tells SLURM to substitute the job ID in the name of the output file. You can also add a -e or –error line with an error file name to separate output and error logs. Observe this type of jobs are not parallel so we need a single cpu to work. Remember to indicate this with the lines –ntasks and –cpus-per-task

See the script: SequentialJob.sh

 1#!/bin/bash
 2#SBATCH --job-name=seqJobTest       # Job name (showed with squeue)
 3#SBATCH --output=seqJobTest_%j.out  # Standard output and error log
 4#SBATCH --qos=short                 # QoS: short,medium,long,long-mem
 5#SBATCH --nodes=1                   # Required only 1 node
 6#SBATCH --ntasks=1                  # Required only 1 task
 7#SBATCH --cpus-per-task=1           # Required only 1 cpu
 8#SBATCH --mem=10G                   # Required 10GB of memory
 9#SBATCH --time=00:05:00             # Required 5 minutes of execution time.
10
11# The first command is to load the required software.
12module load biotools
13
14# Index the reference genome (ref/chr8.fa). The output files will be re-named with preffix: chr8_ref
15srun bwa index ref/chr8.fa -p ref/chr8_ref
16
17# Align a single file of reads (data/reads_00.fq) to the indexed reference file (ref/chr8.fa). We are using a single cpu (parameter: -t 1)
18srun bwa aln -I -t 1 ref/chr8_ref data/reads_00.fq > out/example_aln.sai
19
20exit 0

Commented lines:

  • 4. –qos=short , we are requesting 1 cpu, 10GB of memory and 5 minutes of execution time then we have to select the qos short (see limits)

  • 5. –nodes=1 , the minimum of nodes to run. A maximum node count may also be specified with syntax (min-max): –nodes=1-4 .**You can omit this parameter and let SLURM to select the number of nodes that are necessary.**

  • 6. –ntasks=1 , each job step (each of the srun lines) is submitted with only 1 task.

  • 7. –cpus-per-task=1 , each task (each of the srun lines) will be running in a single CPU, see notes below.

  • 8. –mem=10G , request 10GB of RAM for all the job.

  • 15. srun bwa index , first job step in the job. It’s a sequential job (no parallelized so only needs a single cpu) to indexing the reference genome.

  • 18. srun bwa aln, second job step in the job. It’s a sequential job (no parallelized so only needs a single cpu) to align the one of the 20 reads to the reference genome.

Important

Garnatxa always works with an even number of logical threads per core. This is for performance reasons related to hyper-threading technology. Each physical core in Garnatxa is associated with two computation threads which are reserved exclusively for single jobs. This means that when you request a single cpu in Slurm, the system internally reserves an even number. Then it is your option to use or not the extra thread (cpu) and this depends if your job is capable of parallelization.

We start launching the job to the queue system. the sbatch commands returns the job identifier that will be used then.

[USERNAME@master test]$ sbatch SequentialJob.sh
Submitted batch job 6757

Now we can review the state of the job in the batch system. squeue returns if the job is running: R or pending state waiting for resources: PD ( extended command: squeue_ returns efficiency information about the job). This example takes approximately 2 minutes to finish

[USERNAME@master test]$ squeue -u USERNAME
JOBID PARTITION  QOS    NAME     USER ST       TIME  NODES NODELIST(REASON)
6757     global  short  seqJobTe USERNAME  R       0:05      1 cn07

[USERNAME@master test]$ squeue_ -u USERNAME
 ____________________________________________________________________________________________________________________________________________________________________________________________________________________
| ST | JOB      | NAME               | USER     | ACCOUNT  | QOS      | STARTIME   | TIME         | TIME_LEFT    | ND  | CPU      E.CPU | PEAK_CPU  EFFIC | PEAK_MEM      EFFIC | NOW_MEM       EFFIC | NODES        |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| R  | 6757     | seqJobTest         | USERNAME | admin    | short    | 2022-11-14 | 2:46         | 2:14         | 1   | 1/2      50%   | 1/2       50%   | 0G/10G        0%    | 0G/10G        0%    | cn00         |
|____________________________________________________________________________________________________________________________________________________________________________________________________________________|

While the job is running we can check the amount of CPU and memory used typing squeue_. You will have to wait at least a couple of minutes until the system shows the first efficiency results. In the example the job is using only 1 CPU but the system allocated 2 because is the minimum to request. The job is currently using 220MB (squeue_ only shows values upper than 1GB) we requested 10GB. Take account that the PEAK_MEM column means the maximum size of memory consumed by your job during all the execution.

We can wait for the job is completed and check then the resources consumed. In any case, it’s clear that you should adjust the number of CPUs and memory in the next execution.

In any time we can also check the status of a job typing sacct , sacct_. If the job is finished the job will be marked as: COMPLETED

[USERNAME@master test]$ sacct_ -j 6757
__________________________________________________________________________________________________________________________________________________________________________________________________________________
| JOBID         | NAME          | START                | END                  | ELAPSED    | TOTAL_CPU  | USER          | ACCOUNT  | QOS      | CPU      | E.CPU  | PEAK_MEM      | E.MEM  | STATE     | EXIT_CODE |
|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 6757          | seqJobTest    | 2022-11-14T17:12:20  | 2022-11-14T17:15:02  | 00:02:42   | 00:02:40   | USERNAME      | admin    | short    | 1/1      | 100%   | 0G/10G        | 0%     | COMPLETED | 0:0       |
| 6757.batch    | batch         | 2022-11-14T17:12:20  | 2022-11-14T17:15:02  | 00:02:42   | 00:02:40   |               | admin    |          | -/2      | -      | -/10G         | -      | COMPLETED | 0:0       |
| 6757.0        | bwa           | 2022-11-14T17:12:21  | 2022-11-14T17:14:12  | 00:01:51   | 00:01:48   |               | admin    |          | -/2      | -      | -/10G         | -      | COMPLETED | 0:0       |
| 6757.1        | bwa           | 2022-11-14T17:14:02  | 2022-11-14T17:14:47  | 00:00:50   | 00:00:48   |               | admin    |          | -/2      | -      | -/10G         | -      | COMPLETED | 0:0       |
|__________________________________________________________________________________________________________________________________________________________________________________________________________________|

The above command is showing that the main job (6757) executed job steps (6757.0 and 6757.1). The columns E.CPU and E.MEM shows the achieved efficiency (in this case the job consumed less than 1GB of memory).

Finally we can see the output of the job. Remember the name of the output file contains the job id number.

[USERNAME@master test]$ more seqJobTest_6757.out
[bwa_index] Pack FASTA... 0.98 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=292728044, availableWord=32597292
[BWTIncConstructFromPacked] 10 iterations done. 53770876 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 99337660 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 139833516 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 175822284 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 207805164 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 236227612 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 261485516 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 283930748 characters processed.
[bwt_gen] Finished constructing BWT in 85 iterations.
[bwa_index] 72.14 seconds elapse.
[bwa_index] Update BWT... 0.59 sec
[bwa_index] Pack forward-only FASTA... 0.58 sec
[bwa_index] Construct SA from BWT and Occ... 33.64 sec
[main] Version: 0.7.17-r1188
[main] CMD: /storage/apps/BWA/0.7.17/bin/bwa index -p ref/chr8_ref ref/chr8.fa
[main] Real time: 110.993 sec; CPU: 107.935 sec
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 3.35 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.33 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.31 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 786432 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.25 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 1048576 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.25 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 1310720 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.25 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 1572864 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.25 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 1835008 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.28 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 2097152 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.26 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 2359296 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.24 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 2621440 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.26 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 2883584 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.27 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 3145728 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 3.25 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 3407872 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 1.18 sec
[bwa_aln_core] write to the disk... 0.00 sec
[bwa_aln_core] 3502500 sequences have been processed.
[main] Version: 0.7.17-r1188
[main] CMD: bwa aln -I -t 1 ref/chr8_ref data/reads_00.fq
[main] Real time: 50.314 sec; CPU: 47.208 sec

Multi-Threaded SMP Job

This script can serve as a template for applications that are capable of using multiple processors on a single server or physical computer. These applications are commonly referred to as threaded, OpenMP, PTHREADS, or shared memory applications. While they can use multiple processors, they cannot make use of multiple servers and all the processors must be on the same node.

These applications required shared memory and can only run on one node; as such it is important to remember the following:

You must set –ntasks=1, and then set –cpus-per-task to the number of OpenMP threads you wish to use. You must make the application aware of how many processors to use. How that is done depends on the application: For some applications (using the OpenMP paradigm), set OMP_NUM_THREADS to a value less than or equal to the number of cpus-per-task you set. For some applications, use a command line option when calling that application. Check if your application provides a parameter indicating the number of threads to parallelize.

The script below requests 4 cpus to parallelize one of the job steps. Observe that the indexing process is not parallelized (it will use only 1 cpu). However the alignment process uses the parameter -t to request 4 threads. You ca use the SLURM variable $SLURM_CPUS_PER_TASK to referencing the number of cpus per task required (written in the line #SBATCH –cpus-per-task=4)

 1#!/bin/bash
 2
 3#SBATCH --job-name=multiThreadJob       # Job name
 4#SBATCH --output=OpenMPJob_%j.out       # Standard output and error log
 5#SBATCH --nodes=1                       # Run all processes on a single node
 6#SBATCH --ntasks=1                      # Run a single task
 7#SBATCH --cpus-per-task=4               # Number of CPU cores per task
 8#SBATCH --mem=1gb                       # Job memory request
 9#SBATCH --time=00:05:00                 # Time limit hrs:min:sec
10#SBATCH --qos=short                 # QoS: short,medium,long,long-mem
11
12OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK    # Only if your application uses the OpenMP paradigm
13
14# Load the required software (bwa)
15module load biotools
16
17# Index the reference genome (ref/chr8.fa). The output files will be re-named with preffix: chr8_ref
18srun bwa index ref/chr8.fa -p ref/chr8_ref
19
20# Align a single file of reads (data/reads_00.fq) to the indexed reference file (ref/chr8.fa).
21# We are using 4 cpus (parameter: -t $SLURM_CPUS_PER_TASK)
22srun bwa aln -I -t $SLURM_CPUS_PER_TASK ref/chr8_ref data/reads_00.fq > out/example_aln.sai
23
24exit 0

Commented lines:

  • 6. –ntasks=1 , each job step is assigned to a single task but each task could request 4 cpus (see line 7).

  • 7. –cpus-per-task=4 , if the job step parallelizes then it could employ until 4 cpus.

  • 8. –mem=1gb , the entire job (including all the job steps) requires 1GB of RAM. Review this parameter after your job was finished.

  • 18. srun bwa index, first job step. The indexing process is a sequential task so we can’t specify more than 1 cpu.

  • 22. srun bwa aln -I -t $SLURM_CPUS_PER_TASK , second job step. The alignment process can be parellelized to run faster. Use the -t parameter in bwa application to specify the number to threads running concurrently. We can use: $SLURM_CPUS_PER_TASK to specify the number of threads requested in the sbatch script (line 7).

Then to submit a multi thread job wait until the job is completed. If we show the resulted file you can check that the alignment job step took 33.59 seconds while in the sequential version took 50.34 seconds

[USERNAME@master test]$ sbatch MultiThreadJob.sh
Submitted batch job 7035

[USERNAME@master test]$ cat multiThreadJob_7035.out
[bwa_index] Pack FASTA... 1.06 sec
[bwa_index] Construct BWT for the packed sequence...
[BWTIncCreate] textLength=292728044, availableWord=32597292
[BWTIncConstructFromPacked] 10 iterations done. 53770876 characters processed.
[BWTIncConstructFromPacked] 20 iterations done. 99337660 characters processed.
[BWTIncConstructFromPacked] 30 iterations done. 139833516 characters processed.
[BWTIncConstructFromPacked] 40 iterations done. 175822284 characters processed.
[BWTIncConstructFromPacked] 50 iterations done. 207805164 characters processed.
[BWTIncConstructFromPacked] 60 iterations done. 236227612 characters processed.
[BWTIncConstructFromPacked] 70 iterations done. 261485516 characters processed.
[BWTIncConstructFromPacked] 80 iterations done. 283930748 characters processed.
[bwt_gen] Finished constructing BWT in 85 iterations.
[bwa_index] 69.26 seconds elapse.
[bwa_index] Update BWT... 0.65 sec
[bwa_index] Pack forward-only FASTA... 0.64 sec
[bwa_index] Construct SA from BWT and Occ... 31.88 sec
[main] Version: 0.7.17-r1188
[main] CMD: /storage/apps/BWA/0.7.17/bin/bwa index -p ref/chr8_ref ref/chr8.fa
[main] Real time: 105.585 sec; CPU: 103.520 sec
[bwa_aln] 17bp reads: max_diff = 2
[bwa_aln] 38bp reads: max_diff = 3
[bwa_aln] 64bp reads: max_diff = 4
[bwa_aln] 93bp reads: max_diff = 5
[bwa_aln] 124bp reads: max_diff = 6
[bwa_aln] 157bp reads: max_diff = 7
[bwa_aln] 190bp reads: max_diff = 8
[bwa_aln] 225bp reads: max_diff = 9
[bwa_aln_core] calculate SA coordinate... 5.35 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 262144 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.26 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 524288 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.31 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 786432 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.23 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 1048576 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.29 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 1310720 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.44 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 1572864 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.24 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 1835008 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.34 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 2097152 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.23 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 2359296 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.37 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 2621440 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.44 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 2883584 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.35 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 3145728 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 5.34 sec
[bwa_aln_core] write to the disk... 0.02 sec
[bwa_aln_core] 3407872 sequences have been processed.
[bwa_aln_core] calculate SA coordinate... 1.95 sec
[bwa_aln_core] write to the disk... 0.01 sec
[bwa_aln_core] 3502500 sequences have been processed.
[main] Version: 0.7.17-r1188
[main] CMD: /storage/apps/BWA/0.7.17/bin/bwa aln -I -t 4 ref/chr8_ref data/reads_00.fq
[main] Real time: 33.590 sec; CPU: 75.471 sec

Message Passing Interface (MPI) Jobs

MPI is a specification for software developers used to make use of a cluster of computers. A set of libraries exist for using this standard on modern day (High Performance Computing) HPC Clusters. The problem with a computing cluster is that while some of the CPUs share memory (Shared Memory), others have a distributed memory architecture which is only connected by network. Today’s developers are able to make use of these distributed memory, shared memory and a hybrid system of both; all with the power of MPI.

If your application uses MPI you can review the next script that can serve as a template for MPI, or message passing interface, applications. These are applications that can use multiple processors that may, or may not, be on multiple compute nodes.

First, we need to load the openmpi module in order to your application is able to run with the MPI libraries.

Some parameters listed in the script:

  • -n, –ntasks=<number> Number of tasks (MPI ranks). We are requesting 80 processes (tasks) MPI each of them will use a single cpu then the number of cpus to be used is

  • -c, –cpus-per-task=<ncpus> Request ncpus cores per task.

  • -N, –nodes=<minnodes[-maxnodes]> Request that a minimum of minnodes nodes be allocated to this job. We could omit this parameter and SLURM will employ the suitable number of nodes to allocate the job.

As you can see the last line in the script launch the mpirun command passing the number of MPI tasks that will be created (use the SLURM variable: ${SLURM_NTASKS}).

 1[USERNAME@master test]$ cat MPIJob.sh
 2#!/bin/bash
 3#SBATCH --job-name=MPIJob       # Job name
 4#SBATCH --nodes=2               # Maximum number of nodes to be allocated
 5#SBATCH --ntasks=80             # Number of MPI tasks (i.e. processes)
 6#SBATCH --cpus-per-task=1       # Number of cores per MPI task
 7#SBATCH --mem=1G                # Memory per node
 8#SBATCH --time=00:05:00         # Wall time limit (days-hrs:min:sec)
 9#SBATCH --output=MPIJob_%j.log  # Path to the standard output and error files relative to the working directory
10#SBATCH --qos=short             # QoS: short,medium,long,long-mem
11
12echo "JOBID                          = $SLURM_JOB_ID"
13echo "Number of Nodes Allocated      = $SLURM_JOB_NUM_NODES"
14echo "Number of Tasks Allocated      = $SLURM_NTASKS"
15echo "Number of Cores/Task Allocated = $SLURM_CPUS_PER_TASK"
16
17module load openmpi4
18
19mpirun -np ${SLURM_NTASKS} ./mpi_hello_world

Commented lines:

  • 4. –nodes=2 , in the example we are requesting 80 MPI tasks (80 MPI processes = 80 cpus) so we will need 2 nodes (40 cpus per node). You can omit this parameter and SLURM will select the number of nodes your job need.

  • 5. –cpus-per-task=1, each MPI process will consume 1 cpu.

  • 6. –mem=1G, all MPI processes (80) will use a total of 1GB of RAM. Also we could have specify –mem-per-cpu , in this case the total memory requested by the job would: ntasks * cpus-per-task * mem-per-cpu

  • 17. module load openmpi4, we need to load the openmpi libraries before to submit a MPI application.

  • 19. mpirun -np ${SLURM_NTASKS} , submit the MPI application (in this example is a trivial hello_world example). You must select the number of MPI tasks with the -np pararameter (use the ${SLURM_NTASKS} variable to reference the number of tasks written in the line 5 of sbatch script: –ntasks)

After the job is completed we can check that the 80 MPI tasks were run in two nodes.

[USERNAME@master test]$ sbatch MPIJob.sh
Submitted batch job 7053
[USERNAME@master test]$ more MPIJob_7053.log
Number of Nodes Allocated      = 2
Number of Tasks Allocated      = 80
Number of Cores/Task Allocated = 1
Hello world from processor osd01, rank 42 out of 80 processors
Hello world from processor osd01, rank 46 out of 80 processors
Hello world from processor osd01, rank 51 out of 80 processors
Hello world from processor osd01, rank 55 out of 80 processors
Hello world from processor osd01, rank 69 out of 80 processors
Hello world from processor osd01, rank 79 out of 80 processors
Hello world from processor osd01, rank 40 out of 80 processors
Hello world from processor osd00, rank 11 out of 80 processors
Hello world from processor osd01, rank 47 out of 80 processors
Hello world from processor osd01, rank 50 out of 80 processors
Hello world from processor osd01, rank 52 out of 80 processors
Hello world from processor osd01, rank 54 out of 80 processors
Hello world from processor osd01, rank 60 out of 80 processors
Hello world from processor osd01, rank 62 out of 80 processors
Hello world from processor osd00, rank 29 out of 80 processors
Hello world from processor osd01, rank 64 out of 80 processors
Hello world from processor osd00, rank 32 out of 80 processors
Hello world from processor osd01, rank 67 out of 80 processors
Hello world from processor osd01, rank 41 out of 80 processors
Hello world from processor osd00, rank 36 out of 80 processors
Hello world from processor osd01, rank 43 out of 80 processors
Hello world from processor osd00, rank 39 out of 80 processors
Hello world from processor osd01, rank 44 out of 80 processors
Hello world from processor osd01, rank 49 out of 80 processors
Hello world from processor osd01, rank 53 out of 80 processors
Hello world from processor osd00, rank 2 out of 80 processors
Hello world from processor osd01, rank 57 out of 80 processors
Hello world from processor osd00, rank 4 out of 80 processors
Hello world from processor osd01, rank 58 out of 80 processors
Hello world from processor osd00, rank 5 out of 80 processors
Hello world from processor osd01, rank 61 out of 80 processors
Hello world from processor osd00, rank 10 out of 80 processors
Hello world from processor osd01, rank 65 out of 80 processors
Hello world from processor osd00, rank 12 out of 80 processors
Hello world from processor osd01, rank 68 out of 80 processors
Hello world from processor osd00, rank 15 out of 80 processors
Hello world from processor osd01, rank 70 out of 80 processors
Hello world from processor osd00, rank 20 out of 80 processors
Hello world from processor osd01, rank 71 out of 80 processors
Hello world from processor osd00, rank 8 out of 80 processors
Hello world from processor osd01, rank 73 out of 80 processors
Hello world from processor osd01, rank 74 out of 80 processors
Hello world from processor osd00, rank 17 out of 80 processors
Hello world from processor osd01, rank 75 out of 80 processors
Hello world from processor osd00, rank 3 out of 80 processors
Hello world from processor osd01, rank 76 out of 80 processors
Hello world from processor osd00, rank 19 out of 80 processors
Hello world from processor osd01, rank 78 out of 80 processors
Hello world from processor osd00, rank 33 out of 80 processors
Hello world from processor osd01, rank 45 out of 80 processors
Hello world from processor osd00, rank 34 out of 80 processors
Hello world from processor osd01, rank 48 out of 80 processors
Hello world from processor osd00, rank 37 out of 80 processors
Hello world from processor osd01, rank 56 out of 80 processors
Hello world from processor osd00, rank 9 out of 80 processors
Hello world from processor osd01, rank 59 out of 80 processors
Hello world from processor osd00, rank 13 out of 80 processors
Hello world from processor osd01, rank 63 out of 80 processors
Hello world from processor osd00, rank 24 out of 80 processors
Hello world from processor osd01, rank 66 out of 80 processors
Hello world from processor osd00, rank 26 out of 80 processors
Hello world from processor osd01, rank 72 out of 80 processors
Hello world from processor osd00, rank 35 out of 80 processors
Hello world from processor osd01, rank 77 out of 80 processors
Hello world from processor osd00, rank 1 out of 80 processors
Hello world from processor osd00, rank 22 out of 80 processors
Hello world from processor osd00, rank 27 out of 80 processors
Hello world from processor osd00, rank 30 out of 80 processors
Hello world from processor osd00, rank 31 out of 80 processors
Hello world from processor osd00, rank 38 out of 80 processors
Hello world from processor osd00, rank 0 out of 80 processors
Hello world from processor osd00, rank 14 out of 80 processors
Hello world from processor osd00, rank 16 out of 80 processors
Hello world from processor osd00, rank 28 out of 80 processors
Hello world from processor osd00, rank 21 out of 80 processors
Hello world from processor osd00, rank 25 out of 80 processors
Hello world from processor osd00, rank 7 out of 80 processors
Hello world from processor osd00, rank 6 out of 80 processors
Hello world from processor osd00, rank 18 out of 80 processors
Hello world from processor osd00, rank 23 out of 80 processors

Parallelization of data: ArrayJobs

When you have a lot of files that should be processed with the same applications then you can use SLURM arrays to parallelize the processing. The script below takes as input 20 files of genome reads and run the alignment step simultaneously. The paths to the input files (read files) will be assigned to a component of array. You can limit the maximum number of components of array with the sbatch line: SBATCH –array=1-20. Is possible to select which components of array will be processed, so to submit a job array with index values of 1, 3, 5 and 7 you chould specify: $ sbatch –array=1,3,5,7. To submit a job array with index values between 1 and 7 with a step size of 2 (i.e. 1, 3, 5 and 7) : $ sbatch –array=1-7:2. The maximum number of simultaneously running tasks from the job array may be specified using a “%” separator. For example “–array=0-15%4” will limit the number of simultaneously running tasks from this job array to 4.

You can find more information about the SLURM’s arrays.

 1#!/bin/bash
 2#SBATCH --job-name=ArrayJob
 3#SBATCH --output=arrayJob_%A_%a.out
 4#SBATCH --ntasks=1
 5#SBATCH --cpus-per-task=1
 6#SBATCH --time=00:30:00
 7#SBATCH --mem-per-cpu=1G
 8#SBATCH --array=1-20
 9#SBATCH --qos=short
10
11# Load the required software (bwa)
12module load biotools
13
14# List all reads
15FILES=(data/*)
16
17INPUTFILE=${FILES[$SLURM_ARRAY_TASK_ID]}
18OUTPUTFILE=$(basename ${FILES[$SLURM_ARRAY_TASK_ID]} .fq)
19
20# Index the reference genome (ref/chr8.fa). The output files will be re-named with preffix: chr8_ref
21srun bwa index ref/chr8.fa -p ref/chr8_ref
22
23# Align a single file of reads (data/reads_00.fq) to the indexed reference file (ref/chr8.fa). We are using a single cpu (parameter: -t 1)
24srun bwa aln -I -t 1 ref/chr8_ref ${INPUTFILE}  > out/example_ali_${OUTPUTFILE}.sai
25
26exit 0

Commented lines:

  • 7. –mem-per-cpu=1G , you are requesting the memory per cpu instead of the total memory (–mem) per job. Every job in the array will request ntasks * cpus-per-task = 1 * 1 = 1GB

  • 21. srun bwa index , the indexing of the reference genome is a previous step common to the following 20 alignment processes.

  • 24. srun bwa aln , we need align 20 different files of reads. Arrays in SLURM creates 20 tasks and each task will execute a different alignment depending the input file assigned by SLURM. Note that with SLURM arrays we can use $SLURM_ARRAY_TASK_ID to reference task that is running.

If you submit the script then a total of 1000 job array will be running simultaneously (this is because the maximum number of jobs running per user is limited to 1000). The job will finish when all input files (reads) are processed and the output of each alignment is stored into an individual file. Note that using job arrays in SLURM we are executing as many times as there are components in the array. In the example above the job step indexation will be executed 20 times even though it’s the same process with the same inputs and same outputs. Probably this process will fail because the output files are overwritten 20 times . You should remove the indexation line from this job and submit a previous single job in charged of doing the indexation. Also you can read the next section that explains another mode to submit jobs in parallel.

[USERNAME@master test]$ sbatch ArrayJob.sh
Submitted batch job 7054

[USERNAME@master test]$ squeue_ -u USERNAME
JOBID         QOS            NAME     USER ACCOUNT   TIME    TIME_LEFT START_TIME  NODES CPU MIN_M   NODELIST ST REASON
7054_[11-20]  short      ArrayJob USERNAME  admin    0:00    30:00     N/A         1     1   1G               PD AssocMaxJobsLimit
7054_6        short      ArrayJob USERNAME  admin    1:35    28:25     2022-11-17  1     2   1G      cn03     R  None
7054_7        short      ArrayJob USERNAME  admin    1:35    28:25     2022-11-17  1     2   1G      cn03     R  None
7054_8        short      ArrayJob USERNAME  admin    1:35    28:25     2022-11-17  1     2   1G      cn03     R  None
7054_9        short      ArrayJob USERNAME  admin    1:35    28:25     2022-11-17  1     2   1G      cn03     R  None
7054_10       short      ArrayJob USERNAME  admin    1:35    28:25     2022-11-17  1     2   1G      cn03     R  None

[USERNAME@master test]$ ls arrayJob*
arrayJob_7054_10.out  arrayJob_7054_12.out  arrayJob_7054_14.out  arrayJob_7054_16.out  arrayJob_7054_18.out  arrayJob_7054_1.out   arrayJob_7054_2.out  arrayJob_7054_4.out  arrayJob_7054_6.out  arrayJob_7054_8.out
arrayJob_7054_11.out  arrayJob_7054_13.out  arrayJob_7054_15.out  arrayJob_7054_17.out  arrayJob_7054_19.out  arrayJob_7054_20.out  arrayJob_7054_3.out  arrayJob_7054_5.out  arrayJob_7054_7.out  arrayJob_7054_9.out

If you have a variable amount of files to process in each execution then you can define the limits of the array out of the sbatch file. For the above example you can automatically calculate the number of files to process from the command line :

[USERNAME@master test]$ sbatch --array=0-`ls data|wc -l` ArrayJob.sh

In this way you could increase the number of files in the data directory omitting the directive: #SBATCH --array=1-20 in the sbatch configuration file.

Parallelization of data using a file of commands (arrays version)

Alternatively you can use an input file with a list of samples/datasets (one per line) to process. You must previously create a text file with one execution line per file to be processed. You can see in the example below the file that contains each alignment command.

 1[USERNAME@master test]$ cat list_of_cmd.txt
 2bwa aln -I -t 1 ref/chr8_ref data/reads_00.fq  > out/example_ali_reads_00.sai
 3bwa aln -I -t 1 ref/chr8_ref data/reads_01.fq  > out/example_ali_reads_01.sai
 4bwa aln -I -t 1 ref/chr8_ref data/reads_02.fq  > out/example_ali_reads_02.sai
 5bwa aln -I -t 1 ref/chr8_ref data/reads_03.fq  > out/example_ali_reads_03.sai
 6bwa aln -I -t 1 ref/chr8_ref data/reads_04.fq  > out/example_ali_reads_04.sai
 7bwa aln -I -t 1 ref/chr8_ref data/reads_05.fq  > out/example_ali_reads_05.sai
 8bwa aln -I -t 1 ref/chr8_ref data/reads_06.fq  > out/example_ali_reads_06.sai
 9bwa aln -I -t 1 ref/chr8_ref data/reads_07.fq  > out/example_ali_reads_07.sai
10bwa aln -I -t 1 ref/chr8_ref data/reads_08.fq  > out/example_ali_reads_08.sai
11bwa aln -I -t 1 ref/chr8_ref data/reads_09.fq  > out/example_ali_reads_09.sai
12bwa aln -I -t 1 ref/chr8_ref data/reads_10.fq  > out/example_ali_reads_10.sai
13bwa aln -I -t 1 ref/chr8_ref data/reads_11.fq  > out/example_ali_reads_11.sai
14bwa aln -I -t 1 ref/chr8_ref data/reads_12.fq  > out/example_ali_reads_12.sai
15bwa aln -I -t 1 ref/chr8_ref data/reads_13.fq  > out/example_ali_reads_13.sai
16bwa aln -I -t 1 ref/chr8_ref data/reads_14.fq  > out/example_ali_reads_14.sai
17bwa aln -I -t 1 ref/chr8_ref data/reads_15.fq  > out/example_ali_reads_15.sai
18bwa aln -I -t 1 ref/chr8_ref data/reads_16.fq  > out/example_ali_reads_16.sai
19bwa aln -I -t 1 ref/chr8_ref data/reads_17.fq  > out/example_ali_reads_17.sai
20bwa aln -I -t 1 ref/chr8_ref data/reads_18.fq  > out/example_ali_reads_18.sai
21bwa aln -I -t 1 ref/chr8_ref data/reads_19.fq  > out/example_ali_reads_19.sai
22bwa aln -I -t 1 ref/chr8_ref data/reads_20.fq  > out/example_ali_reads_20.sai

Then we could modify a SLURM array submission script to read from the file of commands and execute each line concurrently in parallel.

 1[USERNAME@master test]$ cat ArrayJob_List.sh
 2#!/bin/bash
 3#SBATCH --job-name=ArrayJob_List
 4#SBATCH --output=arrayJob_List_%A_%a.out
 5#SBATCH --ntasks=1
 6#SBATCH --cpus-per-task=1
 7#SBATCH --time=00:30:00
 8#SBATCH --mem-per-cpu=1G
 9#SBATCH --array=0-20
10#SBATCH --qos=short
11
12# Load the required software (bwa)
13module load biotools
14
15# Put the content of the file of commands into an array
16readarray -t ARRAY_OF_COMMANDS <list_of_cmd.txt
17
18# Index the reference genome (ref/chr8.fa). The output files will be re-named with preffix: chr8_ref
19srun bwa index ref/chr8.fa -p ref/chr8_ref
20
21# Align a single file of reads (data/reads_00.fq) to the indexed reference file (ref/chr8.fa). We are using a single cpu (parameter: -t 1)
22eval srun ${ARRAY_OF_COMMANDS[$SLURM_ARRAY_TASK_ID]}
23
24exit 0
  • 16. readarray -t ARRAY_OF_COMMANDS <list_of_cmd.txt . We are reading the content of the file of commands and putting each line into an array.

  • 22. eval srun ${ARRAY[$SLURM_ARRAY_TASK_ID]} . We are submitting an alignment per each component in the array. We need prefix the command eval in order the redirection symbol ‘>’ can be interpreted by the bash.

The problem using SLURM arrays in this example is the indexation will be executed 20 times when it should be executed only 1 time. The indexation process will fail because the results file are overwritten 20 times. Running this type of jobs (when part of the job should be executed only one time) is better to use the srun in background. You can see how to use this same example running srun in background and using a list of commands: