Array Jobs¶

Slurm array jobs are an efficient way of submitting multiple jobs that perform the same work using the same script(s) but on different data. They provide a simple way of achieving parallelisation on the cluster in cases where no interdependency is required.

When submitted, a single array job will spawn multiple "sub jobs", each denoted by a unique array index but all under the umbrella of the job ID. For example, if 55620_1 is the identifier, 55620 is the job ID and 1 is the array index.

Please refer to the following information alongside the further details provided in the official Slurm Array Job documentation.

Basic example¶

See the Job Scripts section for detailed explanations of the following script's core features.

#!/bin/bash
#SBATCH --time=5:0
#SBATCH --qos=bbshort
#SBATCH --array=2-5

set -e

module purge; module load bluebear

echo "${SLURM_JOB_ID}: Job ${SLURM_ARRAY_TASK_ID} of ${SLURM_ARRAY_TASK_MAX} in the array"

Do not use --mail-type

Please do not include the #SBATCH --mail-type option in your array job script. Depending on the options selected, that can cause Slurm to send multiple emails for each sub job, which could result in many hundreds, or thousands, of emails in a short period of time.

Array size

The maximum number of array tasks that can be specified in a single job is 4,096 (e.g. --array=1-4096).

The header #SBATCH --array=2-5 tells Slurm that this job is an array job and that it should run 4 sub jobs (with IDs 2, 3, 4, 5).

Array job environment variables

Slurm provides the following environment variables that can be used to track jobs dynamically within the array.

${SLURM_ARRAY_TASK_COUNT} will be set to the number of tasks in the job array, so in the example this will be 4.
${SLURM_ARRAY_TASK_ID} will be set to the job array index value, so in the example there will be 4 sub-jobs, each with a different value (from 2 to 5).
${SLURM_ARRAY_TASK_MIN} will be set to the lowest job array index value, which in the example will be 2.
${SLURM_ARRAY_TASK_MAX} will be set to the highest job array index value, which in the example will be 5.
${SLURM_ARRAY_JOB_ID} will be set to the job ID provided by running the sbatch command.

Further examples¶

There are numerous approaches to using array jobs within workflows. The tabs below provide examples for the following strategies:

Name input data files so that they are sequential
Create a "lookup" file that links filenames with an index
Dynamically iterate over a directory using e.g. the ls or find commands
Port a nested for loop from a Bash script

1234

Scenario

Processing a sequence of similarly-named files, e.g. seq_001.in, seq_002.in, seq_003.in etc.

#SBATCH --array=1-10  # Array size must correspond to input sequence

set -e

# For ordering purposes, it is common to zero-fill the numbers in a sequence of files.
# The following printf command pads the $SLURM_ARRAY_TASK_ID numeric string as appropriate
PADDED_TASK_ID=$(printf "%03d" "${SLURM_ARRAY_TASK_ID}") # (1)!

INPUT_FILENAME="seq_${PADDED_TASK_ID}.in"

echo "I am array index ${SLURM_ARRAY_TASK_ID} and am processing file: ${INPUT_FILENAME}"

#some_command --input "${INPUT_FILENAME}"

Tip

You can also use the -v option with printf to directly set the given environment variable name, which is neater, although non-standard:
```
printf -v PADDED_TASK_ID "%03d" "${SLURM_ARRAY_TASK_ID}"
```

Scenario

Processing a list of files, provided via a lookup file.

#SBATCH --array=0-9  # Note, array must start with "0" for use with this example

FILENAME_LIST=($(<input_list.txt))  # Creates an indexed array from the contents of input_list.txt
INPUT_FILENAME=${FILENAME_LIST[${SLURM_ARRAY_TASK_ID}]}  # Look-up using array index

echo "I am array index ${SLURM_ARRAY_TASK_ID} and am processing file: ${INPUT_FILENAME}"

#some_command --input "${INPUT_FILENAME}"

Scenario

Processing a directory of files with a particular type(s) but no common naming style, e.g. turtle.fa, sponge.fa, mouse.fa.

#SBATCH --array=1-10  # Note, array must start with "1" for use with this example

INPUT_FILENAME=$(ls *.fa | sed -n ${SLURM_ARRAY_TASK_ID}p)

echo "I am array index ${SLURM_ARRAY_TASK_ID} and am processing file: ${INPUT_FILENAME}"

#some_command --input "${INPUT_FILENAME}"

Be Defensive!

Tasks will fail if index-lookups don't match an existing file.
Perform a basic test before progressing to any commands, e.g:

[[ -f ${INPUT_FILENAME} ]] || { echo >&2 "${INPUT_FILENAME} does not exist. Exiting"; exit 1; }

Scenario

Port a nested for loop from a Bash script, such as the following:

# Outer loop iterates over the first sequence of six values
for x in {0..5}; do
    # Inner loop iterates over the second sequence of eight values
    for y in {0..7}; do
        # Print the combination of items from both sequences
        echo "${x} and ${y}"
    done
done

Tip

If you are handling non-sequential values, you could perform a lookup:

# Create a lookup using
# an indexed array

x_lookup=(17 19 24 29 37 473)
X_LOOKUP_VAL=${x_lookup[${X_VAL}]}

The size of the array will equal the number of possible permutations.
In this case 6 x 8 = 48.

#SBATCH --array=0-47  # i.e. 8 * 6 values

X_VAL=$((${SLURM_ARRAY_TASK_ID} / 8)) # (1)!

Y_VAL=$((${SLURM_ARRAY_TASK_ID} % 8)) # (2)!

echo "${X_VAL} and ${Y_VAL}"

Counts from 0 -> 5, incrementing by one for each Y_VAL count of eight.
The value could also be further modified, for example adding "1" to get a sequence of 1 to 6:
```
X_VAL=$((${SLURM_ARRAY_TASK_ID} / 8 + 1))
```
Counts from 0 -> 7, six times

Additional Information¶

`--array` syntax¶

There are several syntactical options available for specifying the array indices.

Syntax	Explanation
`--array=1-10`	Index values from 1 to 10 (inclusive)
`--array=1,3,5,7`	Index values of 1, 3, 5 & 7
`--array=1-7:2`	(Also) index values of 1, 3, 5 & 7 (i.e. using a step-size of 2)
`--array=0-15%4`	Index values from 0 to 15 (inclusive) but with max of 4 concurrent jobs

Array job output files¶

Be aware that large array jobs will generate a large amount of output. The default output file format is:
"slurm-%A_%a.out" but this could be modified as follows to direct the output to a subdirectory:

Important!

This subdirectory will need to exist before Slurm can write to it. If the directory doesn't exist then no output will be written but no error will be shown.

#SBATCH --output="output_dir/slurm-%A_%a.out"

Array Jobs¶

Basic example¶

Further examples¶

Additional Information¶

--array syntax¶

Array job output files¶

`--array` syntax¶