Array Jobs¶
Slurm array jobs are an efficient way of submitting multiple jobs that perform the same work using the same script(s) but on different data. They provide a simple way of achieving parallelisation on the cluster in cases where no interdependency is required.
When submitted, a single array job will spawn multiple “sub jobs”, each denoted
by a unique array index but all under the umbrella of the job ID. For example,
if 55620_1
is the identifier, 55620
is the job ID and 1
is the array index.
Please refer to the following information alongside the further details provided in the official Slurm Array Job documentation.
Basic example¶
See the Job Scripts section for detailed explanations of the following script’s core features.
#!/bin/bash
#SBATCH --time=5:0
#SBATCH --qos=bbshort
#SBATCH --array=2-5
set -e
module purge; module load bluebear
echo "${SLURM_JOB_ID}: Job ${SLURM_ARRAY_TASK_ID} of ${SLURM_ARRAY_TASK_MAX} in the array"
Array size
The maximum number of array tasks that can be specified in a single job is 4,096
(e.g. --array=1-4096
).
The header #SBATCH --array=2-5
tells Slurm that this job is an array job and that it should run 4 sub jobs (with IDs 2, 3, 4, 5).
Array job environment variables
Slurm provides the following environment variables that can be used to track jobs dynamically within the array.
${SLURM_ARRAY_TASK_COUNT}
will be set to the number of tasks in the job array, so in the example this will be 4.${SLURM_ARRAY_TASK_ID}
will be set to the job array index value, so in the example there will be 4 sub-jobs, each with a different value (from 2 to 5).${SLURM_ARRAY_TASK_MIN}
will be set to the lowest job array index value, which in the example will be 2.${SLURM_ARRAY_TASK_MAX}
will be set to the highest job array index value, which in the example will be 5.${SLURM_ARRAY_JOB_ID}
will be set to the job ID provided by running thesbatch
command.
Further examples¶
There are numerous approaches to using array jobs within workflows. The tabs below provide examples for the following strategies:
- Name input data files so that they are sequential
- Create a “lookup” file that links filenames with an index
- Dynamically iterate over a directory using e.g. the
ls
orfind
commands - Port a nested
for
loop from a Bash script
Scenario
Processing a sequence of similarly-named files, e.g. seq_001.in
, seq_002.in
,
seq_003.in
etc.
#SBATCH --array=1-10 # Array size must correspond to input sequence
set -e
# For ordering purposes, it is common to zero-fill the numbers in a sequence of files.
# The following printf command pads the $SLURM_ARRAY_TASK_ID numeric string as appropriate
PADDED_TASK_ID=$(printf "%03d" "${SLURM_ARRAY_TASK_ID}") # (1)!
INPUT_FILENAME="seq_${PADDED_TASK_ID}.in"
echo "I am array index ${SLURM_ARRAY_TASK_ID} and am processing file: ${INPUT_FILENAME}"
#some_command --input "${INPUT_FILENAME}"
-
Tip
You can also use the
-v
option withprintf
to directly set the given environment variable name, which is neater, although non-standard:printf -v PADDED_TASK_ID "%03d" "${SLURM_ARRAY_TASK_ID}"
Scenario
Processing a list of files, provided via a lookup file.
#SBATCH --array=0-9 # Note, array must start with "0" for use with this example
FILENAME_LIST=($(<input_list.txt)) # Creates an indexed array from the contents of input_list.txt
INPUT_FILENAME=${FILENAME_LIST[${SLURM_ARRAY_TASK_ID}]} # Look-up using array index
echo "I am array index ${SLURM_ARRAY_TASK_ID} and am processing file: ${INPUT_FILENAME}"
#some_command --input "${INPUT_FILENAME}"
Scenario
Processing a directory of files with a particular type(s) but no common naming style,
e.g. turtle.fa
, sponge.fa
, mouse.fa
.
#SBATCH --array=1-10 # Note, array must start with "1" for use with this example
INPUT_FILENAME=$(ls *.fa | sed -n ${SLURM_ARRAY_TASK_ID}p)
echo "I am array index ${SLURM_ARRAY_TASK_ID} and am processing file: ${INPUT_FILENAME}"
#some_command --input "${INPUT_FILENAME}"
Be Defensive!
Tasks will fail if index-lookups don’t match an existing file.
Perform a basic test before progressing to any commands, e.g:
[[ -f ${INPUT_FILENAME} ]] || { echo >&2 "${INPUT_FILENAME} does not exist. Exiting"; exit 1; }
Scenario
Port a nested for
loop from a Bash script, such as the following:
# Outer loop iterates over the first sequence of six values
for x in {0..5}; do
# Inner loop iterates over the second sequence of eight values
for y in {0..7}; do
# Print the combination of items from both sequences
echo "${x} and ${y}"
done
done
Tip
If you are handling non-sequential values, you could perform a lookup:
# Create a lookup using
# an indexed array
x_lookup=(17 19 24 29 37 473)
X_LOOKUP_VAL=${x_lookup[${X_VAL}]}
The size of the array will equal the number of possible permutations.
In this case 6 x 8 = 48.
#SBATCH --array=0-47 # i.e. 8 * 6 values
X_VAL=$((${SLURM_ARRAY_TASK_ID} / 8)) # (1)!
Y_VAL=$((${SLURM_ARRAY_TASK_ID} % 8)) # (2)!
echo "${X_VAL} and ${Y_VAL}"
-
Counts from
0
->5
, incrementing by one for each Y_VAL count of eight.
The value could also be further modified, for example adding “1” to get a sequence of1
to6
:X_VAL=$((${SLURM_ARRAY_TASK_ID} / 8 + 1))
-
Counts from
0
->7
, six times
Additional Information¶
--array
syntax¶
There are several syntactical options available for specifying the array indices.
Syntax | Explanation |
---|---|
--array=1-10 |
Index values from 1 to 10 (inclusive) |
--array=1,3,5,7 |
Index values of 1, 3, 5 & 7 |
--array=1-7:2 |
(Also) index values of 1, 3, 5 & 7 (i.e. using a step-size of 2) |
--array=0-15%4 |
Index values from 0 to 15 (inclusive) but with max of 4 concurrent jobs |
Array job output files¶
Be aware that large array jobs will generate a large amount of output. The default output file format is:
"slurm-%A_%a.out"
but this could be modified as follows to direct the output to a subdirectory:
Important!
This subdirectory will need to exist before Slurm can write to it. If the directory doesn’t exist then no output will be written but no error will be shown.
#SBATCH --output="output_dir/slurm-%A_%a.out"