Skip to content

Jobs on BlueBEAR

Process summary

A simplified summary of the job submission process on BlueBEAR is as follows:

  1. Compose a job script, which includes:
    • the resources that the job requires
    • the software/application modules to load
    • the commands to run
  2. Submit the job script to the cluster.
    • The job will be queued by the scheduler until there are sufficient resources available for it to run. Queue times for jobs vary depending on how busy BlueBEAR is and the amount of resource that the job has requested.
  3. View the job’s output, either in realtime or once the job completes.

Further information on the mechanics of job submission can be found in the sections below.

Job scheduling with Slurm

Jobs on the cluster are controlled by the Slurm HPC scheduling system. The scheduler is configured to ensure an equitable distribution of resources over time to all users. The key means by which this is achieved are:

  • Jobs are scheduled according to the QOS (Quality of Service) and the resources that are requested. Information on how to request resources for your job is detailed below.
  • Jobs are not necessarily run in the order in which they are submitted.
  • Jobs requiring a large number of cores and/or long walltime will have to queue until the requested resources become available. The system will run smaller jobs, that can fit in available gaps, until all of the resources that have been requested for the larger job become available - this is known as backfill. Hence it is beneficial to specify a realistic walltime for a job so it can be fitted in the gaps.

Job Scripts

Example job script

This example is job.sh

#!/bin/bash
#SBATCH --ntasks=1
#SBATCH --time=5:0
#SBATCH --qos=bbshort
#SBATCH --mail-type=ALL

set -e

module purge; module load bluebear
module load MATLAB/2020a

matlab -nodisplay -r cvxtest

Script Options Explained

  • #!/bin/bash - run the job using GNU Bourne Again Shell (the same shell as the logon nodes).


  • set -e - makes your script fail on first error. This is recommended as early errors can easily be missed.
  • module purge; module load bluebear - resets the environment to ensure that the script hasn’t inherited anything from where it was submitted. This line is required and Slurm will reject the script if it isn’t present – it must be included before any other module load statements.
  • module load MATLAB/2020a - loads the MATLAB 2020a module into the environment. This is required to make the matlab command available.
  • matlab -nodisplay -r cvxtest - the command to run the Matlab example.

Emails about jobs

The Slurm job notification emails can only be sent to University email address. Using the slurm option to specify an external email address will result in no email being delivered.

See the Job Options and Resources section below for further details.

Note

The above is a simple example; the options and commands can be as complex as necessary. All of the options that can be set can be viewed on Slurm’s documentation for sbatch. You can add any of these options as command line arguments but it is recommended that you add all of your options into your job script for ease of reproducibility.

Overview of Job Operations

Submitting a job

The command to submit a job is sbatch. This reads its input from a job script file. The job is submitted to the scheduling system, using the requested resources, and will run on the first available node(s) that are able to provide the resources requested. For example, to submit the set of commands contained in the above example file job.sh, use the command:

sbatch job.sh

The system will return a job number, for example:

$ sbatch job.sh
Submitted batch job 55260

Note

Slurm is aware of your current working directory when submitting the job so there is no need to manually specify it in the script.

Monitoring a job

There are a number of ways to monitor the current status and output of a job:

  • squeue

    squeue -j 55620
    

    squeue is Slurm’s command for viewing the status of your jobs. This shows information such as the job’s ID and name, the QOS used (the partition, which will tell you the node type), the user that submitted the job, time elapsed and the number of nodes being used.

  • scontrol

    scontrol show job 55620
    

    scontrol is a powerful interface that provides an advanced amount of detail regarding the status of your job. The show command within scontrol can be used to view details regarding a specific job.

  • slurm.out and slurm.stats files

    When your job is submitted a slurm.stats file is created and named to include the job id, e.g. slurm-55260.stats.
    Once your job begins to run a slurm.out file is also created (e.g. slurm-55260.out), which contains the standard out (stdout) and standard error (stderr) outputs that would have been shown had you run the command(s) directly in a terminal shell.

    Info

    • These two output files are created in the directory from which you submitted the job.
    • slurm-55260.out is a plain text file (i.e. not an executable). To view its contents you can, for example, run: cat slurm-55260.out

Cancelling a job

To cancel a queued or running job use the scancel command and supply it with the job ID that is to be cancelled. For example, to cancel the previous job:

scancel 55260

Job Options and Resources

Resource Limits

  • The maximum duration (walltime) on BlueBEAR is 10 days per job, except…
    • … for bbshort where the maximum walltime is 10 minutes per job
  • Each user is limited to:
    • 1344 cores and 6TB of memory (RAM) per shared CPU QOS
    • 4 GPUs when using the bbgpu QOS
    • These limits are summed across all running jobs in the QOS
    • There are no additional limits on the sizes of individual jobs
    • If any limit is exceeded, any future jobs (for that QOS) will remain queued until the usage falls below these limits again

Info

The maximum of 1344 CPU cores per job and 6 Terabytes of RAM is across all of the jobs for one person. E.g. this could be made up of:

  • A single job requesting 1344 cores across 12 Sapphire Rapids nodes
  • A single job requesting 14 cores and 6 Terabytes of RAM
  • 12 jobs requesting 112 cores across 12 Sapphire Rapids nodes

Note

Different limits may be set on QOSes relating to user-owned resources.

Resource Utilisation

Some software uses multiple CPU cores on one node and some can support multiple cores on multiple nodes. Programming for multiple nodes generally requires a different approach and so this is less common. Unfortunately, increasing the number of cores will not always make your job run faster.

Job QOS

BlueBEAR uses the QOS (--qos or -q) option of a job to direct it to a particular set of resources. By default, there are two QOS to which you can submit jobs. These are: bbdefault and bbshort. You may also have access to bblargemem or bbgpu.
All shared QOS have a maximum job length (walltime) of 10 days, with the exception of bbshort where it is 10 minutes.
You can specify the QOS to use by adding the following line to your job script:

#SBATCH --qos=bbshort

QOS Details

Please select a QOS to view its details.

This is the default QOS and will be used if no --qos is specified in the job script.
It comprises different types of node, as described on the Standard Resources page.

This QOS contains all nodes in the cluster and is the fastest way to get your job to run. The maximum walltime is 10 minutes.

This QOS contains a mixture of GPU nodes which are available if your job requires a GPU. Please see the GPU Service page for more details on these nodes.

This QOS, pre-2022, contained a mixture of large memory nodes that were available if your job required a larger amount of memory on one node. When the Intel Ice Lake nodes were added the bblargemem QOS was retired. Please see the Large Memory Service page for more details on requesting more memory for a job.

Note

Some of the memory a node has is for running system processes and will be unavailable to jobs.


Associating Jobs with Projects

Every job has to be associated with a project to ensure the equitable distribution of resources. Project owners and members will have been issued a project code for each registered project, and only usernames authorised by the project owner will be able to run jobs using that project code. You can see what projects you are a member of by running the command:

my_bluebear

If a user is registered on more than one project then it should be specified using the --account option followed by the project code. For example, if your project is _project_name_ then add the following line to your job script:

#SBATCH --account=_project_name_

If a job is submitted using an invalid project, either because the project does not exist, the username is not authorised to use that project, or the project does not have access to the requested QOS, then the job will be rejected with the following error:

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

Memory (RAM) Requests

By default, for each core requested (e.g. using the --ntasks option) the job will be allocated 4096MB RAM. Wherever possible we prefer this method as it ensures efficient distribution of jobs across the BlueBEAR cluster.
However, this default can be overridden by specifying one of the following options to request the amount of memory required:

Memory Units

You can specify values in megabytes (M), gigabytes (G) or terabytes (T) with the default unit being M if none is given.

#SBATCH --mem

  • The memory value specified against --mem will be allocated to each node on which a job is running, regardless of cores. This makes it a less suitable option for distributed jobs and it’s therefore commonly combined with the --nodes=1 option.
  • See, for example, how it is used to run a large memory job.

#SBATCH --mem-per-cpu

  • Default value = 4096M.
  • Safer for distributed jobs as memory allocation scales with the core count.

Dedicated Resources

Some research groups have dedicated resources in BlueBEAR. Those users who can submit jobs to a dedicated QOS can see what jobs are running in the dedicated QOS by using the following command with _name_ replaced with the name of your QOS:

view_qos _name_

Modules (software applications)

Software on BlueBEAR is generally provided through our BEAR Applications modules and in a job script it is therefore necessary to load the module for an application before that application becomes available to use. In the job script example above, this can be seen with the MATLAB module being loaded (module load MATLAB/2020a) followed by the MATLAB command being run (matlab -nodisplay -r cvxtest).

Warning

Note that if a job script loads multiple modules, these must all be from the same BEAR Application Version.

For a list of available software modules, see the BEAR Applications website.

For further information on the use of modules (and other options for accessing software), please refer to our Software pages.