Slurm Cluster

In our slurm implementation a researcher is expected to choose a partition (slurm nomenclature for a group of associated nodes) according to their needs, e.g. one comprised of nodes containing fast cores, or lots of memory, gpus, etc, to which they will submit a job. The job submission will include the computing resources (cores, memory, gpu) required, and slurm will schedule node(s) to meet the job requirements.

A quick note on the 'billing' term. In the context of slurm, it has nothing to do with financial matters. It is simply the term the slurm developers have chosen to apply to the tracking of resource use by an account (i.e. a user). There is no money involved!

Slurm will bill resource usage against the user's account, and will track usage over time. This usage is involved in the calculation of job execution priority in the event of multiple jobs in contention for the same resources. In short, the more resources you use, the lower your scheduling priority - it is a fair-share based system. More valuable resources will be billed at higher rates.

Our current billing scheme, subject to tweaking as the needs and demands of the department change, are as follows.

  • CPU is the base billing element, as cpu-seconds. They are calculated on a per-core basis, so one core for 60 seconds is billed as 60. Two cores for 30 seconds would also be billed as 60.
  • Memory is billed at a rate of 1 cpu-second for 0.25GB of ram held for one second.
  • GPU time is billed at 16 cpu-seconds per second of use.

Remember that you are billed for the time you have a resource claimed, not only for the time your job is actually using the resource. For example, if you requested an interactive shell, you are being billed for all the resources that you have requested for the entire duration of that session.

Your accumulated resource usage is one of the primary factors in determining your priority should other jobs be in contention for available resources, so it is in your best interests to request only what you need, and to free the resources as soon as your computations are complete.

At this time, we are not using job preemption. Once your job is running, it should run to completion, however, there is a five day limit on execution time to encourage fair use.

Jobs may be submited to the cluster from any of the general-use compute servers, those being currently comps0.cs, comps1.cs, comps2.cs and comps3.cs. The available nodes will fluxuate over time. You can examine the various configurations currently on offer by logging into a compute server and using the 'slurm_report' command (/usr/local/bin/slurm_report). This script is a wrapper for a variety of slurm commands, and aims to provide a more useful output. It has multiple modes, two of which are the -c and -g flags for an overview of CPU node and GPU node availability, e.g.:

user@comps3:~$ slurm_report -c

This is a report of CPU nodes, i.e. machines for non-GPU computing, which displays currently available resources.  Nodes completely in use or unavailable will not appear.

NODELIST is the name of the node.  PARTITION indicates to which partition the node belongs.

CPUS(A/I/O/T) is the state of cores on the node.  Allocated/Idle/Other/Total.  Essentially you want to look at the Idle number to judge which nodes have cores free.

MEMORY FREE is the currently unused system memory available for your job.

Please specify with e.g. -c and --mem the resources you need when submitting your job.

AVAIL_FEATURES will indicate CPU type, useful if your experiments are specific to a particular manufacturer, or you need consistency across multiple job runs.

STATE could be either 'idle' which means the node is entirely free, or 'mixed' which indicates that other jobs are using some of the resources.  The remaining resources available to your job should be accurately reflected in the CPUS Idle and MEMORY FREE fields.

    NODELIST    PARTITION  CPUS(A/I/O/T)  MEMORY FREE       AVAIL_FEATURES      STATE
   cpunode11    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode12    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode13    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode14    cpunodes*      8/24/0/32        63064   Threadripper_1950X      mixed
   cpunode15    cpunodes*      8/24/0/32        63064   Threadripper_1950X      mixed
   cpunode16    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode17    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode18    cpunodes*      0/64/0/64       128600  Threadripper_2990WX       idle
    cpunode6    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
    cpunode7    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
    cpunode8    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
    cpunode9    cpunodes*      8/56/0/64        63164  Threadripper_2990WX      mixed
   cpunode19    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode20    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode21    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode22    cpunodes*      8/56/0/64        63164  Threadripper_2990WX      mixed
   cpunode24    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode25    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode26    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
    cpunode3    cpunodes*    0/112/0/112       515000        XeonGold-6348       idle
    cpunode2    cpunodes*    8/104/0/112       450264         AMD_Epyc7453      mixed
    cpunode4  bigmemnodes    0/112/0/112      2051000         AMD_Epyc7453       idle
 amdgpunode1  bigmemnodes    0/128/0/128       515000        (8x)AMD_MI100       idle
    cpunode5  bigmemnodes    0/512/0/512      1547730         AMD_Epyc9754       idle

Note that 'amdgpunode1' used to be in service as an AMD-specific GPU server, but due to lack of interest has transitioned to a large memory cpu compute node.

You will want to confirm which partition a node is in using sinfo, as some nodes may be in multiple partitions or specialized partitions (such as the bigmemnodes partition)

user@comps3:~$ sinfo -p cpunodes,bigmemnodes
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpunodes*      up 5-00:00:00      5    mix cpunode[2,9,14-15,22]
cpunodes*      up 5-00:00:00     16   idle cpunode[3,6-8,11-13,16-21,24-26]
bigmemnodes    up 5-00:00:00      3   idle amdgpunode1,cpunode[4-5]

You can call the slurm_report script with the -h flag to see other options. The default invocation will report on your jobs, including any waiting to execute, and will also indicate how many jobs you have remaining before you hit the limit on simultaneous executions.

Customized information can be found using the squeue and sinfo commands.

Submitting Jobs

Log into one of the comps[0-3].cs machines. These machines have significant RAM allowances to make running heavyweight IDEs easier, without worrying about resource contention.

There are three resources that you must specify to run on anything greater than the default minimum of one core and 1GB of RAM. You must specify the number of cores, amount of RAM, and if required, GPU(s).

You must also specify the partition to which you will submit your job.

For example, to submit a job to the 'cpunodes' cluster, requesting 8 cores and 16GB of ram, running a job which reports the resources obtained, you could do as follows:

> srun --partition cpunodes -c 8 --mem=16G --pty slurmtest.sh
running taskset
pid 14205's current affinity mask: ff
pid 14205's new affinity mask: ff
pid 14205's current affinity mask: ff
running memory test 16G
success => 16000MB was allocated
(mask of ff = 8 cores)

To submit a node to the gpunodes cluster, requesting 1 core, 2GB of system ram, and one GPU (slurm will choose the first free in the partition), with a time limit of 60 minutes, testing for a GPU, you could run:

>srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:1 -t 60 nvidia-smi -L
srun: job 15418 queued and waiting for resources
srun: job 15418 has been allocated resources
GPU 0: NVIDIA RTX A4000 (UUID: GPU-a509d4d4-83f4-bbe7-32fb-07d4a7a65dd1)

To see which GPUs are available in which nodes, you can use sinfo:

>sinfo -p gpunodes -o "%20N  %10m  %25f  %20G "
NODELIST              MEMORY      AVAIL_FEATURES             GRES
gpunode[4,33]         63900+      RTX_4090,24564_MiB         gpu:rtx_4090:1
gpunode13             32000       GTX_1080_Ti,11178_MiB      gpu:gtx_1080_ti:1(S:
gpunode[32,34]        63900       RTX_4090,24564_MiB         gpu:rtx_4090:1(S:0)
gpunode[2-3]          127000      RTX_A6000,49140_MiB        gpu:rtx_a6000:1
gpunode[6,11]         31900       RTX_A4000,16117_MiB        gpu:rtx_a4000:1(S:0)
gpunode[29-30]        31900       RTX_A4500,20470_MiB        gpu:rtx_a4500:1(S:0)
gpunode[1,28]         63900+      RTX_A2000,12282_MiB        gpu:rtx_a2000:1
gpunode[16-17]        31900       RTX_2080,7982_MiB          gpu:rtx_2080:1(S:0)
gpunode[18-23,25]     31900       RTX_2070,7982_MiB          gpu:rtx_2070:1(S:0)

Then you might want to queue your job for a particular model of GPU, let's say an A2000:

> 
srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:rtx_a2000:1 nvidia-smi -L
srun: job 15419 queued and waiting for resources
srun: job 15419 has been allocated resources
GPU 0: NVIDIA RTX A2000 12GB (UUID: GPU-5ea88708-76c1-cc7c-01a7-c4900e8cc8b8)

You shouldn't copy these example srun command lines and use them directly. Before using a srun command line, you need to adjust the number of CPUs and the amount of RAM your job will allocate and use, and pick a time limit. If you simply use these example srun commands, your jobs will be terminated after sixty minutes and will not have much RAM. If you don't ask for enough RAM, your job will be terminated when it hits the memory limit.

As a user, it is possible to submit an interactive shell job to a partition, that is to say, a job that allows you run a shell on the node, perhaps to test code prior to doing a compute run. It is suggested that you make limited use of interactive sessions, as you are billed for the resources you have claimed whether or not you are actively running compute jobs in your interactive session.

Here is a sample interactive job submission:

> srun --partition cpunodes -c 4 --mem=8G -t 60 --pty bash --login
srun: job 15420 queued and waiting for resources
srun: job 15420 has been allocated resources
user@cpunode2:~$

Important notes on srun usage

The default resource limits for SLURM jobs in our cluster is 4 GB of RAM per CPU you request and a time limit of 12 hours. These are probably not adequate for your jobs, either CPU jobs or especially GPU jobs, where needing more than four GB of RAM is normal. If your job will potentially run for more than twelve hours, you need to tell SLURM that with the -t argument, such as '-t 3-0' for three days. See the srun manual page for the time options. You can't request a time limit of more than five days, the cluster maximum; if you try, your job will never be scheduled.

The amount of memory your job will get is set with the --mem argument, as in the examples above. To see the amount of memory each node has, you can do:

> 
sinfo -p gpunodes -O PartitionName,NodeHost,Memory,CPUs                             
PARTITION           HOSTNAMES           MEMORY              CPUS
gpunodes            gpunode4            128600              32
gpunodes            gpunode13           32000               8
gpunodes            gpunode32           63900               16
[...]

For GPU video ram, use the 'features' and 'gres' format as documented above.

For GPU jobs, all of our GPU servers have only a single GPU. If you're allocating the GPU, you might as well also allocate most of the RAM and all of the CPUs.

General Slurm Instructions

A comprehensive guide to slurm is beyond the scope of this article, and there are many excellent references on the web for specific cases.

sinfo will provide details of the slurm cluster configuration.

There are many ways to execute slurm jobs, but ultimately most of them will leverage srun or sbatch.

In addition to the sinfo and squeue commands used in the examples above, you should also be cognizant of the sacct command, which is your window into the slurm accounting system. This will allow you to view your activity, assess the state of your job completions, and so on. If you have some jobs succeed and some fail, you would be able to use sacct to clear up confusion regarding which jobs succeeded and which did not. Note that you will have to log into the cluster head node itself in order to run sacct. For example:

user@cluster:~$ sacct --format JobID%4,Partition%12,ExitCode,NodeList,JobName,State,Elapsed
JobI    Partition ExitCode        NodeList    JobName      State    Elapsed
---- ------------ -------- --------------- ---------- ---------- ----------
154+     gpunodes      2:0       gpunode11 gputest.sh     FAILED   00:00:00
154+                   0:0       gpunode11     extern  COMPLETED   00:00:00
154+                   2:0       gpunode11 gputest.sh     FAILED   00:00:00
154+     gpunodes      0:0       gpunode11 nvidia-smi  COMPLETED   00:00:01
154+                   0:0       gpunode11     extern  COMPLETED   00:00:01
154+                   0:0       gpunode11 nvidia-smi  COMPLETED   00:00:01
154+     gpunodes      0:0       gpunode11 nvidia-smi  COMPLETED   00:00:00
...

If you have suggestions for additions to this page, e.g. useful job submission invocations or reporting/monitoring commands, please don't hesitate to email them to ops@cs.