Slurm Cluster

In our slurm implementation a researcher is expected to choose a partition (slurm nomenclature for a group of associated nodes) according to their needs, e.g. one comprised of nodes containing fast cores, or lots of memory, gpus, etc, to which they will submit a job. The job submission will include the computing resources (cores, memory, gpu) required, and slurm will schedule node(s) to meet the job requirements.

A quick note on the 'billing' term. In the context of slurm, it has nothing to do with financial matters. It is simply the term the slurm developers have chosen to apply to the tracking of resource use by an account (i.e. a user). There is no money involved!

Slurm will bill resource usage against the user's account, and will track usage over time. This usage is involved in the calculation of job execution priority in the event of multiple jobs in contention for the same resources. In short, the more resources you use, the lower your scheduling priority - it is a fair-share based system. More valuable resources will be billed at higher rates.

Our current billing scheme, subject to tweaking as the needs and demands of the department change, are as follows.

CPU is the base billing element, as cpu-seconds. They are calculated on a per-core basis, so one core for 60 seconds is billed as 60. Two cores for 30 seconds would also be billed as 60.

Memory is billed at a rate of 1 cpu-second for 0.25GB of ram held for one second.

GPU time is billed at 16 cpu-seconds per second of use.

Remember that you are billed for the time you have a resource claimed, not only for the time your job is actually using the resource. For example, if you requested an interactive shell, you are being billed for all the resources that you have requested for the entire duration of that session.

Your accumulated resource usage is one of the primary factors in determining your priority should other jobs be in contention for available resources, so it is in your best interests to request only what you need, and to free the resources as soon as your computations are complete.

At this time, we are not using job preemption. Once your job is running, it should run to completion, however, there is a five day limit on execution time to encourage fair use.

Jobs may be submited to the cluster from any of the general-use compute servers, those being currently comps0.cs, comps1.cs, comps2.cs and comps3.cs. The available nodes will fluxuate over time. You can examine the various configurations currently on offer by logging into a compute server and using the 'slurm_report' command (/usr/local/bin/slurm_report). This script is a wrapper for a variety of slurm commands, and aims to provide a more useful output. It has multiple modes, two of which are the -c and -g flags for an overview of CPU node and GPU node availability, e.g.:

user@comps3:~$ slurm_report -c

This is a report of CPU nodes, i.e. machines for non-GPU computing, which displays currently available resources.  Nodes completely in use or unavailable will not appear.

NODELIST is the name of the node.  PARTITION indicates to which partition the node belongs.

CPUS(A/I/O/T) is the state of cores on the node.  Allocated/Idle/Other/Total.  Essentially you want to look at the Idle number to judge which nodes have cores free.

MEMORY FREE is the currently unused system memory available for your job.

Please specify with e.g. -c and --mem the resources you need when submitting your job.

AVAIL_FEATURES will indicate CPU type, useful if your experiments are specific to a particular manufacturer, or you need consistency across multiple job runs.

STATE could be either 'idle' which means the node is entirely free, or 'mixed' which indicates that other jobs are using some of the resources.  The remaining resources available to your job should be accurately reflected in the CPUS Idle and MEMORY FREE fields.

    NODELIST    PARTITION  CPUS(A/I/O/T)  MEMORY FREE       AVAIL_FEATURES      STATE
   cpunode11    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode12    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode13    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode14    cpunodes*      8/24/0/32        63064   Threadripper_1950X      mixed
   cpunode15    cpunodes*      8/24/0/32        63064   Threadripper_1950X      mixed
   cpunode16    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode17    cpunodes*      0/32/0/32       128600   Threadripper_1950X       idle
   cpunode18    cpunodes*      0/64/0/64       128600  Threadripper_2990WX       idle
    cpunode6    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
    cpunode7    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
    cpunode8    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
    cpunode9    cpunodes*      8/56/0/64        63164  Threadripper_2990WX      mixed
   cpunode19    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode20    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode21    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode22    cpunodes*      8/56/0/64        63164  Threadripper_2990WX      mixed
   cpunode24    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode25    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
   cpunode26    cpunodes*      0/64/0/64       128700  Threadripper_2990WX       idle
    cpunode3    cpunodes*    0/112/0/112       515000        XeonGold-6348       idle
    cpunode2    cpunodes*    8/104/0/112       450264         AMD_Epyc7453      mixed
    cpunode4  bigmemnodes    0/112/0/112      2051000         AMD_Epyc7453       idle
 amdgpunode1  bigmemnodes    0/128/0/128       515000        (8x)AMD_MI100       idle
    cpunode5  bigmemnodes    0/512/0/512      1547730         AMD_Epyc9754       idle

Note that 'amdgpunode1' used to be in service as an AMD-specific GPU server, but due to lack of interest has transitioned to a large memory cpu compute node.

You will want to confirm which partition a node is in using sinfo, as some nodes may be in multiple partitions or specialized partitions (such as the bigmemnodes partition)

user@comps3:~$ sinfo -p cpunodes,bigmemnodes
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpunodes*      up 5-00:00:00      5    mix cpunode[2,9,14-15,22]
cpunodes*      up 5-00:00:00     16   idle cpunode[3,6-8,11-13,16-21,24-26]
bigmemnodes    up 5-00:00:00      3   idle amdgpunode1,cpunode[4-5]

You can call the slurm_report script with the -h flag to see other options. The default invocation will report on your jobs, including any waiting to execute, and will also indicate how many jobs you have remaining before you hit the limit on simultaneous executions.

Customized information can be found using the squeue and sinfo commands.

Submitting Jobs

Log into one of the comps[0-3].cs machines. These machines have significant RAM allowances to make running heavyweight IDEs easier, without worrying about resource contention.

There are three resources that you must specify to run on anything greater than the default minimum of one core and 1GB of RAM. You must specify the number of cores, amount of RAM, and if required, GPU(s).

You must also specify the partition to which you will submit your job.

For example, to submit a job to the 'cpunodes' cluster, requesting 8 cores and 16GB of ram, running a job which reports the resources obtained, you could do as follows:

> srun --partition cpunodes -c 8 --mem=16G --pty slurmtest.sh
running taskset
pid 14205's current affinity mask: ff
pid 14205's new affinity mask: ff
pid 14205's current affinity mask: ff
running memory test 16G
success => 16000MB was allocated

(mask of ff = 8 cores)

To submit a node to the gpunodes cluster, requesting 1 core, 2GB of system ram, and one GPU (slurm will choose the first free in the partition), with a time limit of 60 minutes, testing for a GPU, you could run:

>srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:1 -t 60 nvidia-smi -L
srun: job 15418 queued and waiting for resources
srun: job 15418 has been allocated resources
GPU 0: NVIDIA RTX A4000 (UUID: GPU-a509d4d4-83f4-bbe7-32fb-07d4a7a65dd1)

To see which GPUs are available in which nodes, you can use sinfo:

>sinfo -p gpunodes -o "%20N  %10m  %25f  %20G "
NODELIST              MEMORY      AVAIL_FEATURES             GRES
gpunode[4,33]         63900+      RTX_4090,24564_MiB         gpu:rtx_4090:1
gpunode13             32000       GTX_1080_Ti,11178_MiB      gpu:gtx_1080_ti:1(S:
gpunode[32,34]        63900       RTX_4090,24564_MiB         gpu:rtx_4090:1(S:0)
gpunode[2-3]          127000      RTX_A6000,49140_MiB        gpu:rtx_a6000:1
gpunode[6,11]         31900       RTX_A4000,16117_MiB        gpu:rtx_a4000:1(S:0)
gpunode[29-30]        31900       RTX_A4500,20470_MiB        gpu:rtx_a4500:1(S:0)
gpunode[1,28]         63900+      RTX_A2000,12282_MiB        gpu:rtx_a2000:1
gpunode[16-17]        31900       RTX_2080,7982_MiB          gpu:rtx_2080:1(S:0)
gpunode[18-23,25]     31900       RTX_2070,7982_MiB          gpu:rtx_2070:1(S:0)

Then you might want to queue your job for a particular model of GPU, let's say an A2000:

> 
srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:rtx_a2000:1 nvidia-smi -L
srun: job 15419 queued and waiting for resources
srun: job 15419 has been allocated resources
GPU 0: NVIDIA RTX A2000 12GB (UUID: GPU-5ea88708-76c1-cc7c-01a7-c4900e8cc8b8)

You shouldn't copy these example srun command lines and use them directly. Before using a srun command line, you need to adjust the number of CPUs and the amount of RAM your job will allocate and use, and pick a time limit. If you simply use these example srun commands, your jobs will be terminated after sixty minutes and will not have much RAM. If you don't ask for enough RAM, your job will be terminated when it hits the memory limit.

As a user, it is possible to submit an interactive shell job to a partition, that is to say, a job that allows you run a shell on the node, perhaps to test code prior to doing a compute run. It is suggested that you make limited use of interactive sessions, as you are billed for the resources you have claimed whether or not you are actively running compute jobs in your interactive session.

Here is a sample interactive job submission:

> srun --partition cpunodes -c 4 --mem=8G -t 60 --pty bash --login
srun: job 15420 queued and waiting for resources
srun: job 15420 has been allocated resources
user@cpunode2:~$

Important notes on srun usage

The default resource limits for SLURM jobs in our cluster is 4 GB of RAM per CPU you request and a time limit of 12 hours. These are probably not adequate for your jobs, either CPU jobs or especially GPU jobs, where needing more than four GB of RAM is normal. If your job will potentially run for more than twelve hours, you need to tell SLURM that with the -t argument, such as '-t 3-0' for three days. See the srun manual page for the time options. You can't request a time limit of more than five days, the cluster maximum; if you try, your job will never be scheduled.

The amount of memory your job will get is set with the --mem argument, as in the examples above. To see the amount of memory each node has, you can do:

> sinfo -p gpunodes -O PartitionName,NodeHost,Memory,CPUs                             
PARTITION           HOSTNAMES           MEMORY              CPUS
gpunodes            gpunode4            128600              32
gpunodes            gpunode13           32000               8
gpunodes            gpunode32           63900               16
[...]

For GPU video ram, use the 'features' and 'gres' format as documented above.

For GPU jobs, all of our GPU servers have only a single GPU. If you're allocating the GPU, you might as well also allocate most of the RAM and all of the CPUs.

General Slurm Instructions

A comprehensive guide to slurm is beyond the scope of this article, and there are many excellent references on the web for specific cases.

sinfo will provide details of the slurm cluster configuration.

There are many ways to execute slurm jobs, but ultimately most of them will leverage srun or sbatch.

In addition to the sinfo and squeue commands used in the examples above, you should also be cognizant of the sacct command, which is your window into the slurm accounting system. This will allow you to view your activity, assess the state of your job completions, and so on. If you have some jobs succeed and some fail, you would be able to use sacct to clear up confusion regarding which jobs succeeded and which did not. Note that you will have to log into the cluster head node itself in order to run sacct. For example:

user@cluster:~$ sacct --format JobID%4,Partition%12,ExitCode,NodeList,JobName,State,Elapsed
JobI    Partition ExitCode        NodeList    JobName      State    Elapsed
---- ------------ -------- --------------- ---------- ---------- ----------
154+     gpunodes      2:0       gpunode11 gputest.sh     FAILED   00:00:00
154+                   0:0       gpunode11     extern  COMPLETED   00:00:00
154+                   2:0       gpunode11 gputest.sh     FAILED   00:00:00
154+     gpunodes      0:0       gpunode11 nvidia-smi  COMPLETED   00:00:01
154+                   0:0       gpunode11     extern  COMPLETED   00:00:01
154+                   0:0       gpunode11 nvidia-smi  COMPLETED   00:00:01
154+     gpunodes      0:0       gpunode11 nvidia-smi  COMPLETED   00:00:00
...

If you have suggestions for additions to this page, e.g. useful job submission invocations or reporting/monitoring commands, please don't hesitate to email them to ops@cs.

Scratch Space

For users of slurm, we provide temporary network-based scratch space for storing the output of jobs or for datasets.

This is network-accessed space shared across all slurm nodes.

This scratch space can be referenced by looking at the contents of "/scratch":

ls -1 /scratch/ | egrep expires

This will display one or two usable scratch space directories for use. The name of the directories will be in the format of "expires-DATE", where DATE will be in the form of a specific expiry date. For example:

expires-2024-Jun-26
expires-2024-Jul-05

Each directory is created to last for 15 days before it lapses, and there will always be a directory in /scratch with a lapse date more than five days in the future.

Once a directory has expired, it will no longer be accessible. Therefore please only use it for storing temporary data and please do not run jobs against it that would go beyond the expiry end date.

GPU Specifics

On slurm gpunodes, CUDA can be found in /usr/local/cuda*. We are transitioning from using ubuntu system packages (which are installed to default paths) to using versions installed from nvidia repositories, because the default ubuntu ones are not updated frequently enough.

This means that on non-upgraded nodes, you can find nvcc and other related binaries and libraries in your default path, and on upgraded nodes (which will eventually be all of them) everything cuda-related will be in /usr/local/cuda/[various]/. Python-specific packages such as pytorch or tensorflow should be installed by users via pip.

Please contact ops@cs should you need additional versions of CUDA.

We suggest using Python Virtual Environments to work with GPU tools such as pytorch or tensorflow.

Recall the slurm_report command, which has a -g flag for GPU info:

user@comps2:~$ slurm_report -g

This is a report of GPU nodes, i.e. machines for GPU computing, which displays currently available resources.

NODELIST is the name of the node.

PARTITION indicates to which gpunode partition the node belongs

CPUS(A/I/O/T) is the state of cores on the node.  Allocated/Idle/Other/Total.  Essentially you want to look at the Idle number to judge which nodes have cores free.

Please note that the default allocation for cores and memory is 1 core and 1GB of ram when a GPU is required.  This is a -very- conservative use of resources.  Please specify with e.g. -c and --mem the resources you need when submitting your job.

AVAIL_FEATURES will indicate GPU type, which you might specify e.g. --constraint=RTX_2070.

GPUS and FREE show count of total and free GPUs in the node.

STATE could be either 'idle' which means the node is entirely free, or 'mixed' which indicates that other jobs are using some of the resources.  The remaining resources available to your job should be accurately reflected in the CPUS Idle, MEMORY FREE and GPUS FREE fields.

    NODELIST       PARTITION  CPUS(A/I/O/T)  MEMORY FREE         AVAIL_FEATURES  GPUS  FREE  STATE
    gpunode4        gpunodes      0/32/0/32       128600     RTX_4090,24564_MiB     1     1   idle
    gpunode5        gpunodes      0/32/0/32       128600     RTX_4090,24564_MiB     1     1   idle
    gpunode7        gpunodes      8/24/0/32       112216     RTX_4090,24564_MiB     1     0  mixed
   gpunode32        gpunodes       8/8/0/16        47516     RTX_4090,24564_MiB     1     0  mixed
   gpunode33        gpunodes      8/24/0/32       112216     RTX_4090,24564_MiB     1     0  mixed
   gpunode34        gpunodes      2/30/0/32         5720     RTX_4090,24564_MiB     1     0  mixed
    gpunode1        gpunodes     16/16/0/32       112216    RTX_A2000,12282_MiB     1     0  mixed
   gpunode28        gpunodes      2/30/0/32        36440    RTX_A2000,12282_MiB     1     0  mixed
    gpunode6        gpunodes        0/4/0/4        31900    RTX_A4000,16117_MiB     1     1   idle
   gpunode11        gpunodes        0/4/0/4        31900    RTX_A4000,16117_MiB     1     1   idle
   gpunode15        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode16        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode17        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode18        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode19        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode20        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode21        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode22        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode23        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode24        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode25        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode26        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode27        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode29        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
   gpunode30        gpunodes        0/4/0/4        31900    RTX_A4500,20470_MiB     1     1   idle
    gpunode2        gpunodes      2/22/0/24         4120    RTX_A6000,49140_MiB     1     0  mixed
    gpunode3        gpunodes      16/8/0/24       110616    RTX_A6000,49140_MiB     1     0  mixed

sinfo -p will give a quick summary to locate idle machines:

user@comps2:~# sinfo -p gpunodes
PARTITION   AVAIL  TIMELIMIT  NODES  STATE NODELIST
gpunodes     up 5-00:00:00      4    mix gpunode[4,13,32,34]
gpunodes     up 5-00:00:00      6  alloc gpunode[2-3,6,29-30,33]
gpunodes     up 5-00:00:00     12   idle gpunode[1,11,16-23,25,28]