Slurm Cluster

In our slurm implementation a researcher is expected to choose a partition (slurm nomenclature for a group of associated nodes) according to their needs, e.g. one comprised of nodes containing fast cores, or lots of memory, older gpus, newer gpus, etc, to which they will submit a job. The job submission will include the computing resources (cores, memory, gpu) required, and slurm will schedule node(s) to meet the job requirements.

A quick note on the 'billing' term. In the context of slurm, it has nothing to do with financial matters. It is simply the term the slurm developers have chosen to apply to the tracking of resource use by an account (i.e. a user). There is no money involved!

Slurm will bill resource usage against the user's account, and will track usage over time. This usage is involved in the calculation of job execution priority in the event of multiple jobs in contention for the same resources. In short, the more resources you use, the lower your scheduling priority - it is a fair-share based system. More valuable resources will be billed at higher rates.

Our current billing scheme, subject to tweaking as the needs and demands of the department change, are as follows.

  • CPU is the base billing element, as cpu-seconds. They are calculated on a per-core basis, so one core for 60 seconds is billed as 60. Two cores for 30 seconds would also be billed as 60.
  • Memory is billed at a rate of 1 cpu-second for 0.25GB of ram held for one second.
  • GPU time is billed depending on the type. Smaller GPUs from the 'smallgpunodes' partition are billed at 4 cpu-seconds per second of use. Larger GPUs from the 'biggpunodes' are billed at 16 cpu-seconds per second of use.

Remember that you are billed for the time you have a resource claimed, not only for the time your job is actually using the resource. For example, if you requested an interactive shell, you are being billed for all the resources that you have requested for the entire duration of that session.

Your accumulated resource usage is one of the primary factors in determining your priority should other jobs be in contention for available resources, so it is in your best interests to request only what you need, and to free the resources as soon as your computations are complete.

At this time, we are not using job preemption. Once your job is running, it should run to completion, however, there is a three day limit on execution time to encourage fair use.

Jobs may be submited to the cluster from any of the general-use compute servers, those being currently comps0.cs, comps1.cs, comps2.cs and comps3.cs. The available nodes will fluxuate over time. You can examine the various configurations currently on offer by logging into a compute server and using the 'slurm_report' command (/usr/local/bin/slurm_report). This script is a wrapper for a variety of slurm commands, and aims to provide a more useful output. It has multiple modes, two of which are the -c and -g flags for an overview of CPU node and GPU node availability, e.g.:

user@comps3:~$ slurm_report -c

NODELIST   CPUS(A/I/O/T) FREE_MEM   AVAIL_FEATURES       STATE
cpunode23  0/8/0/8       58564      XeonE5-2680          idle
cpunode1   0/32/0/32     89296      Threadripper_1950X   idle
cpunode10  0/32/0/32     120935     Threadripper_1950X   idle
cpunode11  0/32/0/32     120952     Threadripper_1950X   idle
cpunode12  0/32/0/32     121063     Threadripper_1950X   idle
cpunode13  0/32/0/32     120637     Threadripper_1950X   idle
cpunode14  0/32/0/32     121325     Threadripper_1950X   idle
cpunode15  0/32/0/32     121311     Threadripper_1950X   idle
cpunode16  0/32/0/32     121428     Threadripper_1950X   idle
cpunode17  0/32/0/32     121804     Threadripper_1950X   idle
cpunode20  0/64/0/64     117820     Threadripper_2990WX  idle
cpunode21  0/64/0/64     119265     Threadripper_2990WX  idle
cpunode22  0/64/0/64     117987     Threadripper_2990WX  idle
cpunode24  0/64/0/64     106409     Threadripper_2990WX  idle
cpunode25  0/64/0/64     112443     Threadripper_2990WX  idle
cpunode26  0/64/0/64     118976     Threadripper_2990WX  idle
cpunode3   0/112/0/112   506667     XeonGold-6348        idle
cpunode2   0/112/0/112   508195     AMD_Epyc7453         idle
cpunode4   0/112/0/112   2046965    AMD_Epyc7453         idle

This is a report of CPU nodes, i.e. machines for non-GPU computing, which displays currently available resources.

NODELIST is the name of the node.

CPUS(A/I/O/T) is the state of cores on the node.  Allocated/Idle/Other/Total.  Essentially you want to look at the Idle number to judge which nodes have cores free.

FREE MEM is the amount of RAM available to jobs, in MB.

AVAIL_FEATURES will indicate CPU type, useful if your experiments are specific to a particular manufacturer, or you need consistency across multiple job runs.

STATE could be either IDLE which means the node is entirely free, or MIXED which indicates that other jobs are using some of the resources.  The CPUS 'I' field and FREE_MEM will reflect what is available to new jobs if a node is in MIXED state.

You will want to confirm which partition a node is in using sinfo, as some nodes may be in multiple partitions or specialized partitions (such as the bigmemnodes partition)

user@comps3:~$ sinfo
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpunodes*        up 5-00:00:00      6  alloc cpunode[6-9,18-19]
cpunodes*        up 5-00:00:00     18   idle cpunode[1-3,10-17,20-26]
bigmemnodes      up 5-00:00:00      1   idle cpunode4
smallgpunodes    up 5-00:00:00     14   idle gpunode[1-5,7-10,14-15,24,26-27]
biggpunodes      up 5-00:00:00      3    mix gpunode[6,13,22]
biggpunodes      up 5-00:00:00      9  alloc gpunode[11-12,16-17,19-21,23,25]
biggpunodes      up 5-00:00:00      1   idle gpunode18

You can call the slurm_report script with the -h flag to see other options. The default invocation will report on your jobs, including any waiting to execute, and will also indicate how many jobs you have remaining before you hit the limit on simultaneous executions.

Customized information can be found using the squeue and sinfo commands.

Submitting Jobs

Log into one of the comps[0-3].cs machines. These machines have significant RAM allowances to make running heavyweight IDEs easier, without worrying about resource contention.

There are three resources that you must specify to run on anything greater than the default minimum of one core and 1GB of RAM. You must specify the number of cores, amount of RAM, and if required, GPU(s).

You must also specify the partition to which you will submit your job.

For example, to submit a job to the 'cpunodes' cluster, requesting 8 cores and 16GB of ram, running a job which reports the resources obtained, you could do as follows:

> srun --partition cpunodes -c 8 --mem=16G --pty slurmtest.sh
running taskset
pid 14205's current affinity mask: ff
pid 14205's new affinity mask: ff
pid 14205's current affinity mask: ff
running memory test 16G
success => 16000MB was allocated
(mask of ff = 8 cores)

To submit a node to the gpunodes cluster, requesting 1 core, 2GB of system ram, and one GPU, testing for a GPU, you could run:

> srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:1 gputest.sh
GPU 0: GeForce GTX 1050 Ti (UUID: GPU-8480d940-000d-1736-8b67-7f788b49391b)

Same thing, but ask for two GPUs:

> srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:2 gputest.sh
GPU 0: GeForce GTX 1050 Ti (UUID: GPU-bb85f88d-2a7e-2ab7-9960-0abb91ad4b2c)
GPU 1: GeForce GTX 1050 Ti (UUID: GPU-db98900c-97ee-f7bc-583f-f554409f8215)

As a user, it is possible to submit an interactive shell job to a partition, that is to say, a job that allows you run a shell on the node, perhaps to test code prior to doing a compute run. It is suggested that you make limited use of interactive sessions, as you are billed for the resources you have claimed whether or not you are actively running compute jobs in your interactive session.

Here is a sample interactive job submission, asking for four cores, 2GB of ram, and two GPUs. We are specifically asking for gpunode1:

> srun --partition gpunodes --nodelist gpunode1 -c 4 --gres=gpu:2 --mem=2G --pty bash --login
srun: error: Unable to allocate resources: Requested node configuration is not available

The job failed as gpunode1 only has one GPU available. We can use sinfo to learn that gpunode14 has two 1050 Ti GPUs available:

> sinfo -N -p gpunodes -o '%10N %G'
NODELIST   GRES
gpunode1   gpu:gtx_1050_ti:1(S:0)
gpunode14  gpu:gtx_1050_ti:2(S:0-1)

So instead, let's submit to gpunode14:

> srun --partition gpunodes --nodelist gpunode14 -c 4 --gres=gpu:2 --mem=2G --pty bash --login
user@gpunode14:~$ nvidia-smi -L
GPU 0: GeForce GTX 1050 Ti (UUID: GPU-bb85f88d-2a7e-2ab7-9960-0abb91ad4b2c)
GPU 1: GeForce GTX 1050 Ti (UUID: GPU-db98900c-97ee-f7bc-583f-f554409f8215)

You can leverage the 'Features' column if you desire to submit to specific types of hardware, for example, if you do not specify your gpu, slurm will give you the next available one from the list:

> srun --partition=biggpunodes -c 1 --mem=2G --gres=gpu:1 gputest.sh
GPU 0: GeForce GTX 1080 (UUID: GPU-1581c530-367f-9a0d-57b0-44904baf34d1)

But if you wish, you could specifically request a RTX 2080:

> srun --partition=biggpunodes -c 1 --mem=2G --gres=gpu:1 --constraint=GeForce_RTX_2080 gputest.sh
GPU 0: GeForce RTX 2080 (UUID: GPU-d1fc01bf-5d55-bcea-63cc-db2298cf9b56)

General Slurm Instructions

A comprehensive guide to slurm is beyond the scope of this article, and there are many excellent references on the web for specific cases. We will provide here some general use examples for someone new to slurm.

sinfo will provide details of the slurm cluster configuration.

There are many ways to execute slurm jobs, but ultimately most of them will leverage srun or sbatch.

In addition to the sinfo and squeue commands used in the examples above, you should also be cognizant of the sacct command, which is your window into the slurm accounting system. This will allow you to view your activity, assess the state of your job completions, and so on. If you have some jobs succeed and some fail, you would be able to use sacct to clear up confusion regarding which jobs succeeded and which did not. For example:

> sacct --format JobID%4,Partition%12,ExitCode,NodeList,JobName,State,Elapsed
JobI    Partition ExitCode        NodeList    JobName      State    Elapsed
---- ------------ -------- --------------- ---------- ---------- ----------
 205     gpunodes      0:0        gpunode1 nvidia-smi  COMPLETED   00:00:01
 206     gpunodes      0:0        gpunode1      uname  COMPLETED   00:00:01
 207     gpunodes      0:0        gpunode1 gputest.sh  COMPLETED   00:00:00
 208     gpunodes      0:0       gpunode14 gputest.sh  COMPLETED   00:00:00
 209     cpunodes      0:0     cpuramnode1       bash  COMPLETED   00:10:13
 210     cpunodes    0:125     cpuramnode2       bash OUT_OF_ME+   00:00:23
 211     cpunodes      0:0     cpuramnode2       bash  COMPLETED   00:00:19
 212     gpunodes      1:0   None assigned       bash     FAILED   00:00:00
 213     gpunodes      0:0       gpunode14       bash  COMPLETED   00:20:13

If you have suggestions for additions to this page, e.g. useful job submission invocations or reporting/monitoring commands, please don't hesitate to email them to ops@cs.