Slurm Cluster
In our slurm implementation a researcher is expected to choose a partition (slurm nomenclature for a group of associated nodes) according to their needs, e.g. one comprised of nodes containing fast cores, or lots of memory, gpus, etc, to which they will submit a job. The job submission will include the computing resources (cores, memory, gpu) required, and slurm will schedule node(s) to meet the job requirements.
A quick note on the 'billing' term. In the context of slurm, it has nothing to do with financial matters. It is simply the term the slurm developers have chosen to apply to the tracking of resource use by an account (i.e. a user). There is no money involved!
Slurm will bill resource usage against the user's account, and will track usage over time. This usage is involved in the calculation of job execution priority in the event of multiple jobs in contention for the same resources. In short, the more resources you use, the lower your scheduling priority - it is a fair-share based system. More valuable resources will be billed at higher rates.
Our current billing scheme, subject to tweaking as the needs and demands of the department change, are as follows.
CPU is the base billing element, as cpu-seconds. They are calculated on a per-core basis, so one core for 60 seconds is billed as 60. Two cores for 30 seconds would also be billed as 60.
Memory is billed at a rate of 1 cpu-second for 0.25GB of ram held for one second.
GPU time is billed at 16 cpu-seconds per second of use.
Remember that you are billed for the time you have a resource claimed, not only for the time your job is actually using the resource. For example, if you requested an interactive shell, you are being billed for all the resources that you have requested for the entire duration of that session.
Your accumulated resource usage is one of the primary factors in determining your priority should other jobs be in contention for available resources, so it is in your best interests to request only what you need, and to free the resources as soon as your computations are complete.
At this time, we are not using job preemption. Once your job is running, it should run to completion, however, there is a five day limit on execution time to encourage fair use.
Jobs may be submited to the cluster from any of the general-use compute servers, those being currently comps0.cs, comps1.cs, comps2.cs and comps3.cs. The available nodes will fluxuate over time. You can examine the various configurations currently on offer by logging into a compute server and using the 'slurm_report' command (/usr/local/bin/slurm_report). This script is a wrapper for a variety of slurm commands, and aims to provide a more useful output. It has multiple modes, two of which are the -c and -g flags for an overview of CPU node and GPU node availability, e.g.:
user@comps3:~$ slurm_report -c This is a report of CPU nodes, i.e. machines for non-GPU computing, which displays currently available resources. Nodes completely in use or unavailable will not appear. NODELIST is the name of the node. PARTITION indicates to which partition the node belongs. CPUS(A/I/O/T) is the state of cores on the node. Allocated/Idle/Other/Total. Essentially you want to look at the Idle number to judge which nodes have cores free. MEMORY FREE is the currently unused system memory available for your job. Please specify with e.g. -c and --mem the resources you need when submitting your job. AVAIL_FEATURES will indicate CPU type, useful if your experiments are specific to a particular manufacturer, or you need consistency across multiple job runs. STATE could be either 'idle' which means the node is entirely free, or 'mixed' which indicates that other jobs are using some of the resources. The remaining resources available to your job should be accurately reflected in the CPUS Idle and MEMORY FREE fields. NODELIST PARTITION CPUS(A/I/O/T) MEMORY FREE AVAIL_FEATURES STATE cpunode11 cpunodes* 0/32/0/32 128600 Threadripper_1950X idle cpunode12 cpunodes* 0/32/0/32 128600 Threadripper_1950X idle cpunode13 cpunodes* 0/32/0/32 128600 Threadripper_1950X idle cpunode14 cpunodes* 8/24/0/32 63064 Threadripper_1950X mixed cpunode15 cpunodes* 8/24/0/32 63064 Threadripper_1950X mixed cpunode16 cpunodes* 0/32/0/32 128600 Threadripper_1950X idle cpunode17 cpunodes* 0/32/0/32 128600 Threadripper_1950X idle cpunode18 cpunodes* 0/64/0/64 128600 Threadripper_2990WX idle cpunode6 cpunodes* 0/64/0/64 128700 Threadripper_2990WX idle cpunode7 cpunodes* 0/64/0/64 128700 Threadripper_2990WX idle cpunode8 cpunodes* 0/64/0/64 128700 Threadripper_2990WX idle cpunode9 cpunodes* 8/56/0/64 63164 Threadripper_2990WX mixed cpunode19 cpunodes* 0/64/0/64 128700 Threadripper_2990WX idle cpunode20 cpunodes* 0/64/0/64 128700 Threadripper_2990WX idle cpunode21 cpunodes* 0/64/0/64 128700 Threadripper_2990WX idle cpunode22 cpunodes* 8/56/0/64 63164 Threadripper_2990WX mixed cpunode24 cpunodes* 0/64/0/64 128700 Threadripper_2990WX idle cpunode25 cpunodes* 0/64/0/64 128700 Threadripper_2990WX idle cpunode26 cpunodes* 0/64/0/64 128700 Threadripper_2990WX idle cpunode3 cpunodes* 0/112/0/112 515000 XeonGold-6348 idle cpunode2 cpunodes* 8/104/0/112 450264 AMD_Epyc7453 mixed cpunode4 bigmemnodes 0/112/0/112 2051000 AMD_Epyc7453 idle amdgpunode1 bigmemnodes 0/128/0/128 515000 (8x)AMD_MI100 idle cpunode5 bigmemnodes 0/512/0/512 1547730 AMD_Epyc9754 idle
Note that 'amdgpunode1' used to be in service as an AMD-specific GPU server, but due to lack of interest has transitioned to a large memory cpu compute node.
You will want to confirm which partition a node is in using sinfo, as some nodes may be in multiple partitions or specialized partitions (such as the bigmemnodes partition)
user@comps3:~$ sinfo -p cpunodes,bigmemnodes PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpunodes* up 5-00:00:00 5 mix cpunode[2,9,14-15,22] cpunodes* up 5-00:00:00 16 idle cpunode[3,6-8,11-13,16-21,24-26] bigmemnodes up 5-00:00:00 3 idle amdgpunode1,cpunode[4-5]
You can call the slurm_report script with the -h flag to see other options. The default invocation will report on your jobs, including any waiting to execute, and will also indicate how many jobs you have remaining before you hit the limit on simultaneous executions.
Customized information can be found using the squeue and sinfo commands.
Submitting Jobs
Log into one of the comps[0-3].cs machines. These machines have significant RAM allowances to make running heavyweight IDEs easier, without worrying about resource contention.
There are three resources that you must specify to run on anything greater than the default minimum of one core and 1GB of RAM. You must specify the number of cores, amount of RAM, and if required, GPU(s).
You must also specify the partition to which you will submit your job.
For example, to submit a job to the 'cpunodes' cluster, requesting 8 cores and 16GB of ram, running a job which reports the resources obtained, you could do as follows:
> srun --partition cpunodes -c 8 --mem=16G --pty slurmtest.sh running taskset pid 14205's current affinity mask: ff pid 14205's new affinity mask: ff pid 14205's current affinity mask: ff running memory test 16G success => 16000MB was allocated
(mask of ff = 8 cores)
To submit a node to the gpunodes cluster, requesting 1 core, 2GB of system ram, and one GPU (slurm will choose the first free in the partition), with a time limit of 60 minutes, testing for a GPU, you could run:
>srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:1 -t 60 nvidia-smi -L srun: job 15418 queued and waiting for resources srun: job 15418 has been allocated resources GPU 0: NVIDIA RTX A4000 (UUID: GPU-a509d4d4-83f4-bbe7-32fb-07d4a7a65dd1)
To see which GPUs are available in which nodes, you can use sinfo:
>sinfo -p gpunodes -o "%20N %10m %25f %20G " NODELIST MEMORY AVAIL_FEATURES GRES gpunode[4,33] 63900+ RTX_4090,24564_MiB gpu:rtx_4090:1 gpunode13 32000 GTX_1080_Ti,11178_MiB gpu:gtx_1080_ti:1(S: gpunode[32,34] 63900 RTX_4090,24564_MiB gpu:rtx_4090:1(S:0) gpunode[2-3] 127000 RTX_A6000,49140_MiB gpu:rtx_a6000:1 gpunode[6,11] 31900 RTX_A4000,16117_MiB gpu:rtx_a4000:1(S:0) gpunode[29-30] 31900 RTX_A4500,20470_MiB gpu:rtx_a4500:1(S:0) gpunode[1,28] 63900+ RTX_A2000,12282_MiB gpu:rtx_a2000:1 gpunode[16-17] 31900 RTX_2080,7982_MiB gpu:rtx_2080:1(S:0) gpunode[18-23,25] 31900 RTX_2070,7982_MiB gpu:rtx_2070:1(S:0)
Then you might want to queue your job for a particular model of GPU, let's say an A2000:
> srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:rtx_a2000:1 nvidia-smi -L srun: job 15419 queued and waiting for resources srun: job 15419 has been allocated resources GPU 0: NVIDIA RTX A2000 12GB (UUID: GPU-5ea88708-76c1-cc7c-01a7-c4900e8cc8b8)
You shouldn't copy these example srun command lines and use them directly. Before using a srun command line, you need to adjust the number of CPUs and the amount of RAM your job will allocate and use, and pick a time limit. If you simply use these example srun commands, your jobs will be terminated after sixty minutes and will not have much RAM. If you don't ask for enough RAM, your job will be terminated when it hits the memory limit.
As a user, it is possible to submit an interactive shell job to a partition, that is to say, a job that allows you run a shell on the node, perhaps to test code prior to doing a compute run. It is suggested that you make limited use of interactive sessions, as you are billed for the resources you have claimed whether or not you are actively running compute jobs in your interactive session.
Here is a sample interactive job submission:
> srun --partition cpunodes -c 4 --mem=8G -t 60 --pty bash --login srun: job 15420 queued and waiting for resources srun: job 15420 has been allocated resources user@cpunode2:~$
Important notes on srun usage
The default resource limits for SLURM jobs in our cluster is 4 GB of RAM per CPU you request and a time limit of 12 hours. These are probably not adequate for your jobs, either CPU jobs or especially GPU jobs, where needing more than four GB of RAM is normal. If your job will potentially run for more than twelve hours, you need to tell SLURM that with the -t
argument, such as '-t 3-0' for three days. See the srun manual page for the time options. You can't request a time limit of more than five days, the cluster maximum; if you try, your job will never be scheduled.
The amount of memory your job will get is set with the --mem
argument, as in the examples above. To see the amount of memory each node has, you can do:
> sinfo -p gpunodes -O PartitionName,NodeHost,Memory,CPUs PARTITION HOSTNAMES MEMORY CPUS gpunodes gpunode4 128600 32 gpunodes gpunode13 32000 8 gpunodes gpunode32 63900 16 [...]
For GPU video ram, use the 'features' and 'gres' format as documented above.
For GPU jobs, all of our GPU servers have only a single GPU. If you're allocating the GPU, you might as well also allocate most of the RAM and all of the CPUs.
General Slurm Instructions
A comprehensive guide to slurm is beyond the scope of this article, and there are many excellent references on the web for specific cases.
sinfo will provide details of the slurm cluster configuration.
There are many ways to execute slurm jobs, but ultimately most of them will leverage srun or sbatch.
In addition to the sinfo and squeue commands used in the examples above, you should also be cognizant of the sacct command, which is your window into the slurm accounting system. This will allow you to view your activity, assess the state of your job completions, and so on. If you have some jobs succeed and some fail, you would be able to use sacct to clear up confusion regarding which jobs succeeded and which did not. Note that you will have to log into the cluster head node itself in order to run sacct. For example:
user@cluster:~$ sacct --format JobID%4,Partition%12,ExitCode,NodeList,JobName,State,Elapsed JobI Partition ExitCode NodeList JobName State Elapsed ---- ------------ -------- --------------- ---------- ---------- ---------- 154+ gpunodes 2:0 gpunode11 gputest.sh FAILED 00:00:00 154+ 0:0 gpunode11 extern COMPLETED 00:00:00 154+ 2:0 gpunode11 gputest.sh FAILED 00:00:00 154+ gpunodes 0:0 gpunode11 nvidia-smi COMPLETED 00:00:01 154+ 0:0 gpunode11 extern COMPLETED 00:00:01 154+ 0:0 gpunode11 nvidia-smi COMPLETED 00:00:01 154+ gpunodes 0:0 gpunode11 nvidia-smi COMPLETED 00:00:00 ...
If you have suggestions for additions to this page, e.g. useful job submission invocations or reporting/monitoring commands, please don't hesitate to email them to ops@cs.
Scratch Space
For users of slurm, we provide temporary network-based scratch space for storing the output of jobs or for datasets.
This is network-accessed space shared across all slurm nodes.
This scratch space can be referenced by looking at the contents of "/scratch":
ls -1 /scratch/ | egrep expires
This will display one or two usable scratch space directories for use. The name of the directories will be in the format of "expires-DATE", where DATE will be in the form of a specific expiry date. For example:
expires-2024-Jun-26
expires-2024-Jul-05
Each directory is created to last for 15 days before it lapses, and there will always be a directory in /scratch with a lapse date more than five days in the future.
Once a directory has expired, it will no longer be accessible. Therefore please only use it for storing temporary data and please do not run jobs against it that would go beyond the expiry end date.
GPU Specifics
On slurm gpunodes, CUDA can be found in /usr/local/cuda*. We are transitioning from using ubuntu system packages (which are installed to default paths) to using versions installed from nvidia repositories, because the default ubuntu ones are not updated frequently enough.
This means that on non-upgraded nodes, you can find nvcc and other related binaries and libraries in your default path, and on upgraded nodes (which will eventually be all of them) everything cuda-related will be in /usr/local/cuda/[various]/. Python-specific packages such as pytorch or tensorflow should be installed by users via pip.
Please contact ops@cs should you need additional versions of CUDA.
We suggest using Python Virtual Environments to work with GPU tools such as pytorch or tensorflow.
Recall the slurm_report command, which has a -g flag for GPU info:
user@comps2:~$ slurm_report -g This is a report of GPU nodes, i.e. machines for GPU computing, which displays currently available resources. NODELIST is the name of the node. PARTITION indicates to which gpunode partition the node belongs CPUS(A/I/O/T) is the state of cores on the node. Allocated/Idle/Other/Total. Essentially you want to look at the Idle number to judge which nodes have cores free. Please note that the default allocation for cores and memory is 1 core and 1GB of ram when a GPU is required. This is a -very- conservative use of resources. Please specify with e.g. -c and --mem the resources you need when submitting your job. AVAIL_FEATURES will indicate GPU type, which you might specify e.g. --constraint=RTX_2070. GPUS and FREE show count of total and free GPUs in the node. STATE could be either 'idle' which means the node is entirely free, or 'mixed' which indicates that other jobs are using some of the resources. The remaining resources available to your job should be accurately reflected in the CPUS Idle, MEMORY FREE and GPUS FREE fields. NODELIST PARTITION CPUS(A/I/O/T) MEMORY FREE AVAIL_FEATURES GPUS FREE STATE gpunode4 gpunodes 0/32/0/32 128600 RTX_4090,24564_MiB 1 1 idle gpunode5 gpunodes 0/32/0/32 128600 RTX_4090,24564_MiB 1 1 idle gpunode7 gpunodes 8/24/0/32 112216 RTX_4090,24564_MiB 1 0 mixed gpunode32 gpunodes 8/8/0/16 47516 RTX_4090,24564_MiB 1 0 mixed gpunode33 gpunodes 8/24/0/32 112216 RTX_4090,24564_MiB 1 0 mixed gpunode34 gpunodes 2/30/0/32 5720 RTX_4090,24564_MiB 1 0 mixed gpunode1 gpunodes 16/16/0/32 112216 RTX_A2000,12282_MiB 1 0 mixed gpunode28 gpunodes 2/30/0/32 36440 RTX_A2000,12282_MiB 1 0 mixed gpunode6 gpunodes 0/4/0/4 31900 RTX_A4000,16117_MiB 1 1 idle gpunode11 gpunodes 0/4/0/4 31900 RTX_A4000,16117_MiB 1 1 idle gpunode15 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode16 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode17 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode18 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode19 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode20 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode21 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode22 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode23 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode24 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode25 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode26 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode27 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode29 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode30 gpunodes 0/4/0/4 31900 RTX_A4500,20470_MiB 1 1 idle gpunode2 gpunodes 2/22/0/24 4120 RTX_A6000,49140_MiB 1 0 mixed gpunode3 gpunodes 16/8/0/24 110616 RTX_A6000,49140_MiB 1 0 mixed
sinfo -p will give a quick summary to locate idle machines:
user@comps2:~# sinfo -p gpunodes PARTITION AVAIL TIMELIMIT NODES STATE NODELIST gpunodes up 5-00:00:00 4 mix gpunode[4,13,32,34] gpunodes up 5-00:00:00 6 alloc gpunode[2-3,6,29-30,33] gpunodes up 5-00:00:00 12 idle gpunode[1,11,16-23,25,28]