Slurm Cluster
In our slurm implementation a researcher is expected to choose a partition (slurm nomenclature for a group of associated nodes) according to their needs, e.g. one comprised of nodes containing fast cores, or lots of memory, older gpus, newer gpus, etc, to which they will submit a job. The job submission will include the computing resources (cores, memory, gpu) required, and slurm will schedule node(s) to meet the job requirements.
A quick note on the 'billing' term. In the context of slurm, it has nothing to do with financial matters. It is simply the term the slurm developers have chosen to apply to the tracking of resource use by an account (i.e. a user). There is no money involved!
Slurm will bill resource usage against the user's account, and will track usage over time. This usage is involved in the calculation of job execution priority in the event of multiple jobs in contention for the same resources. In short, the more resources you use, the lower your scheduling priority - it is a fair-share based system. More valuable resources will be billed at higher rates.
Our current billing scheme, subject to tweaking as the needs and demands of the department change, are as follows.
- CPU is the base billing element, as cpu-seconds. They are calculated on a per-core basis, so one core for 60 seconds is billed as 60. Two cores for 30 seconds would also be billed as 60.
- Memory is billed at a rate of 1 cpu-second for 0.25GB of ram held for one second.
- GPU time is billed depending on the type. Smaller GPUs from the 'smallgpunodes' partition are billed at 4 cpu-seconds per second of use. Larger GPUs from the 'biggpunodes' are billed at 16 cpu-seconds per second of use.
Remember that you are billed for the time you have a resource claimed, not only for the time your job is actually using the resource. For example, if you requested an interactive shell, you are being billed for all the resources that you have requested for the entire duration of that session.
Your accumulated resource usage is one of the primary factors in determining your priority should other jobs be in contention for available resources, so it is in your best interests to request only what you need, and to free the resources as soon as your computations are complete.
At this time, we are not using job preemption. Once your job is running, it should run to completion, however, there is a three day limit on execution time to encourage fair use.
Jobs may be submited to the cluster from any of the general-use compute servers, those being currently comps0.cs, comps1.cs, comps2.cs and comps3.cs. The available nodes will fluxuate over time. You can examine the various configurations currently on offer by logging into a compute server and using the 'slurm_report' command (/usr/local/bin/slurm_report). This script is a wrapper for a variety of slurm commands, and aims to provide a more useful output. It has multiple modes, two of which are the -c and -g flags for an overview of CPU node and GPU node availability, e.g.:
user@comps3:~$ slurm_report -c NODELIST CPUS(A/I/O/T) FREE_MEM AVAIL_FEATURES STATE cpunode23 0/8/0/8 58564 XeonE5-2680 idle cpunode1 0/32/0/32 89296 Threadripper_1950X idle cpunode10 0/32/0/32 120935 Threadripper_1950X idle cpunode11 0/32/0/32 120952 Threadripper_1950X idle cpunode12 0/32/0/32 121063 Threadripper_1950X idle cpunode13 0/32/0/32 120637 Threadripper_1950X idle cpunode14 0/32/0/32 121325 Threadripper_1950X idle cpunode15 0/32/0/32 121311 Threadripper_1950X idle cpunode16 0/32/0/32 121428 Threadripper_1950X idle cpunode17 0/32/0/32 121804 Threadripper_1950X idle cpunode20 0/64/0/64 117820 Threadripper_2990WX idle cpunode21 0/64/0/64 119265 Threadripper_2990WX idle cpunode22 0/64/0/64 117987 Threadripper_2990WX idle cpunode24 0/64/0/64 106409 Threadripper_2990WX idle cpunode25 0/64/0/64 112443 Threadripper_2990WX idle cpunode26 0/64/0/64 118976 Threadripper_2990WX idle cpunode3 0/112/0/112 506667 XeonGold-6348 idle cpunode2 0/112/0/112 508195 AMD_Epyc7453 idle cpunode4 0/112/0/112 2046965 AMD_Epyc7453 idle This is a report of CPU nodes, i.e. machines for non-GPU computing, which displays currently available resources. NODELIST is the name of the node. CPUS(A/I/O/T) is the state of cores on the node. Allocated/Idle/Other/Total. Essentially you want to look at the Idle number to judge which nodes have cores free. FREE MEM is the amount of RAM available to jobs, in MB. AVAIL_FEATURES will indicate CPU type, useful if your experiments are specific to a particular manufacturer, or you need consistency across multiple job runs. STATE could be either IDLE which means the node is entirely free, or MIXED which indicates that other jobs are using some of the resources. The CPUS 'I' field and FREE_MEM will reflect what is available to new jobs if a node is in MIXED state.
You will want to confirm which partition a node is in using sinfo, as some nodes may be in multiple partitions or specialized partitions (such as the bigmemnodes partition)
user@comps3:~$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpunodes* up 5-00:00:00 6 alloc cpunode[6-9,18-19] cpunodes* up 5-00:00:00 18 idle cpunode[1-3,10-17,20-26] bigmemnodes up 5-00:00:00 1 idle cpunode4 smallgpunodes up 5-00:00:00 14 idle gpunode[1-5,7-10,14-15,24,26-27] biggpunodes up 5-00:00:00 3 mix gpunode[6,13,22] biggpunodes up 5-00:00:00 9 alloc gpunode[11-12,16-17,19-21,23,25] biggpunodes up 5-00:00:00 1 idle gpunode18
You can call the slurm_report script with the -h flag to see other options. The default invocation will report on your jobs, including any waiting to execute, and will also indicate how many jobs you have remaining before you hit the limit on simultaneous executions.
Customized information can be found using the squeue and sinfo commands.
Submitting Jobs
Log into one of the comps[0-3].cs machines. These machines have significant RAM allowances to make running heavyweight IDEs easier, without worrying about resource contention.
There are three resources that you must specify to run on anything greater than the default minimum of one core and 1GB of RAM. You must specify the number of cores, amount of RAM, and if required, GPU(s).
You must also specify the partition to which you will submit your job.
For example, to submit a job to the 'cpunodes' cluster, requesting 8 cores and 16GB of ram, running a job which reports the resources obtained, you could do as follows:
> srun --partition cpunodes -c 8 --mem=16G --pty slurmtest.sh running taskset pid 14205's current affinity mask: ff pid 14205's new affinity mask: ff pid 14205's current affinity mask: ff running memory test 16G success => 16000MB was allocated(mask of ff = 8 cores)
To submit a node to the gpunodes cluster, requesting 1 core, 2GB of system ram, and one GPU, testing for a GPU, you could run:
> srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:1 gputest.sh GPU 0: GeForce GTX 1050 Ti (UUID: GPU-8480d940-000d-1736-8b67-7f788b49391b)
Same thing, but ask for two GPUs:
> srun --partition=gpunodes -c 1 --mem=2G --gres=gpu:2 gputest.sh GPU 0: GeForce GTX 1050 Ti (UUID: GPU-bb85f88d-2a7e-2ab7-9960-0abb91ad4b2c) GPU 1: GeForce GTX 1050 Ti (UUID: GPU-db98900c-97ee-f7bc-583f-f554409f8215)
As a user, it is possible to submit an interactive shell job to a partition, that is to say, a job that allows you run a shell on the node, perhaps to test code prior to doing a compute run. It is suggested that you make limited use of interactive sessions, as you are billed for the resources you have claimed whether or not you are actively running compute jobs in your interactive session.
Here is a sample interactive job submission, asking for four cores, 2GB of ram, and two GPUs. We are specifically asking for gpunode1:
> srun --partition gpunodes --nodelist gpunode1 -c 4 --gres=gpu:2 --mem=2G --pty bash --login srun: error: Unable to allocate resources: Requested node configuration is not available
The job failed as gpunode1 only has one GPU available. We can use sinfo to learn that gpunode14 has two 1050 Ti GPUs available:
> sinfo -N -p gpunodes -o '%10N %G' NODELIST GRES gpunode1 gpu:gtx_1050_ti:1(S:0) gpunode14 gpu:gtx_1050_ti:2(S:0-1)
So instead, let's submit to gpunode14:
> srun --partition gpunodes --nodelist gpunode14 -c 4 --gres=gpu:2 --mem=2G --pty bash --login user@gpunode14:~$ nvidia-smi -L GPU 0: GeForce GTX 1050 Ti (UUID: GPU-bb85f88d-2a7e-2ab7-9960-0abb91ad4b2c) GPU 1: GeForce GTX 1050 Ti (UUID: GPU-db98900c-97ee-f7bc-583f-f554409f8215)
You can leverage the 'Features' column if you desire to submit to specific types of hardware, for example, if you do not specify your gpu, slurm will give you the next available one from the list:
> srun --partition=biggpunodes -c 1 --mem=2G --gres=gpu:1 gputest.sh GPU 0: GeForce GTX 1080 (UUID: GPU-1581c530-367f-9a0d-57b0-44904baf34d1)
But if you wish, you could specifically request a RTX 2080:
> srun --partition=biggpunodes -c 1 --mem=2G --gres=gpu:1 --constraint=GeForce_RTX_2080 gputest.sh GPU 0: GeForce RTX 2080 (UUID: GPU-d1fc01bf-5d55-bcea-63cc-db2298cf9b56)
General Slurm Instructions
A comprehensive guide to slurm is beyond the scope of this article, and there are many excellent references on the web for specific cases. We will provide here some general use examples for someone new to slurm.
sinfo will provide details of the slurm cluster configuration.
There are many ways to execute slurm jobs, but ultimately most of them will leverage srun or sbatch.
In addition to the sinfo and squeue commands used in the examples above, you should also be cognizant of the sacct command, which is your window into the slurm accounting system. This will allow you to view your activity, assess the state of your job completions, and so on. If you have some jobs succeed and some fail, you would be able to use sacct to clear up confusion regarding which jobs succeeded and which did not. For example:
> sacct --format JobID%4,Partition%12,ExitCode,NodeList,JobName,State,Elapsed JobI Partition ExitCode NodeList JobName State Elapsed ---- ------------ -------- --------------- ---------- ---------- ---------- 205 gpunodes 0:0 gpunode1 nvidia-smi COMPLETED 00:00:01 206 gpunodes 0:0 gpunode1 uname COMPLETED 00:00:01 207 gpunodes 0:0 gpunode1 gputest.sh COMPLETED 00:00:00 208 gpunodes 0:0 gpunode14 gputest.sh COMPLETED 00:00:00 209 cpunodes 0:0 cpuramnode1 bash COMPLETED 00:10:13 210 cpunodes 0:125 cpuramnode2 bash OUT_OF_ME+ 00:00:23 211 cpunodes 0:0 cpuramnode2 bash COMPLETED 00:00:19 212 gpunodes 1:0 None assigned bash FAILED 00:00:00 213 gpunodes 0:0 gpunode14 bash COMPLETED 00:20:13
If you have suggestions for additions to this page, e.g. useful job submission invocations or reporting/monitoring commands, please don't hesitate to email them to ops@cs.