Slurm Compute Cluster

All of our GPU computing infrastructure is accessed via slurm. Please see our slurm reference document for specifics.

Please do not run applications such as web browsers or email clients on these servers; general-purpose computing should be done on your desktop if you have one, or on an application server if you are connecting remotely.

Although most software on the compute servers should be available as part of the packages provided by the OS vendor, some software which is provided by alternate sources resides in /opt. Matlab is one such piece of software, and there will be multiple version available in /opt as they are released. You should consider setting your PATH or using aliases to make running software in /opt easier; your Point of Contact (PoC) can assist you with this.

On slurm gpunodes, CUDA, cuDNN, and TensorRT are installed as system-packages and should be in the system path (e.g. for python) by default. If you need to locate CUDA manually, various versions will be found in /usr/local. Python-specific packages such as pytorch or tensorflow should be installed by users via pip.

If a desired (but not installed) package is available as part of our current Ubuntu distribution, contact your Point of Contact (PoC) and have them put in a request that it be added to core software.

With respect to software that is neither available via the OS distribution-provided package list nor in /opt, you or your Point of Contact (PoC) can install programs in your home directory or a suitable work partition for use on these servers, so long as the programs do not require Administrator/root privileges to install or run.

Available Resources

You can examine the various configurations currently on offer by logging into any of our general-use compute servers (i.e. comps0.cs through comps3.cs) and using the 'slurm_report' command (/usr/local/bin/slurm_report). This script is a wrapper for a variety of slurm commands, and aims to provide a more useful output. It has multiple modes, one of which is the -g flag for an overview of GPU availability:

user@comps2:~$ slurm_report -g

This is a report of GPU nodes, i.e. machines for GPU computing, which displays currently available resources.

NODELIST is the name of the node.

PARTITION indicates to which gpunode partition the node belongs

CPUS(A/I/O/T) is the state of cores on the node.  Allocated/Idle/Other/Total.  Essentially you want to look at the Idle number to judge which nodes have cores free.

FREE_MEM is the amount of RAM available to jobs, in MB.

Please note that the default allocation for cores and memory is 1 core and 1GB of ram when a GPU is required.  This is a -very- conservative use of resources.  Please specify with e.g. -c and --mem the resources you need when submitting your job.

AVAIL_FEATURES will indicate GPU type, which you might specify with --constraint=.

  NODELIST  CPUS(A/I/O/T)  FREE_MEM                    AVAIL_FEATURES  Total GPUs  Free GPUs
 gpunode14      0/12/0/12      9679  (2x)GeForce_GTX_1050_Ti,4039_MiB           2          2
  gpunode1        0/4/0/4     10056      GeForce_GTX_1050_Ti,4039_MiB           1          1
  gpunode3        0/4/0/4      9496      GeForce_GTX_1050_Ti,4039_MiB           1          1
  gpunode4        0/4/0/4      9448      GeForce_GTX_1050_Ti,4039_MiB           1          1
  gpunode5        0/4/0/4      9410      GeForce_GTX_1050_Ti,4039_MiB           1          1
  gpunode7        0/4/0/4      9491      GeForce_GTX_1050_Ti,4039_MiB           1          1
  gpunode8        0/6/0/6      9456      GeForce_GTX_1050_Ti,4039_MiB           1          1
  gpunode9        0/6/0/6      9444      GeForce_GTX_1050_Ti,4039_MiB           1          1
 gpunode10        0/4/0/4     12515      GeForce_GTX_1050_Ti,4039_MiB           1          1
 gpunode11        0/4/0/4     13026      GeForce_GTX_1050_Ti,4039_MiB           1          1
 gpunode12        0/4/0/4     29133         GeForce_GTX_1080,8119_MiB           1          1
 gpunode13        0/8/0/8     29280     GeForce_GTX_1080_Ti,11178_MiB           1          1
 gpunode15        0/4/0/4      7723         GeForce_RTX_2060,5934_MiB           1          1
 gpunode24        0/4/0/4      7118         GeForce_RTX_2060,5934_MiB           1          1
 gpunode26        0/4/0/4      8533         GeForce_RTX_2060,5934_MiB           1          1
 gpunode27        0/4/0/4      8454         GeForce_RTX_2060,5934_MiB           1          1
 gpunode18        2/2/0/4      4819         GeForce_RTX_2070,7982_MiB           1          0
 gpunode19        0/4/0/4      9717         GeForce_RTX_2070,7982_MiB           1          1
 gpunode20        0/4/0/4      6751         GeForce_RTX_2070,7982_MiB           1          1
 gpunode21        0/4/0/4      6663         GeForce_RTX_2070,7982_MiB           1          1
 gpunode22        0/4/0/4      9311         GeForce_RTX_2070,7982_MiB           1          1
 gpunode23        0/4/0/4      9256         GeForce_RTX_2070,7982_MiB           1          1
 gpunode25        2/2/0/4      3284         GeForce_RTX_2070,7982_MiB           1          0
 gpunode16        1/3/0/4      4016         GeForce_RTX_2080,7982_MiB           1          0
 gpunode17        2/2/0/4      1376         GeForce_RTX_2080,7982_MiB           1          0

You can call the slurm_report script with the -h flag to see other options.

Nvidia GPU Machines

user@comps2:~# sinfo
PARTITION     AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpunodes*        up 3-00:00:00     22   idle cpunode[1-10,18-25],cpuramnode[1-4]
smallgpunodes    up 3-00:00:00      4   idle gpunode[1,3-4,14]
biggpunodes      up 3-00:00:00      5   idle gpunode[12-13,16-17,25]

The slurm partitions for the Nvidia GPUs are 'smallgpunodes' for GPU cards with less than 8GB of ram, and 'biggpunodes' for those with 8GB or greater. The slurm reference page will detail the billing cost (note: this is not a financial term. Slurm refers to the tracking of resource use as 'billing'. Fear not, there is no money involved!) for the various resources. Note that the more resorces you use, the lower your priority will be when in contention with others for resources in the future.

Each machine has at least one version of CUDA installed. CUDA can be found in /usr/local/ and as such should be in the path the system searches by default. Versions of cuDNN and TensoRT that are compatible with the available CUDA are also installed. Python-specific packages such as pytorch or tensorflow should be installed by users via pip.

Please contact software@cs should you need additional versions of CUDA.

We suggest using Python Virtual Environments to work with TensorFlow. While there is a system-wide install of tensorflow, it will be the current default, and may not be optimal with the curent available CUDA.

Setting up multiple Python Virtual Environments to support multiple versions of TensorFlow is simple.

Use

user@gpunode1:~$ python3 -m venv ./venv-nvidia

To set up a python3 environment in the directory 'venv-nvidia'. If the directory does not exist, it will be created.
Then run

user@gpunode1:~$ source ./venv-nvidia/bin/activate

to enter that virtual environment.
Use pip to install the version of TensorFlow or pytorch that you need.

(venv-nvidia) user@gpunode1:~$ pip3 install tensorflow
(venv-nvidia) user@gpunode1:~$ python3 -c 'import tensorflow as tf; print(tf.__version__)'
2.8.4

To exit the current Python Virtual Environment, run

(venv-nvidia) user@gpunode1:~$ deactivate  

By specifying different directories when setting up an environment, you can maintain multiple environments, and switch between them using the 'activate' and 'deactivate' commands.

AMD GPU Machines

There is currently only one machine with AMD GPUs, amdgpunode1.cs. It provides 8 Instinct MI100 GPUs.

Using python virtual enviroments for AMD GPU computing should be similar to the above for NVIDIA computing.

Versions of PyTorch and TensorFlow that supports the available GPUs are already installed system-wide on the machine. Please do note that in general the versions of both that support AMD GPU computing seem to lag a little behind those for NVIDIA GPU computing, and are not updated unless requested.

Setup for accessing the GPUs is not difficult. Essentially you will need two environment variables

PATH=/opt/rocm-5.3.0/bin/:$PATH
LD_LIBRARY_PATH=/opt/rocm-5.3.0/lib

With those in place, you should be able to utilize as many GPUs as your job has reserved. For example:

user@cluster:~$ srun --partition amdgpunodes -c 8 --mem=64G --gres=gpu:2 -t 3:0:0 --pty bash --login

user@amdgpunode1:~$ PATH=/opt/rocm-5.3.0/bin/:$PATH
user@amdgpunode1:~$ LD_LIBRARY_PATH=/opt/rocm-5.3.0/lib

user@amdgpunode1:~$ python3 -c "import torch;print('Torch gpus found: ', torch.cuda.device_count())"
Torch gpus found:  2

user@amdgpunode1:~$ python3 -c "import os;os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3';import tensorflow as tf;print(tf.__version__)"
2.10.0

user@amdgpunode1:~$ python3 -c "import os;os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3';from tensorflow.python.client import device_lib;print(device_lib.list_local_devices())" | egrep '^, name:' 
, name: "/device:GPU:0"
, name: "/device:GPU:1"