Slurm Compute Cluster
All of our GPU computing infrastructure is accessed via slurm. Please see our slurm reference document for specifics.
Please do not run applications such as web browsers or email clients on these servers; general-purpose computing should be done on your desktop if you have one, or on an application server if you are connecting remotely.
Although most software on the compute servers should be available as part of the packages provided by the OS vendor, some software which is provided by alternate sources resides in /opt. Matlab is one such piece of software, and there will be multiple version available in /opt as they are released. You should consider setting your PATH or using aliases to make running software in /opt easier; your Point of Contact (PoC) can assist you with this.
On slurm gpunodes, CUDA, cuDNN, and TensorRT are installed as system-packages and should be in the system path (e.g. for python) by default. If you need to locate CUDA manually, various versions will be found in /usr/local. Python-specific packages such as pytorch or tensorflow should be installed by users via pip.
If a desired (but not installed) package is available as part of our current Ubuntu distribution, contact your Point of Contact (PoC) and have them put in a request that it be added to core software.
With respect to software that is neither available via the OS distribution-provided package list nor in /opt, you or your Point of Contact (PoC) can install programs in your home directory or a suitable work partition for use on these servers, so long as the programs do not require Administrator/root privileges to install or run.
Available Resources
You can examine the various configurations currently on offer by logging into any of our general-use compute servers (i.e. comps0.cs through comps3.cs) and using the 'slurm_report' command (/usr/local/bin/slurm_report). This script is a wrapper for a variety of slurm commands, and aims to provide a more useful output. It has multiple modes, one of which is the -g flag for an overview of GPU availability:
user@comps2:~$ slurm_report -g This is a report of GPU nodes, i.e. machines for GPU computing, which displays currently available resources. NODELIST is the name of the node. PARTITION indicates to which gpunode partition the node belongs CPUS(A/I/O/T) is the state of cores on the node. Allocated/Idle/Other/Total. Essentially you want to look at the Idle number to judge which nodes have cores free. FREE_MEM is the amount of RAM available to jobs, in MB. Please note that the default allocation for cores and memory is 1 core and 1GB of ram when a GPU is required. This is a -very- conservative use of resources. Please specify with e.g. -c and --mem the resources you need when submitting your job. AVAIL_FEATURES will indicate GPU type, which you might specify with --constraint=. NODELIST CPUS(A/I/O/T) FREE_MEM AVAIL_FEATURES Total GPUs Free GPUs gpunode14 0/12/0/12 9679 (2x)GeForce_GTX_1050_Ti,4039_MiB 2 2 gpunode1 0/4/0/4 10056 GeForce_GTX_1050_Ti,4039_MiB 1 1 gpunode3 0/4/0/4 9496 GeForce_GTX_1050_Ti,4039_MiB 1 1 gpunode4 0/4/0/4 9448 GeForce_GTX_1050_Ti,4039_MiB 1 1 gpunode5 0/4/0/4 9410 GeForce_GTX_1050_Ti,4039_MiB 1 1 gpunode7 0/4/0/4 9491 GeForce_GTX_1050_Ti,4039_MiB 1 1 gpunode8 0/6/0/6 9456 GeForce_GTX_1050_Ti,4039_MiB 1 1 gpunode9 0/6/0/6 9444 GeForce_GTX_1050_Ti,4039_MiB 1 1 gpunode10 0/4/0/4 12515 GeForce_GTX_1050_Ti,4039_MiB 1 1 gpunode11 0/4/0/4 13026 GeForce_GTX_1050_Ti,4039_MiB 1 1 gpunode12 0/4/0/4 29133 GeForce_GTX_1080,8119_MiB 1 1 gpunode13 0/8/0/8 29280 GeForce_GTX_1080_Ti,11178_MiB 1 1 gpunode15 0/4/0/4 7723 GeForce_RTX_2060,5934_MiB 1 1 gpunode24 0/4/0/4 7118 GeForce_RTX_2060,5934_MiB 1 1 gpunode26 0/4/0/4 8533 GeForce_RTX_2060,5934_MiB 1 1 gpunode27 0/4/0/4 8454 GeForce_RTX_2060,5934_MiB 1 1 gpunode18 2/2/0/4 4819 GeForce_RTX_2070,7982_MiB 1 0 gpunode19 0/4/0/4 9717 GeForce_RTX_2070,7982_MiB 1 1 gpunode20 0/4/0/4 6751 GeForce_RTX_2070,7982_MiB 1 1 gpunode21 0/4/0/4 6663 GeForce_RTX_2070,7982_MiB 1 1 gpunode22 0/4/0/4 9311 GeForce_RTX_2070,7982_MiB 1 1 gpunode23 0/4/0/4 9256 GeForce_RTX_2070,7982_MiB 1 1 gpunode25 2/2/0/4 3284 GeForce_RTX_2070,7982_MiB 1 0 gpunode16 1/3/0/4 4016 GeForce_RTX_2080,7982_MiB 1 0 gpunode17 2/2/0/4 1376 GeForce_RTX_2080,7982_MiB 1 0
You can call the slurm_report script with the -h flag to see other options.
Nvidia GPU Machines
user@comps2:~# sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpunodes* up 3-00:00:00 22 idle cpunode[1-10,18-25],cpuramnode[1-4] smallgpunodes up 3-00:00:00 4 idle gpunode[1,3-4,14] biggpunodes up 3-00:00:00 5 idle gpunode[12-13,16-17,25]
The slurm partitions for the Nvidia GPUs are 'smallgpunodes' for GPU cards with less than 8GB of ram, and 'biggpunodes' for those with 8GB or greater. The slurm reference page will detail the billing cost (note: this is not a financial term. Slurm refers to the tracking of resource use as 'billing'. Fear not, there is no money involved!) for the various resources. Note that the more resorces you use, the lower your priority will be when in contention with others for resources in the future.
Each machine has at least one version of CUDA installed. CUDA can be found in /usr/local/ and as such should be in the path the system searches by default. Versions of cuDNN and TensoRT that are compatible with the available CUDA are also installed. Python-specific packages such as pytorch or tensorflow should be installed by users via pip.
Please contact software@cs should you need additional versions of CUDA.
We suggest using Python Virtual Environments to work with TensorFlow. While there is a system-wide install of tensorflow, it will be the current default, and may not be optimal with the curent available CUDA.
Setting up multiple Python Virtual Environments to support multiple versions of TensorFlow is simple.
Use
user@gpunode1:~$ python3 -m venv ./venv-nvidia
To set up a python3 environment in the directory 'venv-nvidia'. If the directory does not exist, it will be created.
Then run
user@gpunode1:~$ source ./venv-nvidia/bin/activate
to enter that virtual environment.
Use pip to install the version of TensorFlow or pytorch that you need.
(venv-nvidia) user@gpunode1:~$ pip3 install tensorflow
(venv-nvidia) user@gpunode1:~$ python3 -c 'import tensorflow as tf; print(tf.__version__)' 2.8.4
To exit the current Python Virtual Environment, run
(venv-nvidia) user@gpunode1:~$ deactivate
By specifying different directories when setting up an environment, you can maintain multiple environments, and switch between them using the 'activate' and 'deactivate' commands.
AMD GPU Machines
There is currently only one machine with AMD GPUs, amdgpunode1.cs. It provides 8 Instinct MI100 GPUs.
Using python virtual enviroments for AMD GPU computing should be similar to the above for NVIDIA computing.
Versions of PyTorch and TensorFlow that supports the available GPUs are already installed system-wide on the machine. Please do note that in general the versions of both that support AMD GPU computing seem to lag a little behind those for NVIDIA GPU computing, and are not updated unless requested.
Setup for accessing the GPUs is not difficult. Essentially you will need two environment variables
PATH=/opt/rocm-5.3.0/bin/:$PATH LD_LIBRARY_PATH=/opt/rocm-5.3.0/lib
With those in place, you should be able to utilize as many GPUs as your job has reserved. For example:
user@cluster:~$ srun --partition amdgpunodes -c 8 --mem=64G --gres=gpu:2 -t 3:0:0 --pty bash --login user@amdgpunode1:~$ PATH=/opt/rocm-5.3.0/bin/:$PATH user@amdgpunode1:~$ LD_LIBRARY_PATH=/opt/rocm-5.3.0/lib user@amdgpunode1:~$ python3 -c "import torch;print('Torch gpus found: ', torch.cuda.device_count())" Torch gpus found: 2 user@amdgpunode1:~$ python3 -c "import os;os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3';import tensorflow as tf;print(tf.__version__)" 2.10.0 user@amdgpunode1:~$ python3 -c "import os;os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3';from tensorflow.python.client import device_lib;print(device_lib.list_local_devices())" | egrep '^, name:' , name: "/device:GPU:0" , name: "/device:GPU:1"