GPU Jobs
Submitting GPU Jobs
To see the current GPU's we have available in the cluster, take a look at the following section. To request any GPU, you need to use the flag ngpus (to request the desired number of gpus);
#PBS -l select=1:ncpus=4:mem=24gb:ngpus=1
The value of ngpus is per node. You can select up to 8 GPU's per node. If you request 4 nodes (select=4) and 8 GPU's (ngpus=8) that will equate to 32 GPU's.
Warning
Please only request more than 1 GPU if you know your application supports this and you have made the necessary changes to utilise this; some applications need extra steps in order to use more than 1 GPU!
Within the context of the running job, the shell environment variable CUDA_VISIBLE_DEVICES will be set automatically with the UUID of the allocated GPUs. Our systems utilises Cgroups, meaning you will only be able to see the resource that is allocated to your job, including GPU.
Job Limits
All GPU jobs currently run via the gpu72 queue. You will find up to date job limits for this queue on the Job sizing guidance page.
Requesting Specific GPU types
In order to request a specific GPU type, you can add the following PBS flag - gpu_type
There are 3 GPU types available in the batch queue: L40S, RTX6000, and A100:
#PBS -l select=1:ncpus=4:mem=24gb:ngpus=1:gpu_type=L40S
or
#PBS -l select=1:ncpus=4:mem=24gb:ngpus=1:gpu_type=RTX6000
or
#PBS -l select=1:ncpus=4:mem=24gb:ngpus=1:gpu_type=A100
The default option is the L40S cards, the RTX6000 can be good if your work doesn't require much GPU compute as they are normally in less demand. We only have a few A100 cards and even then, they only have 40GB of VRAM, they are normally used by those that need the double precision of an A100 but not the VRAM.
GPU Node Specification
| L40S PCIe 48 GB | A100 PCIe 40GB | A40 PCIe 48GB | RTX6000 PCIe 24GB | |
|---|---|---|---|---|
| FP64 Double Precision (Tensor) TFLOPS | N/A | 19.5 | N/A | N/A |
| TF32 Single Precision (Tensor) TFLOPS | 183 | 156 | 74.8 | N/A |
| FP64 Double Precision TFLOPS | N/A | 9.7 | N/A | <1 |
| FP32 Single Precision TFLOPS | 91.6 | 19.5 | 37.4 | 16.3 |
| Memory | 48 GB GDDR6 | 40 GB GDDR6 | 48 GB GDDR6 | 24 GB GDDR6 |
| Memory Bandwidth GB/s | 864 | 1,555 | 696 | 672 |
| CUDA Compute Capability | 8.9 | 8.0 | 8.6 | 7.5 |
| GPU Architecture | Ada Lovelace | Ampere | Ampere | Turing |
| --- | ||||
| Available in batch queue | Yes | Yes* | No | No |
| Available in Jupyterhub | No | No | Yes | Yes |
* Note that there are only a few A100's available in the batch queue, as a result, only request them if you're absolutely sure you need them, otherwise you may be queueing for a long time.