HX1

HX1 (or Hex) is the High Performance Computing Cluster at Imperial with a low latency Infiniband network designed for "capability" workloads. More specifically, HX1 is designed for the following types of applications:

  • Multi-node parallel applications, typically using MPI to communicate between the compute nodes.
  • GPU accelerated scientific applications that require double precision.
  • GPU accelerated AI workloads requiring more than 48GB of GPU RAM.

For all other workloads, we would please ask you to continue using the CX3 service; note that the L40s GPUs on CX3 have a newer GPU architecture (Ada Lovelace) and are generally faster than the A100s on HX1 for AI/ML workloads. If you are in any doubt whether your job is suited to HX1, please contact the RCS.

Cluster Specification

Please go to the HX1 section of our cluster specification page for details of HX1.

Access

Info

HX1 is designated for high-performance computing (HPC) users who run multi-node or GPU-accelerated workflows, requiring more than 48GB of GPU RAM and/or strong double-precision performance. Users should also have demonstrable experience with HPC clusters. This restriction is in place to ensure the efficient running of the HX1 facility for those users wishing to run large scale workloads.

All users must already have access to the RCS HPC service before applying for access to HX1 - see https://www.imperial.ac.uk/admin-services/ict/self-service/research-support/rcs/get-access/ for more information. Once you have access to the main RCS HPC facility, you can apply for access using ServiceDesk Ask form.

Groups

Once you have been granted access to HX1, you will be added to the hpc-hx1 secondary group, which will enable you to login to the service. Your primary group on HX1 will be the same as your primary group on the main CX3 system.

Connecting to HX1

There are two login nodes for hx1 which are accessible over ssh. These can be accessed using the login.hx1.hpc.ic.ac.uk hostname. If you see a message asking you to confirm the host key, the fingerprint given in the message should match one of those given below:

  • SHA256:5rlfNbjizjC3EQqdcSUBsSi90CdoriR2gSS6S1DFUBM (RSA)
  • SHA256:n+mpFNQcNSFscvVEKE7QyYM18lSBSkwrgZ9yMnyVlas (ECDSA)
  • SHA256:lYTv2ukiW7T+LOTRrw2pV4Zo45A0RGwxmKJShMcW3LY (ED25519)

Some ssh clients might give them as MD5 sum so we have provided them here:

  • MD5:48:fa:ad:db:a8:a8:b7:be:3b:bb:fa:99:38:76:1c:8f (RSA)
  • MD5:ea:03:50:47:d5:fd:e5:5f:76:28:51:9a:2a:da:0c:a1 (ECDSA)
  • MD5:78:7f:eb:0c:c9:14:65:b8:bf:c7:c1:84:fa:6b:98:54 (ED25519)

Do note that due to security concerns, key-based authentication is disabled for the login-nodes. Users will need to login using their college username and password.

Warning

The HX1 login nodes only have ipv6 addresses meaning you must be connected to the campus in such a way that you access IPv6 addresses. Our guide on Remote Access provides further details.

Storage

Research Data Store

HX1 has its own dedicated high performance file system utilising the IBM General Parallel File System (GPFS), running on Lenovo hardware. This separation between HX1 and the RDS ensures that high loads generated by HX1 do not adversely affect the RDS and, problems with the RDS do not cause job failures on HX1. This means that you must move data to your HX1 home directory before running jobs (sftp/rsync and other similar tools can be used to transfer data from the RDS to HX1 during this pilot phase). Please read the following sections for more information.

Home Directory

When you login to HX1, you will have a home directory automatically created for you on the file system local to HX1. User home directories on HX1 are intended to provide working space for current jobs only and it is the expectation that users will move their data to other systems (such as the RDS) once the data is no longer needed on the cluster. Accordingly, there are NO BACKUP, DISASTER RECOVERY OR SNAPSHOTS for HX1 as these affect performance of the filesystem, and in the event of major hardware failure, accidental data deletion, file system corruption, etc. the data will be lost. It is therefore imperitive that you copy any important files to another storage system such as the RDS once they have been generated.

Quota

Your home directory on hx1 is subject to a default quota of 1 TB and 2 million files/directories (inodes).

Use the mmlsquota command to determine your specific quota level and usage:

/usr/lpp/mmfs/bin/mmlsquota --block-size auto /dev/gpfs

Quota increases are possible on request if justified. We ask that users make efforts to:

  • Minimise the amount of data that needs to be stored for live projects.
  • Avoid having large numbers of files within their home directory.

Users who request a quota increase but are storing many unused files in their home directory will be asked to move or remove these files first before a quota increase is granted. Any quota increases above the default values will be re-assessed every 6 months and users may be asked to provide an updated justification for their quota level.

Retention of Data

The HX1 file system is meant for live data only and any important files should be copied elsewhere (including to the RDS) after being generated. Minimising the amount of data stored on the HX1 filesystem ensures that it maintains the high performance we need it for. For these reasons, RCS staff will be undertaking the following steps to ensure that the file system is only used for live data:

  • If RCS staff believe that a user is storing unused data in their home directory, then they may be contacted and asked to move the data to another storage space if justification cannot be provided.
  • If RCS staff believe an account on HX1 has been unused for 6+ months, then we will contact the user and ask them to clean up their home directory. If we have not received a response from the user or registered supervisor within 6 weeks, then the RCS staff reserve the right to remove the data from the home directory.
  • If a user leaves the university and their account becomes deactivated, then the data in that home account will be removed promptly after the user has left. It is the responsibility of the user to ensure that a copy of important data exists elsewhere before they leave the university.

Shared Project Areas

We are still working on a solution for shared project areas, however we welcome requests for these from either HPC or RDS Project Admins. There will be no charge for these shared spaces but they are for live data sharing only and any quota must be justified (with a review every 6 months). As with home directories, data within these shared areas that have not been accessed for some time are at risk of being deleted.

Software

This section will explain how to access software that has been centrally installed on HX1.

Please make sure any software you run on HX1 has been optimised for the hardware. Ideally use software provided by us (which has already been optimised) or otherwise please make sure to use relevant optimisation flags when compiling your own software. Please avoid simply copying binaries from other systems such as CX3, unless the software is commercial and/or only the binary is available.

Loading Applications

Loading modules/applications on HX1 is similar to that on CX3 except it is not necessary to load the "production" modules (tools/prod). Please refer to our main Loading Applications page for advice on how to load modules on HX1.

EasyBuild

Most of the software installed on HX1 is done so using the EasyBuild software installation system and will have been optimised for the hardware. Please see our EasyBuild page for more information.

Python and Conda Environments

When using python on HX1, we recommend that users use the modules that have been installed via EasyBuild and are accessible via the module system as these packages have been tuned for the HX1 hardware. When software is not available, please raise a request and we will check if the software can be installed via EasyBuild.

We understand that it may not always be possible or desireable to use the modules we provide and users may wish to manage their python toolchains using conda instead. If you do want to use conda, we strongly recommend you use miniforge, which is a minimal installer for conda, tailored to use the conda-forge channel by default.

Installing miniforge

Miniforge is a minimal installer for conda that is tailored to use the conda-forge channel by default. It installs both the conda and mamba tools for managing your environments.

The following instructions are based on those found on the Miniforge repository by have been adapted for use on HX1.

[username@hx1-c12-login-1 ~]$ curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh"
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 89.3M  100 89.3M    0     0  93.1M      0 --:--:-- --:--:-- --:--:--  194M

You should end up with a file called Miniforge3-Linux-x86_64.sh. You can then run the installer with:

[username@hx1-c12-login-1 ~]$ bash Miniforge3-Linux-x86_64.sh

Welcome to Miniforge3 25.3.0-3

In order to continue the installation process, please review the license
agreement.
Please, press ENTER to continue
>>>

After accepting the license, you will be asked to confirm where to install Miniforge3. The default location of miniforge3 in your home directory is fine for most circumstances.

Once the files have finished unpacking, you will be asked:

Do you wish to update your shell profile to automatically initialize conda?
This will activate conda on startup and change the command prompt when activated.
If you'd prefer that conda's base environment not be activated on startup,
   run the following command when conda is activated:

conda config --set auto_activate_base false

You can undo this by running `conda init --reverse $SHELL`? [yes|no]
[no] >>>

We strongly advise that you do not update your shell profile i.e. respond with "no".

You can now enable miniforge in your environment by running the shell hook:

[username@hx1-c12-login-1 ~]$ eval "$(/gpfs/home/username/miniforge3/bin/conda shell.bash hook)"
(base) [username@hx1-c12-login-1 ~]$

You can then use the conda commands as usual. Please see our Conda page for more information on using conda.

Job Submission

Job submission on HX1 (including the use of PBS directives) is identical to CX3 (phase 2) except that there are differences in some of the queues. It is therefore important that:

  • You review this section in its entirety before submitting any jobs
  • You don't simply copy your existing submission script to HX1 and run it, without any modification.

Submitting Jobs

The qsub command can be used to submit jobs to the queue i.e.:

[username@hx1-c12-login-1 ~]$ qsub -q hx myjob.pbs

Jobs can be monitored using the qstat command.

Job Sizing Guidance

The following queues of jobs are supported:

Queue Use Cases Nodes per job No. of cores per node
(ncpus)
Mem per node
(GB)
Walltime
(hrs)
small24 Low core jobs 24h 1 1 - 16 1 - 128 0 - 24
small72 Low core jobs 72h 1 1 - 16 1 - 128 24 - 72
medium24 Single-node jobs 24h 1 1 - 64 1 - 450 0 - 24
medium72 Single-node jobs 72h 1 1 - 64 1 - 450 24 - 72
4nodes24 Two to four node jobs 24h 2 - 4 1 - 64 1 - 450 0 - 24
4nodes48 Two to four node jobs 48h 2 - 4 1 - 64 1 - 450 24 - 48
capability24 Large multi-node jobs 24h 5 - 32 1 - 64 1 - 450 0 - 24
capability48 Large multi-node jobs 48h 5 - 32 1 - 64 1 - 450 24 - 48
a100 Main queue for gpu jobs* 1 1 - 72 1 - 920 0 - 72

* Please see details for specific queues below as there may be additional restrictions or limitations.

single core to single node jobs

Small jobs (single core to single node) should not be routinely run on HX1 other than as part of a wider workflow as these jobs take away from the limited resource pool set aside for multi-node jobs. These jobs are better suited to the CX3 Phase 2 facility.

capability24

The maximum number of cpus allocated to a capability24 job is 2048.

capability48

The maximum number of cpus allocated to a capability48 job is 2048.

a100

There is an additional limit of 12 GPU's total per user on the a100 queue to allow for fair usage of the GPUs.

Example PBS Jobs

MPI Jobs

The following example requests 8 compute nodes, 64 cores and 200GB of RAM per compute node, and running 64 MPI tasks per compute node (for a total of 512 mpi tasks).

#PBS -l select=8:ncpus=64:mpiprocs=64:mem=200gb

The following example differs from the previous example by still requesting all available cores on each node but only running 32 MPI tasks (256 mpi tasks in total). You may do this if you wanted to under-subscribe the nodes.

#PBS -l select=8:ncpus=64:mpiprocs=32:mem=200gb

Hybrid OpenMP/MPI Jobs

The following example requests 8 compute nodes, 64 cores and 200GB RAM per compute node, and running 32 MPI tasks on each node (for a total of 256 mpi tasks), each task with 2 OpenMP threads.

#PBS -l select=8:ncpus=64:mpiprocs=32:ompthreads=2:mem=200gb

Please be careful with hybrid jobs as some MPI distributions (such as OpenMPI) automatically pin the processors to cores and unless you specify the pinning method, you may lose significant performance (further details below). It is also advisable to change the default process placement from "free" to "scatter" to ensure the MPI ranks are distributed evenly across the requested nodes.

MPI Distribution Specific Information

Intel MPI

The following example demonstrates how the HPCG benchmark application, built with the centrally installed Intel toolchain, can be run across multiple compute nodes (please adapt these for your own purposes):

#!/bin/bash
#PBS -l walltime=00:30:00
#PBS -l select=2:ncpus=64:mpiprocs=64:mem=200gb

cd $PBS_O_WORKDIR

module purge
module load HPCG/3.1-intel-2022a

mpirun -v6 xhpcg

Note that it is not necessary to provide Intel MPI with a lists of hosts because it can determine this directly from PBS Pro. For hybrid jobs, you can simply follow the advice above e.g.:

#!/bin/bash
#PBS -l walltime=00:30:00
#PBS -l select=2:ncpus=64:mpiprocs=32:ompthreads=2:mem=200gb
#PBS -lplace=scatter

cd $PBS_O_WORKDIR

module purge
module load HPCG/3.1-intel-2022a

mpirun -v6 xhpcg

Will run 32 mpi process per node, 2 OpenMP threads per task. Note the extra "place" directive.

Bootstrapping Intel MPI

If you decide to install your own version of Intel MPI, you should set the following environment variables:

export I_MPI_HYDRA_BOOTSTRAP="rsh"
export I_MPI_HYDRA_BOOTSTRAP_EXEC="/opt/pbs/bin/pbs_tmrsh"
export I_MPI_HYDRA_BRANCH_COUNT=0

and start the job with either mpirun -v6 or mpiexec -v6.

OpenMPI

The following example demonstrates how the HPCG benchmark application, built with the centrally installed Foss (GCC/OpenMPI) toolchain, can be run across multiple compute nodes:


#!/bin/bash
#PBS -l walltime=00:30:00
#PBS -l select=2:ncpus=64:mpiprocs=64:mem=200gb

cd $PBS_O_WORKDIR

module purge
module load HPCG/3.1-foss-2022a

mpirun hpcg
For hybrid jobs, this is more complicated because by default OpenMPI pins MPI processes to cores. Again note the scatter directive for Hybrid jobs.

#!/bin/bash
#PBS -l walltime=00:30:00
#PBS -l select=2:ncpus=64:mpiprocs=32:ompthreads=2:mem=200gb
#PBS -lplace=scatter

cd $PBS_O_WORKDIR
cat $PBS_NODEFILE

module purge
module load HPCG/3.1-foss-2022a

mpirun --map-by numa:PE=${OMP_NUM_THREADS} xhpcg

This will ensure that each a number of processing elements (numa domains) equal to OMP_NUM_THREADS is bound to each mpi process. Alternatively the cpu binding can be disabled with:

mpirun --bind-to none xhpcg

GPU Jobs

GPU Specification

GPU Type Single Precision
TFLOPS
Double Precision
TFLOPS
Memory
GB
Memory Bandwidth
GB/s
CUDA Compute Capability GPU Architecture
A100 80GB SXM 19.5 9.7 80 2,039 8.0 Ampere

Multi-node GPU Jobs

We are observing some issues with some multi-node GPU jobs on HX1, namely the AI frameworks such as Tensorflow and PyTorch do not appear to either run in a performant manner or are simply not able to establish an connection between compute nodes. We presently believe this is mostly likely due to a lack of compatibility of these frameworks with working in an IPv6 network environment. We are continuing to work towards finding a solution for this.

GPU accelerated "science codes" such as GROMACS and LAMMPS appear to work fine in a multi-node environment.

Example GPU Jobs

#!/bin/bash
#PBS -l walltime=00:30:00
#PBS -l select=1:ncpus=18:mem=200gb:ngpus=1:gpu_type=A100

cd $PBS_O_WORKDIR
cat $PBS_NODEFILE
module purge
module load TensorFlow/2.11.0-foss-2022a-CUDA-11.7.0

python my_TF_code.py

Here we have requested 18 cpus and 200 gb of RAM as the nodes have a total of 72 cpus and 1 TB of ram with 4 GPUs. So users should ideally request no more than 25% of that.

MPI GPU Jobs (single node)

When using MPI you must specify the number of MPI processes in the PBS select directive or PBS will default to 1 process. For example,

#!/bin/bash
#PBS -l walltime=00:10:00
#PBS -l select=1:ncpus=2:mpiprocs=2:mem=80gb:ngpus=2:gpu_type=A100

cd $PBS_O_WORKDIR
cat $PBS_NODEFIL

module purge
module load NCCL/2.12.12-GCCcore-11.3.0-CUDA-11.7.0
module load OpenMPI/4.1.4-GCC-11.3.0

mpirun -n 2 ./your_job

Known Issues

Intel MPI 2021b jobs not starting

Updated 3rd Jan 2024: We believe this issue has now been fixed. If you continue to see an issue using Intel MPI 2021a/b, please let us know by raising a ticket.

We've observed an issue with Intel MPI 2021b jobs not starting on the cluster as communication cannot be established between the nodes. While we identify the cause of the issue, please use 2022a and newer if possible.