R

We have many versions of R already available via EasyBuild ready for you to use that are optimised for our systems.

EasyBuild R

To view the versions of R available for you to use, you can do:

[user@login ~]$ module purge
[user@login ~]$ module load tools/prod
[user@login ~]$ module spider R

module purge is useful if you have loaded other modules that may interfere, otherwise it's not necessary. This will also load the gateway module tools/prod which will allow you to view modules which are currently in the production software stack.

To load R/4.2.1-foss-2022a as an example, you can do the following:

[user@login ~]$ module load tools/prod
[user@login ~]$ module load R/4.2.1-foss-2022a

Remember, that the login nodes are not for running extensive calculations, especially if they are heavy on the file system. If you need to run jobs interactively, please look at the interactive queue section.

Conda and R

To begin, following the instructions in the Conda application guide to setup Conda for your account. Assuming you have done this, the following set of instructions would create an environment called r413 containing R version 4.1.3.

[user@login ~]$ eval "$(~/miniforge3/bin/conda shell.bash hook)"
[user@login ~]$ conda create -n r413 r-base=4.1.3 -c conda-forge
[user@login ~]$ source activate r413

You should then see (r413) on the left hand side which indicates you are in that environment. For installing R packages, it is generally best to stick to the conda-forge channel or the R channel. For example, to install additional packages from the R channel:

[user@login ~]$ conda search -c r "r-*" 
[user@login ~]$ conda install -c r r-png 

An activated environment can be deactivated using:

[user@login ~]$ conda deactivate

Bioconductor

Follow the instructions above to create a suitable conda environment, such as r413. Once you have created this environment, you can search for bioconductor packages by running the following:

[user@login ~]$ conda search bioconductor-*

and you can install bioconductor packages such as bioconductor-teqc by running:

[user@login ~]$ conda install bioconductor-teqc

RStudio

Please note that access to RStudio is provided by the Open OnDemand service.

Running R in Parallel

There are a few ways to run R in Parallel. Probably the simplest is to use Array Jobs to split your work into separate subjobs and pull results together at the end. If this isn't possible then the next simplest is to use the doParallel library within R. There are a few ways to use this library so examples have been given for each. To test these methods a mostly non-sense function is created that will take the place of your work function. Ideally this function should take some seconds or more to run, otherwise the overhead of creating/destroying parallel processes is unlikely to be worth it.

fun <- function(n) {
    size = 1000
    start = size*n
    end = size*(n+1)
    startData = start:end
    endData <- 0:size;
    foreach (i=0:size) %do% {
        endData[i] <- startData[i]^2
    }
    return (sum(endData))
}

This test function manually creates a vector, takes the square of each element, and then returns the sum. Again this function isn't optimised and it just a place holder. Users should take the time to optimise their equivalent function(s) as much as possible.

Getting the right number of cores

Info

In order for this to work correctly, you must make sure to request the ompthreads flag in your jobscript. For example: #PBS -lselect=1:ncpus=8:ompthreads=8:mem=4gb

You can detect the correct number of cores available to your job using future::availableCores(). The function within the parallel::detectCores(), will report the wrong number, the total number of cores in the node, which are not available to you. Using this one when parallelising your code might have unexpected consequences. In other words, you can use the following code snippet to get the maximum number of cores available to you:

library("future")
numOfCores = future::availableCores()

Foreach+dopar

Creating a cluster

It is generally better to create and destroy the cluster to ensure the number of CPUs is set correctly. numOfCores should match the total number of CPUs requested in the PBS script, as described above.

library(doParallel)
numOfCores=4
cl <- makeCluster(numOfCores, type="FORK")
registerDoParallel(cl)
parallelResults <- c()

Updating Loop

If you already have the function being called in foreach loop then it can be very easy to substitute the %do% for %dopar%

dataRange <- 1:100;
parallelResults <- c();
foreach (n=dataRange) %dopar%  {
    parallelResults[n] <- fun(n)
}

mclapply

This is the parallel version of lapply and often one can be replaced with the other. The advantage here is the you don't have to create/destroy the cluster, the disadvantage is everything must fit inside of lapply. Again make sure numOfCores matches the number of cpus you are requesting in PBS.

library(doParallel)
dataRange <- 1:100;
numOfCores=4
ParallelResults <- mclapply(dataRange, fun,mc.cores=numOfCores)

Comparing all Methods

We can see that mclapply is the fastest although again the flexibility of foreach might make that the preferred method for a lot of workflows.

The estimated time using foreach: 14.834 seconds
The estimated time using foreach+dopar: 3.988 seconds
The estimated time using lapply() function: 14.276 seconds
The estimated time using mclapply() function: 3.722 seconds

With numOfCores=4