Allocating Resources and Running Jobs

Running a job on HPC

There are two basic ways of accessing resources and running jobs on the cluster:

run interactively jobs (supervised).
run non-interactive jobs (unsupervised).

Both approaches are investigated in the class.

Running a job in interactive mode

To allocate resources on the cluster interactively we will use the SLURM command salloc. The salloc command is used to allocate a SLURM job allocation, which is a set of resources (nodes), possibly with some set of constraints (e.g. number of processors per node, or memory per processor). When salloc successfully obtains the requested allocation, the user can run commands directly on the compute node. Finally, when the allocation time is up, salloc relinquishes the job.

From the login node type:

salloc -n 1 --time=0:10:0 --mem-per-cpu=3g --account=account-name

As you can see, the salloc command requires a few user inputs:

-n: the number of processors.
--time: for how long you request the resources. The format is hh:mm:ss.
--mem-per-cpu: the amount of memory to assign to each CPU. This is a very important parameter, and it will become clear in later sections of this course how to properly select it.
--account-name: this is the group name on the HPC system, which is an alias of the Research Allocation Project Identifier (RAPI). Typically, group names follow this convention (where xx represents a series of numbers):
- Default: def-[profname][-xx]
- Priority: rrg-[profname][-xx]
- Cloud: crg-[profname][-xx]

The salloc can be customized according to user preferences and needs. To learn more about SLURM and salloc click HERE.

These parameters along with the current system utilization will decide how much time it will take for the queue manager to allocate the resources. For instance, requesting 2 processors will take much less time than requesting 512, and a 10 minute job will be allocated much faster than a 20-hour one.

Here are a few limits to keep in mind when acclocating resources on an Alliance HPC system: Niagara accepts jobs of up to 24 hours run-time, Béluga, Graham and Narval up to 7 days, and Cedar up to 28 days. On general-purpose clusters, longer jobs are restricted to use only a fraction of the cluster by partitions. There are partitions for jobs of

3 hours or less,
12 hours or less,
24 hours (1 day) or less,
72 hours (3 days) or less,
7 days or less, and
28 days or less

Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job. At Béluga a minimum time limit of 1 hour is also imposed. More about job scheduling policies can be found HERE.

When you press enter, the output should be similar to this:

    salloc: Pending job allocation 24928305
    salloc: job 24928305 queued and waiting for resources
    salloc: job 24928305 has been allocated resources
    salloc: Granted job allocation 24928305
    salloc: Waiting for resource configuration
    salloc: Nodes gra287 are ready for job

In these steps, SLURM tells us that a job with ID 24928305 has been created for us, that it is in the queue and waiting for resources, and, eventually, that the resources have been found and granted. Finally, the resources have been configured and a node gra287 is now ready for our task. You will also notice that the command line prompt name has changed to:

[username@gra287 ~]$

This means that we are not sitting anymore on the login node, but we are now on a compute node whose identifier is gra287. One might ask: now that I am on a compute node, how do I run my code that is sitting in my /home directory? And here is a beautiful feature of a computer cluster: when running the ls command from the compute node, you will realize that although you are on a completely different node, you are still able to see the exact same files and directories you had in your home. That is due to the architecture of the cluster which works with a distributed shared memory system. That means you will be able to run your code or access your files from anywhere in the cluster. You can check the status of your job at any time using the SLURM command squeue -u username or simply sq. If you open a new command line window, log into Graham and from the login node type sq, here is what you will see:

[username@gra287 ~]$ sq
          JOBID     USER      ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
       24928305 username account-name    interactive  PD       9:57     1    1        N/A      3G gra287 (None)

[username@gra287 ~]$ sq
          JOBID     USER      ACCOUNT           NAME  ST  TIME_LEFT NODES CPUS TRES_PER_N MIN_MEM NODELIST (REASON)
       24928305 username account-name    interactive   R       9:57     1    1        N/A      3G gra287 (None)

Some important things to note:

The job name is “interactive”, as we were expecting.
The status ST is pending PD in the first extract, and running R in the second.
The number of nodes and number of CPUs should reflect what was defined by the salloc command.

If you want to stop/cancel the job at once, you can use the command scancel followed by the job ID. In our case:

[username@gra287 ~]$ scancel 24928305

With the command salloc we ran earlier, we requested allocation for 10 minutes. Once the requested time is over, SLURM will automatically release the resources and you will be back to the login node. Here is the message you will receive when the time is over:

[username@gra287 ~]$ salloc: Job 24928305 has exceeded its time
         limit and its allocation has been revoked. srun: Job step
         aborted: Waiting up to 62 seconds for job step to finish.
         slurmstepd: error: *** STEP 24928305.interactive ON gra287
         CANCELLED AT 2024-01-27T23:38:18 DUE TO TIME LIMIT ***
         exit
[username@gra-login2 ~]$

Submitting a job in noninteractive mode

Most of the time, it is not necessary for the user to interact with compute node, therefore, the most common way of running a job on the cluster is by submitting a script to SLURM. Think of it as if you are ordering food at a restaurant: your order is written on a piece of paper (batch script), which is then given to the restaurant manager (queue manager) to find a table who sends the order to the kitchen (allocating the resources and running). Your order might be executed directly or you might wait in a queue depending on how busy the restaurant is. The batch file is usually divided into two parts:

In the first part, you will give SLURM all the directives (or options) for the job. Each line containing these parameters must be preceded with #SBATCH in the script. All available directives are described on the sbatch page.
In the second part, you will specify any commands that must be executed as part of the job.

A very simple batch script might look like this:

#!/bin/bash
#SBATCH -t 0-15:00
#SBATCH --account=account-name
#SBATCH -o output.log

./myfirstjob.py

In the above example, ./myfirstjob.py is standard Python script that will run. On general-purpose (GP) clusters, this job reserves 1 core and 256MB of memory for 15 minutes.

Allocation of multiple cores

As we mentioned earlier, the benefit of using a computer cluster to run numerical simulations is the ability to use multiple cores. In this regard, we highlight below a few rules to keep in mind:

Allocating whole nodes: if the goal is to allocate more than 32 cores, one might consider requesting a whole node. To do that it is worth checking the architecture of the Alliance HPC systems and how many cores per node each cluster has. Graham, for instance, has 32 cores per node with a total usable memory of 125GB ( 3.9 per core). Therefore, to request a whole node on Graham, the batch script will look like this:
```
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=32
#SBATCH --mem=0

srun application.exe
```
Where --mem=0 is interpreted by SLURM to mean reserve all the available memory on each node assigned to the job, and srun is the command to execute an application on multiple processes. If one does not need 32 cores but still wants all allocated cores to be on the same node, we can specify the number of tasks per node with --ntasks-per-node.

Choosing the number of cores to request — and whether or not to request whole nodes — may be a trade-off between running time (or efficient use of the computer) and waiting time (or efficient use of your time). If you want help evaluating these factors, please contact the Alliance technical support.
Enable multi-threading: it’s important to note that the number of tasks requested in the batch script corresponds to the number of processes that will be started by the srun command. If one wants to enable multi-threading, it is good practice to set the number of processes count with --ntasks or --ntasks-per-node, and set the threads count with --cpus-per-task. More details on how to run a job with enabled multi-threading can be found HERE.

Now that you know how to allocate resources on the cluster, you are ready to run computations! However, you might be wondering: how do I install my CFD package on the cluster? Here comes yet another, very beautiful feature of most supercomputers, which is the module system.

The module system

The module system is a concept available on most clusters worldwide and guarantees the use of different software (and different versions of the same software) in a precise and controlled manner. In most cases, a supercomputer has far more software installed than the average user will ever need. Therefore, the settings for all these software packages and their supported versions are encapsulated in “environment modules” maintained by the module system. These modules are in no way related to modules in perl or python and should not be confused with that. To have an idea of what modules are already installed on your current profile, from the login node run the command:

[username@gra-login1 ~]$ module list

You will notice that some basic modules such as openmpi, python, or intel may already be installed. If the module or software you look for is not in the list of the modules installed, run the command:

[username@gra-login1 ~]$ module avail

to know what modules are available on this particular cluster. What you will notice is that:

There are a number of different versions of the same software.
The modules are organized and colour-coded based on discipline. For example, bio stands for modules in Bioinformatic libraries/apps and phys stands for modules in Physics libraries/apps.
If you are in the CFD field, there are common software such as starccm, openfoam, and gmsh.
Modules can be entire software such as openfoam and paraview, or libraries such as fftw or hdf5.

Standard environments identify combinations of specific compiler and MPI modules that are used most commonly by our team to build other software. These combinations are grouped in modules named StdEnv. As of February 2023, there are four such standard environments, versioned 2023, 2020, 2018.3 and 2016.4, with each new version incorporating major improvements. Only versions 2020 and 2023 are actively supported. To learn more about the module system on Alliance clusters click HERE.

Once you have chosen the module you need, all you do is run the command:

[username@gra-login1 ~]$ module load <modulename>

However, if you are satisfied with the loaded module and you know that you will use them constantly, you can run the command:

    [username@gra-login1 ~]$ module save

to save the current module setting as the default configuration. There are several other useful commands in the module system environment such as module purge to unload all the currently installed modules, and module switch x y to switch module x with module y. And now you are REALLY ready to run your first job on the cluster!

Example: my first job

In this very simple example code, we check if a specific year is a leap year. Try running this code on the cluster using all the methods learned so far.

# Python program to check if year is a leap year or not
import sys

# To get year from the user
year = int(sys.argv[1])

# divided by 100 means century year (ending with 00)
# century year divided by 400 is leap year
if (year % 400 == 0) and (year % 100 == 0):
    print("{0} is a leap year".format(year))

# not divided by 100 means not a century year
# year divided by 4 is a leap year
elif (year % 4 ==0) and (year % 100 != 0):
    print("{0} is a leap year".format(year))

# if not divided by both 400 (century year) and 4 (not century year)
# year is not leap year
else:
    print("{0} is not a leap year".format(year))

The first option we have is to run the script directly in the login node. Upon checking if the Python module has been loaded, from the login node we simply type:

[username@gra-login2 ~]$ python ./isleap.py 2024
    2024 is a leap year

The second option is to run the script in a compute node in interactive mode. The first step as mentioned previously is allocating the resources:

salloc -n 1 --time=0:10:0 --mem-per-cpu=500m --account=account-name

Here we requested a single processor for 10 minutes with 500 Mb of memory (more than enough for the task at hand).

The third and final option is to run the script by submitting a batch script to SLURM. Lets create a file “firstjob.sh” to give all the necessary information about the job. The batch script will look something like:

#!/bin/bash

#SBATCH --job-name=isleap      ## Name of the job
#SBATCH --output=isleap.out    ## Output file
#SBATCH -e isleap.err          ## Error file
#SBATCH --time=10:00           ## Job Duration
#SBATCH --nodes=1              ## Number of nodes
#SBATCH --ntasks=1             ## Number of processors
#SBATCH --mem-per-cpu=100M     ## Memory per CPU required by the job.

## Execute the python script and pass the input '2024'
srun python isleap.py 2024

Because the submitted job will run unsupervised, it is good practice to ask SLURM to produce an output file (ileap.out) in which the output of our code will be printed, and an error file (isleap.err) in which the error message will be printed in case there is a problem. To run the batch script simply type:

[username@gra-login2 ~]$ sbatch firstjob.sh
    Submitted batch job 25017005

Once again, you can check the status of your simulation by running the command squeue -u or sq, and upon completion, you will notice that a new file has been created in your current directory named “isleap.out”, the content of which is exactly:

[username@gra-login2 ~]$ vim isleap.out
    2024 is a leap year

You will also notice that the file “isleap.err” has been created, however, its content is empty as no error was encountered during execution. Finally, note that the number of tasks requested of SLURM is the number of processes that will be started by srun. Because this code is not written in parallel mode, if more tasks are selected, the different CPUs will all perform the same operation. TASK: in the previous example, modify the batch script to ask for 4 processors, and run the job again. What does the output look like?

A bit more terminology

Before we move on to the next section of the course, there are a couple of concepts that we need to familiarize ourselves with first. These are:

The concept of time in computing, and
The concept of efficient use of HPC resources.

Time in computing

In HPC terminology there are two key definitions we need to be very comfortable with:

Walltime: or more precisely, “elapsed real time” is the length of time, measured in seconds, that a program takes to run (e.g. execute all its assigned tasks). The walltime is independent of how many resources are used; it is the time it takes according to a clock mounted on the wall.
CPU hours: is the amount of CPU time spent processing. Imagine, for example, to execute a program for a walltime of 1 hour on 32 CPUs. We will have used up to 32 CPU hours.

Efficient use of HPC resources

We introduced the concept of performance in “computing” earlier when discussing about FLOPS and clock speed node, however these concepts need to be revisited in the context of HPC clusters where we have a large amount of CPUs (or nodes) available.

Say you are a professional copyist. After your long experience, you have optimized your workflow so that you copy an entire page of a manuscript with no difficulty. However, if you are now tasked with copying a 1000-page manuscript, despite your talent it will probably take you a LONG time, more than any employer will ever wait for. A very simple solution would be to delegate some of the workload to nine other copyists that could assist you (while of course communicating between each other), increasing the chances of getting the work done much faster. In very simple terms, our goal will be to distribute the workload effectively and efficiently over the resources we have available, in order to reach the final result in a reasonable time frame.

If you are taking this course, you are probably an application user and not an application developer, meaning that you are not really interested in developing more efficient CFD tools, but rather in using your favourite CFD package as efficiently as possible. Making our workflow more efficient will be the basis of the next section.

QUIZ

Your goal is to write a batch script to run a Python code problem1.py for 8 hours on 2 nodes each containing 32 processors. The code requires you to use Python 3.9.6. Answer the following questions:

1.3.1 How much walltime do you need?

1.3.2 How many CPU hours would you need?

1.3.3 Write a batch script to submit this job using SLURM.

1.3.3 Solution

#!/bin/bash
#SBATCH --job-name=problem1    ## Name of the job
#SBATCH --time=08:00:00        ## Job Duration hh:mm:ss
#SBATCH --nodes=2              ## Number of nodes
#SBATCH --ntasks=64            ## Number of processors

module load python/3.9.6

srun python problem1.py