Allocating Resources and Running Jobs
At the end of this class, you should be able to:
- Allocate resources using SLURM
- Run jobs in interactive mode
- Run jobs by submitting batch scripts.
- Use various software (CFD included) on the cluster.
Running a job on HPC
There are two basic ways of accessing resources and running jobs on the cluster:
- run interactively jobs (supervised).
- run non-interactive jobs (unsupervised).
Both approaches are investigated in the class.
Running a job in interactive mode
To allocate resources on the cluster interactively we will use the SLURM command salloc
. The salloc
command is used to allocate a SLURM job allocation, which is a set of resources (nodes), possibly with some set of constraints (e.g. number of processors per node, or memory per processor). When salloc
successfully obtains the requested allocation, the user can run commands directly on the compute node. Finally, when the allocation time is up, salloc
relinquishes the job.
From the login node type:
As you can see, the salloc
command requires a few user inputs:
-n
: the number of processors.--time
: for how long you request the resources. The format ishh:mm:ss
.--mem-per-cpu
: the amount of memory to assign to each CPU. This is a very important parameter, and it will become clear in later sections of this course how to properly select it.--account-name
: this is the group name on the HPC system, which is an alias of the Research Allocation Project Identifier (RAPI). Typically, group names follow this convention (wherexx
represents a series of numbers):- Default:
def-[profname][-xx]
- Priority:
rrg-[profname][-xx]
- Cloud:
crg-[profname][-xx]
- Default:
The salloc
can be customized according to user preferences and needs. To learn more about SLURM and salloc
click HERE.
These parameters along with the current system utilization will decide how much time it will take for the queue manager to allocate the resources. For instance, requesting 2 processors will take much less time than requesting 512, and a 10 minute job will be allocated much faster than a 20-hour one.
Here are a few limits to keep in mind when acclocating resources on an Alliance HPC system: Niagara accepts jobs of up to 24 hours run-time, Béluga, Graham and Narval up to 7 days, and Cedar up to 28 days. On general-purpose clusters, longer jobs are restricted to use only a fraction of the cluster by partitions. There are partitions for jobs of
- 3 hours or less,
- 12 hours or less,
- 24 hours (1 day) or less,
- 72 hours (3 days) or less,
- 7 days or less, and
- 28 days or less
Because any job of 3 hours is also less than 12 hours, 24 hours, and so on, shorter jobs can always run in partitions with longer time-limits. A shorter job will have more scheduling opportunities than an otherwise-identical longer job. At Béluga a minimum time limit of 1 hour is also imposed. More about job scheduling policies can be found HERE.
When you press enter, the output should be similar to this:
In these steps, SLURM tells us that a job with ID 24928305 has been created for us, that it is in the queue and waiting for resources, and, eventually, that the resources have been found and granted. Finally, the resources have been configured and a node gra287
is now ready for our task. You will also notice that the command line prompt name has changed to:
This means that we are not sitting anymore on the login node, but we are now on a compute node whose identifier is gra287. One might ask: now that I am on a compute node, how do I run my code that is sitting in my /home
directory? And here is a beautiful feature of a computer cluster: when running the ls
command from the compute node, you will realize that although you are on a completely different node, you are still able to see the exact same files and directories you had in your home. That is due to the architecture of the cluster which works with a distributed shared memory system. That means you will be able to run your code or access your files from anywhere in the cluster. You can check the status of your job at any time using the SLURM command squeue -u username
or simply sq
. If you open a new command line window, log into Graham and from the login node type sq
, here is what you will see:
Some important things to note:
- The job name is “interactive”, as we were expecting.
- The status
ST
is pendingPD
in the first extract, and runningR
in the second. - The number of nodes and number of CPUs should reflect what was defined by the
salloc
command.
If you want to stop/cancel the job at once, you can use the command scancel
followed by the job ID. In our case:
With the command salloc
we ran earlier, we requested allocation for 10 minutes. Once the requested time is over, SLURM will automatically release the resources and you will be back to the login node. Here is the message you will receive when the time is over:
Submitting a job in noninteractive mode
Most of the time, it is not necessary for the user to interact with compute node, therefore, the most common way of running a job on the cluster is by submitting a script to SLURM. Think of it as if you are ordering food at a restaurant: your order is written on a piece of paper (batch script), which is then given to the restaurant manager (queue manager) to find a table who sends the order to the kitchen (allocating the resources and running). Your order might be executed directly or you might wait in a queue depending on how busy the restaurant is. The batch file is usually divided into two parts:
- In the first part, you will give SLURM all the directives (or options) for the job. Each line containing these parameters must be preceded with
#SBATCH
in the script. All available directives are described on the sbatch page. - In the second part, you will specify any commands that must be executed as part of the job.
A very simple batch script might look like this:
In the above example, ./myfirstjob.py
is standard Python script that will run. On general-purpose (GP) clusters, this job reserves 1 core and 256MB of memory for 15 minutes.
Allocation of multiple cores
As we mentioned earlier, the benefit of using a computer cluster to run numerical simulations is the ability to use multiple cores. In this regard, we highlight below a few rules to keep in mind:
-
Allocating whole nodes: if the goal is to allocate more than 32 cores, one might consider requesting a whole node. To do that it is worth checking the architecture of the Alliance HPC systems and how many cores per node each cluster has. Graham, for instance, has 32 cores per node with a total usable memory of 125GB (
3.9 per core). Therefore, to request a whole node on Graham, the batch script will look like this: Where
--mem=0
is interpreted by SLURM to mean reserve all the available memory on each node assigned to the job, andsrun
is the command to execute an application on multiple processes. If one does not need 32 cores but still wants all allocated cores to be on the same node, we can specify the number of tasks per node with--ntasks-per-node
.Choosing the number of cores to request — and whether or not to request whole nodes — may be a trade-off between running time (or efficient use of the computer) and waiting time (or efficient use of your time). If you want help evaluating these factors, please contact the Alliance technical support.
-
Enable multi-threading: it’s important to note that the number of tasks requested in the batch script corresponds to the number of processes that will be started by the
srun
command. If one wants to enable multi-threading, it is good practice to set the number of processes count with--ntasks
or--ntasks-per-node
, and set the threads count with--cpus-per-task
. More details on how to run a job with enabled multi-threading can be found HERE.
Now that you know how to allocate resources on the cluster, you are ready to run computations! However, you might be wondering: how do I install my CFD package on the cluster? Here comes yet another, very beautiful feature of most supercomputers, which is the module system.
The module system
The module system is a concept available on most clusters worldwide and guarantees the use of different software (and different versions of the same software) in a precise and controlled manner. In most cases, a supercomputer has far more software installed than the average user will ever need. Therefore, the settings for all these software packages and their supported versions are encapsulated in “environment modules” maintained by the module system. These modules are in no way related to modules in perl
or python
and should not be confused with that. To have an idea of what modules are already installed on your current profile, from the login node run the command:
You will notice that some basic modules such as openmpi
, python
, or intel
may already be installed. If the module or software you look for is not in the list of the modules installed, run the command:
to know what modules are available on this particular cluster. What you will notice is that:
- There are a number of different versions of the same software.
- The modules are organized and colour-coded based on discipline. For example,
bio
stands for modules in Bioinformatic libraries/apps andphys
stands for modules in Physics libraries/apps. - If you are in the CFD field, there are common software such as
starccm
,openfoam
, andgmsh
. - Modules can be entire software such as
openfoam
andparaview
, or libraries such asfftw
orhdf5
.
Standard environments identify combinations of specific compiler and MPI modules that are used most commonly by our team to build other software. These combinations are grouped in modules named StdEnv
. As of February 2023, there are four such standard environments, versioned 2023
, 2020
, 2018.3
and 2016.4
, with each new version incorporating major improvements. Only versions 2020 and 2023 are actively supported. To learn more about the module system on Alliance clusters click HERE.
Once you have chosen the module you need, all you do is run the command:
- When loading a module it is also a good practice to check if the desired module has some dependencies that must be loaded first by typing
module spider <modulename>
. - When loading a module it is also a good practice to load the exact version of the software you are looking for by typing
module load <modulename/X.X.X>
whereX.X.X
is the version number. - The modules you load will only survive on your profile for the duration of the session. Once you log out of the login node the modules previously loaded will be lost and you will need to reload them again.
- Following (3), modules need to be re-loaded on compute nodes.
However, if you are satisfied with the loaded module and you know that you will use them constantly, you can run the command:
to save the current module setting as the default configuration. There are several other useful commands in the module system environment such as module purge
to unload all the currently installed modules, and module switch x y
to switch module x
with module y
. And now you are REALLY ready to run your first job on the cluster!
In this very simple example code, we check if a specific year is a leap year. Try running this code on the cluster using all the methods learned so far.
The first option we have is to run the script directly in the login node. Upon checking if the Python module has been loaded, from the login node we simply type:
The second option is to run the script in a compute node in interactive mode. The first step as mentioned previously is allocating the resources:
Here we requested a single processor for 10 minutes with 500 Mb of memory (more than enough for the task at hand).
The third and final option is to run the script by submitting a batch script to SLURM. Lets create a file “firstjob.sh” to give all the necessary information about the job. The batch script will look something like:
Because the submitted job will run unsupervised, it is good practice to ask SLURM to produce an output file (ileap.out) in which the output of our code will be printed, and an error file (isleap.err) in which the error message will be printed in case there is a problem. To run the batch script simply type:
Once again, you can check the status of your simulation by running the command squeue -u or sq, and upon completion, you will notice that a new file has been created in your current directory named “isleap.out”, the content of which is exactly:
You will also notice that the file “isleap.err” has been created, however, its content is empty as no error was encountered during execution. Finally, note that the number of tasks requested of SLURM is the number of processes that will be started by srun. Because this code is not written in parallel mode, if more tasks are selected, the different CPUs will all perform the same operation. TASK: in the previous example, modify the batch script to ask for 4 processors, and run the job again. What does the output look like?
A bit more terminology
Before we move on to the next section of the course, there are a couple of concepts that we need to familiarize ourselves with first. These are:
- The concept of time in computing, and
- The concept of efficient use of HPC resources.
Time in computing
In HPC terminology there are two key definitions we need to be very comfortable with:
-
Walltime: or more precisely, “elapsed real time” is the length of time, measured in seconds, that a program takes to run (e.g. execute all its assigned tasks). The walltime is independent of how many resources are used; it is the time it takes according to a clock mounted on the wall.
-
CPU hours: is the amount of CPU time spent processing. Imagine, for example, to execute a program for a walltime of 1 hour on 32 CPUs. We will have used up to 32 CPU hours.
Efficient use of HPC resources
We introduced the concept of performance in “computing” earlier when discussing about FLOPS and clock speed node, however these concepts need to be revisited in the context of HPC clusters where we have a large amount of CPUs (or nodes) available.
Say you are a professional copyist. After your long experience, you have optimized your workflow so that you copy an entire page of a manuscript with no difficulty. However, if you are now tasked with copying a 1000-page manuscript, despite your talent it will probably take you a LONG time, more than any employer will ever wait for. A very simple solution would be to delegate some of the workload to nine other copyists that could assist you (while of course communicating between each other), increasing the chances of getting the work done much faster. In very simple terms, our goal will be to distribute the workload effectively and efficiently over the resources we have available, in order to reach the final result in a reasonable time frame.
If you are taking this course, you are probably an application user and not an application developer, meaning that you are not really interested in developing more efficient CFD tools, but rather in using your favourite CFD package as efficiently as possible. Making our workflow more efficient will be the basis of the next section.
QUIZ
Your goal is to write a batch script to run a Python code problem1.py
for 8 hours on 2 nodes each containing 32 processors. The code requires you to use Python 3.9.6. Answer the following questions:
1.3.1 How much walltime do you need?
1.3.2 How many CPU hours would you need?
1.3.3 Write a batch script to submit this job using SLURM.
1.3.3 Solution
Having finished this class, you should now be able to answer the following questions:
- How do I run a job in interactive mode on the cluster?
- How do I submit a batch script to SLURM?
- How do I load my favorite CFD software?