Running simulations on HPC
At the end of this class, you should be able to:
- Organize and plan simulation files on the cluster
- Run the simulation and optimize
- Perform a run-time analysis of flow parameters
Now we are ready to start a large-scale CFD simulation. We have generated the mesh, setup the simulation, and run scaling tests. The purpose of this section is to standardize the workflow of organizing, running, and monitoring large-scale CFD simulations on a remote HPC system and to provide best-practice tips.
- All examples will be carried out on Graham, but the approach is generalizable to any HPC system
- The workflow is NOT set in stone, but rather suggestions to facilitate HPC usage
- The examples will be performed in both OpenFOAM and SU2; Students can toggle between OpenFOAM and SU2 as shown below.
OpenFOAM commands
SU2 commands
Remote files on the cluster
Before running a simulation, it is good practice to organize the file system on the remote cluster. Although this step may be time-consuming to implement, it will save time in the long run. The question now is: how do we organize our simulation files?
To answer this question, let us first figure out what options we have. Upon logging into Graham, type:
This command will check the available disk space and the current disk utilization on our personal and group profiles. The output will look something like:
Where username
refers to your personal space, while piname refers to your group (or principal investigator) profile.
/home
has a small capacity which is suitable for code development, source code, small parameter files, job submission scripts and version control. Note that we cannot write to the/home
drive from the compute nodes./project (group rrg-piname-ac)
is a directory that is linked to your principal investigator’s account and meant for longer-term storage and sharing data among members of a research group./scratch
is connected to a single user, and is intended for intensive read/write operations on large files. As mentioned in section 2.1, it is the right place to set up and run your simulations.
More detailed information on the Alliance’s storage and management systems can be found on their website.
Important files must be copied off /scratch
regularly since they are not backed up and older files are subject to purging!
As we have space on the /home
drive, if it’s not already done, we can clone from the course GitHub repository. The repository contains the input files and mesh for the examples (and is of modest size), therefore the /home
drive will allow us to modify and save the examples prior to copying the cases to /scratch
for running the simulations.
Create the run directory
Now that you have cloned the repository to your /home
directory, we can make any modifications. Any modifications you want to do on the source should happen in /home
, as it’s your own personal space, and nothing will be purged from here.
For this reason, once you are satisfied with the changes implemented in the source code, you should copy the code into a run directory in /scratch
. In this case, no changes were required in the source files, therefore we can copy them directly:
Naming convention and folder structure
At this stage, do not underestimate the importance of naming conventions for files and directories. Consistent naming conventions can help you find your data, avoid mistakes, and minimize duplication of efforts. Some CFD codes, such as OpenFoam, have strict folder naming and structure conventions for each simulation, while other codes, such as SU2, do not. As each CFD project typically comprises multiple simulations (mesh refinement, scaling tests, different turbulence models), a consistent naming convention and folder structure, determined a priori, will facilitate the organization, running, postprocessing, and, ultimately, the research data management.
Best practices in folder naming conventions:
- Avoid space and special characters in the names!: Use either dashes (-) or underscores (_) if you need to separate elements in the folder name. Alternatively, you can use the camelCase in which the first word is in lowercase and every other word starts with a capital letter (e.g. pimpleFoam).
- Keep it short, but meaningful: Keep folder names as short as possible and consider using abbreviations (write those down!)
- Write down the naming convention: Write down the naming convention in the data management plan (DMP), more in Section 3, or in a README file.
- Dates in folder names should follow an ISO 8601 format If dates are used in the folder names, use a well-accepted standard format YYYYMMDD or YYYY-MM-DD.
Here are a couple considerations for directory naming conventions:
-
Think about the simulations you plan to run: Consider all the possible simulations that you will need to run:
- What are the important parameters of the simulations? Here are some typical parameters that may be varied between simulations:
- turbulence models (SST, k-omega, RSM etc.)
- boundary conditions (freestream velocity, wall resolved, wall modelled, Reynolds number)
- various grid resolutions (coarse, medium, fine, etc.)
- thermophysical properties (Prandtl number, thermal convection, etc.)
- …
- What are the important parameters of the simulations? Here are some typical parameters that may be varied between simulations:
-
Establish a consistent naming convention and folder structure
- Depending on the planned parameter space that you plan to cover, you can establish a consistent naming convention and folder structure. Depending on the complexity of the CFD project, you may opt for either:
- a more complex folder hierarchy
- a more comprehensive folder naming convention
The naming convention and folder structure need not be unique and different research projects may have different naming convention. Here are two different examples:
- Depending on the planned parameter space that you plan to cover, you can establish a consistent naming convention and folder structure. Depending on the complexity of the CFD project, you may opt for either:
Having a deeper folder and subfolder hierarchy may help to organize the various simulations, especially for larger CFD projects. As the simulation organization is embedded within the folder structure, the naming convention of each folder is less critical. Here is an example:
For smaller CFD projects, a flatter folder structure may be preferred to facilitate navigation among various simulations. The flatter folder structures come at the cost of a very comprehensive naming convention. Here is an example:
[SIMULATION_NAME]_[Reynolds_number]_[Turbulence_model]_[resolution]
where [SIMULATION_NAME]
is BFS
for the backward facing step, [Reynolds_number]
is ReXXXX
where XXXX is the Reynolds number, [Turbulence_model]
defines the turbulence models (SST
, KOM
: k-omega, RSM
: Reynolds stress modelling etc.), and [resolution]
is the resolution of the mesh (CRS
coarse, MDM
medium, FIN
fine etc.). Based on these conventions
BFS_Re5000_SST_CRS
BFS_Re5000_SST_CRS
Bookkeeping
When running a large number of simulations, in addition to a consistent naming convention and folder structure, it is good practice to maintain a centralized database of the simulations. This database, which could be an Excel sheet, can provide a quick summary of the important details (date of simulation, number of processors, code compilation characteristics, etc.) for each simulation and any user comment.
Setting up simulation parameters
As the tutorial files have already been prepared, here we highlight only the most important steps in setting up the files prior to running the simulations. We assume that we are using a mesh with a known resolution that has been generated by an external meshing tool (see details in section 2.4). For simplicity and ease of computation, we only use the coarsest mesh with about bfs_200k_DDES
in OpenFoam, and Coarse
in SU2 examples). With this mesh, we can set the remaining parameters of the simulations:
Simulation setups can be daunting. Fortunately, we rarely need to construct a simulation setup from scratch. Instead, the best practice is to first test cases which are often provided within the CFD package. For example, we can look at:
You can then select the closest case to the one you plan to simulate and adapt it. Selecting the best case among those available is often not trivial.
-
Initial and boundary conditions: TODO
-
Numerics: the selection of the numerical details for both the temporal and spatial calculations will directly impact the computational cost. The specific details on the selection of the numerics falls outside the scope of ARC4CFD, therefore it is only a list of some of the considerations to set:
- Explicit, Semi-implicit, and Implicit time advancement and/or the order of the selected numerical scheme
- Spatial discretization and order of scheme (convective and diffusive terms, we can also set details for turbulence equations)
- Pre-conditioning scheme (Jacobi, ILU, LU_SGS etc.)
- Type of linear solvers (FGMRES, BCSTAB, etc.)
- Stabilization schemes
- Convergence criteria
- …
-
Time step size: The chosen numerics, the local grid resolution, and the minimal resolved time scale will directly impact the time step of the simulation. In the present tutorial case, with an explicit time advancement, the time step is CFL bound (section 2.3) and requires:
. Alternatively, in most CFD codes, we can set the maximum CFL number, and the solver will select the to meet the CFL conditions. The advantage of fixing is that we know that we evolved to 1 s after 10,000 time steps, whereas fixing the CFL allows us to maximize the based on the stability of the code.
OpenFOAM
SU2
- Simulation End Time: For steady CFD simulations, for a given set of boundary conditions, the residual of the simulation must be reduced to the desired convergence criterion. For unsteady CFD simulations, especially with turbulence and/or geometric complexities, it is important to run the simulation long enough to let the flow properly develop in the computational domain. As the flow adapts from it’s initial conditions given the boundary conditions of the problem, there will inevitably be a transient phase. Eventually, the flow will reach a statistically steady state during which the spatially- or phased-averaged statistical quantities (drag coefficient, turbulent fluctuation etc.) remain constant. As mentioned in section 2.3 estimating the time required to reach a steady state is very difficult and is case dependent. A reasonably good measure to get a rough estimate is the flow through time (FTT) as described in section 2.2. In this example, we choose the end time to correspond to
flow-through times.
OpenFOAM
SU2
- Output and snapshot time interval: for the flow analysis, we typically rely on a combination of:
- run-time statistics: these statistics are collected during run time
- postprocessed statistics: these statistics are postprocessed after the simulation from the output data
Although run time statistics are often desired as we can get high temporal resolution, they can impart a significant run time penalty on large simulations (e.g. averaging on a plane). Postprocessed statistics provide more flexibility (we can compute new statistics even after the simulation has been run.) but demand significantly more flow realizations or snapshots compute.
OpenFOAM
SU2
- Domain decomposition: after performing the scaling test for a given mesh (section 2.5), we know how many processors we should use to optimize the CFD workflow.
OpenFOAM
The user should modify the numberOfSubdomains
entry in the case/system/decomposeParDict
file. In this example, we use 64 processors.
(depending of on the selected parallelization method (see openFoam user guide) there may be other parameters to be modified)
SU2
No need to specify this a priori in SU2. During execution, the user will decide how many processors to use for the calculation.
After all flow and simulation parameters have been set, we are now ready to run the simulation.
Run a large-scale CFD simulation
As previously seen in section 1.5, when solving the two-dimensional Poisson equation, there are 2 common ways of running large-scale simulations on the cluster:
- an interactive session by logging into the compute nodes directly, and
- submitting a batch job to SLURM. Interactive sessions are easier to set up and debug, as we can interactively run the simulation on the compute node and immediately assess the outputs. But the interactive sessions are not suited for long jobs (as the terminal window must remain open and the workstation on), many processors, or multiple parallel simulations. Therefore, interactive sessions should only be used for small simulations, debugging large simulations, and/or running scaling tests.
Now that we copied the tutorial case from our /home
directory onto /scratch
, our goal is to run 3 simulations starting from a coarse mesh (
Running in interactive mode
- Allocate required HPC resources:
- Create the sub-case directory
bfs_200k_DDES
within the main case directory:
- Generate mesh from file using the
gmsh
utility:
- Convert mesh to OpenFOAM format and modify boundary file to reflect boundary conditions:
- Start the simulation:
Where the Allrun
script performs some very important operations. Among them:
After step 5 is completed, you should see the simulation starting on the terminal:
At this point, the terminal window hangs while the simulation runs. If the terminal window is closed, the simulation stops.
- Allocate required HPC resources:
- Generate mesh from file using the
gmsh
utility:
- Start the simulation:
Submitting a batch script
When dealing with multiple simulations, long duration, or a large number of processors, it is best to submit SLURM. As seen earlier, SLURM will queue the job and run it when the resources become available. In this case, for instance, we could include Steps 1-5 in a single file run.sh
to be run in interactive mode, or even better in a batch job script run_jobscript.sh
to submit to the job scheduler. Both files are included in the GitHub repository and are shown below:
run.sh
run_jobscript.sh
su2job_StdEnv.sh
The command to submit the batch script is simply:
QUIZ
Run the numerical simulation of the same backward facing step flow for the mesh containing
2.6.1 What would be the time step size
2.6.2 What would be the writeInterval required to still print results every 2 milliseconds?
Perform a runtime analysis of the simulation
Once the job is submitted, we should make sure the simulation is running properly. This is done by typing the command:
Based on the output, the code is running, as expected, on 64 processors using 8 nodes. This check, however, does not really tell us that everything is going well, but only that the 64 processes have started and are working on something. The next step would be to check the log file.
If you recall, with the command mpirun pimpleFoam -parallel > log.pimpleFoam
in the run_jobscript.sh we asked the code to write any output to a log file called log.pimpleFoam
. If you notice, this file popped up into our case directory (bfs_200k_DDES
) as soon as the simulation started.
Depending on how far along you are in the simulation, the log.pimplefoam
file might be quite long. To give you a quick run through of how it looks, let us visualize the beginning of it:
See log.pimpleFoam
Important information to retain from the log file are:
- The time iteration corresponds to the time integration of the equations of motion mentioned in section 2.3.
- The CFL or Courant number is displayed at every time stamp.
- Residuals are shown at each iteration for all velocity components and pressure.
- Local and global mass conservation is also printed at each iteration.
These 4 pieces of information are already incredibly useful to understand if the simulation is converging, diverging, if mass is globally conserved, or if there is a problem in the domain.
The simulation will probably run for several hours, and the ideal scenario is that every once in a while we check the behavior of the residuals and mass conservation. As you might guess, staring at numbers on the screen is not the best approach, and, once again, it is better to adopt an automated mechanism to visualize residuals. This can be done using gnuplot, a command-line and GUI program that can generate two- and three-dimensional plots of functions, data, and data fits. Gnuplot is usually present by default in any UNIX system, however, to make sure you have it in your profile on the cluster, you can type:
If you do not see the gnuplot welcome message, or if the terminal throws you an error, you can load the gnuplot module just like any other module:
We can now write a simple script to plot residuals during runtime:
The script above will plot the residuals for all velocity components during the past 40 time steps. The students can modify the highlighted line
in the script to change the plotting range. The script MUST BE located in the same directory of the log.pimpleFoam
, and to run it simply type:
If you recall, in lines 252 to 255 of Backstep_str_config.cfg
file, we have instructed SU2 to generate an output file containing convergence history under the name, history.csv
. If you notice, this file popped up into our case directory (02_BFS_SU2/Coarse/
) as soon as the simulation started.
Depending on how far along in the simulation you are, the history.csv
file might be quite long. To give you a quick run through of how it looks, let us visualize the beginning of it:
See history.csv
In this case, the file contains the root-mean-square
-
Screen output: The convergence history printed on the console.
-
History output: The convergence history written to a file.
-
Volume output: Everything was written to the visualization and restart files.
-
Output field: A single scalar value for screen and history output or a vector of a scalar quantity at every node in the mesh for volume output.
-
Output group: A collection of output fields.
More information can be found HERE. The simulation will be running probably for several hours, and the ideal scenario is that every once in a while we check the behavior of the
If you do not see the gnuplot welcome message, or if the terminal throws you an error, you can load the gnuplot module just like any other module:
We can now write a simple script to plot
The students can modify the highlighted lines in the script above to change the plotting range. The script MUST BE located in the same directory of the lhistory.csv
, and to run it simply type:
You might want to think about output files
When running a simulation in parallel, it is crucial to think about the impact of the output files on the HPC workflow. Running
Number of output files
Depending on the CFD tool used, when running a simulation in parallel, we need to remember that the computational domain has been decomposed into /scratch
directory.
In simple terms, the coarse simulation of the BFS we have just carried out on 64 processors for about 25000 iterations would generate about 8 million files!! This is why the time interval between snapshots should be chosen wisely.
This type of output is known as parallel output, and one should always consider merging all processors’ files after the simulation is done or (if possible) during runtime. This is precisely why reconstructPar
was included in the OpenFOAM batch script file.
Although some available CFD tools will perform this operation by default, it is always a good idea to perform a rough estimate of the number of output files expected from a numerical simulation. Let’s consider our BFS simulation over a total time of
- Number of files parallel output:
- Number of files merged output:
Check the documentation of your CFD tool as you might be able to change the way output files are written. In OpenFOAM, for instance, you can switch between the two write methods on the fly by modifying the controlDict
entry while your case is running.
Size of output files
Some thought should also be given to the size of the output files. Without going int0 too much detail, in HPC we have two possible output formats:
-
Binary: as mentioned in section 1 of this course, the binary language is very efficient for programs and is not designed for humans to read. Executables for instance are written in binary code by the compiler, and contain the set of instructions a program has to execute.
-
ASCII: stands for American Standard Code for Information Interchange. It is a coded character set consisting of 128 7-bit characters. There are 32 control characters, 94 graphic characters, the space character, and the delete character. ASCII file format that can be easily read by humans. A very common text file (.txt) is an ASCII file.
Why does this matter in CFD and HPC?
Binary format is faster for read/write since the machine does not have to convert to a human-readable format. The size of a binary output file is also smaller as compared to an ASCII file. Most binary formats are platform-dependent and not easily transferable between systems.
Applying this reasoning to a CFD case:
-
For complex geometries and very large mesh files where the goal is to print the output for hundreds or thousands of snapshots, binary would be a better choice, as writing many data points and many snapshots can be done relatively instantaneously (compared to converting and writing thousands of ASCII files).
-
For relatively small cases on simple geometries, where the number of output files is not too large, writing ASCII files will not cause a significant performance hit, and one can open and manipulate single output files.
- ASCII:
- Suitable for small meshes and few snapshots.
- Can be visualized and edited using regular text editors.
- Binary:
- Smaller file size.
- Faster to read/write
- Suitable for large meshes and complex geometries.
- ASCII:
- Larger file size.
- Binary:
- Cannot be read and edited by regular text editors.
Align restarts with clock time
Many HPC systems use fair share schedulers which often impose maximum time durations for each submitted job. On most Digital Research Alliance clusters, the maximum walltime of a job is 24 hours. Therefore, restart files (also called breakpoint) need to be output in order to run the simulation for more than 24 hours. To optimize the computational usage, we should seek to have a restart file written immediately before the end of the simulation.
For a 24 h run (86,400 seconds), if we know that the writing files take about 2 minutes (dependent on the size of the simulation), then we would want to have a restart file written at, say 86,000 seconds of wallclock time. Additionally, if we want to assure that we have some intermediary restart files (in case of issues with the simulation), we may want to write every 8,600 seconds of wallclock time. In openFoam this can be done by setting the following in the controlDict
file:
Can I trust my results?
Before diving too much into visualization sessions of our fresh CFD data, we should always ask ourselves if we can trust our results? This is a crucial question to answer as it is central to answering our scientific question(s). In order trust our CFD results, there are two important aspects to consider:
Grid sensitivity study
This is an internal test that provides an assessment that the results are insensitive to selected computational grid. For highly non-linear problems, strongly influenced by the boundary conditions, the grid sensitivity study is carried out by running 3 or more simulations of the same problem, with the same boundary and initial conditions, on successively refined grids, ideally by doubling the refinement in each direction. If the numerical method is stable, and all approximations used in the discretization (finite differences, finite volumes, etc.) are consistent, you will find that the solution eventually converges to a grid-insensitive solution (Ferziger and Peric, 2002). Only at this point can we conclude that the numerical method and discretization scheme chosen can be trusted.
As an example, below we show the result of a grid sensitivity study on the BFS example we used in this course. Although we only showed how to run the coarsest mesh, two finer grids (400k, and 800k) have been tested. The figure shows the time-averaged streamwise velocity
In the figure above, we notice that the dash and solid curves at all locations (different colors) are very close to each other. One can estimate the error between the two profiles at each location, and if the difference between the profiles is less than 5% we can conclude that grid convergence has been reached on the medium-fine grid of about 400,000 points. Any simulation on a grid finer than this will not constitute a significant improvement on the solution and will therefore be an ineffective use of computational resources.
Verification and validation check
This is an external test where the goal is to assess the the models are correctly implemented (verification) and the accuracy of the models to represent real-world flow (validation). The verification and validation processes are well established and standardized among the CFD community. We define:
- Verification is “The process of determining that a model implementation accurately represents the developer’s conceptual description of the model and the solution to the model. (AIAA G-077-1998)“. The objective is to assess that the implementation of the conceptual model in the solver is correct.
- Validation is “The process of determining the degree to which a model is an accurate representation of the real world from the perspective of the intended uses of the model. (AIAA G-077-1998)“. The objective is to assure that the selected model represents the physical reality.
For well-established codes, verification of the implemented models is usually done prior to the codes release (see e.g. openFoam’s V&V). Validation, on the other hand, is a critical step to assure the validity and gain confidence in the CFD results. As CFD is predictive tool, it is often difficult to validate the specific test case. In those cases, there are many canonical flows whose sole purpose is to serve as benchmark (lead driven cavity flow, backward facing step, turbulent boundary layer, periodic channel flow, etc.).
Once you assure that the CFD solution is grid-independent, the CFD tool and models are verified, and the CFD tools validated, then we can gain trust in the CFD results and can advance the understanding of a physical phenomenon.
Let us visualize the flow!
(more on this in the next class)
Having finished this class, you should now be able to answer the following questions:
- How do I organize simulation files on the cluster?
- How do I run a large-scale CFD simulation on HPC systems?
- How do I monitor the simulation during runtime?
- How do I save data efficiently?