High Performance Computing Systems (CMSC714)

Assignment 2: Profiling and OpenMP

Due: Monday March 8, 2021 @ 11:59 PM Anywhere on Earth (AoE)

The purpose of this programming assignment is to gain experience in using performance analysis tools and in writing OpenMP programs. You will start with a working serial program (quake.c) that models an earthquake, analyze its performance and then add OpenMP directives to create a parallel program.

The goal is to be systematic in figuring out how to parallelize this program. You should start by using HPCToolkit and Hatchet to figure out what parts of the program take the most time. From there you should examine the loops in the most important subroutines and figure out how to add OpenMP directives. The program will be run on a single compute node of the deepthought2 machine.

Using HPCToolkit and Hatchet

HPCToolkit is available on deepthought2 via the hpctoolkit/gcc module. You can use HPCtoolkit to collect profiling data for a program in three steps.

  1. Step I: Creating a hpcstruct file (used in Step III) from the executable
    hpcstruct exe
    This will create a file called exe.hpcstruct
  2. Step II: Running the code (quake) with hpcrun:
    hpcrun -e WALLCLOCK@5000 ./exe <args>
    This will generate a measurements directory.
  3. Step III: Post-processing the measurements directory generated by hpcrun:
    mpirun -np 1 hpcprof-mpi --metric-db=yes -S exe.hpcstruct -I <path_to_src> <measurements-directory>
    This will generate a database directory.
Hatchet can be used to analyze the database directory generated by hpcprof-mpi using its from_hpctoolkit reader.

You can install Hatchet using pip install hatchet. I suggest using the development version of hatchet by cloning the git repository:


        git clone https://github.com/LLNL/hatchet.git
        
You can install hatchet on deepthought2 or your local computer by adding the hatchet directory to your PYTHONPATH and running install.sh.

Using OpenMP

To compile OpenMP we will be using gcc version 4.8.1 (the default version on deepthought2, which you can get by doing module load gcc on the deepthought2 login node), which nicely has OpenMP support built in. In general, you can compile this assignment with:


        gcc -fopenmp -O2 -o quake quake.c -lm
        

The -fopenmp tells the compiler to, you guessed it, recognize OpenMP directives. -lm is required because our program uses the math library.

The environment variable OMP_NUM_THREADS sets the number of threads (and presumably processors) that will run the program. Set the value of this environment variable in the script you submit the job from. It defaults to using all available cores, and on a deepthought2 node that means 20.

Running the program

Quake reads its input file from standard input, and produces its output on standard output. Quake generates an output message periodically (every 30 of its simulation time steps), so you should be able to tell if it is making progress.

When the program runs correctly, all versions (serial or parallel) running the quake.in input, irrespective of the number of threads used, should output this at the 3840th timestep:
        Time step 3840
        5903: -3.98e+00 -4.62e+00 -6.76e+00
        16745: 2.45e-03 2.66e-02 -1.01e-01
        30169 nodes 151173 elems 3855 timesteps

This is the output for quake.in.short:
        Time step 30
        978: 8.01e-03 7.19e-03 8.41e-03
        3394: -3.69e-21 1.57e-20 -5.20e-20
        7294 nodes 35025 elems 34 timesteps

What to Submit

You must submit the following files and no other files:

  • Python scripts that use hatchet for the analysis.
  • A report that describes what you did, and identifies the code regions that consume the most time.
  • quake-omp.c: modified quake.c file with OpenMP directives.
  • In the same report, add the times to run it on the input file quake.in (for 1, 2, 4, 8 and 16 threads).
You should put the code, and report in a single directory (named LastName-assign2), compress it to .tar.gz (LastName-assign2.tar.gz) and upload that to ELMS.

Since quake runs for a while on the above input dataset for a small numbers of threads, quake.in.short is another input file that runs for much less time (you can use this for testing).

To time your program, use omp_get_wtime() by placing one call at the beginning of main (next to the omp_get_max_threads call) and another one toward the end of main (before the "Done. Terminating the simulation" print statement).

Grading

The project will be graded as follows:

Component Percentage
Performance analysis 20
Runs correctly with 16 threads 40
Performance with 1 thread 10
Speedup on 16 threads 20
Writeup 10