CMSC416/CMSC818X - Introduction to Parallel Computing (Fall 2021)

Introduction to Parallel Computing (CMSC416/CMSC818X)

Assignment 1: MPI

Due: Monday October 11, 2021 @ 11:59 PM Eastern Time

The purpose of this programming assignment is to gain experience in parallel programming on a cluster and MPI. For this assignment, you have to write a parallel implementation of a program to simulate the Game of Life.

Serial Algorithm

The game of life simulates simple cellular automata. The game is played on a rectangular board containing cells. At the start, some of the cells are occupied, the rest are empty. The game consists of constructing successive generations of the board. The rules for constructing the next generation from the previous one are:

death: cells with 0,1,4,5,6,7, or 8 neighbors die (0,1 of loneliness and 4-8 of over population)
survival: cells with 2 or 3 neighbors survive to the next generation.
birth: an unoccupied cell with 3 neighbors becomes occupied in the next generation.

For this assignment, the game board has finite size. The x-axis starts at 0 and ends at X_limit-1 (supplied on the command line). Likewise, the y-axis start at 0 and ends at Y_limit-1 (supplied on the command line).

You can use the provided serial code as a baseline to develop your parallel implementation. It provides some basic functionality such as parsing the input file and exporting the final board state to a CSV file. Your task is to implement the parallel version using C, C++, or Fortran, and MPI.

Input/Initialization

Your program should read in a file containing the coordinates of the initial cells. Sample files are located here: life.1.256x256.data and life.2.256x256.data (256x256 board). Each line in this file represents the coordinates of a cell on the board that is live. For instance, the following entry:

1,3

means that the cell at position [1, 3] is live in the initial state. You can also find many other sample patterns on the web (use your favorite search engine on "game of life" and/or "Conway").

Your program should take four command line arguments: the name of the data file, the number of generations to iterate, X_limit, and Y_limit. To be more specific, the command line of your program should be:


          ./life <input file name> <# of generations> <X_limit> <Y_limit>

The number of processes the program will run on is specified as part of the mpirun command with the -np argument.


          mpirun -np <number of processes> ./life <input file name> <# of generations> <X_limit> <Y_limit>

Output

Your program should write a single file called <input-file-name>.<no-of-generations>.csv (from one designated rank) that contains comma separated values representing the board. There should be one line (containing the x coordinate, a comma, and then the y coordinate) for each occupied cell at the end of the last iteration. Sample output files are available:

life.1.256x256.100.csv is the output of the file life.1.256x256.data run for 100 generations on a 256x256 board
life.2.256x256.100.csv is the output of the file life.2.256x256.data run for 100 generations on a 256x256 board

Similar to the output files above, your output in the file should be sorted by the X and Y coordinates.

The only print from your program to standard output should be from process 0 that looks like this:


        TIME: Min: 25.389 s Avg: 27.452 s Max: 41.672 s

where Min, Avg and Max time (in seconds) are calculated using MPI reduction operations over the individual time measurements of the "main" loop (over generations) on different processes for the sample final.512x512.data input file.

Parallel Version

Figure out how you will decompose the problem for parallel execution. Remember that MPI (at least the OpenMPI implementation) does not always have great communication performance and so you will want to make message passing infrequent. Also, you will need to be concerned about static load balancing during data distribution/domain decomposition. To learn about decomposing the problem in different ways, you must generate two parallel versions of the program, one that uses a 1D decomposition (rows or columns) and one that uses a 2D decomposition (both rows and columns).

What to Submit

You must submit the following files and no other files:

life1d.[c,C,f77,f90]: parallel version with 1D decomposition, where the file extension depends on the language used for the implementation
life2d.[c,C,f77,f90]: parallel version with 2D decomposition
Makefile that will compile your code successfully on deepthought2 when using mpicc or mpicxx. You can see a sample Makefile here. Make sure that the executable names are life1d and life2d, and do not include the executables in the tarball.
You must also submit a short report (pdf) with performance results (two line plots). The line plots should present the execution times to run the 1D and 2D parallel versions respectively on the input file final.512x512.data (for 4, 8, 16, and 32 processes), running on a 512x512 board for 500 iterations. Since the cluster you run on has 20 cores per node, you must time your code running on two different numbers of cores per node (--ntasks-per-node=8 and 16) to see the performance effects. In total, you will be running each program 8 times for 4 process counts X 2 cores/node settings. In the report, you should mention:
- how you did the decompositions
- how was the initial data distribution done
- what are the performance results, and are they what you expected

You should put the code, Makefile and report in a single directory (named LastName-FirstName-assign1), compress it to .tar.gz (LastName-FirstName-assign1.tar.gz) and upload that to ELMS.

If you want to try a bigger board, to see if you can get better speedups with more processes, try running on the input file life.1024x1024.data.

Tips

Quick deepthought2 primer.
Use the compiler flag -g while debugging but -O2 when collecting performance numbers.
MPI_Wtime example

Grading

The project will be graded as follows:

Component	Percentage
Runs correctly with 4 processes	30 (15% each decomposition)
Runs correctly with 16 processes	40 (20% each decomposition)
Speedup with 4 processes	10 (5% each decomposition)
Speedup with 16 processes	10 (5% each decomposition)
Writeup	10

NOTE: If your program does not run correctly, you do NOT get any points for performance/speedup.

The speedup will be computed using the provided serial code as the baseline. You will get full points as long as your speedup is within 2 times (2x) of the expected speedup. For performance worse than 2x of the expected speedup, you will receive credit proportional to your speedup.