CMSC216 Lab12: Matrices and Memory Optimization

Due: 11:59pm Mon 12-May-2025 on Gradescope
Approximately 1.00% of total grade

CODE DISTRIBUTION: lab12-code.zip

CHANGELOG:

1. Rationale
2. Codepack
3. Overview and Background
4. Basic Overview of Threads
5. Special Note On Testing: Use ODD GRACE Nodes
6. QUESTIONS.txt File Contents
7. Submission

1 Rationale

Threads are frequently used to improve the performance of applications when a computation can be subdivided between the threads. This exercise explores doing so in the context of summing elements in a matrix. Completing it will give a basic overview of launching threads, divvying up work, and synchronizing the threads as they contribute a shared result.

Associated Reading / Preparation

Bryant and O'Hallaron: Ch 12.3-6 covers threaded programming.

Grading Policy

Credit for this exercise is earned by completing the code/asnwers here and submitting a Zip of the work to Gradescope. Students are responsible to check that the results produced locally via make test are reflected on Gradescope after submitting their completed Zip. Successful completion earns 1 Engagement Point.

Lab Exercises are open resource/open collaboration and students are encouraged to cooperate on labs. Students may submit work as groups of up to 5 to Gradescope: one person submits then adds the names of their group members to the submission.

See the full policies in the course syllabus.

2 Codepack

The codepack for the lab contains the following files:

File		Description
`QUESTIONS.txt`	EDIT	Questions to answer: fill in the multiple choice selections in this file.
`csums_thread.h`	Provided	Header file defining some types and functions for
`csums_thread_util.c`	Provided	Utility functions to manipulate matrices/vectors
`csums_thread_funcs.c`	EDIT	Small "algorithms" to time arithmetic operations
`csums_thread_main.c`	EDIT	Main function that times different summing techinques
`Makefile`	Build	Enables `make test` and `make zip`
`QUESTIONS.txt.bk`	Backup	Backup copy of the original file to help revert if needed
`QUESTIONS.md5`	Testing	Checksum for answers in questions file
`test_quiz_filter`	Testing	Filter to extract answers from Questions file, used in testing
`test_lab12.org`	Testing	Tests for this discussion
`testy`	Testing	Test running scripts

3 Overview and Background

This lab deals with matrices and vectors again similar to a previous lab/HW. Review that material if you need a refresher on data types such as matrix_t and vector_t.

The code focuses on completing two functions, col_sums_threaded() and col_sums_worker() to produce a threaded column sums implementation.

4 Basic Overview of Threads

Threads allow the creation of multiple "threads of execution" within the same program. One primary reason to do this is in an attempt to speed up the execution of program by utilizing several threads each of which can be independently scheduled to run. This allows multiple cores/CPUs to be simultaneously used which may speed up a computation.

The general flow of thread usage is as follows.

A "main" thread will create several work threads
Worker threads usually begin execution in a function of some sort. While they share most memory with the main thread and each other, each thread often needs to be pointed at some contextual information on what part of the computation it should perform
Worker threads perform their own computations as independently as possible. When they need to coordinate, they will often use a mutex, an operating system supported data structure that enables mutual exclusion of threads: only one thread at a time will perform certain operations to avoid potential problems.
The main thread will wait for all worker threads to complete their tasks then perform any final wrap-up before moving to the next task or reporting answers

This flow is present in the provided code csums_thread_main.c and csum_thread_funcs.c. Examine this flow carefully and the specific function calls used to facilitate it.

5 Special Note On Testing: Use ODD GRACE Nodes

The code is tuned to run properly on ODD GRACE NODES (grace3, grace5, grace7, grace9) and on Gradescope. If running on Even nodes, the tests may spuriously fail. Like all performance-based tests, this set may also fail if many programs are running on the testing machine so you may wish to try a few different nodes if tests fail.

Below is an example of how to log into a different Grace node from the one that is assigned to your initial login session:

laptop> ssh kauf0095@grace.umd.edu
(kauf0095@grace.umd.edu) Password: .............
Duo two-factor login for kauf0095...

grace4>>
## Assigned to an even node, log into an odd node instead

grace4>> ssh grace5
(kauf0095@grace.umd.edu) Password: .............
Duo two-factor login for kauf0095...

grace5>>
## ODD node which is tuned for this assignment

6 QUESTIONS.txt File Contents

Below are the contents of the QUESTIONS.txt file for the lab. Follow the instructions in it to complete the QUIZ and CODE questions for the lab.

                           _________________

                            LAB12 QUESTIONS
                           _________________


Exercise Instructions
=====================

  Follow the instructions below to experiment with topics related to
  this exercise.
  - For sections marked QUIZ, fill in an (X) for the appropriate
    response in this file. Use the command `make test-quiz' to see if
    all of your answers are correct.
  - For sections marked CODE, complete the code indicated. Use the
    command `make test-code' to check if your code is complete.
  - DO NOT CHANGE any parts of this file except the QUIZ sections as it
    may interfere with the tests otherwise.
  - If your `QUESTIONS.txt' file seems corrupted, restore it by copying
    over the `QUESTIONS.txt.bk' backup file.
  - When you complete the exercises, check your answers with `make test'
    and if all is well, create a zip file with `make zip' and upload it
    to Gradescope. Ensure that the Autograder there reflects your local
    results.
  - IF YOU WORK IN A GROUP only one member needs to submit and then add
    the names of their group.


OVERVIEW
========

  This lab emphasizes optimizing a matrix computation using two
  techniques.
  1. Ensuring the matrix is accessed in a pattern that is favorable to
     cache.
  2. Employing multiple threads to subdivide the computation for
     parallel execution.

  UNLESS BOTH THESE TECHNIQUES ARE USED, PERFORMANCE WILL BE
  SUBOPTIMAL. Both techniques are discussed as they are both needed for
  an upcoming project. Technique (1) has been previously discussed in an
  assignment but is worth reviewing.


QUIZ: col_sums_A() vs col_sums_B()
==================================

  In the file `example_matsums_funcs.c' which contains the solution to a
  previous lab/HW, what is the primary difference between the two
  functions `col_sums_A() / col_sums_B()'?


(A) Code Differences
~~~~~~~~~~~~~~~~~~~~

  - ( ) One version allocates and uses more memory than the other which
    will likely affect their relative speeds
  - ( ) One version uses a row-major matrix while the other uses a
    column-major matrix meaning that the underlying memory layout of the
    matrices used is different between the functions
  - ( ) They both compute the sums of all columsn but traverse the
    matrix in a different order
  - ( ) They compute two different properties of the columns of the
    matrix and therefore necessarily need to traverse the matrix in a
    different order


(B) Speed Differences
~~~~~~~~~~~~~~~~~~~~~

  Between the two versions, which is faster and why?
  - ( ) col_sums_A() is faster as it traverses the elements of the
    matrix with no stride, sequentially visiting memory addresses in a
    pattern that is greatly sped up by CPU caches.
  - ( ) col_sums_B() is faster as it traverses the elements of the
    matrix with no stride, sequentially visiting memory addresses in a
    pattern that is greatly sped up by CPU caches.
  - ( ) col_sums_A() is faster as it allocates less memory overall so
    that there is less "memory cache pressure" and the computation can
    proceed faster
  - ( ) col_sums_B() is faster as it allocates less memory overall so
    that there is less "memory cache pressure" and the computation can
    proceed faster


QUIZ: Thread Startup and Shutdown
=================================

(A) Thread Creation and Completion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Examine `csum_thread_main.c' and `csum_thread_func.c'. Determine where
  worker threads are created and the function used to do so.
  - ( ) Worker threads are created in `main()' using a version of
    `fork()' with optional arguments that were not previously used to
    create processes.
  - ( ) Worker threads are created in `main()' using the
    `pthread_create()' function which has as one of its arguments a
    function that threads start running
  - ( ) Worker threads are created in `col_sums_threaded()' using a
    version of `fork()' with optional arguments that were not previously
    used to create processes.
  - ( ) Worker threads are created in `col_sums_threaded()' using the
    `pthread_create()' function which has as one of its arguments a
    function that threads start running


(B) Thread Completion
~~~~~~~~~~~~~~~~~~~~~

  Research the function that is used to block a thread until another
  thread completes.  This is used in `csum_thread_func.c' to block the
  "main" thread until all worker threads have completed.  Which of the
  following is the most appropriate function to block until a thread has
  completed.

  - ( ) wait_pid(thread_id, NULL, 0)
  - ( ) pthread_join(pthread_struct, NULL)
  - ( ) pthread_wait(pthread_struct, NULL)
  - ( ) kill(thread_id, 9)


QUIZ: col_sums_single() vs col_sums_worker()
============================================

  In the file csums_thread_funcs.c, compare the two functions
  col_sums_single() and col_sums_worker().

  Which of the following denote similarities between these two
  functions. Mark all that apply.

  - ( ) Both have the same types of arguments
  - ( ) Both have the same basic computational loop to sum some/all of
    matrix columns
  - ( ) Both make use of a local array allocated in the function to
    store partial results
  - ( ) Both use mutexes for controlled access to shared data

  Examine the differences between these two functions and mark ALL that
  apply below.
  - ( ) col_sums_single() sums all columns while col_sums_worker() only
    computes partial sums of columns
  - ( ) col_sums_worker() accepts its argument through a struct that
    must be caste to access it fields
  - ( ) col_sums_single() accesses the mat/vec through locals while
    col_sums_threaded() accesses the mat/vec through a global variable

  A comment in the `col_sums_worker()' function indicates that, before
  adding on to `vec.data[]', a worker thread should lock a mutex.  What
  reason is given for this requirement?
  - ( ) Locking the mutex converts a pointer from NULL to a valid
    location so that the thread does not cause a segmentation fault.
  - ( ) Locking the mutex improves the performance of adding onto
    `vec.data[]' and speedup is the goal
  - ( ) Locking prevents multiple threads from simultaenously modifying
    the results which may corrupt them.
  - ( ) Trick question: this is actually a lock-free solution which does
    not require any locking.


CODE: Complete col_sums_thread()
================================

  Complete the TODO items in csums_thread_funcs.c to finish a threaded
  implementation of the column summing.

  Build it and test it on the command line to ensure that you see some
  speedup from using threads.

  *SPECIAL NOTE ON TESTING*: The code is tuned to run properly on ODD
  GRACE NODES (grace3, grace5, grace7, grace9) and on Gradescope. If
  running on Even nodes, the tests may spuriously fail. Like all
  performance-based tests, this set may also fail if many programs are
  running on the testing machine so you may wish to try a few different
  nodes if tests fail.

  Below is an example of how to log into a different Grace node from the
  one that is assigned to your initial login session:
  ,----
  | laptop> ssh kauf0095@grace.umd.edu
  | (kauf0095@grace.umd.edu) Password: .............
  | Duo two-factor login for kauf0095...
  | 
  | grace4>>
  | ## Assigned to an even node, log into an odd node instead
  | 
  | grace4>> ssh grace5
  | (kauf0095@grace.umd.edu) Password: .............
  | Duo two-factor login for kauf0095...
  | 
  | grace5>>
  | ## ODD node which is tuned for this assignment
  `----


  Below is an example of a successful run:
  ,----
  | >> make
  | gcc -Wall -Werror -g -Og -c csums_thread_main.c		
  | gcc -Wall -Werror -g -Og -c csums_thread_funcs.c		
  | gcc -Wall -Werror -g -Og -c csums_thread_util.c
  | gcc -Wall -Werror -g -Og -o csums_thread_main csums_thread_main.o csums_thread_funcs.o csums_thread_util.o
  | 
  | >> ./csums_thread_main 
  | usage: ./csums_thread_main <rows> <cols> <thread_count>
  | 
  | >> ./csums_thread_main 10000 10000 2
  | col_sums_single wall time: 9.7008e-02 sec
  | col_sum_threaded wall time: 5.8766e-02 sec
  | 2 threads speedup: 1.65
  | 
  | >> ./csums_thread_main 10000 10000 4
  | col_sums_single wall time: 9.6832e-02 sec
  | col_sum_threaded wall time: 4.8283e-02 sec
  | 4 threads speedup: 2.01
  `----

7 Submission

Follow the instructions at the end of Lab01 if you need a refresher on how to upload your completed lab zip to Gradescope.