Last Updated: 2025-11-30 Sun 21:01

CMSC216 Lab12: mmap() / Thread Performance

CODE DISTRIBUTION: lab12-code.zip

CHANGELOG:

1 Rationale

Programs can request virtual memory to be mapped into their address space via the mmap() system call which is used to create heaps, stacks, and other logical regions of memory. This lab demonstrates one use of this to create simple data structures that are explicitly managed via mmap() calls.

Threads are frequently used to improve the performance of applications when a computation can be subdivided between the threads. This exercise explores doing so in the context of summing elements in matrix columns. Completing it will give a basic overview of launching threads, divvying up work, and synchronizing the threads as they contribute a shared result.

Associated Reading / Preparation

Bryant and O'Hallaron: Ch 12.3-6 covers threaded programming.

Grading Policy

Credit for this exercise is earned by completing the code/asnwers here and submitting a Zip of the work to Gradescope. Students are responsible to check that the results produced locally via make test are reflected on Gradescope after submitting their completed Zip. Successful completion earns 1 Engagement Point.

Lab Exercises are open resource/open collaboration and students are encouraged to cooperate on labs. Students may submit work as groups of up to 5 to Gradescope: one person submits then adds the names of their group members to the submission.

See the full policies in the course syllabus.

2 Codepack

The codepack for the lab contains the following files:

File   Description
QUESTIONS.txt EDIT Questions to answer: fill in the multiple choice selections in this file.
     
mapped_array.c EDIT PROBLEM 1 Source file to study and edit
     
csums_thread.h Provided PROBLEM 2 Header file defining some types and functions for
csums_thread_util.c Provided PROBLEM 2 Utility functions to manipulate matrices/vectors
csums_thread_funcs.c EDIT PROBLEM 2 Small "algorithms" to time arithmetic operations
csums_thread_main.c EDIT PROBLEM 2 Main function that times different summing techinques
Makefile Build Enables make test and make zip
QUESTIONS.txt.bk Backup Backup copy of the original file to help revert if needed
QUESTIONS.md5 Testing Checksum for answers in questions file
test_quiz_filter Testing Filter to extract answers from Questions file, used in testing
test_lab12.org Testing Tests for this discussion
testy Testing Test running scripts

3 Problem 2 Overview and Background

This lab deals with matrices and vectors again similar to a previous lab/HW. Review that material if you need a refresher on data types such as matrix_t and vector_t.

The code focuses on completing two functions, col_sums_threaded() and col_sums_worker() to produce a threaded column sums implementation.

4 Basic Overview of Threads

Threads allow the creation of multiple "threads of execution" within the same program. One primary reason to do this is in an attempt to speed up the execution of program by utilizing several threads each of which can be independently scheduled to run. This allows multiple cores/CPUs to be simultaneously used which may speed up a computation.

The general flow of thread usage is as follows.

  1. A "main" thread will create several work threads
  2. Worker threads usually begin execution in a function of some sort. While they share most memory with the main thread and each other, each thread often needs to be pointed at some contextual information on what part of the computation it should perform
  3. Worker threads perform their own computations as independently as possible. When they need to coordinate, they will often use a mutex, an operating system supported data structure that enables mutual exclusion of threads: only one thread at a time will perform certain operations to avoid potential problems.
  4. The main thread will wait for all worker threads to complete their tasks then perform any final wrap-up before moving to the next task or reporting answers

This flow is present in the provided code csums_thread_main.c and csum_thread_funcs.c. Examine this flow carefully and the specific function calls used to facilitate it.

5 Special Note On Testing

The code is tuned to run properly on most NORMAL GRACE nodes (not the grace-aws.umd.edu machine). Performance tests are always difficult to get right. The expected node configuration looks like this (available via the lscpu command):

grace5[~/216/lab12-code]% lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4                          # 4 cpus expected
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2                          # 2 cores expected
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 106                        # cpu model and speed
Model name:            Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
Stepping:              6
CPU MHz:               1995.312
BogoMIPS:              3990.62
Hypervisor vendor:     Microsoft
Virtualization type:   full
L1d cache:             48K                        # approximate cache sizes
L1i cache:             32K
L2 cache:              1280K
L3 cache:              49152K
NUMA node0 CPU(s):     0-3
Flags:                 fpu vme de pse tsc msr pae mce ...

If tests fail repeatedly, it may be worthwhile to check the node configuration and/or try logging into a different node. Below is an example of how to log into a different Grace node from the one that is assigned to your initial login session:

laptop> ssh kauf0095@grace.umd.edu
(kauf0095@grace.umd.edu) Password: .............
Duo two-factor login for kauf0095...

grace4>> make test
...
## Assigned to a node but tests are failing, ssh to a different node

grace4>> ssh grace5
(kauf0095@grace.umd.edu) Password: .............
Duo two-factor login for kauf0095...

grace5>> make test
## tests pass more reliably

Below is an example of a successful run:

>> make
gcc -Wall -Werror -g -Og -c csums_thread_main.c		
gcc -Wall -Werror -g -Og -c csums_thread_funcs.c		
gcc -Wall -Werror -g -Og -c csums_thread_util.c
gcc -Wall -Werror -g -Og -o csums_thread_main csums_thread_main.o csums_thread_funcs.o csums_thread_util.o

>> ./csums_thread_main 
usage: ./csums_thread_main <rows> <cols> <thread_count>

>> ./csums_thread_main 10000 10000 2
col_sums_single wall time: 9.7008e-02 sec
col_sum_threaded wall time: 5.8766e-02 sec
2 threads speedup: 1.65

>> ./csums_thread_main 10000 10000 4
col_sums_single wall time: 9.6832e-02 sec
col_sum_threaded wall time: 4.8283e-02 sec
4 threads speedup: 2.01

6 QUESTIONS.txt File Contents

Below are the contents of the QUESTIONS.txt file for the lab. Follow the instructions in it to complete the QUIZ and CODE questions for the lab.

                           _________________

                            LAB12 QUESTIONS
                           _________________


Exercise Instructions
=====================

  Follow the instructions below to experiment with topics related to
  this exercise.
  - For sections marked QUIZ, fill in an (X) for the appropriate
    response in this file. Use the command `make test-quiz' to see if
    all of your answers are correct.
  - For sections marked CODE, complete the code indicated. Use the
    command `make test-code' to check if your code is complete.
  - DO NOT CHANGE any parts of this file except the QUIZ sections as it
    may interfere with the tests otherwise.
  - If your `QUESTIONS.txt' file seems corrupted, restore it by copying
    over the `QUESTIONS.txt.bk' backup file.
  - When you complete the exercises, check your answers with `make test'
    and if all is well, create a zip file with `make zip' and upload it
    to Gradescope. Ensure that the Autograder there reflects your local
    results.
  - IF YOU WORK IN A GROUP only one member needs to submit and then add
    the names of their group.


PROBLEM 1 Overview
==================

  The file `mapped_array.c' demonstrates use of the `mmap() / munmap()'
  system calls to create a simple extendable array called
  `mappedarr_t'. The code is incomplete and requires students to fill in
  options to it for two of the utility functions associated with the
  `mappedarr_t' type.


PROBLEM 1 CODE: Complete mapped_array.c
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Study and complete the mapped_array.c file to get a feel for the use
  of `mmap() / munmap()' functions. Sections marked as TODO require some
  code to complete the functionality but preceding portions provide a
  similar functionality:
  - Study the mappedarr_init() function
  - Use what is found there to complete the mappedarr_resize() function

  When the code is complete, the output or running the `mapped_array'
  program will be below. The behavior of the program can be tested via
  `make test-prob1'.
  ,----
  | >> make
  | ...
  | >> ./mapped_array
  | Created marr at 0x600000000000 for 200 ints / 800 bytes and 1 pages
  | Expanding marr at 0x600000000000 from 200 ints to 800 ints
  | Current pages: 1, Required pages: 1
  | Sufficient pages for requested size, no changes made
  | Expanding marr at 0x600000000000 from 800 ints to 3000 ints
  | Current pages: 1, Required pages: 3
  | Expanded to address 0x600000001000 adding 2 pages
  | Expanding marr at 0x600000000000 from 3000 ints to 4100 ints
  | Current pages: 3, Required pages: 5
  | Expanded to address 0x600000003000 adding 2 pages
  | Expanding marr at 0x600000000000 from 4100 ints to 4200 ints
  | Current pages: 5, Required pages: 5
  | Sufficient pages for requested size, no changes made
  | Created marr at 0x610000002000 for 1000 ints / 4000 bytes and 1 pages
  | Expanding marr at 0x610000002000 from 1000 ints to 9000 ints
  | Current pages: 1, Required pages: 9
  | Expanded to address 0x610000003000 adding 8 pages
  | Created marr at 0x610000000000 for 2000 ints / 8000 bytes and 2 pages
  | Expanding marr at 0x610000000000 from 2000 ints to 5000 ints
  | Current pages: 2, Required pages: 5
  | Failed expansion to address 0x610000002000 with 3 pages
  |   Mapped pages at alternate, non-contiguous address
  | Unmapping 0x600000000000 with 5 pages
  | Unmapping 0x610000002000 with 9 pages
  | Unmapping 0x610000000000 with 2 pages
  | 
  | >> make test-prob1
  | ...
  | ./testy -o md test_mapped_array.org 
  | =============================================
  | == test_mapped_array.org : mapped_array Tests
  | == Running 1 / 1 tests
  | 1) PROBLEM 1 CODE: mapped_array : ok
  | =============================================
  | RESULTS: 1 / 1 tests passed
  `----


PROBLEM 1 QUIZ: mmap() Allocation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  After gaining an understanding of mapped_array.c, answer the following
  questions about mmap().


(A)
---

  After `mmap()' is called, which of the below is used to check whether
  the requested address was successfully mapped?
  - ( ) The return value from `mmap()' is checked for the special value
    `MAP_FAILED' which indicates a failure to map the requested address
  - ( ) The return value from `mmap()' is checked for `NULL' which
    indicates a failure to map the requested address
  - ( ) The return value from `mmap()' is checked for `-1' which
    indicates a failure to map the requested address
  - ( ) The return value for `mmap()' is checked for whether it matches
    the requested address; if not, the mapping did not get made at the
    requested address


(B)
---

  In the `main()' function, why does the last call to expand `marr3'
  fail?
  - ( ) The requested expansion size is too big for the virtual memory
    system to support so the request is rejected
  - ( ) The requested expansion conflicts with pages that are already
    allocated so pages are allocated at an address different from the
    one requested which won't work for a contiguous array
  - ( ) The parameters to the call are incorrect leading to an error
    path in the code
  - ( ) The file that is to be mapped is unavailable so the operating
    system cannot map it


(C)
---

  Run the command `mapped_array 1' (with a 1 on the command line) to
  show the output of pmap on the process before it exits which shows the
  entire virtual address space of the process.

  Which of the below entries from the `pmap' output correspond to the
  `marr1' mapped array?
  - ( ) `0000600000000000 20K rw--- [ anon ]'
  - ( ) `0000610000000000 44K rw--- [ anon ]'
  - ( ) `00005607d7940000 132K rw--- [ anon ]'
  - ( ) `00007ffea14ba000 132K rw--- [ stack ]'

  Which of the below best describes how the `marr2' and `marr3' entries
  appear?
  - ( ) They appear within an entry like this
    ,----
    |   00007ffea14ba000    132K rw---   [ stack ]
    `----
    which is the Stack; these arrays are in the stack at a location
    determined by `mmap()' in collaboration with the process stack
    allocator
  - ( ) They appear within an entry like this
    ,----
    |   00005607d7940000    132K rw---   [ anon ]
    `----
    which is the Heap; these arrays are in the heap at a location
    determined by `mmap()' in collaboration with `malloc()'
  - ( ) They appear as two separate entries like this: ,#+BEGIN_SRC text
    0000610000000000 8K rw--- [ anon ] 0000610000002000 36K rw--- [ anon
    ] #+END_SRC because the two regions are mapped via separate `mmap()'
    calls.
  - ( ) They appear as a single entry like this:
    ,----
    |   0000610000000000     44K rw---   [ anon ]
    `----
    because the two mapped regions for the arrays are contiguous and
    total 44K bytes


PROBLEM 2 Overview
==================

  This problem studies optimizing a matrix computation using two
  techniques.
  1. Ensuring the matrix is accessed in a pattern that is favorable to
     cache.
  2. Employing multiple threads to subdivide the computation for
     parallel execution.

  UNLESS BOTH THESE TECHNIQUES ARE USED, PERFORMANCE WILL BE
  SUBOPTIMAL. Both techniques are discussed as they are both needed for
  an upcoming project. Technique (1) has been previously discussed in an
  assignment but is worth reviewing.


PROBLEM 2 QUIZ: col_sums_A() vs col_sums_B()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  In the file `example_matsums_funcs.c' which contains the solution to a
  previous lab/HW. What is the primary difference between the two
  functions `col_sums_A() / col_sums_B()'?


(A) Code Differences
--------------------

  - ( ) One version allocates and uses more memory than the other which
    will likely affect their relative speeds
  - ( ) One version uses a row-major matrix while the other uses a
    column-major matrix meaning that the underlying memory layout of the
    matrices used is different between the functions
  - ( ) They both compute the sums of all columns but traverse the
    matrix in a different order
  - ( ) They compute two different properties of the columns of the
    matrix and therefore necessarily need to traverse the matrix in a
    different order


(B) Speed Differences
---------------------

  Between the two versions, which is faster and why?
  - ( ) col_sums_A() is faster as it traverses the elements of the
    matrix with no stride, sequentially visiting memory addresses in a
    pattern that is greatly sped up by CPU caches.
  - ( ) col_sums_B() is faster as it traverses the elements of the
    matrix with no stride, sequentially visiting memory addresses in a
    pattern that is greatly sped up by CPU caches.
  - ( ) col_sums_A() is faster as it allocates less memory overall so
    that there is less "memory cache pressure" and the computation can
    proceed faster
  - ( ) col_sums_B() is faster as it allocates less memory overall so
    that there is less "memory cache pressure" and the computation can
    proceed faster


PROBLEM 2 QUIZ: Thread Startup and Shutdown
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

(A) Thread Creation and Completion
----------------------------------

  Examine `csum_thread_main.c' and `csum_thread_funcs.c'. Determine
  where worker threads are created and the function used to do so.
  - ( ) Worker threads are created in `main()' using a version of
    `fork()' with optional arguments that were not previously used to
    create processes.
  - ( ) Worker threads are created in `main()' using the
    `pthread_create()' function which has as one of its arguments a
    function that threads start running
  - ( ) Worker threads are created in `col_sums_threaded()' using a
    version of `fork()' with optional arguments that were not previously
    used to create processes.
  - ( ) Worker threads are created in `col_sums_threaded()' using the
    `pthread_create()' function which has as one of its arguments a
    function that threads start running


(B) Thread Completion
---------------------

  Research the function that is used to block a thread until another
  thread completes.  This is used in `csum_thread_funcs.c' to block the
  "main" thread until all worker threads have completed.  Which of the
  following is the most appropriate function to block until a thread has
  completed.

  - ( ) wait_pid(thread_id, NULL, 0)
  - ( ) pthread_join(pthread_struct, NULL)
  - ( ) pthread_wait(pthread_struct, NULL)
  - ( ) kill(thread_id, 9)


PROBLEM 2 QUIZ: col_sums_single() vs col_sums_worker()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  In the file csums_thread_funcs.c, compare the two functions
  col_sums_single() and col_sums_worker().

  Which of the following denote similarities between these two
  functions. Mark all that apply.

  - ( ) Both have the same types of arguments
  - ( ) Both have the same basic computational loop to sum some/all of
    matrix columns
  - ( ) Both make use of a local array allocated in the function to
    store partial results
  - ( ) Both use mutexes for controlled access to shared data

  Examine the differences between these two functions and mark ALL that
  apply below.
  - ( ) col_sums_single() sums all columns while col_sums_worker() only
    computes partial sums of columns
  - ( ) col_sums_worker() accepts its argument through a struct that
    must be caste to access it fields
  - ( ) col_sums_single() accesses the mat/vec through locals while
    col_sums_threaded() accesses the mat/vec through a global variable

  A comment in the `col_sums_worker()' function indicates that, before
  adding on to `vec.data[]', a worker thread should lock a mutex.  What
  reason is given for this requirement?
  - ( ) Locking the mutex converts a pointer from NULL to a valid
    location so that the thread does not cause a segmentation fault.
  - ( ) Locking the mutex improves the performance of adding onto
    `vec.data[]' and speedup is the goal
  - ( ) Locking prevents multiple threads from simultaenously modifying
    the results which may corrupt them.
  - ( ) Trick question: this is actually a lock-free solution which does
    not require any locking.


PROBLEM 2 CODE: Complete col_sums_thread()
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

  Complete the TODO items in csums_thread_funcs.c to finish a threaded
  implementation of the column summing. Build it and test it on the
  command line to ensure that you see some speedup from using threads.

  CODE and performance tests for this problem can be run via `make
  test-prob2'. Specifically, the tests look for correctness and about a
  1.5x speedup using 2 threads for the column sums computation.

  *SPECIAL NOTE ON TESTING*: The code is tuned to run properly on most
  NORMAL GRACE nodes (not the `grace-aws.umd.edu' machine). Performance
  tests are always difficult to get right. The expected node
  configuration looks like this (available via the `lscpu' command):
  ,----
  | grace5[~/216/lab12-code]% lscpu
  | Architecture:          x86_64
  | CPU op-mode(s):        32-bit, 64-bit
  | Byte Order:            Little Endian
  | CPU(s):                4                          # 4 cpus expected
  | On-line CPU(s) list:   0-3
  | Thread(s) per core:    2
  | Core(s) per socket:    2                          # 2 cores expected
  | Socket(s):             1
  | NUMA node(s):          1
  | Vendor ID:             GenuineIntel
  | CPU family:            6
  | Model:                 106                        # cpu model and speed
  | Model name:            Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
  | Stepping:              6
  | CPU MHz:               1995.312
  | BogoMIPS:              3990.62
  | Hypervisor vendor:     Microsoft
  | Virtualization type:   full
  | L1d cache:             48K                        # approximate cache sizes
  | L1i cache:             32K
  | L2 cache:              1280K
  | L3 cache:              49152K
  | NUMA node0 CPU(s):     0-3
  | Flags:                 fpu vme de pse tsc msr pae mce ...
  `----

  If tests fail repeatedly, it may be worthwhile to check the node
  configuration and/or try logging into a different node.  Below is an
  example of how to log into a different Grace node from the one that is
  assigned to your initial login session:
  ,----
  | laptop> ssh kauf0095@grace.umd.edu
  | (kauf0095@grace.umd.edu) Password: .............
  | Duo two-factor login for kauf0095...
  | 
  | grace4>> make test
  | ...
  | ## Assigned to a node but tests are failing, ssh to a different node
  | 
  | grace4>> ssh grace5
  | (kauf0095@grace.umd.edu) Password: .............
  | Duo two-factor login for kauf0095...
  | 
  | grace5>> make test
  | ## tests pass more reliably
  `----

  Below is an example of a successful run:
  ,----
  | >> make
  | gcc -Wall -Werror -g -Og -c csums_thread_main.c		
  | gcc -Wall -Werror -g -Og -c csums_thread_funcs.c		
  | gcc -Wall -Werror -g -Og -c csums_thread_util.c
  | gcc -Wall -Werror -g -Og -o csums_thread_main csums_thread_main.o csums_thread_funcs.o csums_thread_util.o
  | 
  | >> ./csums_thread_main 
  | usage: ./csums_thread_main <rows> <cols> <thread_count>
  | 
  | >> ./csums_thread_main 10000 10000 2
  | col_sums_single wall time: 9.7008e-02 sec
  | col_sum_threaded wall time: 5.8766e-02 sec
  | 2 threads speedup: 1.65
  | 
  | >> ./csums_thread_main 10000 10000 4
  | col_sums_single wall time: 9.6832e-02 sec
  | col_sum_threaded wall time: 4.8283e-02 sec
  | 4 threads speedup: 2.01
  `----

7 Submission

Follow the instructions at the end of Lab01 if you need a refresher on how to upload your completed lab zip to Gradescope.


Author: Chris Kauffman (profk@umd.edu)
Date: 2025-11-30 Sun 21:01