#### Introduction to Parallel Computing (CMSC416)



#### Alan Sussman, Department of Computer Science



### **GPGPUs and CUDA**

Many slides borrowed from Daniel Nichols' slides



## Announcements

- Assignment I (MPI) was due today at I I:59 pm
  - We are waiving the penalty for late submissions, so you have until Wed. at 11:59PM to get full credit
  - Questions?
- Assignment 2 will be released on Wed.
  - Will discuss in class on Thursday



## GPGPUs

- Originally developed to handle computation related to graphics processing
- Also found to be useful for scientific computing
- Hence the name: General Purpose Graphics Processing Unit





### Accelerators

#### IBM's Cell processors

Used in Sony's Playstation 3 (2006)

### • GPUs: NVIDIA, AMD, Intel

• First programmable GPU: GeForce 256 (1999)

- Around 1999-2001, early GPGPU results
- **FPGAs**



https://www.cs.unc.edu/xcms/wpfiles/50th-symp/Harris.pdf

Alan Sussman & Abhinav Bhatele (CMSC416)



## Used for mainstream HPC 12111C

### • 2013: NAMD used for molecular dynamics simulations on a supercomputer with 3000 **NVIDIA Tesla GPUs**





THE INTERNATIONAL WEEKLY JOURNAL OF SCIENCE

Atomic structure of the AIDS pathogen's protein coat PAGE 643

COSMOLOGY THE FIRST LIGHT In pursuit of the most distant galaxies PAGE 554

CITATION CROSSING THE BORDERS International collaborations make the most impact PAGE 557

#### ANTICANCER DRUG A SITTING TARGET An indirect hit on 'undruggable' KRAS protein PAGES 577 & 638



5

## **GPGPU Hardware**

- Higher instruction throughput
- Hide memory access latencies with computation







## **Comparing GPUs to CPUs**

#### • Intel i9 11900K

- 8 cores
- 3.3 GHz



- 64 cores
- 2.45 GHz



Alan Sussman & Abhinav Bhatele (CMSC416)

### • NVIDIA GeForce RTX 3090

- 10,496 cores
- I.4 GHz

- NVIDIA A100
  - 17,712 cores
  - 0.76 GHz



## Volta GV100 SM

- CUDA Core
  - Single serial execution unit
- Each Volta Streaming Multiprocessor (SM) has:
  - 64 FP32 cores
  - 64 INT32 cores
  - 32 FP64 cores
  - 8 Tensor cores
- CUDA capable device or GPU
  - Collection of SMs

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf



#### SM

|                  |           |           |           |           |                                 | L1 Instruc                                                                              | tion Cache       |           |           |           |           |           |           |       |
|------------------|-----------|-----------|-----------|-----------|---------------------------------|-----------------------------------------------------------------------------------------|------------------|-----------|-----------|-----------|-----------|-----------|-----------|-------|
|                  |           | L0 li     | nstruc    | tion C    | ache                            | L0 Instruction Cache<br>Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk) |                  |           |           |           |           |           |           |       |
|                  | Wai       | rp Sch    | nedule    | r (32 t   | hread/clk)                      |                                                                                         |                  |           |           |           |           |           |           |       |
|                  | Di        | spatc     | h Unit    | (32 th    | read/clk)                       |                                                                                         |                  |           |           |           |           |           |           |       |
|                  | Reg       | jister    | File ('   | 16,384    | Register File (16,384 x 32-bit) |                                                                                         |                  |           |           |           |           |           |           |       |
| FP64             | INT       | INT       | FP32      | FP32      |                                 |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      | H         |           |       |
| FP64             | INT       | INT       | FP32      | FP32      |                                 |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      |           |           |       |
| FP64             | INT       | INT       | FP32      | FP32      |                                 |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      |           |           |       |
| FP64             | INT       | INT       | FP32      | FP32      | TENSO                           |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      | TE        | NSOR      | TENSO |
| FP64             | INT       | INT       | FP32      | FP32      | CORE                            | CORE                                                                                    | FP64             | INT       | INT       | FP32      | FP32      | C         | ORE       | CORI  |
| FP64             | INT       | INT       | FP32      | FP32      |                                 |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      |           |           |       |
| FP64             | INT       | INT       | FP32      | FP32      |                                 |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      |           |           |       |
| FP64             | INT       | INT       | FP32      | FP32      |                                 |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      | $\square$ |           | ++    |
| LD/ LD/<br>ST ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/ LD/<br>ST ST                |                                                                                         | LD/ LD/<br>ST ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | SFU   |
|                  |           | L0 li     | nstruc    | tion C    | ache                            |                                                                                         |                  |           | LOI       | nstruc    | tion C    | ache      |           |       |
|                  | War       |           |           |           | hread/clk)                      |                                                                                         |                  | Wa        | rp Scl    | nedule    | r (32 t   | hread     | d/clk)    |       |
|                  | Di        | spatc     | h Unit    | (32 th    | read/clk)                       |                                                                                         |                  | Di        | spatc     | h Unit    | (32 th    | read/     | /clk)     |       |
|                  | Reg       | jister    | File ('   | 16,384    | 4 x 32-bit                      | )                                                                                       |                  | Reg       | gister    | File (    | 16,384    | 4 x 3     | 2-bit)    |       |
| FP64             | INT       | INT       | FP32      | FP32      |                                 |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      | H         |           |       |
| FP64             | INT       | INT       | FP32      | FP32      |                                 |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      | H         |           |       |
| FP64             | INT       | INT       | FP32      | FP32      |                                 |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      | $\square$ |           |       |
| FP64             | INT       | INT       | FP32      | FP32      | TENSO                           |                                                                                         | FP64             | INT       | INT       | FP32      | FP32      | TE        | NSOR      | TENSO |
| FP64             | INT       | INT       | FP32      | FP32      | CORE                            | CORE                                                                                    | FP64             | INT       | INT       | FP32      | FP32      | C         | ORE       | CORI  |
| FP64             | INT       | INT       | EP32      | FP32      |                                 |                                                                                         | EP64             | INT       | INT       | FP32      | FP32      |           |           |       |

INT INT

LD/ LD/ LD/ LD/ LD/ ST ST ST ST ST

Tex

FP32 FP32

LD/ LD/ ST ST

Tex

FP64

FP64

INT INT FP32 FP3

LD/ ST

Tex

SFU

128KB L1 Data Cache / Shared Memory

INT INT

Tex

FP64

FP64

SFU



### Vo





DEPARTMENT OF COMPUTER SCIENCE

Alan Sussman & Abhinav Bhatele (CMSC416)





### **GPU-based nodes**

#### Figure on the right shows a single node of Summit @ ORNL



Alan Sussman & Abhinav Bhatele (CMSC416)



HBM & DRAM speeds are aggregate (Read+Write). All other speeds (X-Bus, NVLink, PCIe, IB) are bi-directional.



### **CUDA: A programming model for NVIDIA** GPUs

- Allows developers to use C++ as a high-level programming language
- Built around threads, blocks and grids
- Terminology:
  - Host: CPU
  - Device: GPU
  - CUDA kernel: a function that gets executed on the GPU



float x = input[threadID]; float y = func(x);output[threadID] = y;

CUDA Kernel

CUDA threads

Alan Sussman & Abhinav Bhatele (CMSC416)

## **CUDA** software abstraction

#### Thread

Serial unit of execution

- Block
  - Collection of threads
  - Number of threads in block <= 1024</p>

### Grid

Collection of blocks











## Software to hardware mapping





https://developer.nvidia.com/blog/cuda-refresher-cuda-programming-model/





## Three steps to writing a CUDA kernel

- Copy input data from host to device memory
- Load the GPU program (kernel) and execute
- Copy the results back to host memory





## **Copying data to the GPU**

double \*d Matrix, \*h Matrix; h Matrix = new double[N];

cudaMalloc(&d Matrix, sizeof(double)\*N);

// ... initialize h Matrix ... cudaMemcpy(d Matrix, h Matrix, sizeof(double)\*N, cudaMemcpyHostToDevice);

// ... some computation on GPU ...

cudaMemcpy(h Matrix, d Matrix, sizeof(double)\*N, cudaMemcpyDeviceToHost);

cudaFree(d Matrix);



**cudaMemcpyHostToDevice cudaMemcpyDeviceToHost cudaMemcpyDeviceToDevice cudaMemcpyHostToHost cudaMemcpyDefault** 



## **CUDA** syntax

global void saxpy(float \*x, float \*y, float alpha) { int i = threadIdx.x; y[i] = alpha\*x[i] + y[i];

int main() {

• • •

• • •

saxpy<<<1, N>>>(x, y, alpha);



Alan Sussman & Abhinav Bhatele (CMSC416)

What happens when: N > 1024 or N >#device threads

#### <<<#blocks, threads per block>>>



## **Multiple blocks**

global void saxpy(float \*x, float \*y, float alpha, int N) { int i = blockDim.x \* blockIdx.x + threadIdx.x; y[i] = alpha\*x[i] + y[i];

#### int main() {

• • •

• • •

int threadsPerBlock = 512; int numBlocks = N/threadsPerBlock + (N % threadsPerBlock != 0);

saxpy<<<numBlocks, threadsPerBlock>>>(x, y, alpha, N);









# UNIVERSITY OF MARYLAND

### Questions?

