

#### Hardware-Software Co-Design

Abhinav Bhatele, Daniel Nichols



#### Announcements

- For those that haven't presented, submit videos by May I
- Extra credit due May 7
- Exam grades out; submit regrade requests by Friday 4/25



#### What is HW/SW Co-Design?

- So far we have been changing our algorithms to optimally match hardware
- But what if we changed both?

What are some HW inefficiencies we've seen often in this class?



#### Types of HW/SW Co-Design

- Standalone accelerators for specific domains
  - great efficiency but not very general
- Extend existing hardware with task specific components
  - great efficiency but can mess with original performance
- Improve existing hardware



## Examples



#### TENSOR CORE 4X4X4 MATRIX-MULTIPLY ACC





# Examples



## Examples





#### Goals of Co-Design

- Data movement and locality optimizations
- Specialized computation components
  - higher throughput, lower latency
- Reduced power consumption
- Software development ease
- Reduce costs



- Deep Learning Recommendation Models
- "Deep Learning Recommendation Model for Personalization and Recommendation Systems", M. Naumov et al
- Online and offline training



- Deep Learning Recommendation Models
- "Deep Learning Recommendation Model for Personalization and Recommendation Systems", M. Naumov et al
- Online and offline training



- Deep Learning Recommendation Models
- "Deep Learning Recommendation Model for Personalization and Recommendation Systems", M. Naumov et al
- Online and offline training



- Deep Learning Recommendation Models
- "Deep Learning Recommendation Model for Personalization and Recommendation Systems", M. Naumov et al
- Online and offline training



- Deep Learning Recommendation Models
- "Deep Learning Recommendation Model for Personalization and Recommendation Systems", M. Naumov et al
- Online and offline training



#### Neo Overview

- A DLRM training system
  - Neo software with 4D parallelism for embedding operators
  - Optimized sequential embedding implementations
  - ZionEX: a hardware system designed to efficiently run Neo













Use all 4 for ideal parallelism!





(b) Row-wise Parallelism





Use performance models and pipelining to find optimal configurations

### Optimizing the Embedding Operators

- Two key inefficiencies
  - Each lookup is a single GPU kernel; incurs high overheads
  - Large tables need multi-GPU implementations
- Operator optimizations
  - Fused embeddings into single kernels; sort embedding gradients
  - Multi-GPU implementations
  - Co-Design with ZionEX to save memory



## Fused Embedding Operators





#### Sort Gradients of Embedding Operators



#### Memory Optimizations

- Make use of device memory, host memory, and disk
- Access behavior is irregular
  - Default managed memory (like CUDA unified) experiences very poor performance
  - Implement custom cache in software
- Embedding compression
  - low precision (high precision cache and low precision embeddings)
  - sparse optimizers



# ZionEX: Co-Designing with Neo





# ZionEX: Co-Designing with Neo

ZionEx addresses these shortcomings

Designed to support 4D parallelism and data requirements of DLRM



## ZionEX: Co-Designing with Neo

- ZionEx addresses these shortcomings
- Designed to support 4D parallelism and data requirements of DLRM
- Custom data ingestion servers to overcome latencies





# Training Performance



#### Other Training Optimizations



## **Shared Memory Nodes**

MI300A and GH200 are combined CPU-GPU nodes produced by AMD and NVIDIA







(b) APU.





