# Swordfish:

**A Framework for Evaluating Deep Neural Network-based Basecalling** using Computation-in-Memory with Non-Ideal Memristors

#### Taha Michael Shahroodi,

Gagandeep Singh, Mahdi Zahedi, Haiyu Mao, Joel Lindegger, Can Firtina, Stephan Wong, Onur Mutlu, Said Hamdioui











### **Executive Summary**

Context: Basecalling is the first step and a major throughput bottleneck

Basecallers use deep neural networks (DNNs)

DNN-based basecalling **accuracy** and **throughput** impact accuracy and throughput of next analysis

Prior research uses memristor-based Computation-in-Memory (CIM) to accelerate DNNs

Non-idealities in memrister based CIM known to hinder accuracy either

- 1. overlook existing non-idealities,
- 2. overestimates achievable accuracy by studying non-idealities in isolation or using imprecise models/methodology
- 3. overlook the effects of non-idealities mitigation techniques on the achievable

Goal Town Putccurate and realistic evaluation of accuracy and throughput for DNN-based basecalling on memristor-based CIM

**Key Contribution: Swordfish;** the **first framework** for memristor-based CIM that uses **characterized memories** and **accurate models** to

- 1) accurately and realistically evaluate the effects of non-idealities on basecalling accuracy and throughput
- 2) comprehensively investigate the impact of accuracy enhancement techniques on basecalling accuracy and throughput

Key Kesults. Across four real datasets of varying sizes, swordlish realistically provides

- 25.7× better average throughput compared to state-of-the-art basecalling on GPU
- 12% mitigation in basecalling accuracy loss after hardware/software co-designed enhancement techniques
- Three new insights on future research directions for accuracy enhancement techniques

#### **Outline**

**Background & Motivation** 

Swordfish: Design & Implementation

**Evaluation & Key Results** 

**Takeaways & Summary** 

#### **Outline**

### **Background & Motivation**

Swordfish: Design & Implementation

**Evaluation & Key Results** 

**Takeaways & Summary** 

# Nanopore Genome Sequencing and Analysis Pipeline

Genome Sequencing: Determining DNA sequence order for

- 1. Personalized medicine,
- 2. Outbreak tracing,
- 3. Understanding evolution

Nanopore Sequencing: A widely used sequencing technology



Basecalling consumes **up to 84.2%** of the execution time **[Bowden+ 2019]** 

# Nanopore Genome Sequencing and Analysis Pipeline

Genome Sequencing: Determining DNA sequence order for

- 1. Personalized medicine,
- 2. Outbreak tracing,
- 3. Understanding evolution

Nanopore Sequencing: A widely used sequencing technology

Basecalling is

- 1. Accuracy-critical
- 2. Performance Bottleneck

Basecallers are just large DNNs

#### **DNN Hardware Acceleration**

#### **DNN** execution is dominated by:

Vector-Matrix Multiplication (VMM) Data movement between memory and accelerator (e.g., GPU or TPU)



Memristor-based crossbars support VMM



Computation in Memory (CIM) minimizes data movement

#### **DNN Hardware Acceleration**

#### **DNN** execution is dominated by:

Vector-Matrix Multiplication (VMM) Data movement between memory and accelerator

# Memristor-based CIM for DNN Acceleration

Memristor-based crossbars support VMM

Computation in Memory (CIM) minimizes data

[Ankit+, ASPLOS 2019], [Chi+, ISCA 2016], [Lou+, PACT2020], [Shafiee+, ISCA 2016]

### **Memristor-based Crossbars**



Taba Michael Chabraedi, Delft University of Technology

0



#### VMM in Accelerators

#### **In Accelerators**

#### W<sub>11</sub> W<sub>12</sub> W<sub>13</sub> W<sub>14</sub>

#### **Accurate**

$$(O_1 O_2 O_3 O_4)$$

#### VMM in Memristor-based Crossbars

#### **In Memory**

$$W_{21}\ W_{22}\ W_{23}\,W_{24}$$

$$=$$
 (O<sub>1</sub> O<sub>2</sub> O<sub>3</sub> O<sub>4</sub>)



Non-idealities are everywhere

Taha Michael Shahroodi Delft University of Technology

10









### **Our Goal**

To **realistically evaluate** end-to-end basecalling **accuracy** and **throughput** for memristor-based CIM

### **Key Idea**

To account for the **non-idealities** in **device**, **circuit** and **architecture** of memristor-based CIM and the **overhead** of non-idealities **mitigation techniques** 

#### **Outline**

**Background & Motivation** 

Swordfish: Design & Implementation

**Evaluation & Key Results** 

**Takeaways & Summary** 

#### **Swordfish vs Other Frameworks**

Ideal Memristor-based CIM Frameworks for DNNs



### **Swordfish Framework - Overview**

Realistic Memristor-based CIM Frameworks for DNNs



#### **Swordfish Framework - Overview**

Realistic Memristor-based CIM Frameworks for DNNs



#### **VMM Model Generator**

Goal: Capture real output of VMM in presence of nonislabitiesh supports two approaches:





#### **VMM Model Generator**

**Goal:** Capture real output of VMM in presence of non-**Sholitifish** supports two approaches:





### **Swordfish Framework - Overview**

Realistic Memristor-based CIM Frameworks for DNNs



### **Accuracy Enhancement**

**Goal:** Enhance the accuracy of a VMM by adapting input currents and resistance of memristors based on non-idealities Swordfish supports four techniques:

1. Analytical Variation Aware Training (VAT)

2. Knowledge Distillation-based (KD) VAT

3. Read-Verify-Write (R-V-W) Training

4. Random Sparse Adaptation (RSA) Training

### **Example of Accuracy Enhancement**

**Goal:** Enhance the accuracy of a VMM by adapting input currents and resistance of memristors based on non-

# Read more about other techniques in the paper



4. Random Sparse Adaptation (RSA) Training

# Accuracy Enhancement via Random Sparse Adaptation

**Key idea?** Map the weights that otherwise would map to error-prone memristor devices to reliable SRAM cells.

#### RSA in 3 Steps:

- 1. Initial Training (one-time, on GPU) and distribution of weights
- 2. VMM operation using both memories



### **More in the Paper**

- Details of capturing non-idealities at VMM level
- Implementation details of Swordfish components:
  - Partition & Map
  - Accuracy Enhancer
  - VMM Model Generator
  - System Evaluator
- Elaborations on accuracy enhancement techniques
  - Analytical Variation Aware Training (VAT)
  - Knowledge Distillation-based (KD) using VAT
  - Read-Verify-Write (R-V-W) Training

#### **Outline**

**Background & Motivation** 

Swordfish: Design & Implementation

**Evaluation & Key Results** 

**Takeaways & Summary** 

# **Evaluation Methodology: Experimental Setup**

- We evaluate
  - Basecaller: Bonito [Oxford Nanopore 2023]
  - CIM Architecture: PUMA [Ankit+, ASPLOS 2019]

- Infrastructure
  - 2x AMD EPYC 7742 CPU with 500 GB DDR4 DRAM
  - 8x NVIDIA V100

- Datasets and Workloads [Wick+ 2019, Zook+ 2019, CADDE 2020]
  - 4 real read and reference genomes with various genome size
    (D1, D2, D3, and D4)

# Evaluated Non-idealities & Enhancement techniques

Non-idealities

AccuracyEnhancementTechniques



Accuracy: All Non-idealities without Mitigation



Combined non-idealities leads to significant accuracy loss (>18%)

# Accuracy: Enhancement Techniques on All Non-idealities



**Accuracy enhancement** techniques **mitigate** non-idealities, But differently.

# Accuracy: Enhancement Techniques on All Non-idealities



Considerable accuracy loss (>6%) even with All enhancement techniques.

Taba Michael Shahroodi, Delft University of Technology

24



**Ideal** CIM implementation improves the basecalling throughput over Bonito-GPU by **413.6**× **on average** 



Throughput improvement at the high, unacceptable accuracy loss of 18%

DI DZ D3 D4 Average

**Ideal** CIM implementation improves the basecalling throughput over Bonito-GPU by **413.6**× **on average** 



Realistic CIM designs significantly underperform ideal design



Some **realistic CIM designs degrade** throughput compared to Bonito-GPU

20



Realistic CIM design using RSA+KD provides on average 25.7× higher throughput compared to Bonito-GPU

### **More in the Paper**

- Details on evaluation methodology
  - Datasets
  - Array and devices

#### Evaluation results

- Individual non-idealities and architectural limitations on accuracy
- Accuracy enhancements on individual and combined nonidealities and architectural limitations
- Accuracy vs. Area analysis
- Observations and trends from the presented figures
- Results for 256x256 crossbar + comparison with 64x64 crossbars

#### **Outline**

**Background & Motivation** 

Swordfish: Design & Implementation

**Evaluation & Key Results** 

**Takeaways & Summary** 

## **Takeaways**

The target application for memristor-based CIM matters

**Swordfish** enables **realistic** evaluation of accuracy and performance for DNN-based applications on memristor-based CIM

Non-idealities are detrimental to both accuracy and performance

**HW/SW co-designed** techniques mitigate inaccuracy the most

### Summary

**Key Contribution: Swordfish;** the **first framework** for memristor-based CIM that uses **characterized memories** and **accurate models** to

- 1) accurately and realistically evaluate the effects of non-idealities on basecalling accuracy and throughput
- 2) comprehensively investigate the impact of accuracy enhancement techniques on basecalling accuracy and throughput

provides

- 25.7× better average throughput compared to state-of-the-art basecalling on GPU
- 12% mitigation in basecalling accuracy loss after hardware/software codesigned enhancement techniques
- Three new insights on future research directions for accuracy enhancement

#### Many opportunities for

- Realistically evaluating accuracy and throughput other DNNs on memristor-based CIM
- Developing and evaluating novel accuracy enhancement techniques, on software, hardware, or both
- We should remain cautious applying known acceleration techniques to emerging technologies, architectures, and applications

Talan Mishael Chalanadi. Delft University of Technology.

# Swordfish:



# Questions?









