## **DNN Accelerator Architectures**

#### **CICS/MTL Tutorial (2017)**

Website: http://eyeriss.mit.edu/tutorial.html



Joel Emer, Vivienne Sze, Yu-Hsin Chen

## **Highly-Parallel Compute Paradigms**

#### Temporal Architecture (SIMD/SIMT)



Spatial Architecture (Dataflow Processing)











\* multiply-and-accumulate

#### Worst Case: all memory R/W are DRAM accesses

• Example: AlexNet [NIPS 2012] has **724M** MACs

→ 2896M DRAM accesses required









<u>Opportunities</u>: **1 data reuse** 



## **Types of Data Reuse in DNN**

#### **Convolutional Reuse**

CONV layers only (sliding window)







## **Types of Data Reuse in DNN**

#### **Convolutional Reuse**

CONV layers only (sliding window)

# Filter Input Fmap

#### Fmap Reuse

CONV and FC layers



Reuse: Activations Filter weights

Reuse: Activations



## **Types of Data Reuse in DNN**

#### **Convolutional Reuse**

CONV layers only (sliding window)



#### Fmap Reuse

CONV and FC layers



#### Filter Reuse

CONV and FC layers (batch size > 1)

Input Fmaps



Reuse: Activations Filter weights

Reuse: Activations

Reuse: Filter weights





#### Opportunities: 1 data reuse



Can reduce DRAM reads of filter/fmap by up to 500×\*\*

\*\* AlexNet CONV layers





#### Opportunities: 1 data reuse 2 local accumulation

000

Can reduce DRAM reads of filter/fmap by up to 500×

Partial sum accumulation does NOT have to access DRAM





#### Opportunities: 1 data reuse 2 local accumulation

- 12
- Can reduce DRAM reads of filter/fmap by up to 500×
- Partial sum accumulation does **NOT** have to access DRAM
- Example: DRAM access in AlexNet can be reduced from **2896M** to **61M** (best case)

## **Spatial Architecture for DNN**



#### **Low-Cost Local Data Access**



\* measured from a commercial 65nm process 14

#### **Low-Cost Local Data Access**

How to exploit **1** data reuse and **2** local accumulation with *limited* low-cost local storage?



Illii

#### **Low-Cost Local Data Access**

How to exploit **1** data reuse and **2** local accumulation with *limited* low-cost local storage?

specialized **processing dataflow** required!



Illiī

# **Dataflow Taxonomy**

- Weight Stationary (WS)
- Output Stationary (OS)
- No Local Reuse (NLR)



## Weight Stationary (WS)



- Minimize weight read energy consumption
  - maximize convolutional and filter reuse of weights
- Broadcast activations and accumulate psums spatially across the PE array.



## WS Example: nn-X (NeuFlow)





## **Output Stationary (OS)**



- Minimize partial sum R/W energy consumption
  - maximize local accumulation
- Broadcast/Multicast filter weights and reuse activations spatially across the PE array



## **OS Example: ShiDianNao**



psums



## No Local Reuse (NLR)



- Use a large global buffer as shared storage
  - Reduce **DRAM** access energy consumption
- Multicast activations, single-cast weights, and accumulate psums spatially across the PE array



## **NLR Example: UCLA**





#### [Zhang et al., FPGA 2015]

#### **Taxonomy: More Examples**

#### • Weight Stationary (WS)

[Chakradhar, *ISCA* 2010] [nn-X (NeuFlow), *CVPRW* 2014] [Park, *ISSCC* 2015] [ISAAC, *ISCA* 2016] [PRIME, *ISCA* 2016]

#### Output Stationary (OS)

[Peemen, *ICCD* 2013] [ShiDianNao, *ISCA* 2015] [Gupta, *ICML* 2015] [Moons, *VLSI* 2016]

• No Local Reuse (NLR)

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014] [Zhang, FPGA 2015]



## **Energy Efficiency Comparison**

- Same total area 256 PEs ullet
- AlexNet CONV layers Batch size = 16 •



## **Energy Efficiency Comparison**

- Same total area 256 PEs ullet
- AlexNet CONV layers • Batch size = 16 •



# Energy-Efficient Dataflow: Row Stationary (RS)

- Maximize reuse and accumulation at RF
- Optimize for **overall** energy efficiency instead for *only* a certain data type



#### **Row Stationary: Energy-efficient Dataflow**

























- Maximize row convolutional reuse in RF
  - Keep a filter row and fmap sliding window in RF
- Maximize row psum accumulation in RF





#### **2D Convolution in PE Array**







## **2D Convolution in PE Array**







### **2D Convolution in PE Array**


### **2D Convolution in PE Array**





#### **Convolutional Reuse Maximized**



Filter rows are reused across PEs horizontally



#### **Convolutional Reuse Maximized**



Fmap rows are reused across PEs diagonally



### Maximize 2D Accumulation in PE Array



Partial sums accumulate across PEs vertically



#### **Dimensions Beyond 2D Convolution**

**1** Multiple Fmaps **2** Multiple Filters **3** Multiple Channels



### Filter Reuse in PE



**1** Multiple Fmaps



**2** Multiple Filters **3** Multiple Channels

### Filter Reuse in PE

**2** Multiple Filters **3** Multiple Channels

#### Filter 1 Fmap 1 Psum 1 Row 1 **=** Row 1 **Channel 1** Row 1 \* C<sup>≁</sup>. R **C**<sup>₹</sup>. Filter 1 Fmap 2 Psum 2 ← R Row 1 Row 1 Row 1 **Channel 1** \* = share the same filter row Н



**1** Multiple Fmaps

### Filter Reuse in PE



#### Processing in PE: concatenate fmap rows





#### **Fmap Reuse in PE**

#### Multiple Fmaps **2** Multiple Filters **3** Multiple Channels







#### **Fmap Reuse in PE**

#### Multiple Fmaps 2 Multiple Filters 3 Multiple Channels







### **Fmap Reuse in PE**



#### Processing in PE: interleave filter rows





### **Channel Accumulation in PE**

Multiple Fmaps 🕗 Multiple Filters **3 Multiple Channels** 





### **Channel Accumulation in PE**

**1** Multiple Fmaps **2** Multiple Filters **3** Multiple Channels





### **Channel Accumulation in PE**



#### Processing in PE: interleave channels





### **DNN Processing – The Full Picture**



to exploit other forms of reuse and local accumulation

# **Optimal Mapping in Row Stationary**



[Chen et al., ISCA 2016]

# Dataflow Simulation Results



#### **Evaluate Reuse in Different Dataflows**

#### Weight Stationary

Minimize movement of filter weights

#### Output Stationary

Minimize movement of partial sums

#### No Local Reuse

- No PE local storage. Maximize global buffer size.

#### Row Stationary

#### **Evaluation Setup**

- same total area
- 256 PEs
- AlexNet
- batch size = 16



#### **Variants of Output Stationary**





#### **Dataflow Comparison: CONV Layers**



### **Dataflow Comparison: CONV Layers**



#### **Dataflow Comparison: FC Layers**



RS uses at least **1.3× lower** energy than other dataflows











# Hardware Architecture for RS Dataflow



### **Eyeriss DNN Accelerator**



## **Data Delivery with On-Chip Network**

#### Link Clock Core Clock

**DCNN Accelerator** 



How to accommodate different shapes with fixed PE array?



### **Logical to Physical Mappings**



**Physical PE Array** 



**Physical PE Array** 

### **Logical to Physical Mappings**



# **Data Delivery with On-Chip Network**

#### Link Clock Core Clock

**DCNN Accelerator** 



Compared to Broadcast, **Multicast** saves >80% of NoC energy



## **Chip Spec & Measurement Results**

| Technology            | TSMC 65nm LP 1P9M       |
|-----------------------|-------------------------|
| On-Chip Buffer        | 108 KB                  |
| # of PEs              | 168                     |
| Scratch Pad / PE      | 0.5 KB                  |
| <b>Core Frequency</b> | 100 – 250 MHz           |
| Peak Performance      | 33.6 – 84.0 GOPS        |
| Word Bit-width        | 16-bit Fixed-Point      |
|                       | Filter Width: 1 – 32    |
|                       | Filter Height: 1 – 12   |
| Natively Supported    | Num. Filters: 1 – 1024  |
| DNN Shapes            | Num. Channels: 1 – 1024 |
|                       | Horz. Stride: 1–12      |
|                       | Vert. Stride: 1, 2, 4   |



To support 2.66 GMACs [8 billion 16-bit inputs (**16GB**) and 2.7 billion outputs (**5.4GB**)], only requires **208.5MB** (buffer) and **15.4MB** (DRAM)



### **Summary of DNN Dataflows**

#### Weight Stationary

- Minimize movement of filter weights
- Popular with processing-in-memory architectures

#### Output Stationary

- Minimize movement of partial sums
- Different variants optimized for CONV or FC layers

#### No Local Reuse

- No PE local storage  $\rightarrow$  maximize global buffer size

#### Row Stationary

- Adapt to the NN shape and hardware constraints

Optimized for overall system energy efficiency



#### **Fused Layer**

#### Dataflow across multiple layers





#### [Alwani et al., MICRO 2016]

#### **Metrics for DNN Hardware**

- Measure energy and DRAM access relative to number of non-zero MACs and bit-width of MACs
  - Account for impact of sparsity in weights and activations
  - Normalize DRAM access based on operand size
- Energy Efficiency of Design
  - pJ/(non-zero weight & activation)
- External Memory Bandwidth
  - DRAM operand access/(non-zero weight & activation)
- Area Efficiency
  - Total chip mm<sup>2</sup>/multi (also include process technology)
  - Accounts for on-chip memory


## **Website to Summarize Results**

- <u>http://eyeriss.mit.edu/benchmarking.html</u>
- Send results or feedback to: <u>eyeriss@mit.edu</u>

| Metric                        | Units                                                                                                                                           | Input                                                                                                                                                                          |
|-------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Name of CNN                   | Text                                                                                                                                            | AlexNet                                                                                                                                                                        |
| # of Images Tested            | #                                                                                                                                               | 100                                                                                                                                                                            |
| Bits per operand              | #                                                                                                                                               | 16                                                                                                                                                                             |
| Batch Size                    | #                                                                                                                                               | 4                                                                                                                                                                              |
| # of Non Zero MACs            | #                                                                                                                                               | 409M                                                                                                                                                                           |
| Runtime                       | ms                                                                                                                                              | 115.3                                                                                                                                                                          |
| Power                         | mW                                                                                                                                              | 278                                                                                                                                                                            |
| Energy/non-zero               | pJ/MAC                                                                                                                                          | 21.7                                                                                                                                                                           |
| MACs                          |                                                                                                                                                 |                                                                                                                                                                                |
| DRAM access/non-<br>zero MACs | operands<br>/MAC                                                                                                                                | 0.005                                                                                                                                                                          |
|                               | MetricName of CNN# of Images TestedBits per operandBatch Size# of Non Zero MACsRuntimePowerEnergy/non-zero<br>MACsDRAM access/non-<br>zero MACs | MetricUnitsName of CNNText# of Images Tested#Bits per operand#Batch Size## of Non Zero MACs#RuntimemsPowermWEnergy/non-zero<br>MACspJ/MACDRAM access/non-<br>zero MACsoperands |



## **Advanced Memory Technologies**

Many new memories and devices explored to reduce data movement

Non-Volatile Stacked DRAM **Resistive Memories** Global dataline V<sub>1</sub> Bank Row Bank Bank **TSVs** Col dec Col de  $I_1 = V_1 \times G_1$ Inter-bank data bus **Global SA**  $V_{2}$  $G_2$ To local Global DRAM Die vault Eyeriss Logic Die Buffe To remote design  $I_2 = V_2 \times G_2$ vault Vault (Channel) Engine [Gao et al., Tetris, ASPLOS 2017]  $| = |_1 + |_2$ [Kim et al., NeuroCube, ISCA 2016]  $= V_1 \times G_1 + V_2 \times G_2$ 

## eDRAM [Chen et al., DaDianNao, MICRO 2014]

14112

[Shafiee et al., ISCA 2016] [Chi et al., PRIME, ISCA 2016]

WS

dataflow