# **DNN Accelerator Architectures**

## **MICRO Tutorial (2016)**

Website: http://eyeriss.mit.edu/tutorial.html

Joel Emer, Vivienne Sze, Yu-Hsin Chen

# **Highly-Parallel Compute Paradigms**

### Temporal Architecture (SIMD/SIMT)



### Spatial Architecture (Dataflow Processing)







Ш**і**Т 📀



\* multiply-and-accumulate

## Worst Case: all memory R/W are **DRAM** accesses

• Example: AlexNet [NIPS 2012] has 724M MACs

→ 2896M DRAM accesses required









Opportunities: 1 data reuse



# **Types of Data Reuse in DNN**

### **Convolutional Reuse**

CONV layers only (sliding window)



Reuse: Activations Filter weights



# **Types of Data Reuse in DNN**

### **Convolutional Reuse**

CONV layers only (sliding window)



### Fmap Reuse

CONV and FC layers



Reuse: Activations Filter weights

Reuse: Activations



# **Types of Data Reuse in DNN**

### **Convolutional Reuse**

CONV layers only (sliding window)



#### Fmap Reuse

CONV and FC layers



### Filter Reuse

CONV and FC layers (batch size > 1)

Input Fmaps



Reuse: Activations Filter weights

Reuse: Activations

Reuse: Filter weights





## Opportunities: 1 data reuse



Can reduce DRAM reads of filter/fmap by up to 500×\*\*

\*\* AlexNet CONV layers





## Opportunities: 1 data reuse 2 local accumulation

- 100
- Can reduce DRAM reads of filter/fmap by up to 500×
- Partial sum accumulation does NOT have to access DRAM





## Opportunities: 1 data reuse 2 local accumulation

- 102
- Can reduce DRAM reads of filter/fmap by up to 500×
- Partial sum accumulation does NOT have to access DRAM
- Example: DRAM access in AlexNet can be reduced from 2896M to 61M (best case)

## **Spatial Architecture for CNN**



## **Low-Cost Local Data Access**



\* measured from a commercial 65nm process 14

## **Low-Cost Local Data Access**

How to exploit **1** data reuse and **2** local accumulation with *limited* low-cost local storage?



14117

## Low-Cost Local Data Access

How to exploit **1** data reuse and **2** local accumulation with *limited* low-cost local storage?

specialized **processing dataflow** required!



Шіт

# **Dataflow Taxonomy**

- Weight Stationary (WS)
- Output Stationary (OS)
- No Local Reuse (NLR)



# Weight Stationary (WS)



- Minimize weight read energy consumption
  - maximize convolutional and filter reuse of weights
- Broadcast activations and accumulate psums spatially across the PE array.



## WS Example: nn-X (NeuFlow)

## A 3×3 2D Convolution Engine





#### [Farabet et al., ICCV 2009]

# WS Example: nn-X (NeuFlow)

## **Top-Level Architecture**





# **Output Stationary (OS)**



- Minimize partial sum R/W energy consumption
  - maximize local accumulation
- Broadcast/Multicast filter weights and reuse activations spatially across the PE array



## **OS Example: ShiDianNao**

## Input Fmap Dataflow in the PE Array







#### [Du et al., ISCA 2015]

## **OS Example: ShiDianNao**





## No Local Reuse (NLR)



- Use a large global buffer as shared storage
  - Reduce **DRAM** access energy consumption
- Multicast activations, single-cast weights, and accumulate psums spatially across the PE array



## **NLR Example: UCLA**





## NLR Example: DianNao





## **Taxonomy: More Examples**

• Weight Stationary (WS)

[Chakradhar, *ISCA* 2010] [nn-X (NeuFlow), *CVPRW* 2014] [Park, *ISSCC* 2015] [ISAAC, *ISCA* 2016] [PRIME, *ISCA* 2016]

• Output Stationary (OS)

[Peemen, *ICCD* 2013] [ShiDianNao, *ISCA* 2015] [Gupta, *ICML* 2015] [Moons, *VLSI* 2016]

• No Local Reuse (NLR)

[**DianNao**, *ASPLOS* 2014] [**DaDianNao**, *MICRO* 2014] [**Zhang**, *FPGA* 2015]



# **Energy Efficiency Comparison**

- Same total area
  256 PEs
- AlexNet CONV layers Batch size = 16



# **Energy Efficiency Comparison**

- 256 PEs Same total area •
- AlexNet CONV layers • Batch size = 16 •



# Energy-Efficient Dataflow: Row Stationary (RS)

- Maximize reuse and accumulation at RF
- Optimize for **overall** energy efficiency instead for *only* a certain data type



## **Row Stationary: Energy-efficient Dataflow**



























- Maximize row convolutional reuse in RF
  Keep a filter row and fmap sliding window in RF
- Maximize row psum accumulation in RF
























### **Convolutional Reuse Maximized**



Filter rows are reused across PEs horizontally



### **Convolutional Reuse Maximized**



Fmap rows are reused across PEs diagonally



### Maximize 2D Accumulation in PE Array



Partial sums accumulate across PEs vertically



### **Dimensions Beyond 2D Convolution**



## Filter Reuse in PE







## Filter Reuse in PE





## Filter Reuse in PE

#### **1** Multiple Fmaps 2 Multiple Filters 3 Multiple Channels Filter 1 Fmap 1 Psum 1 Row 1 Row 1 Channel 1 \* Row 1 C<sup>≁</sup>. н — R C<sup>₹</sup>. Filter 1 Fmap 2 Psum 2 ← R Row 1 Row 1 **Channel 1** Row 1 \* share the same filter row Н

### Processing in PE: concatenate fmap rows





### **Fmap Reuse in PE**







### **Fmap Reuse in PE**

### Multiple Fmaps **2** Multiple Filters







## **Fmap Reuse in PE**



### Processing in PE: interleave filter rows





## **Channel Accumulation in PE**





## **Channel Accumulation in PE**





## **Channel Accumulation in PE**



### Processing in PE: interleave channels





## **DNN Processing – The Full Picture**





## **Optimal Mapping in Row Stationary**



[Chen et al., ISCA 2016]

lliī 💿

# Dataflow Simulation Results



### **Evaluate Reuse in Different Dataflows**

### Weight Stationary

Minimize movement of filter weights

### Output Stationary

Minimize movement of partial sums

### No Local Reuse

- No PE local storage. Maximize global buffer size.

### Row Stationary

### **Evaluation Setup**

- same total area
- 256 PEs
- AlexNet
- batch size = 16



### **Variants of Output Stationary**





### **Dataflow Comparison: CONV Layers**



RS optimizes for the best **overall** energy efficiency

### **Dataflow Comparison: CONV Layers**



RS uses 1.4× – 2.5× lower energy than other dataflows

### **Dataflow Comparison: FC Layers**



RS uses at least **1.3× lower** energy than other dataflows











# Hardware Architecture for RS Dataflow



## **Eyeriss Deep CNN Accelerator**



## **Data Delivery with On-Chip Network**

### \_ink Clock Clock

**DCNN Accelerator** 



How to accommodate different shapes with fixed PE array?



## **Logical to Physical Mappings**





**Physical PE Array** 

## **Logical to Physical Mappings**



### **Multicast Network Design**



## **Data Delivery with On-Chip Network**

### \_ink Clock Clock

**DCNN Accelerator** 



Compared to Broadcast, **Multicast** saves >80% of NoC energy


# **Chip Spec & Measurement Results**

| Technology         | TSMC 65nm LP 1P9M       |
|--------------------|-------------------------|
| On-Chip Buffer     | 108 KB                  |
| # of PEs           | 168                     |
| Scratch Pad / PE   | 0.5 KB                  |
| Core Frequency     | 100 – 250 MHz           |
| Peak Performance   | 33.6 – 84.0 GOPS        |
| Word Bit-width     | 16-bit Fixed-Point      |
|                    | Filter Width: 1 – 32    |
|                    | Filter Height: 1 – 12   |
| Natively Supported | Num. Filters: 1 – 1024  |
| DNN Shapes         | Num. Channels: 1 – 1024 |
|                    | Horz. Stride: 1–12      |
|                    | Vert. Stride: 1, 2, 4   |





### **Benchmark – AlexNet Performance**

Image Batch Size of **4** (i.e. 4 frames of 227x227) Core Frequency = 200MHz / Link Frequency = 60 MHz

| Layer | Power<br>(mW) | Latency<br>(ms) | # of MAC<br>(MOPs) | Active #<br>of PEs (%) | Buffer Data<br>Access (MB) | DRAM Data<br>Access (MB) |
|-------|---------------|-----------------|--------------------|------------------------|----------------------------|--------------------------|
| 1     | 332           | 20.9            | 422                | 154 (92%)              | 18.5                       | 5.0                      |
| 2     | 288           | 41.9            | 896                | 135 (80%)              | 77.6                       | 4.0                      |
| 3     | 266           | 23.6            | 598                | 156 (93%)              | 50.2                       | 3.0                      |
| 4     | 235           | 18.4            | 449                | 156 (93%)              | 37.4                       | 2.1                      |
| 5     | 236           | 10.5            | 299                | 156 (93%)              | 24.9                       | 1.3                      |
| Total | 278           | 115.3           | 2663               | 148 (88%)              | 208.5                      | 15.4                     |

To support 2.66 GMACs [8 billion 16-bit inputs (**16GB**) and 2.7 billion outputs (**5.4GB**)], only requires **208.5MB** (buffer) and **15.4MB** (DRAM)

### **Benchmark – AlexNet Performance**

Image Batch Size of **4** (i.e. 4 frames of 227x227) Core Frequency = 200MHz / Link Frequency = 60 MHz

| Layer | Power<br>(mW) | Latency<br>(ms) | # of MAC<br>(MOPs) | Active #<br>of PEs (%) | Buffer Data<br>Access (MB) | DRAM Data<br>Access (MB) |
|-------|---------------|-----------------|--------------------|------------------------|----------------------------|--------------------------|
| 1     | 332           | 20.9            | 422                | 154 (92%)              | 18.5                       | 5.0                      |
| 2     | 288           | 41.9            | 896                | 135 (80%)              | 77.6                       | 4.0                      |
| 3     | 266           | 23.6            | 598                | 156 (93%)              | 50.2                       | 3.0                      |
| 4     | 235           | 18.4            | 449                | 156 (93%)              | 37.4                       | 2.1                      |
| 5     | 236           | 10.5            | 299                | 156 (93%)              | 24.9                       | 1.3                      |
| Total | 278           | 115.3           | 2663               | 148 (88%)              | 208.5                      | 15.4                     |

**51682** operand\* access/input image pixel

→ 506 access/pixel from buffer + 37 access/pixel from DRAM



\*operand = weight, activation, psum

# **Comparison with GPU**

|                         | This Work     | NVIDIA TK1 (Jetson Kit)               |
|-------------------------|---------------|---------------------------------------|
| Technology              | 65nm          | 28nm                                  |
| Clock Rate              | 200MHz        | 852MHz                                |
| # Multipliers           | 168           | 192                                   |
| On-Chin Storage         | Buffer: 108KB | Shared Mem: 64KB                      |
| on omp otorage          | Spad: 75.3KB  | Reg File: 256KB                       |
| Word Bit-Width          | 16b Fixed     | 32b Float                             |
| Throughput <sup>1</sup> | 34.7 fps      | 68 fps                                |
| Measured Power          | 278 mW        | Idle/Active <sup>2</sup> : 3.7W/10.2W |
| DRAM Bandwidth          | 127 MB/s      | 1120 MB/s <sup>3</sup>                |

- 1. AlexNet CONV Layers
- 2. Board Power
- 3. Modeled from [Tan, SC 2011]



### **From Architecture to System**



#### https://vimeo.com/154012013



# **Summary of DNN Dataflows**

#### Weight Stationary

- Minimize movement of filter weights
- Popular with processing-in-memory architectures

### Output Stationary

- Minimize movement of partial sums
- Different variants optimized for CONV or FC layers

### No Local Reuse

- No PE local storage  $\rightarrow$  maximize global buffer size

### Row Stationary

- Adapt to the NN shape and hardware constraints
- Optimized for overall system energy efficiency



### **MICRO 2016 Papers in the Taxonomy**

- **Stripes:** bit-serial computation in a **NLR**-like engine (based on DaDianNao)
- **NEUTRAMS**: a toolset for accelerators running the **WS** dataflow (synaptic weight memory array)
- Fused-layer: exploit inter-layer data reuse in a NLR engine (based on [Zhang, FPGA 2015])



### **Fused Layer**

#### Dataflow across multiple layers





#### [Alwani et al., MICRO 2016]