

|'|iT

**Follow** @eems\_mit



ns technology laboratories

## Video is the Biggest Big Data

Over 70% of today's Internet traffic is video Over 300 hours of video uploaded to YouTube <u>every minute</u> Over 500 million hours of video surveillance collected <u>every day</u>



Need energy-efficient pixel processing!



#### Processing at "Edge" instead of the "Cloud"

#### Privacy





#### Processing at "Edge" instead of the "Cloud"

#### **Privacy**



## Processing at "Edge" instead of the "Cloud"



ystems technology laboratories

## Example Applications of Machine Learning

#### **Computer Vision**



#### **Speech Recognition**



Medical









## Machine Learning Pipeline (Inference)

7



Score = 
$$\Sigma_n x_i w_i$$

Main Computation: Dot Product of Features (x) and Weights (w)

#### What is Deep Learning?



Image Source: [Lee et al., Comm. ACM 2011]





#### Weighted Sums



Image Source: Stanford



## Why is Deep Learning Hot Now?





#### ImageNet: Image Classification Task



[Russakovsky et al., IJCV 2015]



#### Human or Superhuman Accuracy Level

#### Face recognition

– Deep learning accuracy (97.25%) vs. Human accuracy (97.53%)



- Fine grained category recognition (e.g. dogs, monkeys, snakes, birds)
  - Deep learning errors: 7 vs. Human errors: 28



120 species of dogs

[O. Russakovsky et al., IJCV 2015]





#### Deep Learning on Games

#### Google DeepMind AlphaGo

Go is exponentially more complex than chess (10<sup>170</sup> legal positions)







#### Deep Convolutional Neural Networks







#### Deep Convolutional Neural Networks





#### Deep Convolutional Neural Networks





**Convolutions** account for more than 90% of overall computation, dominating **runtime** and **energy consumption** 



Input Image (Feature Map)







Input Image (Feature Map)



Element-wise Multiplication











**Sliding Window Processing** 







Many Input Channels (C)









#### <sup>23</sup> High-Dimensional CNN Convolution



Image batch size: 1 – 256 (N)

l'liī



ns technology laboratories

## Large Sizes with Varying Shapes

AlexNet<sup>1</sup> Convolutional Layer Configurations

| Layer | Filter Size (R) | # Filters (M) | # Channels (C) | Stride |
|-------|-----------------|---------------|----------------|--------|
| 1     | 11x11           | 96            | 3              | 4      |
| 2     | 5x5             | 256           | 48             | 1      |
| 3     | 3x3             | 384           | 256            | 1      |
| 4     | 3x3             | 384           | 192            | 1      |
| 5     | 3x3             | 256           | 192            | 1      |

Layer 1



34k Params 105M MACs Layer 2





307k Params 224M MACs



885k Params 150M MACs



#### Popular DNNs

- LeNet (1998)
- AlexNet (2012)
- OverFeat (2013)
- VGGNet (2014)
- GoogleNet (2014)
- ResNet (2015)

#### ImageNet: Large Scale Visual Recognition Challenge (ILSVRC)



[O. Russakovsky et al., IJCV 2015]





25

### **Summary of Popular DNNs**

| Metrics          | LeNet-5                             | AlexNet  | VGG-16   | GoogLeNet<br>(v1) | ResNet-50 |
|------------------|-------------------------------------|----------|----------|-------------------|-----------|
| Top-5 error      | n/a                                 | 16.4     | 7.4      | 6.7               | 5.3       |
| Input Size       | 28x28                               | 227x227  | 224x224  | 224x224           | 224x224   |
| # of CONV Layers | 2                                   | 5        | 16       | 21 (depth)        | 49        |
| Filter Sizes     | 5                                   | 3, 5,11  | 3        | 1, 3 , 5, 7       | 1, 3, 7   |
| # of Channels    | 1, 6                                | 3 - 256  | 3 - 512  | 3 - 1024          | 3 - 2048  |
| # of Filters     | 6, 16                               | 96 - 384 | 64 - 512 | 64 - 384          | 64 - 2048 |
| Stride           | 1                                   | 1, 4     | 1        | 1, 2              | 1, 2      |
| # of Weights     | 2.6k                                | 2.3M     | 14.7M    | 6.0M              | 23.5M     |
| # of MACs        | 283k                                | 666M     | 15.3G    | 1.43G             | 3.86G     |
| # of FC layers   | 2                                   | 3        | 3        | 1                 | 1         |
| # of Weights     | 58k                                 | 58.6M    | 124M     | 1M                | 2M        |
| # of MACs        | 58k                                 | 58.6M    | 124M     | 1M                | 2M        |
| Total Weights    | 60k                                 | 61M      | 138M     | 7M                | 25.5M     |
| Total MACs       | 341k                                | 724M     | 15.5G    | 1.43G             | 3.9G      |
| ī                | CONV Layers increasingly important! |          |          |                   |           |

stems technology laboratories massachusetts institute of technology

26

## <sup>27</sup> Complexity versus Difficulty of Task

- Evaluate hardware using the appropriate DNN model and dataset
  - Difficult tasks typically require larger models
  - Different datasets for different tasks

#### **MNIST**

3681796691 6757863485 2179712845

7618641560 7592658197 2222234480 0238073857 0146460243 7128969861

9018894





#### l'liī

481

ImageNet

#### <sup>28</sup> Training vs. Inference









## Challenges





|4**1**17

#### Key Metrics

- Accuracy
  - Measured on a publicly available dataset
  - Popular DNN Models
- Programmability
  - Support multiple applications
  - Different weights
- Energy/Power
  - Energy per operation
  - DRAM Bandwidth
- Throughput/Latency
  - GOPS, frame rate, delay
- Cost
- I'lii Area (memory and logic size)













## Website to Summarize DNN Results

- <u>http://eyeriss.mit.edu/benchmarking.html</u>
- Send results or feedback to: <u>eyeriss@mit.edu</u>

|                                    |                        | Metric                        | Units            | Input   |
|------------------------------------|------------------------|-------------------------------|------------------|---------|
| ASIC Specs                         | Input                  | Name of CNN                   | Text             | AlexNet |
| Process<br>Technology              | 65nm LP TSMC<br>(1.0V) | # of Images Tested            | #                | 100     |
| Core area (mm <sup>2</sup> ) /     | 0.073                  | Bits per operand              | #                | 16      |
| multiplier                         |                        | Batch Size                    | #                | 4       |
| On-Chip memory                     | 1.14                   | # of Non Zero MACs            | #                | 409M    |
| (kB) / multiplier                  |                        | Runtime                       | ms               | 115.3   |
| Measured or<br>Simulated           | Measured               | Power                         | mW               | 278     |
| If Simulated, Syn<br>or PnR? Which | n/a                    | Energy/non-zero<br>MACs       | pJ/MAC           | 21.7    |
| corner?                            |                        | DRAM access/non-<br>zero MACs | operands<br>/MAC | 0.005   |



# Opportunities in Architecture





14i7

## **33** GPUs and CPUs Targeting Deep Learning

Intel Knights Landing (2016) Nvidia PASCAL GP100 (2016)





Knights Mill: next gen Xeon Phi "optimized for deep learning"

Use matrix multiplication libraries on CPUs and GPUs





## Map DNN to a Matrix Multiplication



Goal: Reduced number of operations to increase throughput



34

## 35 Reduce Operations in Matrix Multiplication

- Fast Fourier Transform [Mathieu, ICLR 2014]
  - **Pro:** Direct convolution  $O(N_o^2 N_f^2)$  to  $O(N_o^2 \log_2 N_o)$
  - Con: Increase storage requirements
- Strassen [Cong, ICANN 2014]
  - Pro: O(N<sup>3</sup>) to (N<sup>2.807</sup>)
  - Con: Numerical stability
- Winograd [Lavin, CVPR 2016]
  - Pro: 2.25x speed up for 3x3 filter
  - Con: Specialized processing depending on filter size



## **Analogy: Gauss's Multiplication Algorithm**

$$(a+bi)(c+di) = (ac-bd) + (bc+ad)i.$$

4 multiplications + 3 additions

$$k_{1} = c \cdot (a + b)$$

$$k_{2} = a \cdot (d - c)$$

$$k_{3} = b \cdot (c + d)$$
Real part =  $k_{1} - k_{3}$ 
Imaginary part =  $k_{1} + k_{2}$ .

3 multiplications + 5 additions

**Reduce** number of multiplications, but **increase** number of additions



# Accelerators





- Operations exhibit high parallelism
  - → high throughput possible



- Operations exhibit high parallelism
   → high throughput possible
- Memory Access is the Bottleneck



\* multiply-and-accumulate



- Operations exhibit high parallelism
   → high throughput possible
- Memory Access is the Bottleneck



Worst Case: all memory R/W are **DRAM** accesses

Example: AlexNet [NIPS 2012] has 724M MACs
 → 2896M DRAM accesses required



- Operations exhibit high parallelism
   → high throughput possible
- Input data reuse opportunities (up to 500x)

→ exploit **low-cost memory** 



Images

# 42 Highly-Parallel Compute Paradigms

#### Temporal Architecture (SIMD/SIMT)



# Spatial Architecture (Dataflow Processing)





# **Advantages of Spatial Architecture**





### 44 How to Map the Dataflow?



Goal: Increase reuse of input data (weights and pixels) and local partial sums accumulation

### Spatial Architecture (Dataflow Processing)





45

# **Energy-Efficient Dataflow**

Yu-Hsin Chen, Joel Emer, Vivienne Sze, ISCA 2016

### Maximize data reuse and accumulation at RF





### **Data Movement is Expensive**



#### **Processing Engine**



**Data Movement Energy Cost** 



Maximize data reuse at lower levels of hierarchy

# Weight Stationary (WS)



- Minimize weight read energy consumption
  - maximize convolutional and filter reuse of weights
- Examples:

[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015] [Origami, GLSVLSI 2015]



# Output Stationary (OS)



- Minimize partial sum R/W energy consumption
  - maximize local accumulation
- Examples:

[Gupta, *ICML* 2015] [ShiDianNao, *ISCA* 2015] [Peemen, *ICCD* 2013]



## 49 No Local Reuse (NLR)



- Use a large global buffer as shared storage
  - Reduce **DRAM** access energy consumption
- Examples:

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014] [Zhang, FPGA 2015]



### Row Stationary: Energy-efficient Dataflow









































- Maximize row convolutional reuse in RF
  - Keep a filter row and image sliding window in RF
- Maximize row psum accumulation in RF





### **56** Row Stationary Dataflow



### Evaluate Reuse in Different Dataflows

### Weight Stationary

- Minimize movement of filter weights

### Output Stationary

- Minimize movement of partial sums

### No Local Reuse

- Don't use any local PE storage. Maximize global buffer size.

### • Row Stationary



### Evaluate Reuse in Different Dataflows

### Weight Stationary

- Minimize movement of filter weights

### Output Stationary

- Minimize movement of partial sums

### No Local Reuse

- Don't use any local PE storage. Maximize global buffer size.

### Row Stationary

#### **Evaluation Setup**

- Same Total Area
- AlexNet
- 256 PEs
- Batch size = 16



#### **Dataflow Comparison: CONV Layers** 59



tems technology laboratories

Plii

### **Dataflow Comparison: CONV Layers**



# Opportunities in Joint Algorithm Hardware Design



### 62 Cost of Operations





tems technology laboratories

### Commercial Products using 8-bit Integer





#### Nvidia's Pascal (2016)

#### Google's TPU (2016)





# Reduced Precision in Research

#### Reduce number of bits

- Binary Nets [Courbariaux, NIPS 2015]

#### Reduce number of unique weights

- Ternary Weight Nets [Li, arXiv 2016]
- XNOR-Net [Rategari, ECCV 2016]

#### Non-Linear Quantization

- LogNet [Lee, ICASSP 2017]



#### Log Domain Quantization

**Binary Filters** 



**Pli**i

### **Sparsity in Data**

Many zeros in output fmaps after ReLU



### Sero Data Processing Gating

- Skip PE local memory access
- Skip MAC computation
- Save PE processing power by 45%





### <sup>67</sup> Compression Reduces DRAM BW



Simple RLC within 5% - 10% of theoretical entropy limit





### Sparsity with Basis Projection

Reduce the number of multiplications by projecting onto a basis that increases sparsity (>1.8x power reduction)

**Basis Projection Equation** 



[Suleiman et al., VLSI 2016]





# Pruning – Make Weights Sparse

#### Prune based on *magnitude* of weights



**Example:** AlexNet **Weight Reduction:** CONV layers 2.7x, FC layers 9.9x (Most reduction on fully connected layers) **Overall:** 9x weight reduction, 3x MAC reduction

[Han et al., NIPS 2015]



69

# 70 Key Metrics for Embedded DNN

- Accuracy → Measured on Dataset
- Speed  $\rightarrow$  Number of MACs
- Storage Footprint → Number of Weights
- Energy  $\rightarrow$  ?



# Intergy-Evaluation Methodology



Hardware Energy Costs of each MAC and Memory Access

T MIT



Illi Energy estimation tool available at http://eyeriss.mit.edu

### 72 Energy-Aware Pruning



3.7x reduction in AlexNet / 1.6x reduction in GoogLeNet

[Yang et al., CVPR 2017]



# Opportunities in Mixed Signal Circuits

# Reduce data movement by embedding computation into memory and sensor





## <sup>74</sup> Mixed-Signal Circuit Processing

- Primarily target dot product
  - Reduced precision (e.g., binary weights)
- Challenges
  - Need ADC and DAC conversion
    - Weights trained in digital domain
  - More sensitive to variations and nonlinearity
- Reduce data movement from memory and sensor



### Binary Weight Classifier in SRAM



### Weak because:

- 1. Weights restricted to be +/-1
- 2. Bit-cell discharge subject to variation, nonlinearity

[Zhang et al., VLSI 2016]





### Switched Cap MAC for Classification

### Reduce ADC conversions by 21x Input: 32x32x3 (6b) $\rightarrow$ Output: 4x4x9 (6b); Weight 3b



76

[Lee et al., ISSCC 2016]



### **TEXT** Embedded Feature Extraction in Sensor

#### **Compute the HOG feature in Image Sensor**

- Reduce bandwidth by 96.5% (vs. 8b output)
- Mixed-signal computation of gradient angle



[Choi et al., ISSCC 2013]

RESEARCH LABORATORY OF ELECTRONICS AT MIT

s technology laboratories

l'liiT

# **Opportunities in Advanced Technologies**

# Reduce data movement by embedding computation into memory and sensor





## **I Advanced Memory Technologies**

Many new memories and devices explored to reduce data movement

**Non-Volatile** Stacked DRAM **Resistive Memories** Global dataline Ir/Cn V₁ Bank Row Bank Bank WS **TSVs** Col dec Col de dataflow  $I_1 = V_1 \times G_1$ **Global SA** Inter-bank data bus ٧,  $G_2$ To local Global DRAM Die vault Eyeriss Logic Die Buffei To remote design  $I_2 = V_2 \times G_2$ vault Vault (Channel) Engine [Gao et al., Tetris, ASPLOS 2017]  $| = |_1 + |_2$ [Kim et al., NeuroCube, ISCA 2016]  $= V_1 \times G_1 + V_2 \times G_2$ 

eDRAM [Chen et al., DaDianNao, MICRO 2014]





### **ASP: Angle Sensitive Pixels**

### **Extract gradients directly in the sensor**

- Reduces read bandwidth by 10x
- Reduces ADC conversion by 10x



[Chen et al., CICC 2012]



## Hand-Crafted vs. Learned Features



## Machine Learning Pipeline (Inference)







### Joint Algorithm Hardware Optimizations

**Histogram of Weights** 



### Energy-Efficient Object Detection



MIT Object Detection Chip [VLSI 2016]





### **Eyeriss Deep CNN Accelerator**



### **I** Optimization to Reduce Data Movement

- Energy-efficient dataflow to reduce data movement
- Exploit data statistics for high energy efficiency



[Chen et al., ISCA 2016, ISSCC 2016]



## Eyeriss Chip Spec & Measurement Results

| Technology       | TSMC 65nm LP 1P9M                                                                                                                                           |       | _ 4000 µm→      |      |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|-----------------|------|
| On-Chip Buffer   | 108 KB                                                                                                                                                      | <     | – +υυυ μιτι — → |      |
| # of PEs         | 168                                                                                                                                                         |       |                 |      |
| Scratch Pad / PE | 0.5 KB                                                                                                                                                      | Globa | I Spatial Array |      |
| Core Frequency   | 100 – 250 MHz                                                                                                                                               | Buffe | (168 PEs)       |      |
| Peak Performance | 33.6 – 84.0 GOPS                                                                                                                                            |       |                 | 4000 |
| Word Bit-width   | 16-bit Fixed-Point                                                                                                                                          |       |                 | 00   |
|                  | Filter Width: $1 - 32$<br>Filter Height: $1 - 12$<br>Num. Filters: $1 - 1024$<br>Num. Channels: $1 - 1024$<br>Horz. Stride: $1-12$<br>Vert. Stride: 1, 2, 4 |       |                 | µm   |

AlexNet: For 2.66 GMACs [8 billion 16-bit inputs (**16GB**) and 2.7 billion outputs (**5.4GB**)], only requires **208.5MB** (buffer) and **15.4MB** (DRAM)





ystems technology laboratories

### Features: Energy vs. Accuracy



2.

88

[Suleiman et al., ISCAS 2017]



ns technology laboratories



- Machine Learning is an important area of research
  - Wide range of applications
  - Various methods to extract features (hand-crafted and learned)
- Challenge is to balance the key metrics
  - Accuracy, Energy, Throughput, Cost, etc.
- Opportunities at various levels of hardware design
  - Architecture, Joint Algorithm-Hardware, Mixed-Signal Circuits, Advanced Technologies
  - Important to consider interactions between levels to maximize impact





### Acknowledgements



Research conducted in the **MIT Energy-Efficient Multimedia Systems Group** would not be possible without the support of the following organizations:







### 91 References

# More info about **Eyeriss** and **Tutorial on DNN Architectures** at

http://eyeriss.mit.edu

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, "*Efficient Processing of Deep Neural Networks: A Tutorial and Survey*", arXiv, 2017

### More info about research in the Energy-Efficient Multimedia Systems Group @ MIT

http://www.rle.mit.edu/eems

For updates

JFollow @eems\_mit

http://mailman.mit.edu/mailman/listinfo/eems-news



