# **Energy-Efficient Deep Learning: Challenges and Opportunities**

#### Vivienne Sze

#### **Massachusetts Institute of Technology**





# Example Applications of Deep Learning

#### **Computer Vision**



#### **Speech Recognition**



Medical







# What is Deep Learning?



Image Source: [Lee et al., Comm. ACM 2011]







## Weighted Sums



Image Source: Stanford



# Why is Deep Learning Hot Now?





# Deep Convolutional Neural Networks







# Deep Convolutional Neural Networks





# Deep Convolutional Neural Networks





**Convolutions** account for more than 90% of overall computation, dominating **runtime** and **energy consumption** 





Input Image (Feature Map)







Input Image (Feature Map)



Element-wise Multiplication









**Sliding Window Processing** 







Many Input Channels (C)











Image batch size: 1 – 256 (N)



ns technology laboratories

# Large Sizes with Varying Shapes

AlexNet<sup>1</sup> Convolutional Layer Configurations

| Layer | Filter Size (R) | # Filters (M) | # Channels (C) | Stride |
|-------|-----------------|---------------|----------------|--------|
| 1     | 11x11           | 96            | 3              | 4      |
| 2     | 5x5             | 256           | 48             | 1      |
| 3     | 3x3             | 384           | 256            | 1      |
| 4     | 3x3             | 384           | 192            | 1      |
| 5     | 3x3             | 256           | 192            | 1      |

Layer 1



34k Params 105M MACs Layer 2





307k Params 224M MACs



885k Params 150M MACs



# Popular CNNs

- LeNet (1998)
- AlexNet (2012)
- OverFeat (2013)
- VGGNet (2014)
- GoogleNet (2014)
- ResNet (2015)

#### ImageNet: Large Scale Visual Recognition Challenge (ILSVRC)



[O. Russakovsky et al., IJCV 2015]



# Summary of Popular CNNs

| Metrics          | LeNet-5 | AlexNet  | VGG-16   | GoogLeNet<br>(v1) | ResNet-50 |
|------------------|---------|----------|----------|-------------------|-----------|
| Top-5 error      | n/a     | 16.4     | 7.4      | 6.7               | 5.3       |
| Input Size       | 28x28   | 227x227  | 224x224  | 224x224           | 224x224   |
| # of CONV Layers | 2       | 5        | 16       | 21 (depth)        | 49        |
| Filter Sizes     | 5       | 3, 5,11  | 3        | 1, 3 , 5, 7       | 1, 3, 7   |
| # of Channels    | 1, 6    | 3 - 256  | 3 - 512  | 3 - 1024          | 3 - 2048  |
| # of Filters     | 6, 16   | 96 - 384 | 64 - 512 | 64 - 384          | 64 - 2048 |
| Stride           | 1       | 1, 4     | 1        | 1, 2              | 1, 2      |
| # of Weights     | 2.6k    | 2.3M     | 14.7M    | 6.0M              | 23.5M     |
| # of MACs        | 283k    | 666M     | 15.3G    | 1.43G             | 3.86G     |
| # of FC layers   | 2       | 3        | 3        | 1                 | 1         |
| # of Weights     | 58k     | 58.6M    | 124M     | 1M                | 2M        |
| # of MACs        | 58k     | 58.6M    | 124M     | 1M                | 2M        |
| Total Weights    | 60k     | 61M      | 138M     | 7M                | 25.5M     |
| Total MACs       | 341k    | 724M     | 15.5G    | 1.43G             | 3.9G      |

CONV Layers increasingly important!



18

# <sup>19</sup> Training vs. Inference







# Processing at "Edge" instead of the "Cloud"



ystems technology laboratories

# Challenges





# Key Metrics

#### • Accuracy

22

 Evaluate hardware using the appropriate DNN model and dataset

### Programmability

- Support multiple applications
- Different weights

### • Energy/Power

- Energy per operation
- DRAM Bandwidth

## Throughput/Latency

- GOPS, frame rate, delay

#### • Cost

Area (size of memory and # of cores)

[Sze et al., CICC 2017]



#### ImageNet









microsystems technology laboratorie: massachusetts institute of technolog

# Opportunities in Architecture





|||iT

# GPUs and CPUs Targeting Deep Learning

Intel Knights Landing (2016) Nvidia PASCAL GP100 (2016)





Knights Mill: next gen Xeon Phi "optimized for deep learning"

Use matrix multiplication libraries on CPUs and GPUs





# Map DNN to a Matrix Multiplication



Goal: Reduced number of operations to increase throughput



25

# <sup>26</sup> Reduce Operations in Matrix Multiplication

- Fast Fourier Transform [Mathieu, ICLR 2014]
  - **Pro:** Direct convolution  $O(N_o^2 N_f^2)$  to  $O(N_o^2 \log_2 N_o)$
  - Con: Increase storage requirements
- Strassen [Cong, ICANN 2014]
  - Pro: O(N<sup>3</sup>) to (N<sup>2.807</sup>)
  - Con: Numerical stability
- Winograd [Lavin, CVPR 2016]
  - Pro: 2.25x speed up for 3x3 filter
  - Con: Specialized processing depending on filter size



# Analogy: Gauss's Multiplication Algorithm

$$(a+bi)(c+di) = (ac-bd) + (bc+ad)i.$$

4 multiplications + 3 additions

$$k_{1} = c \cdot (a + b)$$
  

$$k_{2} = a \cdot (d - c)$$
  

$$k_{3} = b \cdot (c + d)$$
  
Real part =  $k_{1} - k_{3}$   
Imaginary part =  $k_{1} + k_{2}$ .

3 multiplications + 5 additions

**Reduce** number of multiplications, but **increase** number of additions



# Specialized Hardware (Accelerators)





14ii

# <sup>29</sup> Properties We Can Leverage

- Operations exhibit high parallelism
   → high throughput possible
- Memory Access is the Bottleneck



Worst Case: all memory R/W are **DRAM** accesses

Example: AlexNet [NIPS 2012] has 724M MACs
 → 2896M DRAM accesses required



# **Properties We Can Leverage**

- Operations exhibit high parallelism
   → high throughput possible
- Input data reuse opportunities (up to 500x)

→ exploit **low-cost memory** 



Images

# Highly-Parallel Compute Paradigms

#### Temporal Architecture (SIMD/SIMT)



Spatial Architecture (Dataflow Processing)





# Advantages of Spatial Architecture





# Bow to Map the Dataflow?



Goal: Increase reuse of input data (weights and pixels) and local partial sums accumulation

# Spatial Architecture (Dataflow Processing)





34

# **Energy-Efficient Dataflow**

Yu-Hsin Chen, Joel Emer, Vivienne Sze, ISCA 2016

#### Maximize data reuse and accumulation at RF





## **35 Data Movement is Expensive**



#### **Processing Engine**



**Data Movement Energy Cost** 



Maximize data reuse at lower levels of hierarchy

# Weight Stationary (WS)



- Minimize weight read energy consumption
  - maximize convolutional and filter reuse of weights
- Examples:

[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015] [Origami, GLSVLSI 2015]


# Output Stationary (OS)



- Minimize partial sum R/W energy consumption
  - maximize local accumulation
- Examples:

[Gupta, *ICML* 2015] [ShiDianNao, *ISCA* 2015] [Peemen, *ICCD* 2013]



# No Local Reuse (NLR)



- Use a large global buffer as shared storage
  - Reduce **DRAM** access energy consumption
- Examples:

[DianNao, ASPLOS 2014] [DaDianNao, MICRO 2014] [Zhang, FPGA 2015]



### Row Stationary: Energy-efficient Dataflow









































- Maximize row convolutional reuse in RF
  - Keep a filter row and image sliding window in RF
- Maximize row psum accumulation in RF





#### **45 Row Stationary Dataflow**



### Evaluate Reuse in Different Dataflows

#### Weight Stationary

- Minimize movement of filter weights

#### Output Stationary

- Minimize movement of partial sums

#### No Local Reuse

- Don't use any local PE storage. Maximize global buffer size.

#### Row Stationary



### Evaluate Reuse in Different Dataflows

#### Weight Stationary

- Minimize movement of filter weights

#### Output Stationary

- Minimize movement of partial sums

#### No Local Reuse

- Don't use any local PE storage. Maximize global buffer size.

#### Row Stationary

#### **Evaluation Setup**

- Same Total Area
- AlexNet
- 256 PEs
- Batch size = 16



#### **Dataflow Comparison: CONV Layers** 48



tems technology laboratories

Plii

### Dataflow Comparison: CONV Layers



### **50 Eyeriss Deep CNN Accelerator**



# Eyeriss Chip Spec & Measurement Results

| Technology                       | TSMC 65nm LP 1P9M                                                                                                                                 |        | 4000 um       |                   |
|----------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------|--------|---------------|-------------------|
| On-Chip Buffer                   | 108 KB                                                                                                                                            | <      |               | ►<br>■ ▲          |
| # of PEs                         | 168                                                                                                                                               |        |               |                   |
| Scratch Pad / PE                 | 0.5 KB                                                                                                                                            | Global | Spatial Array | of internet dates |
| Core Frequency                   | 100 – 250 MHz                                                                                                                                     | Buffer | (168 PEs)     |                   |
| Peak Performance                 | 33.6 – 84.0 GOPS                                                                                                                                  |        |               | 40                |
| Word Bit-width                   | 16-bit Fixed-Point                                                                                                                                |        |               |                   |
| Natively Supported<br>CNN Shapes | Filter Width: 1 – 32<br>Filter Height: 1 – 12<br>Num. Filters: 1 – 1024<br>Num. Channels: 1 – 1024<br>Horz. Stride: 1–12<br>Vert. Stride: 1, 2, 4 |        |               | µm<br>            |

AlexNet: For 2.66 GMACs [8 billion 16-bit inputs (**16GB**) and 2.7 billion outputs (**5.4GB**)], only requires **208.5MB** (buffer) and **15.4MB** (DRAM)





# <sup>52</sup> Comparison with GPU

|                         | Eyeriss                       | NVIDIA TK1 (Jetson Kit)               |
|-------------------------|-------------------------------|---------------------------------------|
| Technology              | 65nm                          | 28nm                                  |
| Clock Rate              | 200MHz                        | 852MHz                                |
| # Multipliers           | 168                           | 192                                   |
| On-Chip Storage         | Buffer: 108KB<br>Spad: 75.3KB | Shared Mem: 64KB<br>Reg File: 256KB   |
| Word Bit-Width          | 16b Fixed                     | 32b Float                             |
| Throughput <sup>1</sup> | 34.7 fps                      | 68 fps                                |
| Measured Power          | 278 mW                        | Idle/Active <sup>2</sup> : 3.7W/10.2W |
| DRAM Bandwidth          | 127 MB/s                      | 1120 MB/s <sup>3</sup>                |

- 1. AlexNet Convolutional Layers Only
- 2. Board Power
- 3. Modeled from [Tan, SC11]

#### http://eyeriss.mit.edu



# Machine Learning Pipeline (Inference)





53

### Energy-Efficient Object Detection





#### Features: Energy vs. Accuracy



2.

55

[Suleiman et al., ISCAS 2017]



# Opportunities in Joint Algorithm Hardware Design



56

#### **57** Approaches

#### <u>Reduce size</u> of operands for storage/compute

- Floating point  $\rightarrow$  Fixed point
- Bit-width reduction
- Non-linear quantization

#### • <u>Reduce number</u> of operations for storage/compute

- Exploit Activation Statistics (Compression)
- Network Pruning
- Compact Network Architectures



### Commercial Products using 8-bit Integer





Nvidia's Pascal (2016)

#### Google's TPU (2016)





# Reduced Precision in Research

#### Reduce number of bits

- Binary Nets [Courbariaux, NIPS 2015]

#### Reduce number of unique weights

- Ternary Weight Nets [Li, arXiv 2016]
- XNOR-Net [Rategari, ECCV 2016]

#### Non-Linear Quantization

- LogNet [Lee, ICASSP 2017]



#### Log Domain Quantization

**Binary Filters** 



**Phi**r

# Reduced Precision Hardware

Stripes

[Judd et al., MICRO 2016]

**Bit-serial processing for speed** 





#### **KU** Leuven

[Moons et al., VLSI 2016]

#### Voltage scaling for energy savings



# <sup>61</sup> Binary/Ternary Net Hardware

- Examples
  - YodaNN (binary weights)
  - BRein (binary weights and activations)
  - TrueNorth (ternary weights and binary activations)



OF ELECTRONICS AT MIT

s technology laboratories

These designs tend not to support state-of-the-art DNN models (except YodaNN)

#### <sup>62</sup> Sparsity in Feature Maps

Many zeros in output fmaps after ReLU



OF ELECTRONICS AT MIT

# **Exploit Sparsity**

Method 1: Skip memory access and computation



Method 2: Compress data to reduce storage and data movement



### Pruning – Make Weights Sparse

#### **Optimal Brain Damage**

[Lecun et al., NIPS 1989]

#### Prune DNN based on *magnitude* of weights [Han et al., NIPS 2015]





RESEARCH LABORATORY OF ELECTRONICS AT MIT

64

### 65 Key Observations

- Number of weights *alone* is not a good metric for energy
- All data types should be considered





# I Energy-Evaluation Methodology



66

Hardware Energy Costs of each MAC and Memory Access

T MIT



Illi Energy estimation tool available at http://eyeriss.mit.edu

# Energy Consumption of Existing DNNs



Deeper CNNs with fewer weights do not necessarily consume less energy than shallower CNNs with more weights

[Yang et al., CVPR 2017]



# I Magnitude-based Weight Pruning



Reduce number of weights by **removing small magnitude weights** 





**Phi**r

### Energy-Aware Pruning



l'liiT

69

[Yang et al., CVPR 2017]



### <sup>70</sup> NetAdapt: Platform-Aware DNN Adaptation

- Automatically adapt DNN to a mobile platform to reach a target latency or energy budget
- Use **empirical measurements** to guide optimization (avoid modeling of tool chain or platform architecture)



**III** In collaboration with Google's Mobile Vision Team



# Improved Latency vs. Accuracy Tradeoff

 NetAdapt boosts the real inference speed of MobileNet by up to 1.7x with higher accuracy



Reference:

**MobileNet:** Howard et al, "Mobilenets: Efficient convolutional neural networks for mobile vision applications", arXiv 2017 **MorphNet:** Gordon et al., "Morphnet: Fast & simple resource-constrained structure learning of deep networks", CVPR 2018







RESEARCH LABORATORY OF ELECTRONICS AT MIT
## 73 Network Architecture Design

### Build Network with series of Small Filters

### **GoogleNet/Inception v3**



### Apply sequentially



**VGG-16** 



### Apply sequentially







## <sup>74</sup> 1x1 Bottleneck in Popular DNN models





stems technology laboratories

## **I** Tutorial Material on Efficient DNNs

# Proceedings of EEE

### Efficient Processing of Deep Neural Networks: A Tutorial and Survey

System Scaling With Nanostructured Power and RF Components Nonorthogonal Multiple Access for 5G and Beyond

Point of View: Beyond Smart Grid—A Cyber–Physical–Social System in Energy Future Scanning Our Past: Materials Science, Instrument Knowledge, and the Power Source Renaissance



### Tutorial on Hardware Architectures for Deep Neural Networks

MICRO-49 (Full Day: October 16, 2016)

Joel Emer Vivienne Sze Yu-Hsin Chen MIT, NVIDIA MIT MIT

Email: eyeriss at mit dot edu

#### Updates

▶ Follow @eems\_mit or subscribe to our mailing list for updates on the Tutorial (e.g. notification of when slides will be posted)

### Overview

Deep neural networks (DNNs) are currently widely used for many AI applications including computer vision, speech recognition, robotics, etc. While DNNs deliver state-of-the-art accuracy on many AI tasks, it comes at the cost of high computational complexity. Accordingly, designing efficient hardware architectures for deep neural networks is an important step towards enabling the wide deployment of DNNs in AI systems.

### http://eyeriss.mit.edu/tutorial.html

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, "*Efficient Processing of Deep Neural Networks: A Tutorial and Survey,*" Proceedings of the IEEE, 2017



## **Need More Comprehensive Benchmarks**

Processors should support a **diverse set of DNNs** that utilize different techniques

## **Example:**

- Sparse and Dense
- Large and Compact network architectures
- Different Layers (e.g., CONV and FC)
- Variable Bit-width

01100110

### **Network Pruning**



## **Compact Network Architecture** 32-bit float

8-bit fixed Binary

11117

### **Reduce Precision**







### [Chen et al., SysML 2018]

## Eyexam: Understanding Sources of Inefficiencies in DNN Accelerators

A systematic way to evaluate how each architectural decision affects performance (throughput) for a given DNN workload

**Tightens the roofline model** 



[Chen et al., In Submission]



# **Opportunities in Memories and Devices**



## **I Advanced Memory Technologies**

Many new memories and devices explored to reduce data movement

**Non-Volatile** Stacked DRAM **Resistive Memories** Global dataline Ir/Cn V₁ Bank Row Bank Bank WS **TSVs** Col dec Col de dataflow  $I_1 = V_1 \times G_1$ **Global SA** Inter-bank data bus ٧,  $G_2$ To local Global DRAM Die vault Eyeriss Logic Die Buffei To remote design  $I_2 = V_2 \times G_2$ vault Vault (Channel) Engine [Gao et al., Tetris, ASPLOS 2017]  $| = |_1 + |_2$ [Kim et al., NeuroCube, ISCA 2016]  $= V_1 \times G_1 + V_2 \times G_2$ 

eDRAM [Chen et al., DaDianNao, MICRO 2014]



[Shafiee et al., ISCA 2016]

[Chi et al., PRIME, ISCA 2016]



## Binary Weight Classifier in SRAM



Weak because:

- 1. Weights restricted to be +/-1
- 2. Bit-cell discharge subject to variation, nonlinearity

[Zhang et al., VLSI 2016]





## More Compute In Memory



82

# Benchmarking Metrics for DNN Hardware

How can we compare designs?

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, "*Efficient Processing of Deep Neural Networks: A Tutorial and Survey*," Proceedings of the IEEE, Dec. 2017





## <sup>83</sup> Metrics for DNN Hardware

### Accuracy

- Quality of result for a given task

### • Throughput

- Analytics on high volume data
- Real-time performance (e.g., video at 30 fps)

### • Latency

- For interactive applications (e.g., autonomous navigation)

### • Energy and Power

- Edge and embedded devices have limited battery capacity
- Data centers have stringent power ceilings due to cooling costs

### • Hardware Cost

- \$\$\$



## Specifications to Evaluate Metrics

### • Accuracy

84

- Difficulty of dataset and/or task should be considered

### • Throughput

- Number of cores (include utilization along with peak performance)
- Runtime for running specific DNN models

### • Latency

Include batch size used in evaluation

### • Energy and Power

- Power consumption for running specific DNN models
- Include external memory access

### Hardware Cost

On-chip storage, number of cores, chip area + process technology



## Example: Metrics of Eyeriss Chip

| ASIC Specs                                | Input                  |                                        |        |          |
|-------------------------------------------|------------------------|----------------------------------------|--------|----------|
| Process Technology                        | 65nm LP<br>TSMC (1.0V) | Metric                                 | Units  | Input    |
|                                           |                        | Name of CNN Model                      | Text   | AlexNet  |
| Total Core Area<br>(mm <sup>2</sup> )     | 12.25                  | Top-5 error classification on ImageNet | #      | 19.8     |
| Total On-Chip<br>Memory (kB)              | 192                    | Supported Layers                       |        | All CONV |
|                                           |                        | Bits per weight                        | #      | 16       |
| Number of Multipliers                     | 168                    | Bits per input activation              | #      | 16       |
| Clock Frequency<br>(MHz)                  | 200                    | Batch Size                             | #      | 4        |
|                                           |                        | Runtime                                | ms     | 115.3    |
| Core area (mm <sup>2</sup> ) / multiplier | 0.073                  | Power                                  | mW     | 278      |
|                                           |                        | Off-chip Access per                    | MBytes | 3 85     |
| On-Chip memory<br>(kB) / multiplier       | 1.14                   | Image Inference                        |        | 0.00     |
|                                           |                        | Number of Images                       | #      | 100      |
| Measured or                               | Measured               | Tested                                 |        |          |
| Simulated                                 | weasured               | Iested                                 |        |          |



## Comprehensive Coverage

- All metrics should be reported for fair evaluation of design tradeoffs
- Examples of what can happen if certain metric is omitted:
  - Without the accuracy given for a specific dataset and task, one could run a simple DNN and claim low power, high throughput, and low cost – however, the processor might not be usable for a meaningful task
  - Without reporting the off-chip bandwidth, one could build a processor with only multipliers and claim low cost, high throughput, high accuracy, and low chip power – however, when evaluating system power, the offchip memory access would be substantial
- Are results measured or simulated? On what test data?





## <sup>87</sup> Evaluation Process

The evaluation process for whether a DNN system is a viable solution for a given application might go as follows:

- **1.** Accuracy determines if it can perform the given task
- **2. Latency and throughput** determine if it can run fast enough and in real-time
- **3. Energy and power consumption** will primarily dictate the form factor of the device where the processing can operate
- **4. Cost**, which is primarily dictated by the chip area, determines how much one would pay for this solution



## Summary

- Deep Learning is an important area of research
  Wide range of applications
- Challenge is to balance the key metrics
  - Accuracy, Energy, Throughput, Cost, etc.
- Opportunities at various levels of hardware design
  - Architecture, Joint Algorithm-Hardware, Mixed-Signal Circuits/Memories, Advanced Technologies
  - Important to consider interactions between levels to maximize impact

### For updates on Eyerissv2, Eyexam, NetAdapt, etc.

Follow @eems\_mit

or join EEMS news mailing list









### **Overview Paper**

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, *"Efficient Processing of Deep Neural Networks: A Tutorial and Survey,"* **Proceedings of the IEEE**, December 2017

More info about **Eyeriss** and **Tutorial on DNN Architectures** <u>http://eyeriss.mit.edu</u>

MIT Professional Education Course on **"Designing Efficient Deep Learning Systems"** July 23 – 24, 2018 on MIT Campus <u>http://professional-education.mit.edu/deeplearning</u>

For updates **Second Second** Follow @eems\_mit

http://mailman.mit.edu/mailman/listinfo/eems-news



## Acknowledgements



Research conducted in the **MIT Energy-Efficient Multimedia Systems Group** would not be possible without the support of the following organizations:



ns technology laboratories