# Efficient Image Processing with Deep Neural Networks

#### Vivienne Sze, Tien-Ju Yang

**Massachusetts Institute of Technology** 





#### **Contributors** 2









#### **Vivienne Sze**

Professor MIT

#### **Tien-Ju Yang**

PhD Candidate MIT

#### **Joel Emer**

Senior Distinguished **Research Scientist** 

#### **NVIDIA**

Professor

MIT

#### **Yu-Hsin Chen** PhD Graduate MIT





## Outline of Tutorial

- Brief overview of Deep Neural Networks (DNN)
- **Part 1: Hardware Platforms for DNNs** (e.g., CPU, GPU, • FPGA, ASIC) and metrics for evaluating the efficiency of DNNs
- Part 2: Co-design algorithms and hardware for efficient DNNs (e.g., precision, sparsity, network architecture design, network architecture search, designing networks with hardware in the loop)
- Part 3: Application of efficient DNNs on a wide range of image processing and computer vision tasks (e.g., image classification, depth estimation, image segmentation, super-resolution)



## Additional Resources

#### **Overview Paper**

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," Proceedings of the IEEE, Dec. 2017 Book Coming Soon!

More info about **Tutorial on DNN Architectures** <u>http://eyeriss.mit.edu/tutorial.html</u>



Efficient Processing of Deep Neural Networks: A Tutorial and Survey System Scaling With Nanostructured Power and RF Components Nonorthogonal Multiple Access for 5G and Beyond

Point of View: Beyond Smart Grid—A Cyber–Physical–Social System in Energy Futur Scanning Our Past: Materials Science, Instrument Knowledge, and the Power Source Renaissance



MIT Professional Education Course on "Designing Efficient Deep Learning Systems" <u>http://professional-education.mit.edu/deeplearning</u>

For updates **Y** Follow @eems\_mit

http://mailman.mit.edu/mailman/listinfo/eems-news





## Example Applications of Deep Learning

#### **Computer Vision**



#### **Speech Recognition**



Medical









## Compute Demands for Deep Learning

## **Common carbon footprint benchmarks**

#### in lbs of CO2 equivalent

Roundtrip flight b/w NY and SF (1 passenger)

Human life (avg. 1 year)

American life (avg. 1 year)

US car including fuel (avg. 1 lifetime)

Transformer (213M parameters) w/ neural architecture search



Chart: MIT Technology Review • Source: Strubell et al. • Created with Datawrapper

626,155





## Processing at "Edge" instead of the "Cloud"



RESEARCH LABORATORY OF ELECTRONICS AT MIT

tems technology laboratories

## Deep Learning for Self-Driving Cars

JACK STEWART TRANSPORTATION 02.06.18 08:00 AM

## SELF-DRIVING CARS USE CRAZY AMOUNTS OF POWER, AND IT'S BECOMING A PROBLEM



Shelley, a self-driving Audi TT developed by Stanford University, uses the brains in the trunk to speed around a racetrack autonomously.



Cameras and radar generate ~6 gigabytes of data every 30 seconds.

Prototypes use around 2,500 Watts. Generates wasted heat and some prototypes need water-cooling!



R NIKKI KAHN/THE WASHINGTON POST/GETTY IMAGES

## Existing Processors Consume Too Much Power



< 1 Watt

> 10 Watts





## Overview of Deep Neural Networks



1411















#### Optional layers in between CONV and/or FC layers









**Convolutions** account for more than 90% of overall computation, dominating **runtime** and **energy consumption** 





#### <sup>16</sup> Convolution (CONV) Layer

a plane of input activations a.k.a. **input feature map (fmap)** 

filter (weights)









### Convolution (CONV) Layer



Element-wise Multiplication





#### Convolution (CONV) Layer







#### <sup>19</sup> Convolution (CONV) Layer



**Sliding Window Processing** 



### <sup>20</sup> Convolution (CONV) Layer



Many Input Channels (C)



### <sup>21</sup> Convolution (CONV) Layer







#### 22 Convolution (CONV) Layer



14117

#### <sup>23</sup> CNN Decoder Ring

- N Number of input fmaps/output fmaps (batch size)
- C Number of 2-D input fmaps /filters (channels)
- H Height of input fmap (activations)
- W Width of input fmap (activations)
- R Height of 2-D filter (weights)
- S Width of 2-D filter (weights)
- M Number of 2-D output fmaps (channels)
- E Height of output fmap (activations)
- F Width of output fmap (activations)





#### <sup>24</sup> Traditional Activation Functions

#### Sigmoid



y=1/(1+e<sup>-x</sup>)

## Hyperbolic Tangent





<sup>25</sup> Modern Activation Functions





Image Source: Caffe Tutorial



#### <sup>26</sup> FC Layer – from CONV Layer POV





14117

#### <sup>27</sup> Fully-Connected (FC) Layer

- Height and width of output fmaps are 1 (E = F = 1)
- Filters as large as input fmaps (R = H, S = W)
- Implementation: Matrix Multiplication





## <sup>28</sup> Pooling (POOL) Layer

- Reduce resolution of each channel independently
- Overlapping or non-overlapping  $\rightarrow$  depending on stride



Increases translation-invariance and noise-resilience





#### <sup>29</sup> Normalization (NORM) Layer

- Batch Normalization (BN)
  - Normalize activations towards mean=0 and std. dev.=1 based on the statistics of the training dataset
  - put in between CONV/FC and Activation function



Believed to be key to getting high accuracy and faster training on very deep neural networks.





#### **30** BN Layer Implementation

• The normalized value is further scaled and shifted, the parameters of which are learned from training





## Relevant Components for this Tutorial

- Typical operations that we will discuss:
  - Convolution (CONV)
  - Fully-Connected (FC)
  - Max Pooling
  - ReLU



## Popular DNN Models





#### **33** Popular DNNs

• LeNet (1998)

- AlexNet (2012)
- OverFeat (2013)
- VGGNet (2014)
- GoogleNet (2014)
- ResNet (2015)

#### ImageNet: Large Scale Visual Recognition Challenge (ILSVRC)



[O. Russakovsky et al., IJCV 2015]



## <sup>34</sup> ImageNet

## IM GENET

#### **Image Classification**

~256x256 pixels (color) 1000 Classes 1.3M Training 100,000 Testing (50,000 Validation)

#### For ImageNet Large Scale Visual Recognition Challenge (ILSVRC)

accuracy of classification task reported based on top-1 and top-5 error

Image Source: http://karpathy.github.io/



http://www.image-net.org/challenges/LSVRC/





#### 35 AlexNet

CONV Layers: 5 Fully Connected Layers: 3 Weights: 61M MACs: 724M ReLU used for non-linearity

#### ILSCVR12 Winner

#### Uses Local Response Normalization (LRN)



## **Large Sizes with Varying Shapes**

**AlexNet Convolutional Layer Configurations** 

| Layer | Filter Size (RxS) | # Filters (M) | # Channels (C) | Stride |
|-------|-------------------|---------------|----------------|--------|
| 1     | 11x11             | 96            | 3              | 4      |
| 2     | 5x5               | 256           | 48             | 1      |
| 3     | 3x3               | 384           | 256            | 1      |
| 4     | 3x3               | 384           | 192            | 1      |
| 5     | 3x3               | 256           | 192            | 1      |

Layer 1



34k Params 105M MACs Layer 2



307k Params 224M MACs

Layer 3



885k Params 150M MACs

[Krizhevsky et al., NeurIPS 2012]





### <sup>37</sup> VGG-16

CONV Layers: 13 Fully Connected Layers: 3 Weights: 138M MACs: 15.5G

Also, 19 layer version

Reduce # of weights

stack 2 3x3 conv





for a 5x5 receptive field

[figure credit A. Karpathy]

Image Source: <a href="http://www.cs.toronto.edu/~frossard/post/vgg16/">http://www.cs.toronto.edu/~frossard/post/vgg16/</a>

|'|iT

[Simonyan et al., arXiv 2014, ICLR 2015] rie RESEARCH LABOR



### 38 GoogLeNet/Inception (v1)

CONV Layers: 21 (depth), 57 (total) Fully Connected Layers: 1 Weights: 7.0M MACs: 1.43G Also, v2, v3 and v4 ILSVRC14 Winner



[Szegedy et al., arXiv 2014, CVPR 2015]





### <sup>39</sup> GoogLeNet/Inception (v1)



[Szegedy et al., arXiv 2014, CVPR 2015]









ImageNet Classification top-5 error (%)

Image Source: <u>http://icml.cc/2016/tutorials/icml2016\_tutorial\_deep\_residual\_networks\_kaiminghe.pdf</u>





### <sup>41</sup> ResNet-50

CONV Layers: 49 Fully Connected Layers: 1 Weights: 25.5M MACs: 3.9G

### Also, 34,**152** and 1202 layer versions ILSVRC15 Winner



[He et al., arXiv 2015, CVPR 2016]





ResNet-34

# 42 Summary of Popular CNNs

| Metrics          | LeNet-5 | AlexNet  | VGG-16   | GoogLeNet<br>(v1) | ResNet-50 |
|------------------|---------|----------|----------|-------------------|-----------|
| Top-5 error      | n/a     | 16.4     | 7.4      | 6.7               | 5.3       |
| Input Size       | 28x28   | 227x227  | 224x224  | 224x224           | 224x224   |
| # of CONV Layers | 2       | 5        | 16       | 21 (depth)        | 49        |
| Filter Sizes     | 5       | 3, 5,11  | 3        | 1, 3 , 5, 7       | 1, 3, 7   |
| # of Channels    | 1, 6    | 3 - 256  | 3 - 512  | 3 - 1024          | 3 - 2048  |
| # of Filters     | 6, 16   | 96 - 384 | 64 - 512 | 64 - 384          | 64 - 2048 |
| Stride           | 1       | 1, 4     | 1        | 1, 2              | 1, 2      |
| # of Weights     | 2.6k    | 2.3M     | 14.7M    | 6.0M              | 23.5M     |
| # of MACs        | 283k    | 666M     | 15.3G    | 1.43G             | 3.86G     |
| # of FC layers   | 2       | 3        | 3        | 1                 | 1         |
| # of Weights     | 58k     | 58.6M    | 124M     | 1M                | 2M        |
| # of MACs        | 58k     | 58.6M    | 124M     | 1M                | 2M        |
| Total Weights    | 60k     | 61M      | 138M     | 7M                | 25.5M     |
| Total MACs       | 341k    | 724M     | 15.5G    | 1.43G             | 3.9G      |
|                  |         |          |          |                   |           |

CONV Layers increasingly important!



# 43 Summary of Popular CNNs

### • AlexNet

- First CNN Winner of ILSVRC
- Uses LRN (deprecated after this)

### • VGG-16

- Goes Deeper (16+ layers)
- Uses only 3x3 filters (stack for larger filters)

### • GoogLeNet (v1)

- Reduces weights with Inception and only one FC layer
- Inception: 1x1 and DAG (parallel connections)
- Batch Normalization
- ResNet
  - Goes Deeper (24+ layers)
  - Shortcut connections



### Beyond ResNet

### 

ResNeXt

[Zagoruyko et al., BMVC 2016]



# Part 1: Hardware Platforms for DNN Processing





# 46 GPUs and CPUs Targeting Deep Learning

### Intel Xeon Scalable CPU (2019) Nvidia's V100 GPU (2018)



### Use matrix multiplication libraries on CPUs and GPUs





# 47 Matrix Multiplication Libraries

- Implementation: Matrix Multiplication (GEMM)
  - CPU: OpenBLAS, Intel MKL, etc
  - GPU: cuBLAS, cuDNN, etc
- Library will note shape of the matrix multiply and select implementation optimized for that shape.
- Optimization usually involves proper tiling to storage hierarchy



# Map DNN to a Matrix Multiplication



Goal: Reduced number of operations to increase throughput



48

# Analogy: Gauss's Multiplication Algorithm

$$(a+bi)(c+di) = (ac-bd) + (bc+ad)i.$$

4 multiplications + 3 additions

$$k_{1} = c \cdot (a + b)$$
  

$$k_{2} = a \cdot (d - c)$$
  

$$k_{3} = b \cdot (c + d)$$
  
Real part =  $k_{1} - k_{3}$   
Imaginary part =  $k_{1} + k_{2}$ .

3 multiplications + 5 additions

**Reduce** number of multiplications, but **increase** number of additions



l'liiT

49

# Reduce Operations in Matrix Multiplication

- Fast Fourier Transform [Mathieu, ICLR 2014]
  - **Pro:** Direct convolution  $O(N_o^2 N_f^2)$  to  $O(N_o^2 \log_2 N_o)$
  - Con: Increase storage requirements
- Strassen [Cong, ICANN 2014]
  - Pro: O(N<sup>3</sup>) to (N<sup>2.807</sup>)
  - Con: Numerical stability
- Winograd [Lavin, CVPR 2016]
  - Pro: 2.25x speed up for 3x3 filter
  - Con: Specialized processing depending on filter size



# Specialized Hardware (Accelerators)



14117

# **Properties We Can Leverage**

- Operations exhibit high parallelism
   → high throughput possible
- Memory Access is the Bottleneck



# **<sup>53</sup> Properties We Can Leverage**

- Operations exhibit high parallelism
   → high throughput possible
- Memory Access is the Bottleneck





# **Properties We Can Leverage**

- Operations exhibit high parallelism
   → high throughput possible
- Memory Access is the Bottleneck



Worst Case: all memory R/W are **DRAM** accesses

Example: AlexNet [NeurIPS 2012] has 724M MACs
 → 2896M DRAM accesses required



# **<sup>55</sup> Properties We Can Leverage**

- Operations exhibit high parallelism
   → high throughput possible
- Input data reuse opportunities (up to 500x)

→ exploit **low-cost memory** 



Images

# <sup>56</sup> Highly-Parallel Compute Paradigms

### Temporal Architecture (SIMD/SIMT)



Spatial Architecture (Dataflow Processing)





# Advantages of Spatial Architecture







# How to Map the Dataflow?



Goal: Increase reuse of input data (weights and pixels) and local partial sums accumulation

### Spatial Architecture (Dataflow Processing)





# **Energy-Efficient Dataflow**

Y.-H. Chen, J. Emer, V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," ISCA 2016





|'|iT

### **Data Movement is Expensive**





\* measured from a commercial 65nm process

Maximize data reuse at low cost levels of hierarchy

# <sup>61</sup> Weight Stationary (WS)



- Minimize weight read energy consumption
  - maximize convolutional and filter reuse of weights
- Examples:

[Chakradhar, ISCA 2010] [nn-X (NeuFlow), CVPRW 2014] [Park, ISSCC 2015] [Origami, GLSVLSI 2015]



# Output Stationary (OS)



- Minimize partial sum R/W energy consumption
  - maximize local accumulation
- Examples:

[Gupta, *ICML* 2015] [ShiDianNao, *ISCA* 2015] [Peemen, *ICCD* 2013]



# **Bow Stationary Dataflow**



- Maximize row
   convolutional reuse in RF
  - Keep a filter row and fmap sliding window in RF
- Maximize row psum accumulation in RF





# Row Stationary Dataflow



# **Evaluate Reuse in Different Dataflows**

### Weight Stationary

- Minimize movement of filter weights

### Output Stationary

- Minimize movement of partial sums

### No Local Reuse

- Don't use any local PE storage. Maximize global buffer size.

### Row Stationary

### **Evaluation Setup**

- Same Total Area
- AlexNet
- 256 PEs
- Batch size = 16



# **Dataflow Comparison: CONV Layers**



[Chen et al., ISCA 2016]



ns technology laboratories

#### **Dataflow Comparison: CONV Layers** 67



ns technology laboratories

# **Exploit Sparsity**

Method 1. Skip memory access and computation



<u>Method 2</u>. Compress data to reduce storage and data movement



nicrosystems technology laboratories massachusetts institute of technology

## **Eyeriss: Deep Neural Network Accelerator**



[Chen et al., ISSCC 2016, ISCA 2016]

*Exploits data reuse for* **100x** reduction in memory accesses from global buffer and **1400x** reduction in memory accesses from off-chip DRAM

Overall >10x energy reduction compared to a mobile GPU (Nvidia TK1)

#### **Results for AlexNet**



### <sup>70</sup> Features: Energy vs. Accuracy



[Suleiman et al., ISCAS 2017]

l'liiT

RESEARCH LABORATORY OF ELECTRONICS AT MIT

hicrosystems technology laboratories massachusetts institute of technology

# Benchmarking Metrics for DNN Hardware

How can we compare designs?

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer,

"Efficient Processing of Deep Neural Networks: A Tutorial and Survey,"

Proceedings of the IEEE, Dec. 2017





### 72 Metrics for DNN Hardware

### • Accuracy

Quality of result for a given task

### Throughput

- Analytics on high volume data
- Real-time performance (e.g., video at 30 fps)

### • Latency

- For interactive applications (e.g., autonomous navigation)

### • Energy and Power

- Edge and embedded devices have limited battery capacity
- Data centers have stringent power ceilings due to cooling costs

### • Hardware Cost

- \$\$\$





# 73 Specifications to Evaluate Metrics

# • Accuracy

Difficulty of dataset and/or task should be considered

# Throughput

- Number of cores (include utilization along with peak performance)
- Runtime for running specific DNN models

# • Latency

Include batch size used in evaluation

# • Energy and Power

- Power consumption for running specific DNN models
- Include external memory access

# • Hardware Cost

On-chip storage, number of cores, chip area + process technology





# **Example: Metrics of Eyeriss Chip**

| ASIC Specs                          | Input                  | Metric                                    | Units  | loout            |
|-------------------------------------|------------------------|-------------------------------------------|--------|------------------|
| Process Technology                  | 65nm LP TSMC<br>(1.0V) | Name of CNN Model                         | Text   | Input<br>AlexNet |
| Total Core Area (mm <sup>2</sup> )  | 12.25                  | Top-5 error classification on<br>ImageNet | #      | 19.8             |
| Total On-Chip Memory<br>(kB)        | 192                    | Supported Layers                          |        | All CONV         |
|                                     | 169                    | Bits per weight                           | #      | 16               |
| Number of Multipliers               | 168                    | Bits per input activation                 | #      | 16               |
| Clock Frequency (MHz)               | 200                    | Batch Size                                | #      | 4                |
| Core area (mm <sup>2</sup> )        | 0.073                  | Runtime                                   | ms     | 115.3            |
| /multiplier                         |                        | Power                                     | mW     | 278              |
| On-Chip memory (kB) /<br>multiplier | 1.14                   | Off-chip Access per Image<br>Inference    | MBytes | 3.85             |
| Measured or Simulated               | Measured               | Number of Images Tested                   | #      | 100              |



# **75** Comprehensive Coverage

- All metrics should be reported for fair evaluation of design tradeoffs
- Examples of what can happen if certain metric is omitted:
  - Without the accuracy given for a specific dataset and task, one could run a simple DNN and claim low power, high throughput, and low cost – however, the processor might not be usable for a meaningful task
  - Without reporting the off-chip bandwidth, one could build a processor with only multipliers and claim low cost, high throughput, high accuracy, and low chip power – however, when evaluating system power, the off-chip memory access would be substantial
- Are results measured or simulated? On what test data?



# **TEVALUATION Process**

The evaluation process for whether a DNN system is a viable solution for a given application might go as follows:

- **1.** Accuracy determines if it can perform the given task
- **2. Latency and throughput** determine if it can run fast enough and in real-time
- **3. Energy and power consumption** will primarily dictate the form factor of the device where the processing can operate
- **4. Cost**, which is primarily dictated by the chip area, determines how much one would pay for this solution



# Part 2: Co-Design of Algorithms and Hardware for DNNs



l'liī

# 78 Approaches

# <u>Reduce size</u> of operands for storage/compute

- Floating point  $\rightarrow$  Fixed point
- Bit-width reduction
- Non-linear quantization

# • <u>Reduce number</u> of operations for storage/compute

- Exploit Activation Statistics (Compression)
- Network Pruning
- Compact Network Architectures



# **Reduced Precision**





# Cost Per Operation





# <sup>81</sup> Floating Point $\rightarrow$ Fixed Point



# <sup>82</sup> Commercial Products Support Reduced Precision



#### Intel's NNP-L (2019)



Nvidia's Pascal (2016)







8-bit Inference & bfloat16 for Training



# <sup>83</sup> Microsoft BrainWave

Narrow Precision for Inference



Custom 8-bit floating point format ("ms-fp8")

[Chung et al., Hot Chips 2017] re





# **Reduced Precision Hardware**

**Stripes** 

[Judd et al., MICRO 2016]

**Bit-serial processing for speed** 





#### **KU Leuven**

[Moons et al., VLSI 2016]

#### Voltage scaling for energy savings







### Binary Connect (BC)

- Weights {-1,1}, Activations 32-bit float
- MAC  $\rightarrow$  addition/subtraction
- Accuracy loss: 19% on AlexNet

[Courbariaux, NeurIPS 2015]

# • Binarized Neural Networks (BNN)

- Weights {-1,1}, Activations {-1,1}
- MAC  $\rightarrow$  XNOR
- Accuracy loss: 29.8% on AlexNet

[Courbariaux, arXiv 2016]

**Binary Filters** 





# Scale the Weights and Activations

# Binary Weight Nets (BWN)

- Weights  $\{-\alpha, \alpha\} \rightarrow$  except first and last layers are 32-bit float
- Activations: 32-bit float
- $\alpha$  determined by the I<sub>1</sub>-norm of all weights in a filter
- Accuracy loss: 0.8% on AlexNet

# XNOR-Net

- Weights  $\{-\alpha, \alpha\}$ 

Hardware needs to support both activation precisions

- Activations  $\{-\beta_i, \beta_i\} \rightarrow$  except first and last layers are 32-bit float
- β<sub>i</sub> determined by the I<sub>1</sub>-norm of all activations across channels
   for given position i of the input feature map
- Accuracy loss: 11% on AlexNet

Scale factors ( $\alpha$ ,  $\beta_i$ ) can change per filter or position in filter

[Rastegari et al., BWN & XNOR-Net, ECCV 2016]



## 87 Ternary Nets

- Allow for weights to be zero
  - Increase sparsity, but also increase number of bits (2-bits)
- Ternary Weight Nets (TWN) [Li et al., arXiv 2016]
  - Weights {-w, 0, w}  $\rightarrow$  except first and last layers are 32-bit float
  - Activations: 32-bit float
  - Accuracy loss: 3.7% on AlexNet
- Trained Ternary Quantization (TTQ) [Zhu et al., ICLR 2017]
  - Weights  $\{-w_1, 0, w_2\} \rightarrow$  except first and last layers are 32-bit float
  - Activations: 32-bit float
  - Accuracy loss: 0.6% on AlexNet





# 88 Non-Linear Quantization

- Precision refers to the number of levels
  - Number of bits =  $log_2$  (number of levels)
- Quantization: mapping data to a smaller set of levels
  - Linear, e.g., fixed-point
  - Non-linear
    - Computed
    - Table lookup

Objective: Reduce size to improve speed and/or reduce energy while preserving accuracy





# **Computed Non-linear Quantization**

#### Log Domain Quantization





[Lee et al., LogNet, ICASSP 2017]



RESEARCH LABORATORY OF ELECTRONICS AT MIT

# 90 Reduce Precision Overview

• Learned mapping of data to quantization levels (e.g., k-means)



- Additional Properties
  - Fixed or Variable (across data types, layers, channels, etc.)





# <sup>91</sup> Non-Linear Quantization Table Lookup

**Trained Quantization:** Find K weights via K-means clustering to reduce number of unique weights *per layer* (weight sharing)

Example: AlexNet (no accuracy loss)256 unique weights for CONV layer16 unique weights for FC layer



Consequences: Narrow weight memory and second access from (small) table

1411



# Summary of Reduce Precision

92

| Category                     | Method                                | Weights<br>(# of bits) | Activations<br>(# of bits) | Accuracy Loss<br>vs. 32-bit float (%) |
|------------------------------|---------------------------------------|------------------------|----------------------------|---------------------------------------|
| Dynamic Fixed<br>Point       | w/o fine-tuning                       | 8                      | 10                         | 0.4                                   |
|                              | w/ fine-tuning                        | 8                      | 8                          | 0.6                                   |
| Reduce weight                | Ternary weights<br>Networks (TWN)     | 2*                     | 32                         | 3.7                                   |
|                              | Trained Ternary<br>Quantization (TTQ) | 2*                     | 32                         | 0.6                                   |
|                              | Binary Connect (BC)                   | 1                      | 32                         | 19.2                                  |
|                              | Binary Weight Net<br>(BWN)            | 1*                     | 32                         | 0.8                                   |
| Reduce weight and activation | Binarized Neural Net (BNN)            | 1                      | 1                          | 29.8                                  |
|                              | XNOR-Net                              | 1*                     | 1                          | 11                                    |
| Non-Linear                   | LogNet                                | 5(conv), 4(fc)         | 4                          | 3.2                                   |
|                              | Weight Sharing                        | 8(conv), 4(fc)         | 16                         | 0                                     |

\* first and last layers are 32-bit float



# 93 Approaches

# <u>Reduce size</u> of operands for storage/compute

- Floating point  $\rightarrow$  Fixed point
- Bit-width reduction
- Non-linear quantization
- <u>Reduce number</u> of operations for storage/compute
  - Exploit Activation Statistics (Compression)
  - Network Pruning
  - Compact Network Architectures



# **Exploit Sparsity**





# Sparsity in Feature Maps

Many zeros in output fmaps after ReLU



ms technology laboratories



# Exploit Sparsity

Method 1: Skip memory access and computation



<u>Method 2</u>: Compress data to reduce storage and data movement



# Pruning – Make Weights Sparse

## **Optimal Brain Damage**

[Lecun et al., NeurIPS 1989]

# Prune DNN based on *magnitude* of weights [Han et al., NeurIPS 2015]





97

# **Pruning – Make Weights Sparse**

98

Remove the weights with the **smallest joint impact** on the output feature map instead of that with the smallest magnitude



[Yang et al., Energy-Aware Pruning, CVPR 2017]



# **Fast Local Fine-Tuning**

99

We then **locally fine-tune** the remaining weights, which is much faster than performing end-to-end training



l'liiT [Yang et al., Energy-Aware Pruning, CVPR 2017]

# **100 Compression of Weights & Activations**

- Compress weights and activations between DRAM and accelerator
- Variable Length / Huffman Coding

Example:

Value:  $16'b0 \rightarrow$  Compressed Code:  $\{1'b0\}$ 

Value:  $16'bx \rightarrow$  Compressed Code:  $\{1'b1, 16'bx\}$ 

• Tested on AlexNet  $\rightarrow$  2× overall BW Reduction

| Layer        | Filter / Image<br>bits (0%) | Filter / Image<br>BW Reduc. | IO / HuffIO<br>(MB/frame) | Voltage<br>(V) | MMACs/<br>Frame | Power<br>(mW) | Real<br>(TOPS/W) |
|--------------|-----------------------------|-----------------------------|---------------------------|----------------|-----------------|---------------|------------------|
| General CNN  | 16 (0%) / 16 (0%)           | 1.0x                        |                           | 1.1            | -               | 288           | 0.3              |
| AlexNet 11   | 7 (21%) / 4 (29%)           | 1.17x / 1.3x                | 1 / 0.77                  | 0.85           | 105             | 85            | 0.96             |
| AlexNet 12   | 7 (19%) / 7 (89%)           | 1.15x / 5.8x                | 3.2 / 1.1                 | 0.9            | 224             | 55            | 1.4              |
| AlexNet 13   | 8 (11%) / 9 (82%)           | 1.05x / 4.1x                | 6.5 / 2.8                 | 0.92           | 150             | 77            | 0.7              |
| AlexNet 14   | 9 (04%) / 8 (72%)           | 1.00x / 2.9x                | 5.4 / 3.2                 | 0.92           | 112             | 95            | 0.56             |
| AlexNet 15   | 9 (04%) / 8 (72%)           | 1.00x / 2.9x                | 3.7 / 2.1                 | 0.92           | 75              | 95            | 0.56             |
| Total / avg. | -                           |                             | 19.8 / 10                 | —              | _               | 76            | 0.94             |
| LeNet-5 11   | 3 (35%) / 1 (87%)           | 1.40x / 5.2x                | 0.003 / 0.001             | 0.7            | 0.3             | 25            | 1.07             |
| LeNet-5 12   | 4 (26%) / 6 (55%)           | 1.25x / 1.9x                | 0.050 / 0.042             | 0.8            | 1.6             | 35            | 1.75             |
| Total / avg. | -                           | _                           | 0.053 / 0.043             | _              |                 | 33            | 1.6              |

IIII [M

[Moons et al., VLSI 2016; Han et al., ICLR 2016]



#### **Sparse Hardware** 101



technology laboratories

# <sup>102</sup> Sparse Hardware – Eyeriss v2

#### Supports both Convolutional and Fully Connected Layers





|              | AlexNet | sparse-<br>AlexNet |
|--------------|---------|--------------------|
| GOPS         | 148.3   | 405.8              |
| fps          | 102.4   | 280.1              |
| Over v1      | 15.5×   | 42.5×              |
| GOPS/W       | 277.9   | 1028.1             |
| Inferences/J | 191.8   | 709.7              |
| Over v1      | 3.0×    | 11.3×              |

[Chen et al., JETCAS 2019]





# Manual Network Architecture Design





# <sup>104</sup> Simplify CONV Layers

14117



OF ELECTRONICS AT MIT

rL



# <sup>105</sup> Simplify CONV Layers

filters R S R M

Methods can be roughly categorized by how the filters are simplified:

- Reduce spatial size (R, S): stacked filters
- Reduce channels (C): 1x1 convolution, group of filters
- Reduce filters (M): feature map reuse



# <sup>106</sup> Simplify CONV Layers

filters R S R Μ

Methods can be roughly categorized by how the filters are simplified:

- Reduce spatial size (R, S): stacked filters
- Reduce channels (C): 1x1 convolution, group of filters
- Reduce filters (M): feature map reuse



# <sup>107</sup> Stacked Filters

### **GoogleNet/Inception v3**



Replace a large filter with a series of smaller filters





# **108** Stacked Filters

• Use stack of smaller filters (3x3) to cover the same receptive field with fewer filter weights





## **109** Stacked Filters

• Use stack of smaller filters (3x3) to cover the same receptive field with fewer filter weights

filter (3x3)

### **Example**





## **Stacked Filters**

- Use stack of smaller filters (3x3) to cover the same receptive field with fewer filter weights
   filter (3x3)
  - 0 1 0 1 1 1 0 1 0

Example: 5x5 filter (25 weights)  $\rightarrow$  two 3x3 filters (18 weights)



### <sup>111</sup> Simplify CONV Layers



Methods can be roughly categorized by how the filters are simplified:

- Reduce spatial size (R, S): stacked filters
- Reduce channels (C): 1x1 convolution, group of filters

Reduce filters (M): feature map reuse



### <sup>112</sup> 1x1 Convolution

Use **1x1 filter** to condense the cross-channel information.



[Lin et al., Network in Network, arXiv 2013, ICLR 2014]







Use **1x1 filter** to condense the cross-channel information.



[Lin et al., Network in Network, arXiv 2013, ICLR 2014]







Use **1x1 filter** to condense the cross-channel information.



[Lin et al., Network in Network, arXiv 2013, ICLR 2014]





# **II5 GoogLeNet:1x1 Convolution**

Apply 1x1 convolution before 'large' convolution filters. Reduce weights such that **entire CNN can be trained on one GPU**. Number of multiplications reduced from 854M  $\rightarrow$  358M



[Szegedy et al., arXiv 2014, CVPR 2015]







## **116 Group of Filters**

l'liiT

Idea: split filters and channels of feature map into different groups Example: 2 groups, each filter requires **2x fewer weights and multiplications**.



## **II7 Group of Filters**

1411



## **118 Group of Filters**

AlexNet uses group of filters to train on two separate GPUs (Drawback: correlation between channels of different groups is not used)







### **Group of Filters**

### Two ways of mixing information from groups





Pointwise (1x1) Convolution (Mix in one step) MobileNet Shuffle Operation (Mix in multiple steps) ShuffleNet





## **MobileNets: Comparison**

| Model             | ImageNet | Million   | Million   |
|-------------------|----------|-----------|-----------|
|                   | Accuracy | Mult-Adds | Parameter |
| 1.0 MobileNet-224 | 70.6%    | 569       | 4.2       |
| GoogleNet         | 69.8%    | 1550      | 6.8       |
| <b>VGG 16</b>     | 71.5%    | 15300     | 138       |

 Table 9. Smaller MobileNet Comparison to Popular Models

| Model              | ImageNet | Million   | Million   |
|--------------------|----------|-----------|-----------|
|                    | Accuracy | Mult-Adds | Parameter |
| 0.50 MobileNet-160 | 60.2%    | 76        | 1.32      |
| Squeezenet         | 57.5%    | 1700      | 1.25      |
| AlexNet            | 57.2%    | 720       | 60        |





### <sup>121</sup> Simplify CONV Layers



Methods can be roughly categorized by how the filters are simplified:

- Reduce spatial size (R, S): stacked filters
- Reduce channels (C): 1x1 convolution, group of filters
- Reduce filters (M): feature map reuse



### **122** Feature Map Reuse





output fmap with M channels



Reuse (M-K) channels in feature maps from previously processed layers





#### **Feature Map Reuse** 123



**Transition layers** 

[Huang et al., CVPR 2017]





l'liiT

### 124 **DenseNet**

Higher accuracy than ResNet with fewer weights and multiplications



Note: 1 MAC = 2 FLOPS



[Huang et al., CVPR 2017]

### 125 Feature Map Reuse

More complicated layer aggregation





[Yu et al., CVPR 2018]



# 126 Simplify FC Layers

CONV Layers: 5 Fully Connected Layers: 3 Weights: 61M MACs: 724M

ILSCVR12 Winner

[Krizhevsky et al., NIPS 2012]





мт

tems technology laboratories

# 127 Simplify FC Layers



[Lin et al., ICLR 2014]





# **128** Knowledge Distillation



[Bucilu et al., KDD 2006], [Hinton et al., arXiv 2015]





129

# Network Architecture Search (NAS)





## **130 Learn Network Architecture**

Rather than handcrafting the architecture, automatically search for it



### **Evaluate NAS Performance** 131

- Key Metrics
  - Achievable DNN accuracy
  - Required search time





• Trade the discoverable architectures for search speed





- Trade the discoverable architectures for search speed
- May irrecoverably limit the achievable network performance
  - Domain knowledge learned in manual network design provides guidance





• Search space = <u>layer operations</u> + connections between layers



### Common layer operations:

- Identity
- 1x3 then 3x1 convolution
- 1x7 then 7x1 convolution
- 3x3 dilated convolution
- 1x1 convolution
- 3x3 convolution

- 3x3 separable convolution
- 5x5 separable convolution
- 3x3 average pooling
- 3x3 max pooling
- 5x5 max pooling
- 7x7 max pooling



Search space = layer operations + <u>connections between layers</u>







# **(2) Improve Optimization Algorithm**







# 137 (2) Improve Optimization Algorithm

### Random



W1

**Coordinate Descent** 



Randomly samples the entire space

- Simple
- Does not use previous results

Starts from the previous best sample and greedily finds the best direction to move

- Uses previous results
- Simple
- Limited number of directions

Starts from the previous best sample and goes in the direction that has the largest gradient

- Explores more directions
- The metric should be differentiable



# 138 (2) Improve Optimization Algorithm

п Г

| Starts from the previous     | Learns from the previous                                    | Models the entire surface            |
|------------------------------|-------------------------------------------------------------|--------------------------------------|
| best sample and goes in      | samples and infers the                                      | of the search space and              |
| the best randomly-           | best sample                                                 | picks the best sample                |
| sampled direction            | Better uses the                                             | Gets rid of the iterative            |
| The metric does not          | previous samples                                            | process                              |
| need to be<br>differentiable | <ul> <li>Needs to design and<br/>train the agent</li> </ul> | Hard to model a large search space   |
| More complicated             |                                                             |                                      |
| Evolutionary                 | <b>Reinforcement Learning</b>                               | Bayesian                             |
| AKKK                         | Environment                                                 | $P(A B)^{2} \frac{P(B A)P(A)}{P(B)}$ |

Г



MTL • • •

microsystems technology laboratories massachusetts institute of technology

- NAS needs only the <u>rank</u> of the performance values
- Method 1: approximate accuracy
- Method 2: approximate weights
- Method 3: approximate metrics (e.g., latency, energy)



- NAS needs only the <u>rank</u> of the performance values
- Method 1: approximate accuracy



- Method 2: approximate weights
- Method 3: approximate metrics





- NAS needs only the <u>rank</u> of the performance values
- Method 1: approximate accuracy
- Method 2: approximate weights



Method 3: approximate metrics





- NAS needs only the <u>rank</u> of the performance values
- Method 1: approximate accuracy
- Method 2: approximate weights
- Method 3: approximate metrics (e.g., latency, energy)







# 143 Other Things to Know

- The components may not be chosen individually
  - Some optimization algorithms limit the search space
  - Using direct hardware metrics may limit the selection of the optimization algorithms

- Commonly overlooked properties
  - The complexity of implementation and usage
  - The ease of tuning
  - The probability of convergence to a good architecture



## 144 NetAdapt: Platform-Aware DNN Adaptation

- Automatically adapt DNN to a mobile platform to reach a target latency or energy budget
- An example of coordinate descent NAS



**IIII** In collaboration with Google's Mobile Vision Team



ms technology laboratories

## **Problem Formulation**

 $\max_{Net} Acc(Net) \text{ subject to } Res_j(Net) \leq Bud_j, j = 1, \cdots, m$ 

Break into a set of simpler problems and solve iteratively

 $\max_{Net_i} Acc(Net_i) \text{ subject to } Res_j(Net_i) \leq Res_j(Net_{i-1}) - \Delta R_{i,j}, j = 1, \cdots, m$ 

\**Acc*: accuracy function, *Res*: resource evaluation function, *ΔR*: resource reduction, *Bud*: given budget

#### Advantages

- Supports multiple resource budgets at the same time
- Guarantees that the budgets will be satisfied because the resource consumption decreases monotonically
- Generates a family of networks (from each iteration) with different resource versus accuracy trade-offs



#### <sup>146</sup> Simplified Example of One Iteration



Code available at http://netadapt.mit.edu



ns technology laboratories

## 147 Improved Latency vs. Accuracy Tradeoff

 NetAdapt boosts the real inference speed of MobileNet by up to 1.7x with higher accuracy



Reference:

**MobileNet:** Howard et al, "Mobilenets: Efficient convolutional neural networks for mobile vision applications", arXiv 2017 **MorphNet:** Gordon et al., "Morphnet: Fast & simple resource-constrained structure learning of deep networks", CVPR 2018

[Yang et al., ECCV 2018]

- Reimplemented framework on PyTorch
- Flexible: can support different networks and tasks
- Scalable: spawn multiple workers to simplify networks in parallel



• Easy-to-use: require implementing only one file (8 functions)

Code available at <a href="https://github.com/denru01/netadapt">https://github.com/denru01/netadapt</a>







RESEARCH LABORATORY OF ELECTRONICS AT MIT







|'||iT







|'||iT







1411

# Hardware In the Loop





## <sup>154</sup> # of Operations vs. Latency

• # of operations (MACs) does not approximate latency well



Source: Google (https://ai.googleblog.com/2018/04/introducing-cvpr-2018-on-device-visual.html)



## <sup>155</sup> # of Weights vs. Energy

- Number of weights *alone* is not a good metric for energy
- All data types should be considered





[Yang et al., CVPR 2017]

### **156** Other Hardware Metrics

• E.g., noise resilience in analog accelerators



DNN model that gives highest accuracy on a digital processor may not be the best for an analog processor





#### <sup>157</sup> Data Movement is Expensive





\* measured from a commercial 65nm process

Energy of weight depends on **memory hierarchy** and **dataflow** 

## 158 Energy Estimation Methodology



Hardware Energy Costs of each MAC and Memory Access



**Phi**r



## 159 Energy Estimation Tool V1

#### Website: https://energyestimation.mit.edu/

#### **Deep Neural Network Energy Estimation Tool**

#### Overview

This Deep Neural Network Energy Estimation Tool is used for evaluating and designing energy-efficient deep neural networks that are critical for embedded deep learning processing. Energy estimation was used in the development of the energy-aware pruning method (Yang et al., CVPR 2017), which reduced the energy consumption of AlexNet and GoogLeNet by 3.7x and 1.6x, respectively, with less than 1% top-5 accuracy loss. This website provides a simplified version of the energy estimation tool for shorter runtime (around 10 seconds).

#### Input

14117

To support the variety of toolboxes, this tool takes a single network configuration file. The network configuration file is a txt file, where each line denotes the configuration of a CONV/FC layer. The format of each line is:



- · Layer Index; the index of the layer, from 1 to the number of layers. It should be the same as the line number.
- <u>Conf IfMap, Conf Filt, Conf OfMap</u>: the configuration of the input feature maps, the filters and the output feature maps. The configuration of each of the three data types is in the format of "height width number\_of\_channels number\_of\_maps\_or\_filts number\_of\_zero\_entries bitwidth\_in\_bits".
- <u>Stride</u>: the stride of this layer. It is in the format of "stride\_y stride\_x".
- <u>Pad:</u> the amount of input padding. It is in the format of "pad\_top pad\_bottom pad\_left pad\_right".

Therefore, there will be 25 entries separated by commas in each line.

#### **Running the Estimation Model**

After creating your text file, follow these steps to upload your text file and run the estimation model:

- 1. Check the "I am not a robot" checkbox and complete the Google reCAPTCHA challenge. Heip us prevent spam.
- 2. Click the "Choose File" button below to choose your text file from your computer.
- 3. Click the "Run Estimation Model" button below to upload your text file and run the estimation model.

#### **Eyeriss V1**



#### Output DNN energy breakdown across layers







## 160 Energy Estimation Tool V2 - Accelergy





### 161 Energy Estimation Tool V2 - Accelergy



Tutorial at MICRO 2019: http://accelergy.mit.edu/tutorial.html



## 162 Energy Estimation Tool V2 - Accelergy

#### Website: https://accelergy.mit.edu/

| Code 🕕 Issues 0 11             | Pull requests 0 🛛 🕅 Project | s 0 📧 Wiki 🕕 Secu | irity 🔟 Insights |              |                  |                       |
|--------------------------------|-----------------------------|-------------------|------------------|--------------|------------------|-----------------------|
| o description, website, or top | pics provided.              |                   |                  |              |                  |                       |
| D 22 commits                   |                             | 🛇 1 release       | 2 contributors   |              | s <u>t</u> s MIT |                       |
| Branch: master 👻 New pull requ | uest                        |                   | Create new file  | Upload files | Find File        | Clone or download     |
| nelliewu95 Delete ERT_generate | or_old.py                   |                   |                  |              | Latest com       | mit Fb37b81 2 days ag |
| accelergy                      | Delete ERT_generator_old.py |                   |                  |              |                  | 2 days ag             |
| examples                       | v0.2 initial milestone      |                   | 2 days ago       |              |                  |                       |
| share                          | compound class v0.2 parsing |                   |                  |              |                  | 3 days ag             |
| gitignore                      | v0.2 initial milestone      |                   |                  | 2 days ago   |                  |                       |
|                                | initial commit              |                   | 3 months ago     |              |                  |                       |
| README.md                      | v0.2 initial milestone      |                   | 2 days ago       |              |                  |                       |
| setup.py                       | separation of v0            | 1                 |                  |              |                  | 3 days ago            |

#### Accelergy infrastructure (version 0.2)

An infrastructure for architecture-level energy estimations of accelerator designs. Project website: http://accelergy.mit.edu

#### Get started

Infrastructure tested on RedHat Linuv6 WLS

#### Output DNN energy breakdown across components

hierarchy.PE[0].ifmap\_sp: 140.0 hierarchy.PE[0].mac[0]: 70.0 hierarchy.PE[0].mac[1]: 70.0 hierarchy.PE[1].ifmap\_sp: 180.0 hierarchy.PE[1].mac[0]: 70.0 hierarchy.PE[1].mac[1]: 70.0 hierarchy.weights glb: 5400.0





## 163 Energy-Aware Pruning

- Problem formulation:  $\min_{Net} Erg(Net)$  subject to  $Acc(Net) \ge Th$
- Reduces energy by removing redundant weights
- Uses estimated energy to guide the layer-by-layer pruning
  - Prunes the layer that consume the most energy first





## 164 Energy-Aware Pruning

Directly target energy and incorporate it into the optimization of DNNs to provide greater energy savings

- Sort layers based on energy and prune layers that consume most energy first
- EAP reduces AlexNet energy by
   **3.7x** and outperforms the previous work that uses magnitude-based pruning by **1.7x**



Pruned models available at <a href="http://eyeriss.mit.edu/energy.html">http://eyeriss.mit.edu/energy.html</a>





## <sup>165</sup> NetAdapt: Platform-Aware DNN Adaptation

- Automatically adapt DNN to a mobile platform to reach a target latency or energy budget
- Use **empirical measurements** to guide optimization (avoid modeling of tool chain or platform architecture)



RESEARCH LABORATORY OF ELECTRONICS AT MIT

ms technology laboratories

**IIII** In collaboration with Google's Mobile Vision Team

#### NetAdapt: Using Direct Metrics is Important

- If NetAdapt was guided by the number of MACs, it would also achieve a better accuracy-MAC trade-off
- However, it does not mean lower latency
- It is important to incorporate direct metrics rather than indirect metrics into the design of DNNs

| Network            | Top-1 Accuracy | # of MACs (M) | Latency (ms) |
|--------------------|----------------|---------------|--------------|
| Small MobileNet V1 | 45.1 (+0)      | 13.6 (100%)   | 4.65 (100%)  |
| NetAdapt           | 46.3 (+1.2)    | 11.0 (81%)    | 6.01 (129%)  |
| Large MobileNet V1 | 68.8 (+0)      | 325.4 (100%)  | 69.3 (100%)  |
| NetAdapt           | 69.1 (+0.3)    | 284.3 (87%)   | 74.9 (108%)  |



#### NetAdapt: Fast Resource Consumption Estimation

- Taking measurements can be slow due to the long turn-around time and the limited number of platforms
- Solution: use per-layer lookup tables
  - The network latency can be estimated by the sum of the latency of each layer
  - The layers with the same configuration only need to be measured once
  - The network-wise lookup table grows exponentially with the number of layers



RESEARCH LABORATORY OF ELECTRONICS AT MIT

ms technology laboratories

**Phi**r

#### <sup>168</sup> NetAdapt: Code

Support building and using lookup tables



Code available at <a href="https://github.com/denru01/netadapt">https://github.com/denru01/netadapt</a>





# Part 3: Applications (Beyond Image Classification)







## **170** FastDepth: Fast Monocular Depth Estimation

- Real-time low-power depth sensing is critical for navigation of small robotic vehicles.
- Depth estimation from a single RGB image desirable, due to the relatively low cost and size of monocular cameras.



Our goal is to enable high accuracy, low latency, high throughput monocular depth estimation on a deployable embedded system.









### 171 Efficient Network Design for FastDepth



FastDepth achieves high frame rates through

- An efficient and lightweight encoder-decoder network architecture with a low-latency decoder design incorporating depthwise separable layers and additive skip connections
- Network pruning (NetAdapt) applied to whole encoder-decoder network
- Platform-specific compilation (TVM) targeting embedded systems







### **172** FastDepth: Fast Monocular Depth Estimation

Depth estimation at **high frame rates on an embedded platform** (an order of magnitude faster than previous approaches) while still maintaining accuracy



 IIIii
 Models available at <a href="http://fastdepth.mit.edu">http://fastdepth.mit.edu</a>



~40fps on an iPhone



### 173 Simplify Network by NetAdapt

|            | Before Pruning | After Pruning | Reduction    |
|------------|----------------|---------------|--------------|
| Weights    | 3.93M          | 1.34M         | $2.9 \times$ |
| MACs       | 0.74G          | 0.37G         | $2.0 \times$ |
| RMSE       | 0.599          | 0.604         | -            |
| $\delta_1$ | 0.775          | 0.771         | -            |
| CPU [ms]   | 66             | 37            | $1.8 \times$ |
| GPU [ms]   | 8.2            | 5.6           | $1.5 \times$ |



1411

nicrosystems technology laboratories

#### **DeeperLab: Single-Shot Image Parser**

**Results from Xception** 

technology laboratories

Joint Semantic and Instance Segmentation (high resolution input image)



One-shot parsing for efficient processing

 Fully convolutional, one-shot parsing (bottom-up approach)
 One backbone for two tasks

 http://deeperlab.mit.edu/
 Fully-Convolutional Network

 [Yang et al., arXiv 2019]
 Image

In collaboration with Google's Mobile Vision Team



## **DeeperLab: Efficient Image Parsing**

#### Address memory requirement for large feature map

Wide MobileNet: Increase kernel size rather than depth



2

Space-to-depth/depth-to-space: Avoid upsampling



Achieves near real-time 6.19 fps on GPU (V100) with 25.2% PQ and 49.8% PC on Mapillary Vistas dataset



http://deeperlab.mit.edu/



# Applications (Beyond DNN Acceleration)





## **177** Super-Resolution on Mobile Devices



Transmit low resolution for lower bandwidth

Screens are getting larger



Use **super-resolution** to improve the viewing experience of lower-resolution content (*reduce communication bandwidth*)





#### **IT8** FAST: A Framework to Accelerate SuperRes



**Real-time** 

A framework that accelerates **any SR** algorithm by up to **15x** when running on compressed videos

[Zhang et al., CVPRW 2017]

**Phi**r





## <sup>179</sup> Free Information in Compressed Videos







Compressed video

Pixels

Block-structure

Motion-compensation

Video as a stack of pixels

**Representation in compressed video** 

This representation can help accelerate super-resolution







## **180** Transfer is Lightweight



Fractional Bicubic Interpolation

Skip Flag

The complexity of the transfer is comparable to bicubic interpolation. Transfer N frames, accelerate by N







## **181** Evaluation: Accelerating SRCNN







PartyScene

RaceHorse

**BasketballPass** 

#### Examples of videos in the test set (20 videos for HEVC development)





 $4 \times$  acceleration with NO PSNR LOSS.  $16 \times$  acceleration with 0.2 dB loss of PSNR





#### **182** Visual Evaluation



SRCNN FAST + SRCNN

Look *beyond* the DNN accelerator for opportunities to accelerate DNN processing (e.g., structure of data and temporal correlation)

Code released at <u>www.rle.mit.edu/eems/fast</u>

|'|iT

[Zhang et al., CVPRW 2017]





**Bicubic** 



- DNNs are a critical component in the AI revolution, delivering record breaking accuracy on many important AI tasks for a wide range of applications; however, it comes at the cost of high computational complexity
- Efficient processing of DNNs is an important area of research with many promising opportunities for innovation at various levels of hardware design, including algorithm co-design
- When considering different DNN solutions it is important to **evaluate with the appropriate workload** in term of both input and model, and recognize that they are **evolving rapidly**.
- It's important to consider a comprehensive set of metrics when evaluating different DNN solutions: accuracy, speed, energy, and cost



## **184** Additional Resources

#### **Overview Paper**

V. Sze, Y.-H. Chen, T-J. Yang, J. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," Proceedings of the IEEE, Dec. 2017 Book Coming Soon!

More info about **Tutorial on DNN Architectures** <u>http://eyeriss.mit.edu/tutorial.html</u>



Efficient Processing of Deep Neural Networks: A Tutorial and Survey System Scaling With Nanostructured Power and RF Components Nonorthogonal Multiple Access for 5G and Beyond

Point of View: Beyond Smart Grid—A Cyber–Physical–Social System in Energy Futur Scanning Our Past: Materials Science, Instrument Knowledge, and the Power Source Renaissance



MIT Professional Education Course on "Designing Efficient Deep Learning Systems" <u>http://professional-education.mit.edu/deeplearning</u>

For updates **Y** Follow @eems\_mit

http://mailman.mit.edu/mailman/listinfo/eems-news





#### • Overview on DNN and Popular DNN Models

- *Ioffe, Sergey, and Christian Szegedy.* "Batch normalization: Accelerating deep network training by reducing internal covariate shift," ICML 2015.
- LeNet: LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proc. IEEE 1998.
- AlexNet: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." NIPS. 2012.
- **VGGNet**: Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." ICLR 2015.
- **GoogleNet**: Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE conference on computer vision and pattern recognition. CVPR 2015.
- **ResNet**: He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. CVPR 2016.
- **DenseNet**: Huang, Gao, et al. "Densely connected convolutional networks." CVPR 2017
- Wide ResNet: Zagoruyko, Sergey, and Nikos Komodakis. "Wide residual networks." BMVC 2017.
- ResNext: Xie, Saining, et al. "Aggregated residual transformations for deep neural networks." CVPR 2017.





#### • Part 1: Energy-Efficient Hardware for Deep Neural Networks

- Project website: <u>http://eyeriss.mit.edu</u>
- Y.-H. Chen, T. Krishna, J. Emer, V. Sze, "Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks," IEEE Journal of Solid State Circuits (JSSC), ISSCC Special Issue, Vol. 52, No. 1, pp. 127-138, January 2017.
- Y.-H. Chen, J. Emer, V. Sze, "Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks," International Symposium on Computer Architecture (ISCA), pp. 367-379, June 2016.
- Y.-H. Chen, T.-J. Yang, J. Emer, V. Sze, "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), June 2019.
- Eyexam: <u>https://arxiv.org/abs/1807.07928</u>
- Limitations of Existing Efficient DNN Approaches
  - Y.-H. Chen\*, T.-J. Yang\*, J. Emer, V. Sze, "Understanding the Limitations of Existing Energy-Efficient Design Approaches for Deep Neural Networks," SysML Conference, February 2018.
  - V. Sze, Y.-H. Chen, T.-J. Yang, J. Emer, "Efficient Processing of Deep Neural Networks: A Tutorial and Survey," Proceedings of the IEEE, vol. 105, no. 12, pp. 2295-2329, December 2017.
  - Hardware Architecture for Deep Neural Networks: <u>http://eyeriss.mit.edu/tutorial.html</u>





#### Transforms for processing on GPU and CPUs

- Lavin, Andrew, and Gray, Scott, "Fast Algorithms for Convolutional Neural Networks," arXiv preprint arXiv:1509.09308 (2015)
- Mathieu, Michael, Mikael Henaff, and Yann LeCun. "Fast training of convolutional networks through FFTs." arXiv preprint arXiv:1312.5851 (2013).
- Cong, Jason, and Bingjun Xiao. "Minimizing computation in convolutional neural networks." International Conference on Artificial Neural Networks. Springer International Publishing, 2014.

#### • Part 2: Co-Design of Algorithms and Hardware for Deep Neural Networks

- T.-J. Yang, Y.-H. Chen, V. Sze, "Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Energy estimation tool: <u>http://eyeriss.mit.edu/energy.html</u>
- T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, "NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications," European Conference on Computer Vision (ECCV), 2018. <u>http://netadapt.mit.edu</u>
- T.-J. Yang, V. Sze, "Design Considerations for Efficient Deep Neural Networks on Processing-in-Memory Accelerators," IEEE International Electron Devices Meeting (IEDM), Invited Paper, December 2019.
- Y. N. Wu, J. S. Emer, V. Sze, "Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs," International Conference on Computer Aided Design (ICCAD), November 2019. <u>http://accelergy.mit.edu</u>
- T.-J. Yang, Y.-H. Chen, J. Emer, V. Sze, "A Method to Estimate the Energy Consumption of Deep Neural Networks," Asilomar Conference on Signals, Systems and Computers, Invited Paper, October 2017.





- Reduced Precision
  - Courbariaux, Matthieu, and Yoshua Bengio. "Binarynet: Training deep neural networks with weights and activations constrained to+ 1 or-1." arXiv preprint arXiv:1602.02830 (2016).
  - Courbariaux, Matthieu, Yoshua Bengio, and Jean-Pierre David. "Binaryconnect: Training deep neural networks with binary weights during propagations," NeurIPS, 2015
  - Rastegari, Mohammad, et al. "XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks," ECCV, 2016
  - Judd, Patrick, Jorge Albericio, and Andreas Moshovos. "Stripes: Bit-serial deep neural network computing." IEEE Computer Architecture Letters (2016).
  - Lee, Edward H., et al. "LogNet: Energy-efficient neural networks using logarithmic computation." 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017.
  - Han, Song, Huizi Mao, and William J. Dally. "Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding," ICLR, 2016.



#### • Exploit Sparsity

- LeCun, Yann, et al. "Optimal brain damage," NIPS, 1989.
- Han, Song, et al. "Learning both weights and connections for efficient neural network," NeurIPS, 2015.
- T.-J. Yang, Y.-H. Chen, V. Sze, "Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Parashar, Angshuman, et al. "SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks." ISCA, 2017
- Han, Song, et al. "EIE: efficient inference engine on compressed deep neural network," ISCA, 2016.
- Y.-H. Chen, T.-J. Yang, J. Emer, V. Sze, "Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices," IEEE Journal on Emerging and Selected Topics in Circuits and Systems (JETCAS), June 2019.

#### Manual Network Design

- Network in Network: Lin, Min, Qiang Chen, and Shuicheng Yan. "Network in network." ICLR 2014
- **MobileNet**: Howard, Andrew G., et al. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017).
- **ShuffleNet**: Zhang, Xiangyu, et al. "ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices." arXiv preprint arXiv:1707.01083 (2017).
- Yu, Fisher, et al. "Deep layer aggregation." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.







#### Neural Architecture Search

- Learning Network Architecture: Zoph, Barret, et al. "Learning Transferable Architectures for Scalable Image Recognition." arXiv preprint arXiv:1707.07012 (2017).
- T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, "NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications," European Conference on Computer Vision (ECCV), 2018. <u>http://netadapt.mit.edu</u>

#### • Hardware In the Loop

- T.-J. Yang, Y.-H. Chen, V. Sze, "Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning," IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Energy estimation tool: <u>http://eyeriss.mit.edu/energy.html</u>
- T.-J. Yang, A. Howard, B. Chen, X. Zhang, A. Go, V. Sze, H. Adam, "NetAdapt: Platform-Aware Neural Network Adaptation for Mobile Applications," European Conference on Computer Vision (ECCV), 2018. <u>http://netadapt.mit.edu</u>
- T.-J. Yang, V. Sze, "Design Considerations for Efficient Deep Neural Networks on Processing-in-Memory Accelerators," IEEE International Electron Devices Meeting (IEDM), Invited Paper, December 2019.
- Y. N. Wu, J. S. Emer, V. Sze, "Accelergy: An Architecture-Level Energy Estimation Methodology for Accelerator Designs," International Conference on Computer Aided Design (ICCAD), November 2019. <u>http://accelergy.mit.edu</u>
- T.-J. Yang, Y.-H. Chen, J. Emer, V. Sze, "A Method to Estimate the Energy Consumption of Deep Neural Networks," Asilomar Conference on Signals, Systems and Computers, Invited Paper, October 2017.





#### • Part 3: Applications Beyond Image Classification

- D. Wofk\*, F. Ma\*, T.-J. Yang, S. Karaman, V. Sze, "FastDepth: Fast Monocular Depth Estimation on Embedded Systems," IEEE International Conference on Robotics and Automation (ICRA), May 2019. <u>http://fastdepth.mit.edu/</u>
- T.-J. Yang, M. D. Collins, Y. Zhu, J.-J. Hwang, T. Liu, X. Zhang, V. Sze, G. Papandreou, L.-C. Chen, "DeeperLab: Single-Shot Image Parser," arXiv, February 2019. <u>http://deeperlab.mit.edu</u>
- Z. Zhang, V. Sze, "FAST: A Framework to Accelerate Super-Resolution Processing on Compressed Videos," CVPR Workshop on New Trends in Image Restoration and Enhancement, July 2017. <u>www.rle.mit.edu/eems/fast</u>

