Multimedia Systems Group
Professor Vivienne Sze
Computer vision, which automatically extracts information from visual data (e.g., object detection, recognition, etc.), promises to enable revolutionary technologies such as wearable vision for the blind, self-driving cars, drone navigation, robotics, etc. In many of these applications, local processing near the image sensor, rather than in the cloud, is desirable due to latency, security/privacy, and communication bandwidth concerns. However, computer vision algorithms tend to be computationally complex, making local processing difficult as the amount of energy available is limited by battery capacity.
The Energy-Efficient Multimedia Systems Group aims to drastically reduce the energy consumption of real-time visual data processing so that computer vision can fully realize its potential in a wide range of embedded vision applications. To meet these demands, our group develops energy-efficient design techniques that cross the traditional research boundaries between signal processing algorithms and hardware (including circuits, architectures and systems).
Current research topics
- Energy-Efficient Computer Vision, Machine Learning and Autonomous Navigation
- Accelerating Super Resolution
- Next-Generation Video Coding
Example Research Projects
Energy-Efficient Deep Neural Networks
Deep neural networks (DNNs) set the foundation for modern AI, and have enabled applications such as object recognition (e.g., automatic photo tagging in Facebook), speech recognition (e.g., Apple Siri), autonomous vehicles, and even strategy planning (e.g., Google DeepMind AlphaGo). While DNNs deliver state-of-the-art accuracy on these applications, they require significant computation resources due to the size of the networks (e.g. hundreds of megabytes for filter weights storage and 30k-600k operations per input pixel). Our goal is to efficiently process these large high-dimensional networks on small, embedded hardware.
Data movement is the dominant source of energy consumption for DNNs due to the high dimensional data. In this project, we developed a framework to generate energy-efficient data flows that minimize data movement. Our architecture consists of a spatial array composed of processing engines (PE), each with local storage; inter PE communication enables regions of PEs to share data. We developed an energy-efficient dataflow, called row stationary, that minimizes data accesses from large expensive memories (DRAM and global buffer), by maximizing data reuse from small low-cost memories (local storage in PE and inter-PE). It exploits all forms of data reuse available in the DNNs including convolutional, filter and image reuse to deliver 1.4 – 2.5x lower energy consumption than other data flows.
We developed a spatial array hardware accelerator, named Eyeriss, to support our row stationary dataflow. In addition to reducing data movement, we also exploit data statistics in two ways to further reduce energy consumption: (1) we reduce the accelerator energy by 2x using data gating to skip reads and multiplications for zero values; (2) we reduce DRAM bandwidth by up to 2x using run-length compression. The Eyeriss chip was designed in 65nm CMOS, and has been integrated into a system that demonstrates real-time 1000-class image classification at below one-third of a Watt, is over 10x more energy efficient than existing mobile GPUs. In addition, Eyeriss can be reconfigured to support varying filter shapes across different layers within the DNN and across different DNN, while still delivering high throughput at high energy-efficiency.
To further reduce energy consumption, we also developed methods to incorporate energy-efficiency into the algorithm development. While there are many previous efforts that try to reduce the CNN model size or amount of computation, we find that they do not necessarily result in lower energy consumption, and therefore do not serve as a good metric for energy cost estimation. To close the gap between DNN design and energy consumption optimization, we propose an energy-aware pruning algorithm for DNNs that directly uses energy consumption estimation of a DNN to guide the pruning process. The energy estimation methodology uses parameters extrapolated from actual hardware measurements that target realistic battery-powered systems. With the proposed pruning method, the energy consumption of AlexNet and GoogLeNet is reduced by 3.7x and 1.6x, respectively, with less than 1% top-5 accuracy loss.
Enabling Autonomous Navigation of Miniaturized Robots
(In collaboration with Sertac Karaman)
Autonomous navigation of miniaturized robots (e.g., nano/pico aerial vehicles) is currently a grand challenge for robotics research, due to the need for processing a large amount of sensor data (e.g., camera frames) with limited on-board computational resources. In this work, we focus on the design of a visual-inertial odometry (VIO) system in which the robot estimates its ego-motion (and a landmark-based map) from on-board camera and IMU data. We argue that scaling down VIO to miniaturized platforms (without sacrificing performance) requires a paradigm shift in the design of perception algorithms, and we advocate a co-design approach in which algorithmic and hardware design choices are tightly coupled.
Towards Ubiquitous Embedded Vision
We would like to reduce the energy consumption substantially such that all cameras can be made “smart” and output meaningful information with little need for human intervention. As an intermediate benchmark, we would like to make understanding pixels as energy-efficient as compressing pixels, so that computer vision can be as ubiquitous as video compression, which is present in most cameras today. This is challenging, as computer vision often requires that the data be transformed into a much higher dimensional space than video compression, which results in more computation and data movement. For example, object detection, used in applications such as Advanced Driver Assistant Systems (ADAS), autonomous control in Unmanned Aerial Vehicles (UAV), mobile robot vision, video surveillance, and portable devices, requires an additional dimension of image scaling to detect objects of different sizes, which increases the number of pixels that must be processed.
In this project, we use joint algorithm and hardware design to reduce computational complexity by exploiting data statistics. We developed object detection algorithms that enforce sparsity in both the feature extraction from the image as well as the weights in the classifier. In addition, when detecting deformable objects using multiple classifiers to identify both the root and the different parts of an object, we only perform the parts classification on the high scoring roots. We then design hardware that exploits these various forms of sparsity for energy-efficiency, which reduces the energy consumption by 5x, enabling object detection to be as energy-efficient as video compression at < 1nJ/pixel. This is an important step towards achieving continuous mobile vision, which benefits applications such as wearable vision for the blind.
Accelerating Super Resolution with Compressed Video
Video resolutions can be limited due to the available content or limited transmission bandwidth. To improve the viewing experiences of low-resolution videos on high-resolution displays, super-resolution algorithms can be used to upsample the video while keeping the frames sharp and clear. Unfortunately, state-of-the-art super-resolution algorithms are computationally complex and cannot deliver the throughput required to support real-time upsampling of the low-resolution videos streamed from the cloud.
In this project, we develop FAST, a framework to accelerate any image based super-resolution algorithm by leveraging embedded information in compressed videos. FAST exploits the similarity between adjacent frames in a video. Given the output of a super-resolution algorithm on one frame, the technique adaptively transfers super-resolution pixels to the adjacent frames to avoid running super-resolution on those frames. The transferring process has negligible computation cost because the required formation including motion vectors, block size, and prediction residual are embedded in the compressed video for free. We show that FAST accelerates state-of-the-art super-resolution algorithms (e.g. SR-CNN) by up to an order of magnitude with acceptable quality loss up to 0.2 dB. FAST is an important step towards enabling real-time super-resolution on streamed videos for large displays.
Next-Generation Video Coding Systems
Video is perhaps the biggest of the ‘big data’ being collected and transmitted. Today, over 500 million hours of video surveillance are collected every day, over 300 million hours of video are uploaded to YouTube every hour, and over 70% of the today’s Internet traffic is used to transport video. The exponential growth of video places a significant burden on global communication networks. Next generation video compression systems must deliver not only higher coding efficiency, but also high throughput to support increasing resolutions, and low energy consumption as most video is captured on battery operated devices. We used joint design of algorithms and hardware in the development of the latest video coding standard, High Efficiency Video Coding (H.265/HEVC), which delivers 50% higher coding efficiency relative to its predecessor H.264/AVC, while at the same time increasing throughput and reducing energy consumption.
CABAC entropy coding was a well-known throughput bottleneck in H.264/AVC due to its highly serial nature with many feedback loops. We redesigned CABAC entropy coding for the H.265/HEVC standard to both increase coding efficiency, and to deliver higher throughput by reordering syntax elements and restructuring the context modeling to minimize feedback loops. We then designed hardware that exploited these features such that our H.265/HEVC CABAC accelerator achieves 4x higher throughput than the fastest existing H.264/AVC CABAC accelerators, enough for Ultra-HD 8K at 120 fps. Another advance is the use of large transforms in H.265/HEVC for higher coding efficiency. Larger transforms traditionally result in more computation and thus larger energy cost. We designed hardware that exploits the sparsity of the coefficients, such that the same energy is consumed per pixel regardless of the transform size. This approach may enable future video coding systems to use larger transforms for higher coding efficiency without impacting energy cost.
Finally, there is also a strong need for video coding tools beyond H.265/HEVC. In this project, we developed a new technique called Rotate Intra Block Copy, which utilizes the rotation invariant similarity between the patches in the same frame to improve intra block prediction, such that it provides coding gain to not only screen content, but all forms of video content. Combined with the existing predictor from HEVC, this technique gives an average of 20% reduction in residual energy. With a novel method to encode the intra motion vector, it achieves a coding gain in BD-rate of 3.4%. In practice, this technique can reduce the transmission bandwidth of the ultra-high-resolution video content.