Coarse-grained reconfigurable architectures for machine learning applications

Lead Research Organisation: University of Cambridge
Department Name: Computer Science and Technology

Abstract

Neural networks are becoming a state-of-the-art technique for solving problems in computer vision, they achieve outstanding accuracies in large-scale image classifications. Most existing convolutional neural networks are complex: they require a large number of parameters and intensive computations to achieve high accuracies on large datasets. Running big networks on embedded hardware therefore has two major difficulties. First, the large number of parameters of a neural network requires a lot of hardware memory. Second, energy consumption is dominated by memory accesses, and accessing the large number of parameters of a neural network exceeds the energy envelope of power sensitive embedded systems. To resolve the above problems, the computer architecture community is currently exploring novel hardware architectures for neural network inference. Many novel custom hardware architectures have been proposed for neural network inference and training. Various FPGA-based accelerators have been recently applied to neural networks. In addition, there is an increasing number of ASIC designs for deep neural network inference and training. These accelerators normally utilize a large on-chip memory and have custom computing units to calculate matrix dot-products. A custom accelerator is definitely beneficial for running neural network computations efficiently, but accessing and storing the large number of parameters in the memory is still a fundamental limit for these accelerators. This research will explore novel network compression methods and exploit more efficient number representation systems to help reduce the size and computation complexity of neural networks. The compressed neural network is then easier to execute on any network accelerators.

The area of network compression is now under active research. For reducing the number of parameters of a neural network, pruning and regularization are popular methods. Network pruning removes some unimportant connections or neurons and then retrain the obtained smaller topology. If retraining converges and test accuracy remains unchanged, this suggests that the pruning process successfully finds a smaller network topology. Pruning methods can be classified into fine-grained (pruning individual weights) and coarse-grained (pruning entire filters). Regularization reduces parameters of a neural network by encouraging scarcity in a neural network during the training phase. The research aims to combine pruning with regularization to achieve better compression results. Additionally, the research will focus on pruning in a non-deterministic manner, where previously pruned weights can recover if they found their importance later.

Reducing the bit-width of each individual parameters has a direct impact on the size of a neural network. Popular number representation systems such as fixed-point and dynamic fixed-point representations have been fully discovered and evaluated on various neural networks. Both of these proposed arithmetics are linear quantization methods. Non-linear quantizations, such various encoding schemes, are more efficient in representing weights but suffer from high computation complexity by adding extra hardware encoders and decoders. Part of my future research would focus on exploring novel number representation systems that exploit quantization in a non-linear fashion but maintains relatively low computational costs.

Publications

10 25 50

Studentship Projects

Project Reference Relationship Related To Start End Student Name
EP/N509620/1 01/10/2016 30/09/2022
1941039 Studentship EP/N509620/1 01/10/2017 31/03/2021 Yiren Zhao
 
Description The results of the work contain algorithmic and hardware aspects. The inference speed of DNNs, whether on user devices or in the cloud, directly affects the quality of service. Recent years have thus seen a surge of interest in minimizing the memory and compute costs of DNN inference. Pruning algorithms compress DNNs by setting individual weights (fine-grained pruning) or groups of weights (coarse-grained pruning) to zero, thus removing connections or neurons from the models. Quantization methods reduce the number of bits required to represent each value, and thus further provide memory, bandwidth and compute savings. I would like to highlight two algorithms I've designed for CNN compression: a coarse-grained pruning algorithm for commodity hardware and a novel quantization scheme for custom accelerators. In terms of hardware, I have developed a hardware tool (Tomato) for auto-mapping compressed CNNs to reconfigurable hardware devices.
Exploitation Route The outcomes of the funding are written in academic papers. In addition, I open sourced a tool called Mayo for all of the algorithms I've proposed and plan to open source Tomato that auto-maps compressed CNNs to reconfigurable hardware devices.
Sectors Digital/Communication/Information Technologies (including Software)