Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neural Networks
Ghouthi Boukli Hacene, Vincent Gripon, Matthieu Arzel, Nicolas Farrugia, Yoshua Bengio

To cite this version:
hal-01965304

HAL Id: hal-01965304
https://hal.archives-ouvertes.fr/hal-01965304
Preprint submitted on 25 Dec 2018

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Quantized Guided Pruning for Efficient Hardware Implementations of Convolutional Neural Networks

Ghouthi Boukli Hacene\textsuperscript{1,2}, Vincent Grippon\textsuperscript{1,2}, Matthieu Arzel\textsuperscript{2}, Nicolas Farrugia\textsuperscript{2} and Yoshua Bengio\textsuperscript{1}
\textsuperscript{1} Université de Montréal, MILA, \textsuperscript{2}IMT Atlantique, Lab-STICCC

(Paper submitted to ISCAS 2019 on October 31, 2018)

Abstract—Convolutional Neural Networks (CNNs) are state-of-the-art in numerous computer vision tasks such as object classification and detection. However, the large amount of parameters they contain leads to a high computational complexity and strongly limits their usability in budget-constrained devices such as embedded devices. In this paper, we propose a combination of a new pruning technique and a quantization scheme that effectively reduce the complexity and memory usage of convolutional layers of CNNs, and replace the complex convolutional operation by a low-cost multiplexer. We perform experiments on the CIFAR10, CIFAR100 and SVHN and show that the proposed method achieves almost state-of-the-art accuracy, while drastically reducing the computational and memory footprints. We also propose an efficient hardware architecture to accelerate CNN operations. The proposed hardware architecture is a pipeline and accommodates multiple layers working at the same time to speed up the inference process.

Index Terms—convolutional neural networks, pruning, weight binarization, hardware implementation.

I. INTRODUCTION AND RELATED WORK

For the past few years, Deep Neural Networks (DNNs), and especially Convolutional Neural Networks (CNNs) \cite{1}, have received considerable attention thanks to their remarkable accuracy in computer vision tasks \cite{2}–\cite{5} such as classification and detection \cite{6}. However, their need for intensive computations and memory has meant that most of the implementations are based on GPUs, while providing efficient hardware implementations is still a very active and prospective field of research. Therefore, the deployment of CNNs in embedded systems is complex and potentially blocking for many potential applications.

To address this limitation, multiple approaches have been proposed in the literature. For example the authors of \cite{7} and \cite{8} propose to reduce DNNs’ memory footprint by compressing their weights. In these two approaches, the obtained DNN is not retrained after compression, leading to potentially sub-optimal solutions. Following this lead, the authors of \cite{9} have showed that training and compressing weights simultaneously can lead to better accuracy. In the same vein, in \cite{10} and \cite{11}, the authors propose to binarize the weights during the learning phase. As a result, the obtained DNN contains only weights whose values are 1 or –1, while suffering from a very limited drop in accuracy compared to state-of-the-art solutions. These works have been improved later in \cite{12}, where the authors proposed to add a scaling factor per layer and per kernel, as a mean to offer better diversity to binary networks, with almost no impact on memory usage or complexity. Other approaches have proposed to limit weights to three or more values (–1, 0, and 1) \cite{13}–\cite{15}. These approaches demonstrate that using slightly more bits to encode weights enable to improve accuracy by a significant amount, but they also require much more memory and other hardware components to compute non-binary operations. In recent works \cite{16}, \cite{17}, authors have proposed to binarize both weights and activations in CNNs, resulting in potentially very efficient hardware implementations. However, these methods end up with a significant lower accuracy than state-of-the-art ones.

Once a binary neural network has been trained, efficient implementations can advantageously benefit from simplified operations. For example, multiplications in binary neural networks can be replaced by simple low-cost multiplexers. Efficient solutions have been proposed in \cite{18}, \cite{19}. However, even binary neural networks still require significant computational power and memory. These solutions also typically lead to a considerable latency, which may be an issue for some applications. In another line of work, authors have been aiming at reducing the number of trainable parameters in DNNs. In \cite{7}, \cite{20}, the authors successfully apply pruning techniques to fully connected layers of DNNs. However, state-of-the-art CNNs are using more and more convolutional layers nowadays: in a typical modern architectures like ResNet18, about 99% of the connections are in convolutional layers, and thus pruning connections only in fully connected layers has almost no impact on the overall complexity and memory usage of the architecture.

In this paper we propose to combine an efficient pruning technique, which can be effectively leveraged at implementation stage, with binary neural networks. We apply the proposed pruning technique on convolutional layers, resulting in very lightweight convolutions that can be implemented with simple multiplexers. The proposed method approaches state-of-art accuracy on the CIFAR10, CIFAR100 and SVHN dataset. We also propose a hardware implementation which uses very few resources and computational power. This implementation can compute more than one layer at a time and uses a simple multiplexer to perform convolutional operations. As such, it provides significantly smaller latency than existing counterparts.

The outline of the paper is as follows: in Section II we describe the proposed method and describe experiments on
the CIFAR10, CIFAR100 and SVHN dataset. In Section III we present the proposed hardware implementation and show hardware implementation results. Section IV concludes.

II. PROPOSED METHOD

In this section, we introduce a method to efficiently prune connections in convolutional layers. Note that pruning may have two different aims: a) to decrease the number of parameters to be trained in a given architecture, thus resulting in lesser chance of overfitting and b) to decrease the memory usage and complexity of a given architecture, so that it becomes lighter to implement in a budget-restricted configuration. If some author (e.g. [20]) argue they do both, we believe this is questionable as the reduction of the number of trainable weights they obtain on the one hand is balanced by the increasing complexity of identifying which connections are kept and which are lost in the process.

The proposed method has the double of interest of decreasing the number of parameters to be trained while keeping a simple deterministic way of identifying which connections are kept and which are disregarded.

A. Details of the Proposed Method

Let us denote by \( x \) (resp. \( y \) or \( w \)) the input (resp. output or kernel) tensor of a given convolutional layer. We index \( x \) (resp. \( y \)) using three indices \( i, j, k \) (resp. \( \ell \)), where \( 0 \leq i < i_{\text{max}} \) and \( 0 \leq j < j_{\text{max}} \) correspond to 2D coordinates and \( 0 \leq k < k_{\text{max}} \) (resp. \( 0 \leq \ell < \ell_{\text{max}} \)) indexes a feature map. Similarly, we index \( w \) using four indices: \( 0 \leq i \leq i_{\text{max}} \) and \( 0 \leq \lambda \leq \lambda_{\text{max}} \) correspond to 2D coordinates, and \( k \) and \( \ell \) are as introduced above. So, an element of the input tensor is written \( x_{i,j,k} \), an element of the kernel tensor is written \( w_{i,\lambda,k,\ell} \) and an element of the output tensor is written \( y_{i,j,k,\ell} \).

The idea we propose consists of removing most of the connections in each slice \( w_{...k,\ell} \) of the kernel tensor. The connections to be kept are chosen according to a deterministic rule agnostic of the initialization and of the training dataset. Namely, we choose to only keep the connections \( w_{i,\lambda,k,\ell} \) for which

\[
\lambda + \lambda_{\text{max}} = k \pmod{i_{\text{max}} \lambda_{\text{max}}}. \tag{1}
\]

When considering \( 3 \times 3 \) kernels for example, we remove 89% of the connections in the convolutional layer. The reason for choosing this scheme is quite straightforward: we want diversity in the connections we keep to be sure our kernels do not simplify to a simple \( 1 \times 1 \) convolution and still cover the initial kernel to its full extent (providing at least 9 feature maps are used).

We then perform the training on the remaining connections, disregarding the other ones. Using this method, the convolution of each slice of the kernel tensor is replaced by a simple multiplication.

To further benefit from the reduced complexity of this pruning technique, we combine it with a weight binarization method. Here, we use Binary Connect (BC) [10]. Once remaining connections have been binarized, it is possible to replace the multiplication operation by a multiplexer.

B. Results

To evaluate the performance of our proposed method, we use the CIFAR10 vision benchmark made of tiny 32x32 images. We compare various modern CNN architectures such as Resnet [21], Wide-Resnet [22], Densenet [23], and MobileNet [24]. Note that these architectures contain \( 1 \times 1 \) and \( 3 \times 3 \) convolutional kernels only. Thus we apply the proposed method on the \( 3 \times 3 \) kernels.

As a first experiment, we aim at estimating the drop in performance caused by pruning connections. We thus randomly remove \( m \) connections per kernel slice. Figure 1 shows that the accuracy of the architecture is quite robust to this process, even when 8 out of the 9 connections in slices of \( 3 \times 3 \) kernels are removed.

We then report in Table I the obtained results using Equation (1) to remove kernels connections. Note that contrary to the previous experiment, removed connections are not chosen randomly anymore but according to a deterministic scheme. As a consequence, the positions of removed connections does not have to be stored. We compare the accuracy obtained using baseline architectures, pruned ones, binarized ones, and our proposed method mixing pruning and BC. Note that BC offers a 32 compression factor in terms of memory used, and our method roughly multiply this factor by 9, achieving an almost 300 factor compression in total. We also perform experiments on SVHN (resp. CIFAR100) on Resnet18 (resp. WideResnet-40-10) and obtain 97%/96% (resp. 80%/77%) accuracy for Full-precision/pruning+BC.

<table>
<thead>
<tr>
<th>Architecture</th>
<th>Accuracy (%)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Resnet18</td>
<td>95%</td>
</tr>
<tr>
<td>Resnet34</td>
<td>94%</td>
</tr>
<tr>
<td>WideResnet-28-10</td>
<td>96%</td>
</tr>
<tr>
<td>Densenet121</td>
<td>95%</td>
</tr>
<tr>
<td>MobileNetV2</td>
<td>94%</td>
</tr>
</tbody>
</table>

Fig. 1. Evolution of accuracy as function of number of connections removed per kernel slice.

III. HARDWARE IMPLEMENTATION

In this section, we first introduce the hardware architecture of the proposed method, its different components, and the way they are connected. Then, we present the hardware implementation of the proposed method, applied on ResNet18, on a Field Programmable Gate Array (FPGA).
TABLE I
COMPARISON OF ACCURACY BETWEEN BASELINE ARCHITECTURES, PRUNED ONES, BINARIZED ONES, AND THE PROPOSED METHOD ON CIFAR10.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Full-precision</td>
<td>94.5%</td>
<td>95%</td>
<td>96%</td>
<td>95%</td>
<td>94.4%</td>
</tr>
<tr>
<td>Pruning</td>
<td>93.5%</td>
<td>93.8%</td>
<td>95%</td>
<td>94.3%</td>
<td>93.3%</td>
</tr>
<tr>
<td>BC</td>
<td>93.31%</td>
<td>93.64%</td>
<td>95.2%</td>
<td>94.5%</td>
<td>93%</td>
</tr>
<tr>
<td>Pruning + BC</td>
<td>91%</td>
<td>91.3%</td>
<td>94%</td>
<td>93%</td>
<td>91%</td>
</tr>
</tbody>
</table>

TABLE II
FPGA RESULTS FOR THE PROPOSED ARCHITECTURE ON VU13P (XCVU13P-FIGD2104-1-E).

<table>
<thead>
<tr>
<th>P</th>
<th>LUT</th>
<th>FF</th>
<th>BRAMs</th>
<th>Frequency</th>
<th>Processing Latency</th>
<th>Processing outflow</th>
<th>Power</th>
</tr>
</thead>
<tbody>
<tr>
<td>Conv64 – 64</td>
<td>16</td>
<td>22424</td>
<td>22424</td>
<td>114</td>
<td>240MHz</td>
<td>52µs s</td>
<td>19230 images/s</td>
</tr>
<tr>
<td>×Conv64 – 64</td>
<td>16</td>
<td>89746</td>
<td>75235</td>
<td>456</td>
<td>240MHz</td>
<td>256µs s</td>
<td>19230 images/s</td>
</tr>
<tr>
<td>3×Conv128 – 128</td>
<td>32</td>
<td>59780</td>
<td>45024</td>
<td>171</td>
<td>240MHz</td>
<td>154.8µs s</td>
<td>19379 images/s</td>
</tr>
<tr>
<td>3×Conv128 – 128</td>
<td>64</td>
<td>134090</td>
<td>102552</td>
<td>171</td>
<td>240MHz</td>
<td>103.2µs s</td>
<td>29069 images/s</td>
</tr>
<tr>
<td>3×Conv256 – 256</td>
<td>64</td>
<td>74067</td>
<td>52051</td>
<td>87</td>
<td>250MHz</td>
<td>147.3µs s</td>
<td>20366 images/s</td>
</tr>
<tr>
<td>3×Conv256 – 256</td>
<td>128</td>
<td>154599</td>
<td>102723</td>
<td>87</td>
<td>218MHz</td>
<td>112.8µs s</td>
<td>26555 images/s</td>
</tr>
<tr>
<td>3×Conv512 – 512</td>
<td>128</td>
<td>132155</td>
<td>52151</td>
<td>45</td>
<td>208MHz</td>
<td>177µs s</td>
<td>16949 images/s</td>
</tr>
</tbody>
</table>

A. Hardware Architecture

In Figure 2, we depict the proposed hardware architecture for performing convolutions, which we name a “layer block”. This architecture uses a simple low-cost multiplexer. In more details, a layer block is made of two sub-blocks: a memory block and a processing unit one.

![Fig. 2. Hardware architecture of a layer block.](image)

The memory block contains two block RAMs (BRAMs) where content is encoded using n bits fixed point. The first is used to store the computed feature maps. Once they are all computed, the content of the first BRAM is copied to the second one, so that it becomes the input of the next layers. At the same time, the computed feature maps of another image can be stored in the first BRAM. We thus obtain a pipeline architecture, in which all implemented layers work at the same time to speed up the classification process.

To avoid data overflow, we process each row of a slice of the input tensor \(X\) independently, and each slice of the kernel independently. In more details, we copy from BRAM one to BRAM two a feature subvector \(X_{i,k}^2 = \{x_{i,1,k}^2, x_{i,2,k}^2, \ldots, x_{i,R,k}^2\}\) made of \(R^p\) values, instead of the whole feature vector \(X_{i,k}^1 = \{x_{i,1,k}^1, x_{i,1,k}^2, \ldots, x_{i,R,k}^1\}\) made of \(R > R^p\) values (cf. Figure 2). This is to account for the border effects (padding). To simplify notations, we replace \(X_{i,k}^1\) (resp. \(X_{i,k}^2\)) by \(X^1\) (resp. \(X^2\)) in the following.

To compute a convolutional operation, kernels move along feature maps with a step which is called stride in CNNs. In a typical case in which stride value is 1, \(X^2\) represents either the first \(R^p\) values, the middle \(R^p\) ones or the last \(R^p\) values, where \(R = R - 2\), depending on the position of the nonzero kernel value. When stride value is 2 (cf. Figure 4), only half of the values are copied from \(X^1\) to \(X^2\) by selecting either the odd or even values of \(j\) in \(x_{i,j,k}^1\) using multiplexers. This process can be generalised to any stride value other than 1 or 2.

![Fig. 4. Hardware architecture to copy the first BRAM contents to the second BRAM, when stride value is 2.](image)
memory block of the next layer. At the end of this process, the *Iter_done* signal is set to 1 in the processing unit block, so new data can be read from the memory block to process other feature vectors.

To achieve the computation associated with the layer block described in Figure 2, \( k_{\text{max}} j_{\text{max}} \) clock cycles (CCs) are required to copy all contents from the first BRAM to the second one, \( j_{\text{max}} k_{\text{max}} \ell_{\text{max}} / P \) CCs to compute all output feature vectors of one layer, and \( j_{\text{max}} \ell_{\text{max}} \) CCs to write all computed feature vectors into the memory block of the next layer. Thus the total number of CCs required is:

\[
CCs = j_{\text{max}} k_{\text{max}} + j_{\text{max}} k_{\text{max}} \ell_{\text{max}} / P + j_{\text{max}} \ell_{\text{max}}
\]  

(2)

This should be compared to [19], where the number of clock cycles becomes:

\[
CCs = 3 j_{\text{max}}^2 k_{\text{max}} \ell_{\text{max}} / P
\]  

(3)

We observe that the proposed architecture is \( 3 j_{\text{max}} \) faster than [19], which can be significant when \( j_{\text{max}} \) is big. For instance with the CIFAR10 dataset, at the input layer of a CNN \( j_{\text{max}} = 32 \), and thus the proposed method is 96 times faster. In addition it is a pipeline architecture, so it can be \( 3L j_{\text{max}} \) faster where \( L \) is the total number of layer blocks that fit in an FPGA.

Note that in the proposed architecture, \( P \) should be lower or equal to \( \ell_{\text{max}} \), otherwise reaching full parallelism would require to read more than one vector \( X^2 \), and as such would also require more BRAMs, resulting in a more complex architecture.

B. Hardware Results

We implemented one/few layers of Resnet18 on Xilinx Ultra Scale Vu13p (xcvu13p-fgld2104-1-e) FPGA. The implemented layers are arranged in a pipeline, and their functionality has been verified comparing the output of each layer block with the ones obtained by software simulation over a batch of examples. Table II shows the required resources to implement one/few layers of Resnet18 trained on CIFAR10 dataset for different values of \( P \). It also shows that the obtained architecture obtain a low processing latency to compute a valid output of one layer. Moreover, this processing latency increases when processing more than one layer, but processing outflow is maintained thanks to the pipeline design.

IV. Conclusion

In this paper, we proposed to extend pruning techniques to convolutional layers of DNNs. We introduced a deterministic pruning scheme that can be taken advantage of in implementations. We combined pruning with weight binarization to reduce both complexity and memory usage and showed the resulting neural network is still able to reach very high accuracy.

We implemented the proposed scheme using a low cost hardware architecture in which complex convolution operations are replaced by simple multiplexers. As a result, we were able to implement a considerable part of some complex CNNs such as Resnet18. Moreover, the architecture only consumes a few watts, making it a good solution for embedded applications. Future work will extend this method to all kernel shapes, and propose a low cost hardware architecture to handle other challenging vision datasets such as ImageNet.
REFERENCES

applied to document recognition,” Proceedings of the IEEE, vol. 86,
[2] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally,
and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer
parameters and < 0.5 mb model size,” arXiv preprint arXiv:1602.07360,
2016.
thinking the inception architecture for computer vision,” arXiv preprint
no. 7553, p. 436, 2015.
deep neural networks with pruning, trained quantization and huffman
of deep convolutional neural networks for fast and low power mobile
deep neural networks with binary weights during propagations,” in
networks on the fly,” in International Conference on Artificial Neural
Imaginet classification using binary convolutional neural networks,” in
Training low bitwidth convolutional neural networks with low bitwidth
[16] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv, and Y. Ben-
weights and activations constrained to+ 1 or-1,” arXiv preprint
low power convolutional neural network accelerator based on binary
neural networks with binary weights,” in Circuits and Systems (ISCAS),
[20] ———, “Sparsely-connected neural networks: towards efficient vlsi imple-
mentation of deep neural networks,” arXiv preprint arXiv:1611.01427,
2016.
recognition,” in Proceedings of the IEEE conference on computer vision
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
2018, pp. 4510–4520.