Scalable High-Performance Architecture for Convolutional Ternary Neural Networks on FPGA
Adrien Prost-Boucle, Alban Bourge, Frédéric Pétrot, Hande Alemdar, Nicholas Caldwell, Vincent Leroy

To cite this version:
Adrien Prost-Boucle, Alban Bourge, Frédéric Pétrot, Hande Alemdar, Nicholas Caldwell, et al.. Scalable High-Performance Architecture for Convolutional Ternary Neural Networks on FPGA. Field Programmable Logic and Applications (FPL), 2017 27th International Conference on, Sep 2017, Gent, Belgium. 2017. <hal-01563763>

HAL Id: hal-01563763
https://hal.archives-ouvertes.fr/hal-01563763
Submitted on 18 Jul 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Distributed under a Creative Commons CC0 - Public Domain Dedication| 4.0 International License
Scalable High-Performance Architecture for Convolutional Ternary Neural Networks on FPGA

Adrien Prost-Boucle*, Alban Bourge*, Frédéric Pétrot*
Hande Alemdar†, Nicholas Caldwell‡, Vincent Leroy‡
*Univ. Grenoble Alpes, CNRS, Grenoble INP, TIMA, F-38000 Grenoble, France
†Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, F-38000 Grenoble, France
Email: name.surname@univ-grenoble-alpes.fr

Abstract—Thanks to their excellent performances on typical artificial intelligence problems, deep neural networks have drawn a lot of interest lately. However, this comes at the cost of large computational needs and high power consumption. Benefiting from high precision at acceptable hardware cost on these difficult problems is a challenge. To address it, we advocate the use of ternary neural networks (TNN) that, when properly trained, can reach results close to the state of the art using floating-point arithmetic. We present a highly versatile FPGA friendly architecture for TNN in which we can vary both the number of bits of the input data and the level of parallelism at synthesis time, allowing to trade throughput for hardware resources and power consumption. To demonstrate the efficiency of our proposal, we implement high-complexity convolutional neural networks on the Xilinx Virtex-7 VC709 FPGA board. While reaching a better accuracy than comparable designs, we can target either high throughput or low power. We measure a throughput up to 27 000 fps at ≈7 W or up to 8.36 TMAC/s at ≈13 W.

I. INTRODUCTION

Artificial neural networks (ANN) have had a long and complicated history [1], but there is now a consensus that networks with many layers and many neurons per layer are achieving the best results on a broad range of artificial intelligence tasks. For the record, an ANN needs to be trained on many instances of a problem to determine synaptic weights (a.k.a learning) that are later used to solve a new instance of the same problem (a process called inference). Thanks to advances in integration technology and computer architecture, full software solutions to both learning and inference can be done at high performance on general purpose processors and graphical processing units. However, solving problems like clustering or classification has a lot of interest on systems of addressable memory (TCAM) to ensure a fast and low-power search, but are therefore limited to ASIC.

To achieve better results on ever increasing data sets, ANN have grown wider and deeper, leading to a large number of neurons. As a consequence a lot of floating-point multiplications are needed to realize the multiply-accumulate operations that compute the activations of the neurons. For instance, the implementation of ConvNet [2], a relatively classical convolutional neural network for synthetic vision, requires 435 million multiply-accumulate for VGA size images when using a 7 × 7 convolution kernel. Our goal in this paper is to demonstrate that it is possible to design deep neural networks (DNN) architectures that feature high throughput and low power while producing inference results that are close to the state of the art.

There are two strategies to lower power consumption: limit the amount of data to work on by using application-specific preprocessing and/or perform the computations with a low number of bits or a small set of values [3]. The extreme solution is to binarize all synaptic weights and activations, which eliminates multiplications once and for all, as proposed by [4], [5]. The loss of precision of these approaches is however quite high.

In this paper, we propose an FPGA architecture for ternary neural networks as a trade-off between inference accuracy, hardware resource utilization and power consumption.

II. WHY TERNARY NEURAL NETWORKS?

There have been many recent works aiming at better utilizing the hardware resources to implement DNN. We can classify these works into two main categories.

The first one still uses floating-point arithmetic, but limits the number of possible values to a subset. Among representative works, [3] presents an ASIC architecture where they aim at limiting greatly the number and size of external accesses to memory. To that end, they prune the redundant connections and share weights by adequate training. As a consequence their design works on sparse matrices and uses small indexes to access arrays of weights. The approach proposed in [6] is somewhat different: at training time, it uses only specific combinations of activation and weight values. The pre-computed multiplication results are stored in lookup tables. They use ternary content addressable memory (TCAM) to ensure a fast and low-power search, but are therefore limited to ASIC.

The second category, the one we also follow, limits the number of bits for weight and/or activation values. The approach is not new, and for example [7] is an early paper studying the quality of the result as a function of the number of bit to code the weights. Using normal arithmetic, it is today admitted that using 6 or 7 bits does not significantly degrade the result of inference [8]. However, more extreme solutions have been advocated lately: binary [8], [9] (BNN) or ternary [10], [11] (TNN) encodings of the weights. Based on these training-
focused works, several hardware implementations have been proposed.

We first quickly review the most recent works focusing on binary weights. Andri et al. [12] implement a systolic array which processes each layer sequentially. Their ASIC implementation of BNN achieves state of the art area and energy efficiency, but because of the use of binary weights and activations, their error rate in applications is still fairly high. Umuroglu et al. [13] focus on high throughput FPGA implementations, and achieve highest reported throughput on a single FPGA chip. But again, binary trained networks are limited in use by the accuracy they achieve.

The very first work on TNN that we found is [14], a relatively imprecise short abstract from 1988 in which the authors study the adaptation of learning algorithms for ternary weights. Even though interesting from an historical perspective, the paper is quite lacunary. The first VLSI implementation of a TNN is reported in [15]. It also presents a training approach. However, the results are very difficult to interpret and to compare with the current technologies and state of the art. Since then, we have not found any detailed description of a hardware architecture for TNN while, according to [11], TNN can be fairly accurate when trained with the appropriate technique. To the best of our knowledge, [11] is the only recent work that makes reference to hardware implementations of TNN, FPGA and ASIC, but a) there is no detail whatsoever regarding the hardware architecture and its implementation, and b) they use ternary data obtained after a preprocessing step as primary input.

Given the accuracy achievable with TNN, we believe they are a sweet-spot between resource usage and precision, and that they have a place in applications for which power vs. accuracy trade-offs have to be made, for instance autonomous embedded devices or large-scale datacenters. The rest of this paper is dedicated to the presentation and evaluation of our TNN architecture and its FPGA implementation.

III. PROPOSED ARCHITECTURE

We now detail our TNN architecture. We first give an overview of the architecture in terms of functional blocks, and then we describe each block thoroughly. We also detail how parallelism and area efficiency can be achieved by a proper pipeline design.

A. Overview

The large-scale ternary CNN pipeline VGG-like introduced in [8] is used as example throughout this paper. The architecture of our CNN is the following:

\[
\begin{align*}
(2 \times nCV_{3 \times 3}) & - MP_{2 \times 2} - (2 \times 2nCV_{3 \times 3}) - MP_{2 \times 2} - \\
(2 \times 4nCV_{3 \times 3}) & - MP_{2 \times 2} - (2 \times 8nFC) - 100FC
\end{align*}
\]

where \( mCV_{3 \times 3} \) represents a Convolution Layer (CVL) with \( m \) neurons, window size 3 \times 3, step 1 and one pixel of padding at zero, \( 2 \times mCV_{3 \times 3} \) is a pair of \( mCV_{3 \times 3} \) layers in series, \( MP_{2 \times 2} \) is max-pooling with window size 2 \times 2, step 2 and no padding, and \( mFC \) is a fully-connected neuron layer with \( m \) neurons.

Figure 1 depicts how the VGG-like pipeline is decomposed into layers connected in the form of a pipeline. All layers are independent from each other: they have their own state machine and image data is streamed through FIFO interfaces. For each layer type, we design a hardware block (hand-written VHDL) that is reused in the pipeline with different parameters. See the top of Figure 1 for a simplified schematic view of the implementation of each block type. Four main layer types are used: Sliding Window Layer (SWL), Neuron Layer (NL), Ternarization Layer (TL) and Max Pooling Layer (MPL). The TL exists because of the constraints introduced by the ternary-only activations: the result of a neuron is a scalar and it must...
be ternarized before being sent as input of the next NL. The pipeline begins with two CVL. A CVL is comprised of an SWL, an NL and a TL. These two CVL are followed by another MPL, another CVL, again two CVL and another MPL. It ends with three fully-connected NL.

Throughout this paper, we use two networks with different dimensions, NN-64 and NN-128, respectively with \( n = \{64, 128\} \). For instance according to Equation 1, the fifth NL has \( 2n \) neurons, hence the second SWL frame size has dimension \( z = 128 \) for NN-64 and \( z = 256 \) for NN-128. Our network NN-128 actually has same architecture than the network used in [8] except we increased the number of output neurons from 10 to 100 to enable using datasets with up to 100 classes.

Data channels between two blocks are implemented as small FIFOs (not shown for clarity) to compensate for the pipeline depth of the blocks and simplify their control flow. In our baseline implementation, each of these FIFOs transfers at most one activation value per clock cycle. This directly dictates the design throughput, in frames per second (fps). To increase throughput, parallelism is introduced in the layers that are responsible for the bottleneck. The corresponding FIFOs are widened and more activation values are transferred per clock cycle. How parallelism is implemented depends on the layer type and is explained in the following sections.

### B. Sliding Window Layer

The Sliding Window Layer (SWL) is used for feeding either an NL or an MPL. To do so, it is highly configurable, partly at synthesis time and partly at runtime. Basically, the SWL acts as a buffer that stores enough data for the next layer to process in the order required by the following layer. To save memory resources, a SWL stores only a fraction of a frame and works as a ping-pong buffer. Both input and output sides can be parallelized to increase throughput. The output parallelism wanted defines the number of RAM blocks that are used to read data in parallel.

Figure 2 gives an example for an SWL configured with dimensions \( 20 \times 8 \times 8 \) and window dimensions \( 3 \times 3 \). Here, output parallelism is \( P_o = 4 \) using RAM1 to RAM4 and input parallelism is \( P_i = 2 \). Only 2 clock cycles are necessary to read an entire \( z \)-dimension of the window. Window size and step within all three directions can be set at run time. Input data is written in the following fashion: \( z \), then \( x \) and finally \( y \) dimension. One should note that a \( P_i \) up to 4 could have been achieved thanks to the 4 RAM blocks that can be written at the same cycle.

### C. Neuron Layer

One Neuron Layer (NL) is composed of neurons and a memory holding the ternary weights. At each clock cycle, one or more input activation values are broadcast to all neurons. Simultaneously, the weights are read from the memory and distributed to the appropriate neurons. All neurons then perform one multiply-accumulate operation on an internal register.

To extract the values out of the neuron accumulators as well as allow a compact placement in the FPGA, neurons are interconnected and form a scan chain, as proposed in [16]. This scan chain has its own registers, which enables to copy accumulator values and to extract them while accumulators perform the computations on the next frame data.

The architectural interest of using ternary values is illustrated Figure 3, which details the internal structure of the proposed neuron. The ternary multiplier requires two LUT4 which fit into one unique LUT6 on a Xilinx FPGA. Hence the neuron mainly consists of its two registers and associated ALUs and multiplexers. The ALUs and multiplexers are small enough so that they fit in the same slice with their associated registers. The neurons may use more than one slice in height, depending on the accumulator width that is required in the layer. For resource efficiency, control signals are generated by a finite state machine (FSM) that is shared among all the neurons of a layer, in an SIMD fashion. As an example in the FPGA used in our experiments (433200 LUT6 and 3600 DSP cores), it is
possible to implement 5 to 6× more 12-bit ternary neurons (19 LUT6 each) than neurons based on DSP cores. Weight sparsity is intentionally not exploited. Indeed, compared to our very optimized FSM and neurons, the amount of per-neuron control needed to handle sparsity would come at an excessive cost in area and power.

Parallelism levels for input and output of the \( NL \) (\( P_i \) and \( P_o \)) are independent. Figure 4 illustrates how parallelism is applied with \( P_i = 4 \) and \( P_o = 2 \). On the input side, each neuron receives \( P_i \) weight and activation values, which are added up with a small adder tree before the accumulator. On the output side, all neurons are separated into \( P_o \) groups according to their index modulo \( P_o \), each group having its own scan chain.

The weight memories are implemented either using RAM blocks or using the LUTRAM functionality of certain LUTs of the FPGA. For each neuron layer, the memory implementation is selected according to an arbitrary heuristic about the number of weights per neuron (\( W \)): LUTRAM is used when \( W \leq 64 \) or when \( W \leq 128 \) and \( P_i \geq 4 \) or when \( W \leq 256 \) and \( P_i \geq 16 \), otherwise RAM blocks are used. This balances well the usage of LUTs for memory and for the neuron logic, while reserving RAM blocks for the deepest memories of the network.

D. Ternarization Layer

The Ternarization Layer (\( TL \)) is used to convert to ternary the scalar values produced by an \( NL \). It acts as activation function as is often used in the literature. It is composed of one memory storing threshold values, two comparators and a multiplexer. There are two threshold values for each neuron of the previous neuron layer. Ternarization is performed the following way: if the result of a neuron is less than the first threshold, then the output is \(-1\); if it is higher than the second threshold then the output is \(+1\); and between the two thresholds the output is \(0\). Specifications of this step are closely linked to our training methodology, which is described in [11].

Parallelism level \( P \) is obtained by instantiating the ternarization block \( P \) times while sharing the same FSM. Instance index \( i \) handles data index \( i \) modulo \( P \). Input and output parallelism levels of this layer are identical. In particular, this parallelism level is identical to the output parallelism of the previous \( NL \).

E. Max Pooling Layer

The Max Pooling Layer (\( MPL \)) is used to find the maximum activation within a window. The window values are sent by an \( SWL \). Like other layers, both the input and the output can be parallelized. Figure 5 depicts an \( MPL \) with \( P_i = 4 \) and \( P_o = 2 \). This parallelism configuration is for illustration only. Actually, it is not particularly well suited to the typical case of a \( 2 \times 2 \) sliding window feeding the \( MPL \) (see Figure 1 for the \( SWL \) configuration). There are 4 data items (\( 2 \times 2 \) window) at each 2-bit input (\( 0 \mod 4 \) to \( 3 \mod 4 \)) and the number of cycles to empty the scan chain is 2 thanks to the output parallelism. Hence in this configuration, the scan chain is stalled half of the time. This output parallelism value is best used when the \( P_i \) reaches 8.

IV. EXPERIMENTS AND RESULTS

In this section, we first describe our experimental setup. Then we present some characteristics of our TNN namely area vs. throughput and power consumption.

A. Experimental Setup

Experiments are performed on a VC709 FPGA board directly plugged in a PCI-Express slot of a workstation. This board is equipped with the Xilinx FPGA XC7VX690T. We highlight that the on-board 8 GB RAM is unused because only on-chip memory is used in our designs.
The RIFFA framework [17] is used as PCI-Express communication interface with the computer. All designs run at 250 MHz clock frequency, which is the frequency generated by the embedded PCI-Express endpoint. Power measurements are performed with on-board PMBus, an I²C bus dedicated to that purpose. We added a custom and independent UART-to-I²C bridge to our designs to read power values without interfering with PCI-Express data transfers.

The networks used are NN-64 and NN-128. Experiments are conducted with well-known datasets CIFAR10 [18], GTSRB [19] and SVHN [20]. In all datasets, all images have a size of 32 × 32 pixels on 3 color channels. As also performed in the related works, we pre-process the images before sending them to the FPGA: Global Contrast Normalization followed by LeCun LCN is used for datasets GTSRB and SVHN, and normalization and ZCA whitening is used for dataset CIFAR10. We use 8-bit quantization per pixel and per color channel.

Demonstration materials (bitstreams and communication software) are available at the team webpage1. It allows to reproduce the paper results.

B. Area and throughput

In our base design, all layers of the network receive and transmit at most one ternary value per clock cycle. In particular, inside neuron layers, all neurons perform in parallel one multiply-accumulate operation per clock cycle. To increase the design throughput (in frames/second), we parallelize the layers that are the bottleneck of the architecture. Table I presents the parallelism levels applied to neurons and to max pooling layers. Corresponding parallelism levels on ternarization and window layers result directly. Layers are named NL1 to NL3 for Neuron Layers and MPL1 to MPL3 for Max Pooling Layers. The unit used is the number of values transferred per clock cycle in the input and output ports of these layers. The input and output of the pipeline are not bottlenecks and are not parallelized.

We highlight that gaining a 2 × speedup does not necessarily require 2 × more hardware resources. This is illustrated in Figures 6 and 7 (resources not on the same scale for clarity). Indeed, all layers have different execution times, and only the most demanding layers are parallelized, which may be only a small fraction of the design resources. Similarly, inside the neurons themselves, only the size of the adder tree increases, not the entire neuron.

Without parallelism, all weight memory banks are implemented within dedicated block RAM (BRAM) resources of the FPGA. Adding parallelism increases the amount of data that these memory banks have to produce at each clock cycle. Even though the storage needs (in bits) does not increase, the BRAM requirements increase to implement the required output width. To avoid BRAM shortage, the LUTRAM resources are used when parallelism is high enough and frame size is low enough in the neuron layers.

There are two limits to the achievable parallelism with our design. The first is due to hardware resources: NN-128 with parallelism level 128 does not fit in our FPGA. The second is due to our parallelization technique for the SWLs: the maximum parallelism level achievable is the dimension of the image in the \( z \) dimension, which is directly related to the number of neurons. This is why with our current design, the maximum parallelism level for NN-64 is 64. Otherwise, the available hardware resources would allow parallelism level 128, with corresponding throughput 54k fps.

C. Power consumption

We measure the power consumption using the core 1 V power supply rail of the FPGA, since our designs fit entirely inside the FPGA. To confirm that this way of doing the measure is correct, we also monitored the global board power (all supply rails measured through PMBus) and observed that it is higher than the core 1 V rail by a rather constant 4.5 W for all designs.

Results are presented in Figure 8. The figures related to NN-64 and NN-128 form two very distinct groups. For each of the two NN sizes, the power consumption is approximately a linear function of the design throughput, and varies little between datasets. This is due to our FPGA implementation not exploiting dataset sparsity (zero-activations and zero-weights) to reduce design activity. When neuron weights are packed inside large RAM banks, it is not possible to inhibit RAM read for selected positions.

TABLE I: Neural network parallelization

<table>
<thead>
<tr>
<th>NN</th>
<th>Par. level</th>
<th>Parallelism per layer (in/out)</th>
<th>FPGA usage</th>
<th>Throughput (fps)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>LUT (logic)</td>
<td>LUTRAM</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>size</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

1http://tima.imag.fr/sls/research-projects/tm-fpga-implementation/
TABLE II: Comparison with related works

<table>
<thead>
<tr>
<th>Dataset</th>
<th>Authors</th>
<th>Plat. name</th>
<th>NN Arch.</th>
<th>Input quant.</th>
<th>Weight quant.</th>
<th>%err</th>
<th>fps</th>
<th>Power (W)</th>
<th>fps/W</th>
<th>Target</th>
</tr>
</thead>
<tbody>
<tr>
<td>CIFAR10</td>
<td>This work</td>
<td>NN-64</td>
<td>3 ch, 8 bits</td>
<td>2 bits</td>
<td>13.29</td>
<td>27043</td>
<td>6.80</td>
<td>3976</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>This work</td>
<td>NN-128</td>
<td>3 ch, 8 bits</td>
<td>2 bits</td>
<td>10.61</td>
<td>13526</td>
<td>13.64</td>
<td>992</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>[13] FINN</td>
<td>NN-64</td>
<td>24 bits</td>
<td>1 bit</td>
<td>19.90</td>
<td>21900</td>
<td>3.6</td>
<td>6080</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>[21] BCNN</td>
<td>NN-64</td>
<td>3 ch, 6 bits</td>
<td>1 bit</td>
<td>11.32</td>
<td>168</td>
<td>4.7</td>
<td>35.8</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>[22] BNN</td>
<td>NN-64</td>
<td>3 ch, 2 bits</td>
<td>1 bit</td>
<td>12.20</td>
<td>27043</td>
<td>5.10</td>
<td>19.90</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>[13] FINN</td>
<td>NN-64</td>
<td>24 bits</td>
<td>1 bit</td>
<td>5.10</td>
<td>21900</td>
<td>3.6</td>
<td>6080</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>[11] TNN</td>
<td>NN-64</td>
<td>12 ch, 2 bits</td>
<td>2 bits</td>
<td>2.73</td>
<td>3390</td>
<td>4.8</td>
<td>709</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>[11] TNN</td>
<td>NN-128</td>
<td>3 ch, 8 bits</td>
<td>2 bits</td>
<td>1.05</td>
<td>27043</td>
<td>6.40</td>
<td>4073</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>[13] FINN</td>
<td>NN-64</td>
<td>24 bits</td>
<td>1 bit</td>
<td>11.32</td>
<td>168</td>
<td>4.7</td>
<td>35.8</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>[11] TNN</td>
<td>NN-128</td>
<td>12 ch, 2 bits</td>
<td>2 bits</td>
<td>0.80</td>
<td>13526</td>
<td>12.57</td>
<td>1076</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>[11] TNN</td>
<td>NN-128</td>
<td>12 ch, 2 bits</td>
<td>2 bits</td>
<td>0.98</td>
<td>1695</td>
<td>9.58</td>
<td>178</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

![Fig. 8: Power versus framerate](image)

For low throughput (less than 4000 fps), the NN-128 power is roughly 2× the NN-64 power. This was expected because most of the time only layer NL2 runs while other layers wait, so the idle power dominates. But as throughput increases, this difference increases up to 3×. Actually for one frame, NN-128 performs 618 million multiply-accumulate operations while NN-64 performs 155 million, so a difference of 4× instead of 3× was expected. This is due to our implementation of parallelism inside neurons: for a given framerate, the neurons in NN-128 have to be parallelized twice more than in NN-64, but this impacts only the leaves of the adder tree and not the accumulator and scan chain.

We extrapolate the idle power as the intersection with the y-axis, and we obtain around 1.8 W for NN-64 and 4 W for NN-128. The possible sources are the static power, the clocks and the IPs related to the PCI-Express interface. According to the synthesis tool estimations (Vivado 2015.3) for parallelism level 64, the highest contributors to the idle power are the clocks (2.2 W and 3.2 W) and the static power (0.5 W and 0.6 W). The power related to PCI-Express can be high (up to 2.8 W) but, assuming that it scales according to the ratio of the maximum throughput, that communication interface should account for only 0.3 W. So Vivado values are over-estimated for NN-64, but rather close for NN-128.

V. RELATED WORK

Our results are presented in Table II, along with results from others FPGA-based works using the same datasets. For each dataset and neural network architecture, our platform provides the highest throughput and it outperforms accuracy of related works.

Umuroglu et al. [13] propose FINN, an FPGA implementation of NN-64 with binary weights on board ZC706. Their design can classify the datasets CIFAR10 and SVHN at a throughput of 21,900 fps. Our raw processing speed (frames per second) is higher than their, but this is exclusively due to our higher frequency (250 MHz instead of 200 MHz). This difference may be linked to us using a higher-performance FPGA technology (Virtex-7 where they use Zynq-7000). Our power efficiency (throughput per watt) is lower than theirs by 33–37%. Indeed, using ternary weights makes neuron operations a little more complex than with binary weights, which contributes to power. However, we are using a higher-performance FPGA technology and PCI-Express communication interface, and our FPGA is largely oversized for NN-64. Actually, our design would fit in their board. The strongest difference is accuracy: our error rate is only 13.29% for CIFAR10 and 2.40% for SVHN, where they have 19.9% and 5.1%, respectively. Given how difficult it usually is to reduce error rate, this shows superiority of ternary over binary-only weights.

Zhao et al. [22] propose BNN, an FPGA implementation of NN-128 with binary weights on board ZedBoard. They focus on accelerating the neural network in a very reduced FPGA, so the resulting throughput is very low. All weights don’t fit in the FPGA, so they have to transfer them from the external on-board DDR memory. Moreover, the FPGA is so small that the power consumption of the on-chip processor subsystem dominates. Their accuracy is also notably lower than ours with a neural network of identical size. Overall, the resulting efficiency and accuracy is still interesting as an accelerator for the small on-chip processors, but it is far from related works who focus on performance per watt and/or accuracy.

Li et al. [21] propose BCNN, an FPGA implementation of NN-128 with binary weights on FPGA xc7vx690t (same chip than our board VC709). Their design is not entirely binary:
they use 2-bit weights in the first neuron layer. They use Vivado HLS to generate their design and their results are the Vivado-estimated execution times and power consumption. The communication interface is unknown. Their HLS-generated design runs at 90 MHz and processes the dataset CIFAR10 at 6218 fps. With our hand-written RTL, our frequency is higher (250 MHz) and our design is notably faster. But even if they used our frequency, their throughput would be only 17272 fps which is still much lower than our platform. Their design is presented as a 7.663 TOP/s accelerator (with multiply and accumulate counted as different operations). We have 4.19 TMAC/s for NN-64 and 8.36 TMAC/s for NN-128, hence respectively 8.38 TOP/s and 16.72 TOP/s, an improvement of respectively 9.4% and 118% over their design.

In [11], Alemdar et al. propose ternary neural networks similar to our NN-64 and NN-128, but with 12-channel ternary input. Only a fixed parallelism of 8× is used in their FPGA version and the power consumption is based on pessimistic estimations. Our results bring a significant improvement over their work: our designs are more power-efficient with about 6× better throughput per watt, and error rate is lower.

VI. CONCLUSION

Thanks to their very good performance in solving inference problems when properly trained, TNN are good candidates for efficient hardware implementations. In this work, we have designed a set of blocks that can be stacked and pipelined to build arbitrarily complex convolutional neural networks making use of ternary values for weights and/or activations. The ternary nature of the network leads to significantly better inference results than binary NN, for an increase in resource usage and power affordable in many applications. The resulting designs feature high density, high throughput and low power. With no impact on accuracy, parallelism levels can be tuned to span a broad range of power-area-throughput trade-offs.

ACKNOWLEDGMENT

This project is being funded in part by Grenoble Alpes Métropole through the Nano2017 Esprit project. The authors would like to thank Olivier Menut from ST Microelectronics for his valuable inputs and continuous support.

REFERENCES