Beyond Gbps Turbo Decoder on Multi-Core CPUs
Adrien Cassagne, Thibaud Tonnellier, Camille Leroux, Bertrand Le Gal, Olivier Aumage, Denis Barthou

To cite this version:
Adrien Cassagne, Thibaud Tonnellier, Camille Leroux, Bertrand Le Gal, Olivier Aumage, et al.. Beyond Gbps Turbo Decoder on Multi-Core CPUs. The 10th International Symposium on Turbo Codes and Iterative Information Processing (ISTC 2016), Sep 2016, Brest, France. 10.1109/ISTC.2016.7593092 . hal-01363980

HAL Id: hal-01363980
https://hal.archives-ouvertes.fr/hal-01363980
Submitted on 13 Sep 2016

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Beyond Gbps Turbo Decoder on Multi-Core CPUs

Adrien Cassagne*†, Thibaud Tonnellier*, Camille Leroux*, Bertrand Le Gal*, Olivier Aumage‡ and Denis Barthou‡

*IMS Lab, Bordeaux INP, France
†Inria / Labri, Univ. Bordeaux, INP, France

Abstract—This paper presents a high-throughput implementation of a portable software turbo decoder. The code is optimized for traditional multi-core CPUs (like x86) and it is based on the Enhanced max-log-MAP turbo decoding variant. The code follows the LTE-Advanced specification. The key of the high performance comes from an inter-frame SIMD strategy combined with a fixed-point representation. Our results show that proposed multi-core CPU implementation of turbo-decoders is a challenging alternative to GPU implementation in terms of throughput and energy efficiency. On a high-end processor, our software turbo-decoder exceeds 1 Gbps information throughput for all rate-1/3 LTE codes with \( K < 4096 \).

I. INTRODUCTION

Turbo codes [1] are widely used as the channel coding component in digital communication standards, such as the LTE wireless specification [2]. Dedicated hardware architectures of turbo-decoders are usually mapped on custom Silicon in order to reach high energy-efficiency and high-throughput [3]–[6]. In [6] a turbo-decoder ASIC is demonstrated to reach 1.01 Gbps while consuming 0.7 nJ per decoded bit (\( K = 6144 \) and 6 iterations). However, dedicated hardware turbo-decoders lack flexibility. This will become especially true in the future 5G mobile networks where baseband processing will be virtualized and implemented on centralized cloud platforms [7], [8]. Software implementations will then offer flexibility and short development cycles to the channel decoder design, at the cost of lower throughputs and higher energy consumptions. In such a context, the channel coding functions must be analyzed and optimized to be efficiently mapped on general purpose processors [7], [8]. In the case of turbo coding, some work has been initiated to devise efficient software implementations of turbo decoders, mostly focusing on the LTE standard.

In [9]–[18], turbo decoders were implemented on GPU targets to benefit from their computing power in order to comply with the LTE required throughputs. This was made possible by exploiting the parallelism within the turbo decoding process \( (\text{intra-frame} \text{ parallelism}) \). An alternative is to process several codewords on distinct computation resources \( (\text{inter-frame} \text{ parallelism}) \). For instance, in [15], a throughput of 122.8 Mbps was reached for a code dimension \( K = 6144 \) and 6 decoding iterations on a GPU device. One should notice that this throughput is approximately one decade lower than the dedicated hardware implementation in [6] while the power consumption is one to two decades higher. Indeed, despite the large amount of parallelism in a GPU, it is not obvious to feed every processing unit with data and some non-negligible time is spent in memory accesses. This leads to an inefficient use of the hardware resources and to high energy consumption.

An intermediate solution between dedicated hardware and a GPU is the use of general purpose processors (GPP). Multi-core devices provide high performance computation capabilities while consuming noticeably less power. Thanks to a large set of processor cores it becomes possible to implement computation intensive applications such as channel decoding with a lower energy consumption in comparison to GPU targets. In [16], [19], [20], software turbo decoders are implemented on GPP targets with \( \text{intra-frame} \) parallelism, similar to hardware-oriented strategies. Unlike these works, we propose to investigate the exclusive use of \( \text{inter-frame} \) parallelism. A generic and portable software turbo decoder has been designed and ported on several GPP targets. Experimental results show that \( \text{inter-frame} \) parallelism allows a more efficient use of CPU resources and our software turbo-decoder outperforms existing implementations in terms of throughput and energy efficiency. Moreover, it exceeds 1 Gbps information throughput on a high-end CPU, making multi-core CPU a compelling alternative to GPU for channel decoding processing in cloud-based Random Access Network (RAN) [7], [8].

The remainder of this paper is organized as follows. Section II presents the turbo code decoding algorithm that was implemented in this work. Section III shows the benefit of \( \text{inter-frame} \) parallelism for multi-core implementations. Section IV details the optimized implementation of the turbo-decoder. Section V presents experiments and comparison with related works in the field.

II. OVERVIEW OF THE TURBO DECODING PROCESS

The turbo-decoding process is an iterative process in which two soft input soft output (SISO) decoders exchange extrinsic information. Each SISO decoder uses the channel information and \( \text{a priori} \) extrinsic information to compute \( \text{a posteriori} \) extrinsic information. The \( \text{a posteriori} \) information becomes the \( \text{a priori} \) information for the other SISO decoder and is exchanged via interleaver/deinterleaver.

In turbo-coding, the two component codes are convolutional codes, the associated decoding modules perform the BCJR or forward-backward algorithm [21] which is optimal for the maximum a posteriori (MAP) decoding of convolutional codes. In order to calculate the extrinsic information for a bit, a BCJR SISO decoder first computes the probability that a trellis transition occurred during the encoding process. The branch
metrics associated with states $s_i^k$ and $s_j^{k+1}$ are computed as:

$$\gamma(s_i^k, s_j^{k+1}) = 0.5(L_{sys}^k + L_{a}^k)u^k + 0.5(L_{p}^{k+1}). \tag{1}$$

Here, $L_{sys}^k$ and $L_{a}^k$ are the systematic channel LLR and the a-priori LLR for the $k^{th}$ trellis section, respectively. In addition, the parity LLRs for the $k^{th}$ trellis step are $L_{p}^k = L_{p0}$ for MAP decoder 0 and $L_{p}^{k+1}$ for MAP decoder 1. We do not need to evaluate the branch metric $\gamma(s_i^k, s_j^{k+1})$ for all 16 possible branches, as there are only four different branch metrics: $\gamma_0^k = 0.5(L_{sys}^k + L_{a}^k + L_{p}^k)$, $\gamma_1^k = 0.5(L_{sys}^k + L_{a}^k - L_{p}^k)$, $\gamma_{-1}^k$, and $\gamma_{-2}^k$. After that, the SISO decoder computes forward and backward recursions over the trellis representation of the convolutional code. In this work, we use the Enhanced max-log-MAP algorithm [22], [23]. For each state $j$ of section $k$ of the trellis, the forward ($\alpha$) and backward ($\beta$) metrics are computed as follows:

$$\alpha_j^{k+1} = \max_{i \in \mathbb{F}} \{\alpha_i^k + \gamma(s_i^k, s_j^{k+1})\} \tag{2}$$

$$\beta_j^{k} = \max_{i \in \mathbb{F}} \{\beta_i^{k+1} + \gamma(s_i^k, s_j^{k+1})\} \tag{3}$$

Then, the extrinsic information for each bit at position $k$ is:

$$L_c^k = \max_{\{s_k, s_{k+1}\} \in \mathbb{U}} \{\alpha_k^1 + \beta_j^{k+1} + \gamma(s_k^1, s_{j}^{k+1})\}$$

$$- \max_{\{s_k, s_{k+1}\} \in \mathbb{U}} \{-\alpha_k^{1} + \beta_j^{k+1} + \gamma(s_{k}^{1}, s_j^{k+1})\} \tag{4}$$

Finally, $L_c$ is scaled by a fixed factor of 0.75.

III. PARALLELISM ANALYSIS

a) Intra-frame versus inter-frame parallelism: A Turbo decoder is in charge of decoding a large set of frames. Two strategies are then possible to speedup the decoding process. i) Intra-frame parallelism: the decoder exploits the parallelism within the turbo-decoding process by executing concurrent tasks during the decoding of one frame. ii) Inter-frame parallelism: several frames are decoded simultaneously.

In the perspective of a hardware implementation, the intra-frame approach is efficient [24] because the area overhead resulting from parallelization is lower than the speedup. On the contrary, the inter-frame strategy is inefficient, due to the duplication of multiple hardware turbo-decoders. The resulting speedup comes at a high cost in term of area overhead.

In the perspective of a software implementation, the issue is different. The algorithm is executed on a programmable non-modifiable architecture. The degree of freedom lies in the mapping of the different parallelizable tasks on the parallel units of the processor. Modern multi-core processors support Single Program Multiple Data (SPMD) execution. Each core includes Single Instruction Multiple Data (SIMD) units. The objective is then to identify the parallelization strategy suitable for both SIMD and SPMD programming models. In the literature, intra-frame parallelism is often mapped on SIMD units while inter-frame parallelization is usually kept for multi-threaded approaches (SPMD). In [16], [20], multiple trellis-state computations are performed in parallel in the SIMD units. In [9]–[18], [20], the decoded frame is split into sub-blocks that are processed in parallel in the SIMD units. An alternative approach is to process both SISO decoding in parallel but it requires additional computations for synchronization and/or impacts on error-correction performance [24]. However, for all these approaches a part of the computation of the BCJR decoder remains sequential, bounding the speedup beyond the capabilities of SIMD units. Inter-frame parallelism has been proposed in [9], [10], [16], [20]. Multiple codewords are decoded in parallel, it improves the memory access regularity and the usage rate of SIMD units. The speedup is no longer bounded by the sequential parts, all removed, but this comes at the expense of an increase in memory footprint and latency.

In this work, we focus on the inter-frame parallelization and show that the use of this approach allows some register-reuse optimizations that are not possible in the intra-frame strategy.

b) Inter-frame parallelism on multi-core CPUs: The contribution of this work is to propose an efficient mapping of multiple frames on the CPU SIMD units (inter-frame strategy): the decoding of $M$ frames is vectorized. Before the decoding process can be launched, this new approach requires to: (a) buffer a set of $M$ frames and (b) reorder the input LLRs in order to make the SIMDization efficient with memory aligned transactions (see Fig. 1). Similarly, a reversed-reordering step has to be performed at the end of the decoding process. These reordering operations are expensive but they make the complete decoding process very regular and efficient for SIMD parallelization. Moreover, reordering is applied only once, independently of the number of decoding iterations.

**Algorithm 1 Standard BCJR implementation**

1: for all frames do
2:  for $k = 0; k < K$; $k = k + 1$ do
3:    $\gamma^k \leftarrow$ computeGamma($L_{sys}^k, L_{p}^{k}, L_c^k$) \hspace{1cm} \triangleright$ Sequential loop
4:  4:    $\alpha^k \leftarrow$ initAlpha() \hspace{1cm} \triangleright$ Parallel loop
5:  5:    for $k = 1; k < K$; $k = k + 1$ do
6:    6:      $\alpha^k \leftarrow$ computeAlpha($\alpha^{k-1}, \gamma^{k-1}$) \hspace{1cm} \triangleright$ Sequential loop
7:  7:      $\beta^{K-1} \leftarrow$ initBeta() \hspace{1cm} \triangleright$ Parallel loop
8:  8:      for $k = K - 2; k \geq 0; k = k - 1$ do
9:  9:        $\beta^{k} \leftarrow$ computeBeta($\beta^{k+1}, \gamma^{k}$) \hspace{1cm} \triangleright$ Sequential loop
10: 10:    for $k = 0; k < K$; $k = k + 1$ do
11:    11:      $L^k \leftarrow$ computeExtrinsics($\alpha^k, \beta^k, \gamma^k$) \hspace{1cm} \triangleright$ Parallel loop
In the proposed implementation, the inter-frame parallelism is used to fill the SIMD units of the CPU cores. Algorithm 1 illustrates the traditional implementation of the BCJR (used for the intra-frame vectorization). The inter-frame strategy makes the outer loop on the frame parallel (through vectors). This means all computations inside this loop operate on SIMD vectors instead of scalars, and the inner loops can be turned into sequential loops on SIMD vectors. This gives the opportunity for memory optimizations, through loop fusion. The initial 4 inner loops are merged into 2 loops. Algorithm 2 presents this loop fusion optimization. This makes possible the scalar promotion of $\beta_j$ (no longer an array), since it can be directly reused from the CPU registers. In this version, the SIMD are always stressed.

On a multicore processor, each core decodes $M$ frames using its own SIMD unit and $T$ threads are activated, a total of $M \times T$ frames are therefore decoded simultaneously with the inter-frame strategy. Theoretically, this SIMD parallelization strategy provide an acceleration up to a factor $T$, with $T$ cores. Large memory footprint, exceeding L3 cache capacity may reduce the effective speedup, as shown in Section V.

### IV. IMPLEMENTATION OF THE DECODER

The presented decoder implementation is available in the AFF3CT\(^1\) software [25]. The use of C++ templates associated to our generic SIMD library enables the same source code to be compiled using different formats (32-bit float, 16-bit short, and 8-bit char) and different SIMD instructions (SSE, AVX and NEON), providing possible trade-offs between SIMDization, throughput and error-correction performance.

#### a) Fixed-point representation: Nowadays on x86 CPUs, there are large SIMD registers: SSE/NEON are 128 bits wide and AVX are 256 bits wide. The number of elements that can be vectorized depends on the SIMD length and on the data format: $n_{elem} = sizeof(SIMD)/sizeof(data)$. So, the key for a large parallelism is to work on short data.

As there is no floating-point support for 16-bit and 8-bit data, a fixed-point representation is used. The AWGN channel soft information is quantized as follows: $y_{s,v}^k = \Psi(2^v \cdot y_k^k \pm 0.5)$, with $y_k^k$ the current floating-point value from the channel, $s$ the number of bit of the quantized number, including $v$ bits for the fractional part and the saturation function $\Psi(x) = min(max(x, -2^{v-1} + 1), 2^{v-1} - 1)$. In the experiments (cf. Fig. 2) $Q_{s,v}$ denotes this channel quantization.

During the turbo-decoding process, the extrinsic values grow at each iteration. It is then necessary for internal LLRs to have a larger dynamic than the channel information. Depending on data format, 16-bit or 8-bit, the quantization used in the decoder is $Q_{16,3}$ or $Q_{8,2}$, respectively.

#### b) Memory allocations: The systematic information $L_{sysN}/L_{sysI}$ and the parity information $L_{pN}/L_{pI}$ are stored in the natural domain $N$ as well as in the interleaved domain $I$. Two extrinsic vectors are also stored: $L_{eN}$ in $N$ and $L_{eI}$ in $I$. Inside the BCJR decoding and per trellis section, two $\gamma_i$ and eight $\alpha_j$ metrics are stored. Thanks to the loop fusion optimization, the eight $\beta_j$ metrics are not stored in the global memory. In the proposed implementation $i \in \{0, 1\}$ and $j \in \{0, 1, 2, 3, 4, 5, 6, 7\}$. Notice that all those previously-mentioned vectors are $K$-bit wide and are duplicated $M \times T$ times because of the inter-frame strategy. The memory footprint in bytes is approximately equal to: $16 \times K \times sizeof(data) \times M \times T$. The interleaving and deinterleaving lookup tables have been neglected in this model.

#### c) Forward trellis traversal: The objective is to reduce the number of loads/stores, performing the arithmetic computations (add and max) inside registers. The max-log-MAP algorithm only stresses the integer pipeline of the CPU. This kind of operations takes only one cycle to execute when the latency is also very small (1 cycle too). In contrast, a load/store can take a larger number of cycles depending on where the current value is loaded/stored in the memory hierarchy. Using data directly from the registers is cost-free but loading/storing it from the L1/L2/L3 cache can take up to 30 cycles (at worst).

Per trellis section $k$, the two $\gamma_i^k$ metrics are computed from the systematic and the parity information. These two $\gamma_i^k$ are directly reused to compute the eight $\alpha_j^k$ metrics. Depending on the number of bits available, the trellis traversal requires to normalize the $\alpha_j^k$ because of the accumulations along the multiple sections. In 8-bit format, the $\alpha_j^k$ metrics are normalized for each section: the first $\alpha_0^k$ value is subtracted to all the $\alpha_j^k$ (including $\alpha_0^k$ itself). In the 16-bit decoder, the normalization is only applied every eight steps (like in [16]), since there are enough bits to accumulate eight values. We have observed in experiments that there is no performance degradation due to the normalization process. At the end of a trellis section $k$ the two $\gamma_i^k$ and the eight normalized $\alpha_j^k$ are stored in memory. In the next trellis section $(k + 1)$ the eight previous $\alpha_j^k$ are not loaded from memory but they are directly reused from registers to compute the $\alpha_j^{k+1}$ values.

#### d) Backward trellis traversal: Per trellis section $k$, the two $\gamma_i^k$ metrics are loaded from the memory. These two metrics are then used to compute, on the fly, the eight $\beta_j^k$ metrics (whenever needed the $\beta_j^k$ metrics have been normalized like for the $\alpha_j^k$ metrics). After that, the $\alpha_j^k$ metrics are loaded from the memory. The $\alpha_j^k$, $\beta_j^k$ and $\gamma_i^k$ metrics are used to determine the a posteriori and the extrinsic LLRs. In the next trellis section $(k - 1)$ the previous $\beta_j^k$ metrics are directly reused

---

\(^1\)AFF3CT is an Open-source software (MIT license) for fast forward error correction simulations, see http://aff3ct.github.io
from registers in order to compute the next $a^{i-1}$ values. The $\beta^b_{j}$ metrics are then never stored in the memory.

V. EXPERIMENTS AND RESULTS

The experiments have been conducted on three different x86-based processors detailed in Table I. A mid-range processor (P2) is used for comparison with similar CPU targets in the literature [16], [19], [20] while the two high-end processors (P1 and P3) are used for comparison with GPU-based turbo-decoder implementations. Indeed, P1 and P3 have a number of cores that is similar to the number of Streaming Multiprocessors (SM) inside a GPU. Moreover, the code has been compiled on Linux (Ubuntu 14.04 LTS) with the GNU compiler (version 4.8) and with the -Ofast -funroll-loops -msse4.1/-mavx2 flags.

a) BER/FER performance: Fig. 2 shows the decoding performance of the proposed software turbo-decoder for the $K = 6144$ rate-1/3, LTE-specified turbo-code. The decoding performance of a floating-point decoder is provided as a reference. Unlike [16], the proposed 16-bit implementation does not degrade the decoding performance. The 8-bit version of our decoder shows a 0.15dB degradation. The limited dynamic of 8-bit format together with early saturation inside the decoder are responsible for this small performance loss.

b) Throughput performance: Fig. 3 shows the evolution of the information throughput depending on the code dimension $K$. This experiment was conducted on P2 and P3 (both have Haswell architectures). The throughput tends to increase linearly with the number of cores (up to 24 cores) except in AVX mode where a performance drop can be observed when $K > 4096$. The reason is that the AVX instructions use vectors 2× wider than those used by SSE instructions and the inter-frame strategy loads twice the number of frames to fill these vectors. Thus, for $K > 4096$, in AVX, the memory footprint exceeds the L3 cache optimal occupancy and the performance is driven by the RAM bandwidth. Then, as $K$ increases the number of RAM accesses increases and there is not enough memory bandwidth to feed all the cores. This explains the decreasing throughput for $K > 4096$, in AVX mode. Nonetheless, on P3 target, the throughput exceeds 1Gbps for all codes with $K < 4096$.

Fig. 4 shows the energy consumed by the processor to decode one information bit ($E_b$) of the codes using SSE and AVX instructions, on the P2 CPU target. For small codewords ($K = 1024$) it is more energy efficient to resort to AVX. But this is not so clear on larger codewords ($K = 6144$) since with $3/4$ cores, the code using SSE outperforms the AVX one.

Table II shows a performance comparison with related works\(^2\). The variety of CPU/GPU targets and algorithmic parameters allows to show some global emerging trends. When comparing to similar CPU targets [16], [20], the proposed im-

\(^2\)To be as fair as possible with the other works, we assume that the Intel Turbo Boost (ITB) technology was disabled on their CPUs. For our experiments, the ITB technology was on and the real frequency is picked up. Moreover, for GPU works there is an asterisk when it is unclear if the CPU/GPU data transfer times have been taken into account.

---

**TABLE I**

SPECIFICATIONS OF THE TARGET PROCESSORS.

<table>
<thead>
<tr>
<th>CPU</th>
<th>P1 : Xeon E5-2650</th>
<th>P2: Core i7-4960HQ</th>
<th>P3: Xeon E5-2680v3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Intel Arch.</td>
<td>Ivy Bridge Q1’12</td>
<td>Haswell Q2’13</td>
<td>Haswell Q2’14</td>
</tr>
<tr>
<td>Cores/Freq.</td>
<td>8 cores, 2-2.8 GHz</td>
<td>4 cores, 2.6-3.8 GHz, 12 cores, 2.5-3.3 GHz</td>
<td></td>
</tr>
<tr>
<td>LLC</td>
<td>20MB L3</td>
<td>16MB L3</td>
<td>20MB L3</td>
</tr>
<tr>
<td>TDP</td>
<td>95 W</td>
<td>47 W</td>
<td>120 W</td>
</tr>
</tbody>
</table>

---

**Fig. 2.** Bit Error Rate (BER) and Frame Error Rate (FER) of the decoder for $K = 6144$ (6 iters) and $R = 1/3$. Enhanced max-log-MAP algorithm (scaling factor = 0.75). BPSK modulation and AWGN channel were used.

**Fig. 4.** Energy-per-bit ($E_b$) depending on the number of cores and the instruction types. 6 iterations, 8-bit fixed-point. The throughput and power measurements were conducted on P2 with the Intel Power Gadget tool.
Decoding performances

\[
N_{\text{Thr.}} = \frac{\text{(Freq. x Cores)}}{\text{(Freq. x SimD)}}
\]

\[
T_{\text{D/P}} = \frac{N_{\text{Thr.}}}{\text{(SimD x i7-960)}}
\]

\[
E_{\text{r}} = \frac{1}{2} \text{(TPS x i7-960)}
\]

Reference values are normalized to one iteration. Numerical results are rounded off to two decimal places. All values are in MHz.

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Performance</th>
<th>Latency</th>
<th>Throughput</th>
</tr>
</thead>
<tbody>
<tr>
<td>ML-MAP</td>
<td>1.69 GHz</td>
<td>14.6 ns</td>
<td>1.30 Gbps</td>
</tr>
<tr>
<td>EML-MAP</td>
<td>1.54 GHz</td>
<td>16.5 ns</td>
<td>1.22 Gbps</td>
</tr>
<tr>
<td>EML-MAP</td>
<td>1.69 GHz</td>
<td>18.4 ns</td>
<td>1.33 Gbps</td>
</tr>
<tr>
<td>EML-MAP</td>
<td>1.54 GHz</td>
<td>20.3 ns</td>
<td>1.24 Gbps</td>
</tr>
</tbody>
</table>

ACKNOWLEDGMENT

This work was supported by a grant overseen by the French National Research Agency (ANR), ANR-15-CE25-0006-01.

REFERENCES