An FPGA-CAPH Stereo Matching Processor Based on the Sum of Hamming Distances
Abiel Aguilar-González, Miguel Arias-Estrada

To cite this version:

HAL Id: hal-01627292
https://hal.archives-ouvertes.fr/hal-01627292
Submitted on 2 Nov 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
An FPGA-CAPH stereo matching processor based on the Sum of Hamming Distances

Abiel Aguilar-González and Miguel Arias-Estrada
Instituto Nacional de Astrofísica Óptica y Electrónica (INAOE),
Tonantzintla, Puebla, México
http://www.inaoep.mx/

Abstract. Stereo matching is a useful algorithm to infer depth information from two or more of images and has uses in mobile robotics, three-dimensional building mapping and three-dimensional reconstruction of objects. In area-based algorithms, the similarity between one pixel of an image (key frame) and one pixel of another image is measured using a correlation index computed on neighbors of these pixels (correlation windows). In order to preserve edges, the use of small correlation windows is necessary while for homogeneous areas, large windows are required. In addition, to improve the execution time, stereo matching algorithms often are implemented in dedicated hardware such as FPGA or GPU devices. In this article, we present an FPGA stereo matching processor based on the Sum of Hamming Distances (SHD). We propose a grayscale-based similarity criterion, which allows separating the objects and background from the correlation window. By using the similarity criterion, it is possible to improve the performance of any grayscale-based correlation coefficient and reach high performance for homogeneous areas and edges. The developed FPGA architecture reaches high performance compared to other real-time stereo matching algorithms, up to 10% more accuracy and enables to increase the processing speed near to 20 megapixels per second.

Keywords: Stereo matching; FPGA; Sum of Hamming Distances; CAPH

1 Introduction

The perception of depth from images is an important task of computer vision systems and has been used in several applications such as recognition, detection, three-dimensional reconstruction and positioning systems for mobile robots [12,16]. There are several techniques to compute depth from images, such as matching all pixels using correlation windows [21,13], matching interest points or features [24,19] and optimization techniques based on dynamic programming or graph cuts [10,9]. In case of matching all pixels with windows, the correspondence between stereo pairs and the geometrical configuration of the stereo camera allows obtaining dense disparity maps. To obtain a dense disparity map it is necessary to measure the similarity of all points in the stereo pair.
1.1 Related work

In this research, we are interested in dense disparity maps. There are several dense stereo matching algorithms in the literature. To reach real-time processing, stereo matching algorithms are often implemented in FPGA devices. [6] presents an FPGA module for computing dense disparity maps. The developed module enables a hardware-based cellular automata (CA) parallel-pipelined design. The presented algorithm provides high processing speed at the expense of accuracy, with large scalability in terms of disparity levels. In [4] a fuzzy approach for computing dense disparity maps is presented. The FPGA architecture determines the similarity between pixels using a Fuzzy Inference System. Although the proposed algorithm increases the accuracy in the computed disparities, it is sensible to the untextured pixels and their performance in untextured regions is limited. In [5] an FPGA module for computing dense disparity maps using a vergence control is proposed. Different to previous work, the developed module constantly estimates the required range of disparity levels upon a given stereo image set using a vergence control. In [1] a correlation-edge distance approach is described. By using a geometric feature (the Euclidean distance between the selected point and the nearest left edge), the developed FPGA architecture has low utilization of hardware resources, high speed processing and offers high performance for low disparity levels. However, the accuracy decreases for large disparity values. In [14] an adaptive window algorithm based on the SAD algorithm is proposed. The developed FPGA architecture offers more accuracy with respect to other FPGA-based stereo matching algorithms in the literature and allows to increasing the processing speed but large correlation window’s sizes are required and hardware resource consumption is high.

1.2 Motivation and scope

The main disadvantage of matching all pixels with windows is selecting the correlation window’s size. Large window size values allow determining the correct correlation values in untextured areas. However, large window sizes imply high computational demand and erroneous values due to the averaging effect of comparing an object and the background. On the other hand, small window sizes imply low computational demand but the correlation coefficient measurement is sensitive to noise hence, erroneous values at untextured regions are generated. If a correlation window large enough to avoid noise is used, high accuracy in untextured areas must be reached. However, edges be slightly blurred and erroneous values at depth discontinuities are generated. If the objects and the background in the correlation window are separated, it is possible to compute the correlation index using pixels of the same object to the reference pixel. This allows maintaining high performance at untextured areas whilst blurring edges are avoided, i.e., it allows retaining the high performance characteristics of the small and large window sizes. To separate the objects and background from a correlation window, we propose the use of a grayscale-based similarity criterion.
The rest of the article is organized as follows: section 2 presents the proposed algorithm. In section 3, the FPGA architecture for the proposed algorithm is described. Experimental results for different synthetic stereo pairs and performance comparisons regarding to other algorithms in the literature are detailed in section 4. Finally, section 5 concludes this article.

2 The proposed algorithm

Fig. 1.(a) shows many objects at different depths. When any correlation coefficient is computed using all the pixels of the correlation window the averaging effect yields errors on the estimated disparity as shown in Fig. 1.(b). On the other hand, Fig. 1.(c) shows a neighborhood in which only the pixels that are the projections of the same object are used. In this case retained pixels are similar to the central pixel and they have the same disparity as shown in Fig. 1.(d). Although separating objects and background can improve the accuracy of the stereo match algorithms, in previous work it has been addressed via segmentation algorithms or super pixels which have high mathematical complexity and whose real-time implementation is complex, [3,7,22]. However, in this research, we separate objects and background using a similarity criterion based on the grayscale levels of the correlation window. The proposed similarity criterion allows implementing in dedicated hardware for real-time processing with low hardware resource consumption and parallel-pipelined design.

![Fig. 1: Tsukuba scene](image-url)
In the proposed algorithm, a fixed size window is centering on each pixel of the reference image, but only the pixels selected by the similarity criterion are used to compute the correlation coefficient. Any grayscale-based correlation coefficient such as Sum of Absolute Differences (SAD), Sum of Squared Differences (SSD), Normalized Cross Correlation (NCC) and so on can be modified using this technique. However, to reach real-time processing, the proposed algorithm is inspired by the Sum of Hamming Distances (SHD) [8] since it consists in binary register operations and it can be implemented in FPGA devices with low hardware resource consumption and parallel-pipelined design. In this case, we propose to use Eq 1-2.

\[
C(x, y, z) = \sum_{i=-w}^{w} \sum_{j=-w}^{w} H\{Q\}, \quad (1)
\]

\[
Q = \text{XOR}(I_l(x+i,y+j) \cdot \lambda(x,y,i,j),
I_r(x+z+i,y+j) \cdot \lambda(x,y,i,j)), \quad (2)
\]

where the parameter \(\lambda(x,y,i,j)\), is equal to one for all pixels in the correlation window which correspond to same object that the center pixel, zero otherwise. \((2 \cdot w + 1)^2\) is the correlation window size. \(z\) is range from 0 to \(z_{\text{max}}\) (maximum expected disparity). \(I_l(a,b), I_r(a,b)\) are binary registers for the pixels from the left and right images, respectively and the \(H\) operator is defined as shown in Eq 3.

\[
H\{Q\} = \sum_{k=1}^{k=bpp} q(k), \quad (3)
\]

where \(Q\) is a binary register and \(bpp\) is the size of the register \(Q\), i.e., the bits per pixel for the input images.

### 2.1 Technique to define the similarity criterion

We can assume that two pixels have different disparity levels when there is a significant difference between their greyscale values. Hence, we define \(\lambda(x,y,i,j)\) as one only when the grey level \(I_l(x+i,y+j)\) is close to the grey level of the pixel \(I_l(x,y)\), as shown in Eq 2. Where \(\varphi(x,y)\) is the maximum acceptable difference between the greyscale values defined as shown in Eq 4.

\[
\lambda(x,y,i,j) = \begin{cases} 
1, & |I_l(x+i,y+j) - I_l(x,y)| \leq \varphi(x,y) \\
0, & \text{otherwise,}
\end{cases} \quad (4)
\]

the value of the \(\varphi\) parameter must be related to the uniformity of the greyscale values in the correlation window. To compute the \(\varphi\) value it is proposed to use Eq 5.

\[
\varphi(x,y) = \frac{\sum_{i=-w}^{w} \sum_{j=-w}^{w} |I_l(x,y) - I_l(x+i,y+j)|}{(2 \cdot w + 1)^2} \quad (5)
\]
2.2 Disparity computation

In standard stereo algorithms, the disparity \( d_l(x, y) \) is defined as the shift \( z \) which gives the maximum (or minimum) of the correlation values in Eq. 1. To detect occlusions, the left-right consistency is used [17]. For each pixel, if the disparity \( d_l(x, y) \) computed using the left image as a reference is equal to the disparity \( d_r(x + z_{\text{max}}, y) \) computed using the right image as the reference, use Eq. 6 instead of Eq. 2.

\[
Q = \text{XOR}(I_l(x + i, y + j) \cdot \lambda(x, y, i, j), \]
\[
I_r(x - z + i, y + j) \cdot \lambda(x, y, i, j)),
\]

the solution is considered as correct. Otherwise the pixels are marked as occluded, however, the disparity can be assigned as the minimum value between \( d_l(x, y) \) and \( d_r(x + z_{\text{max}}, y) \).

3 The FPGA design

In Fig. 2, an overview of the FPGA design is shown. This design has five inputs, \text{clk\_pixel} as the pixel clock for the input stereo pairs, \text{left\_image} [7:0] and \text{right\_image} [7:0] as grayscale values of the pixels from the images of left and right, respectively, \text{x\_resolution} [10:0] as the horizontal resolution of the input images, \text{n} [4:0] as the number of lines in the correlation window. On the other hand, the design has one output, \text{final\_disparity} [7:0], corresponding to disparity values for the output image. Its general behavior can be described as following: first, the \text{stereopair\_buffer} module stores the grayscale values for all pixels in the correlation windows from 0 up to \( z_{\text{max}} \). Then, the proposed similarity criterion is computed. Then, left-disparity and right-disparity modules compute the disparity value using the left and right image as reference. Finally, a multiplexer (\text{mux}) sets the final disparity value as the minimum value between the left and right disparity values.

![Fig. 2: General diagram for the FPGA design](image-url)
3.1 The stereopair_buffer module

The stereopair_buffer module manages an array of $n + 1 \cdot 2$ BRAM cores which store $n$ lines of two images as shown in Fig. 3. The RAM_controller module assigns to each BRAM core their corresponding address and write-read values. The RAM_controller module has two inputs, $x$ resolution $[10:0]$ as the horizontal resolution of the input images and $n$ $[4:0]$ as the number of lines in the correlation window. There are two outputs, where $w/r$ $[n + 1:0]$ consist on a logic vector with $n + 1$ bits of size, the write-read value of each BRAM are determined by each bit of the vector. address $[10:0]$ consists of a logic vector, which corresponds to a read/write address for all the BRAM cores. Each BRAM core only provides the grayscale value of one pixel from one horizontal line in the correlation window. In order to store the others horizontal values necessary for the disparity computation, we used the line_vector module. Using the line_vector module, the value of the first horizontal pixels of the $n + 1$ BRAMs are read in parallel form and, the pixels of the $n$ BRAMs in read mode are placed in the bits 7-0 of $n$ storage vectors, respectively. Then, the first horizontal pixels are placed in the bits 15-8 and the second horizontal pixels are placed in the bits 7-0. This process is repeated until all horizontal pixels necessary for computing all disparity levels are stored.

![Fig. 3: FPGA design for the stereopair_buffer module](image-url)
3.2 The disparity module

For the computation of the disparity map via the proposed algorithm, a pixel-parallel and window parallel architecture was designed. The architecture of the disparity module is presented in Fig. 4, its general behavior is as follows: first, the XOR modules compute the XOR operation between the pixels from left and right images of the correlation window. This process is executed in each of the \( z_{\text{max}} + 1 \) XOR modules, implemented in parallel, which are configured for expected disparity levels from 0 until \( z_{\text{max}} \), where each module process only one disparity level and computes the XOR operation only for pixels selected by the proposed similarity criterion (Eq. 2). Then, the output of each of the XOR modules are sent to its corresponding binary_opperator module. The binary_opperator module corresponds to the \( H \) operator (Eq. 3). Then, the adder module computes the sum of the values for all pixels retained in the correlation window (Eq. 1). Finally, the mux_tree module which consist in a multiplexer tree assigns the corresponding index for all correlation values, then, determines the minimum correlation value and set the disparity value as the index of the minimum correlation value. In the developed FPGA design, two disparity modules were implemented in parallel form where the first module uses the left image as reference while the second module uses the right image as reference.

![Fig. 4: FPGA design for the disparity module](image)
4 Results and discussions

The developed FPGA architecture was implemented in an FPGA Cyclone IV EP4CGX150CF23C8 of Altera. All modules were designed via the CAPH design tool, which allows high-level programming and friendly design for video stream processing [20]. Furthermore, all modules were validated via post-synthesis simulations performed in ModelSim Altera. The selected configuration for the implemented algorithm was set as: correlation window size equal to 19 · 19, i.e., \(2w+1 = 19\) and maximum disparity level equal to 31, i.e., \(z_{\text{max}} = 31\). This setup allows to read 19x19 pixels from 32 correlation windows in parallel form, i.e., it computes the correlation index value for all disparity levels in the reference pixel in the same clock cycle. The setup requires 40 BRAM cores, 64 XOR, \(\mathcal{H}\) and adder modules, and two mux_tree module implemented in parallel-pipelined form. The hardware resource consumption of the developed FPGA architecture is shown in Table 1.

Table 1: Hardware resource consumption for the FPGA implementation

<table>
<thead>
<tr>
<th>Resource (FPGA:EP4CGX150CF23C8)</th>
<th>Consumption</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total logic elements</td>
<td>62,689/149,760 (41.85%)</td>
</tr>
<tr>
<td>Total pins</td>
<td>25/287 (11.48%)</td>
</tr>
<tr>
<td>Total Memory Bits</td>
<td>242,064/6,635,920 (27.41%)</td>
</tr>
<tr>
<td>Embedded multiplier elements</td>
<td>0/720 (0%)</td>
</tr>
<tr>
<td>Total PLLs</td>
<td>0/6 (0%)</td>
</tr>
</tbody>
</table>

In Table 2, quantitative results of the number of erroneous pixels obtained by the proposed algorithm compared with other FPGA-based stereo matching algorithms in the literature are presented. Table 2 demonstrates that the proposed algorithm improves most algorithms in the literature. The algorithms presented in [2,23,11] allow estimating dense disparity maps. However, the averaging effect due to large window sizes used generates errors at depth discontinuities. To avoid the averaging effect algorithms with small window sizes have been developed [6,4,5]. However, noise sensibility is increased and erroneous values in untextured regions are generated. In both cases erroneous pixels near to 12% are reached. Other approach such as [1], which uses a geometric feature, or [15], which is inspired in the Census transform and introduces and adaptive coefficient reach erroneous pixels near to 10%. Finally, although the algorithm presented in [14] uses a grayscale-based similarity criterion to improve the performance of the SAD algorithm, it is possible to affirm that using the similarity criterion proposed in Eq. 1-6 of this article, allows to increase the accuracy, near to 10% with respect to [14]. Furthermore, due to the proposed algorithm allows decreasing the correlation window size with respect to [14], the FPGA resource usage for the proposed algorithm is more efficient than [14].
In Fig. 5 the disparity maps for the Tsukuba and Venus scenes generated by the proposed method are shown. Although the obtained results retain some noise, previous algorithms in the literature [6,4,5,1,14,15] have been improved quantitatively and qualitatively, disparity maps for the Tsukuba and Venus scenes of most algorithms compared in Table 2 can be consulted in the Middlebury Stereo Vision Page [18]. On the other hand, in Table 3 comparison of processing speed regarding to other real-time stereo matching algorithms reported in the literature is presented. It is observed high increase with respect to several FPGA-based stereo matching algorithms in the literature [15,2,23,11].

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Tsukuba (all)</th>
<th>Venus (all)</th>
<th>Correlation window size</th>
</tr>
</thead>
<tbody>
<tr>
<td>Aguilar-González et al., [1]</td>
<td>10.9%</td>
<td>6.93%</td>
<td>3 · 3</td>
</tr>
<tr>
<td>Georgoulas et al., [6]</td>
<td>12.0%</td>
<td>8.0%</td>
<td>7 · 7</td>
</tr>
<tr>
<td>Alba et al., [2]</td>
<td>13.92%</td>
<td>12.6%</td>
<td>19 · 19</td>
</tr>
<tr>
<td>Georgoulas and Andreadis, [4]</td>
<td>11%</td>
<td>8%</td>
<td>7 · 7</td>
</tr>
<tr>
<td>Georgoulas and Andreadis, [5]</td>
<td>12%</td>
<td>9%</td>
<td>7 · 7</td>
</tr>
<tr>
<td>Perri et al., [15]</td>
<td>11.8%</td>
<td>7.2%</td>
<td>13 · 13</td>
</tr>
<tr>
<td>Ttofis et al., [23]</td>
<td>10.4%</td>
<td>12.1%</td>
<td>11 · 11</td>
</tr>
<tr>
<td>Pérez-Patricio and Aguilar-González, [14]</td>
<td>7.6%</td>
<td>3.2%</td>
<td>29 · 29</td>
</tr>
<tr>
<td>Jin et al., [11]</td>
<td>11.57%</td>
<td>5.27%</td>
<td>15 · 15</td>
</tr>
<tr>
<td>Proposed*</td>
<td>6.8%</td>
<td>2.8%</td>
<td>19 · 19</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>Algorithm</th>
<th>Resolution</th>
<th>Frames/s</th>
<th>Pixels/s</th>
</tr>
</thead>
<tbody>
<tr>
<td>Georgoulas et al., [6]</td>
<td>1280-1024</td>
<td>65</td>
<td>85,196,800</td>
</tr>
<tr>
<td>Perri et al., [15]</td>
<td>640-480</td>
<td>68</td>
<td>20,889,600</td>
</tr>
<tr>
<td>Alba et al., [2]</td>
<td>256-256</td>
<td>100</td>
<td>6,553,600</td>
</tr>
<tr>
<td>Ttofis et al., [23]</td>
<td>1280-1024</td>
<td>50</td>
<td>65,536,000</td>
</tr>
<tr>
<td>Aguilar-González et al., [1]</td>
<td>1280-1024</td>
<td>75</td>
<td>98,304,000</td>
</tr>
<tr>
<td>Pérez-Patricio and Aguilar-González, [14]</td>
<td>450-375</td>
<td>592</td>
<td>99,900,000</td>
</tr>
<tr>
<td>Jin et al., [11]</td>
<td>640-480</td>
<td>630</td>
<td>70,656,000</td>
</tr>
<tr>
<td>Proposed</td>
<td>1280-720</td>
<td>117</td>
<td>107,827,200</td>
</tr>
</tbody>
</table>
5 Conclusions

In this article an area-based algorithm for stereo matching using a similarity criterion, which is used as pixel selector in the correlation window was presented. It was demonstrated that using the modified Sum of Hamming Distance algorithm proposed in this article, it is possible to increase the accuracy of most real-time FPGA-based stereo matching algorithms in the literature and reach parallel-pipelined design that increases the processing speed. The best performance of the proposed algorithm was obtained with a large window, appropriate for untextured areas. However, due to only pixels of the same object are used in the correlation window, blurring effects are avoided, and therefore, erroneous values at depth discontinuities are reduced.

References

