An FPGA Correlation-Edge Distance approach for disparity map
Abiel Aguilar-González, Madain Perez-Patricio, Miguel Arias-Estrada, Jorge-Luis Camas-Anzueto, Héctor-Ricardo Hernández-de León, Avisaí Sánchez-Alegría

To cite this version:
Abiel Aguilar-González, Madain Perez-Patricio, Miguel Arias-Estrada, Jorge-Luis Camas-Anzueto, Héctor-Ricardo Hernández-de León, et al.. An FPGA Correlation-Edge Distance approach for disparity map. 2015 International Conference on Electronics, Communications and Computers (CONIELECOMP), Feb 2015, Cholula, Mexico. pp.21 - 28, 10.1109/CONIELECOMP.2015.7086952 . hal-01627287

HAL Id: hal-01627287
https://hal.archives-ouvertes.fr/hal-01627287
Submitted on 31 Oct 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
An FPGA Correlation-Edge Distance Approach for Disparity Map


*Instituto Tecnológico de Tuxtla Gutiérrez, Tuxtla Gutiérrez, Chiapas, México
**Instituto Nacional de Astrofísica, Óptica y Electrónica, Tonantzintla, Puebla, México

Abstract—This paper describes an FPGA Correlation-Edge Distance approach for real time disparity map generation in stereo-vision. The proposed method calculates the disparity map for the input and disparity map for Edge Distance images of a stereopair. In both cases the approximation algorithm of disparity map SAD (Sum of Absolute Differences) is used. The final disparity map is determined from the previously generated maps, considering a homogeneity parameter defined for each point in the scene. Due to low complexity when implementing stereo-vision algorithms in FPGA devices, the proposed method was implemented in a Cyclone II EP2C35F672C6 FPGA assembled in an Altera DE2 breadboard. The developed module can process stereo-pairs of 1280×1024 pixel resolution at a rate of 75 frames/s and produces 8-bit dense disparity maps within a range of disparities up to 63 pixels. The presented architecture provides a significant improvement in regions with uniformed texture over correlation based stereo-vision algorithms in the reported literature and an accelerated processing rate.

Keywords—FPGA, Verilog, Disparity map.

I. INTRODUCTION

Perception of depth values of points contained in a scene is one important task of computer vision systems. Recently depth perception has been used in diverse applications such as navigation systems for mobile robots, object recognition and 3D reconstruction [1]-[4]. Extracting information about the depth from images using a stereo configuration is the most used technique. In this technique the correspondence between images and the geometric configuration of the same allows to obtain depth images called disparity maps [5].

To define a disparity map it is necessary to measure the similarity of content points in the stereo image. Techniques to determine these similarities are divided into two categories: area-based algorithms [6]-[7], and feature-based algorithms [8]-[9]. In area-based algorithms the gray level of pixels round the interest pixel are used as similarity measure to produce dense disparity maps, i.e., the disparity is calculated for all the points in the scene. On the other hand, feature-based algorithms are based on specific points of interest. These points are selected in concordance with appropriate feature detectors. Feature-based algorithms are more stable against changes in lighting environment and contrast, because they represent the geometric properties of a scene. The main characteristic of using feature-based algorithms is that they do not generate dense disparity maps. Therefore, these algorithms must be applied in conjunction with other techniques and require an additional step for characteristics extraction, which increases the computational costs and runtime.

Due to benefits regarding to the management of large amounts of data at high speed of FPGAs devices, there is currently a wide variety of algorithms for estimating disparity maps implemented in FPGAs in the reported literature. In [10], an array of four FPGAs is used to estimate the Cross Correlation for 256×256 pixel resolution images at a rate of 7 frames per second. In [11], a hybrid system that uses digital signal processors with programmable logic devices PLD is presented. The authors of this paper generated disparity maps for 256×256 pixel resolution images at a rate of 30 frames per second. In [12], is proposed a four Xilinx Virtex 2000E structure, on which is possible to obtain real-time dense disparity-maps for 256×360 pixel resolution images at a rate of 40 frames per second. In [13] the use of a single FPGA is proposed. The developed system processes images at 30 frames per second using 640×480 pixel resolution images. An adaptive window technique in conjunction with SAD is used in [14]. The presented method processes images up to 1024×1024 pixel resolution at a rate of 47 frames/s and produces 8-bit dense disparity maps within a range of disparities up to 32 pixels. The architecture presented in [15] uses four FPGAs to make a real time correction, in on the same paper a left-right consistency check is done to improve the quality of the disparity map produced. This method processes images up to 640×480 pixel resolution at a rate of 30 frames/s and produces 8-bit dense disparity maps within a range of disparities up to 128 pixels. In [16] a module for calculating the real-time disparity map is proposed. The module was implemented in a single FPGA of Altera Stratix IV family. The authors of this paper processed images up to 640×480 pixel resolution at a rate of 320 frames/s and produces 8-bit dense disparity maps within a range of disparities up to 80 pixels.

In this paper an FPGA module to calculate real-time dense disparity maps is presented. The novelty is the architecture design and FPGA implementation for the proposed method. The disparity maps are calculated at a rate of 75 frames/s for 1280×1024 pixel resolution images and generate 8-bit dense disparity maps within a range of disparities up to 63 pixels. The developed module allows simple and systematically scalability to different range of disparities, therefore the resulting hardware could be applied to a wide range of real-time stereo-vision applications such as high-speed tracking, tracking paths, high-speed objects recognition and mobile robot navigation.
II. THE PROPOSED METHOD

An overview of the proposed method can be seen in Fig. 1. This method consists in two steps. First, disparity maps for the input images and edge-distance images are computed, simultaneously an homogeneity parameter $\psi$ is calculated for each point in the scene. Followed, based on the disparity maps previously generated and using the $\psi$ parameter, a final disparity map is generated.

![Block diagram of the proposed method](image)

**A. Sum of absolute differences (SAD)**

The Sum of Absolute Differences (SAD) is a correlation-based method mostly used due to its high computational efficiency. The general behavior can be described as following, given $(x, y)$ coordinates of a pixel in left image and maximum value of expected disparity $d_{\text{max}}$, an correlation index $\text{Crl}(x, y, s)$ is calculated for each displacement $S$ of the correlation window in right image. To calculate the correlation the following equation is used:

$$\text{Crl}(x, y, s) = \sum_{u=-w}^{w} \sum_{v=-w}^{w} |I_l(x+u, y+v) - I_r(x+u+s, y+v)|$$  \hspace{1cm} (1)

as $2w+1$ is the window size centered on the pixel with position $(x, y)$. $I_l$, $I_r$ are the gray values of the pixels in the left and right images respectively and $S$ ranges from 0 to $d_{\text{max}}$. The disparity $d(x, y)$ is defined as $S$ displacement that minimizes the correlation index:

$$d(x, y) = \arg \min_s \text{Crl}(x, y, s)$$  \hspace{1cm} (2)

The main problem with this method is to select the correlation window size. High window size values allow to determine the true correlation values in areas with uniform texture, however this window size values imply a high computational demand and erroneous values at certain points due to the blurring edges and small features eliminated (see Fig. 3). On the other hand, small window size values imply low computational demand but the correlation coefficient measurement is very sensitive to noise and erroneous values at uniform texture regions are generated as seen in Fig. 4. Fig. 2 shows the true disparity map of a Tsukuba scene.

![Tsukuba scene, true disparity map](image)

![Tsukuba scene, disparity map SAD $w = 15$.](image)

![Tsukuba scene, disparity map SAD $w = 1$.](image)

**B. Edge distance**

The euclidean distance between each pixel with $I_l(x, y)$, $I_r(x, y)$ coordinates, and the nearest left edge is calculated as following:

$$k(x, y) = |I_\delta(x, y) - I_\delta(x-1, y)|$$  \hspace{1cm} (3)

$$\text{distance}(x, y) = \begin{cases} 
 l = 0, & k(x, y) < \beta \\
 l = l + 1, & k(x, y) > \beta 
\end{cases}$$  \hspace{1cm} (4)

as $\beta$ is the threshold value that defines an edge and $\delta$ is $l$ or $r$ for left or right images, respectively.
Fig. 5 shows the nearest left edge distance for each point of the left image of Tsukuba scene. The darker values represent low value distances while light values represent high value distances. Fig. 6 shows the disparity map obtained using images of edge distance as input images for SAD. In this figure a significant improvement in regions with uniform texture is perceived, on the other hand, an increase of errors in regions with uneven texture is generated.

C. Homogeneity parameter

A \( \psi \) parameter corresponding to the homogeneity degree for each point of the left image of stereo pair is determined as following:

\[
h(x, y) = \sum_{u=-w, v=-w}^{u=w, v=w} I_l(x + u, y + v)\]

\[
\psi(x, y) = \begin{cases} 
0, & h(x, y)/(w + 1)^2 < \lambda \\
1, & h(x, y)/(w + 1)^2 \geq \lambda 
\end{cases}
\]  

(6)

as \( \lambda \) is threshold value that determines the homogeneity of a point with respect to its corresponding correlation window and \( 2w + 1 \) is the window size centered on the pixel with position \((x, y)\). Fig. 7 shows \( \psi \) values for all points contained in the Tsukuba scene considering \( \lambda = 1 \).

D. Composition

Using the \( \psi \) parameter it is possible to determine a final disparity map by assigning the values obtained from the edge disparity map (Fig. 6) for points with uniform texture and the values obtained from the disparity map (Fig. 4) for points with uneven texture as following:

\[
disparity(x, y) = \begin{cases} 
distance(x, y), & \psi(x, y) = 0, \sigma = 1 \\
d(x, y), & \psi(x, y) = 1 
\end{cases}
\]

(7)

as \( \sigma \) is determined as following: \( \sigma : 0 < \text{disparity}(x, y) < d_{\text{max}} \).

Fig. 8 shown the disparity maps generated by the proposed method for the Tsukuba scene where \( w, \beta \) and \( \lambda \) parameters were configured with values equal to \{1, 32, 1\} respectively. In this figure a high improvement in regions of uniform texture and a low improvement in the points near the edges is obtained.

The proposed method requires less computational load in contrast to various methods in the reported literature [7]-[8], [15]-[16], however compute a disparity map for a stereo pair of 384\( \times \)288 pixel resolution (Tsukuba scene resolution) implies a runtime close to 1 second. These runtime values are not acceptable for real-time applications. This was the main motivation to search efficient ways to implement the developed method, opted for an FPGA implementation. In Section III the detailed implementation in hardware is presented while in section IV the experimental results generated by this implementation are shown.
III. FPGA IMPLEMENTATION

The main advantages of using FPGA devices are: easiness to re-designed architectures based on specifications without incurring in non-recurring engineering and parallel processing ability, allowing real-time processing. Due to these advantages the proposed method was implement in a single Cyclone II EP2C35F672C6 FPGA assembled in an Altera DE2 bread-board. A 50 MHz clock frequency was used. In order to get input stereo pairs, TRDB DC2 plate connected in 1st expansion port of DE2 plate is used and provides stereo pairs to 1280×1024 pixel resolution in RGB scale. On the other hand, to determine the grayscale value of input stereo pairs the green channel value are used as grayscale value. The developed architecture is described in detail in the following sub-sections. A general diagram of the designed architecture can be seen in Fig. 9. In TABLE I a full description of the physical ports used by proposed architecture is shown.

<table>
<thead>
<tr>
<th>Name</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>Clk</td>
<td>Input</td>
<td>Pixel clock</td>
</tr>
<tr>
<td>Left image</td>
<td>Input</td>
<td>Logical vector [7 : 0]</td>
</tr>
<tr>
<td>Right image</td>
<td>Input</td>
<td>Logical vector [7 : 0]</td>
</tr>
<tr>
<td>Disparity map</td>
<td>Output</td>
<td>Logical vector [7 : 0]</td>
</tr>
</tbody>
</table>

A. Sum of absolute differences (SAD)

To calculate the disparity map by SAD it is necessary to have stored fragments of images to be processed. In order to manage the FPGA memory two vectors with capacity to store three rows of left image and three rows of right image respectively were used. The behavior of these vectors is similar to a shift register unit, however, due simplicity of these vectors hardware resources demand is reduced. In general terms, during time zero the pixel value (1,1) is stored in index vector 0, a clock cycle later this value is crossed to index 1 and the pixel (1, 2) is stored in index 0. A similar process is repeated for all pixels making up the image.

To calculate the SAD, a 3×3 correlation window and maximum value of expected disparity $d_{max} = 63$ is used. An architecture of pixel-parallel window-parallel was designed. The inputs are obtained from the vectors previously configured, using appropriate indexes is possible to process the image at video rate. SAD calculation starts after storing the first three rows of each image. This results in a $(x-w)^*(y-w)$ disparity map resolution, as $x, y$ is the resolution values of input image and $2w+1$ is the correlation window size. Fig. 10 shows the architecture for correlation calculation to a disparity value equal to zero. To calculate the correlation values for the remaining disparities similar architectures implemented in parallel were used.

To determine the appropriate disparity value, the min module is used (Fig. 11). First, this module uses the previously calculated correlation values and using an array of multiplexers activated by comparators as $Sel : I_1 > I_2$ determines the correlation value which minimizes the correlation window displacement. After, a CASE structure is used to assigned the corresponding correlation index $Crl_{index}$, to determining the disparity value an equalizer is applied to $Crl_{index}$ as follows:

$$\text{disparity} = Crl_{index} \times \left(\frac{255}{d_{max}}\right)$$ (8)
B. Edge distance

Fig. 12 shows the architecture for nearest left edge distance calculation. First, the storage_vector module stores pixels with \((x - 1, y)\) and \((x, y)\) coordinates for the left image and right images, then, applying equations 3 and 4 nearest left edge distance is calculated in edge_distance module.

C. Homogeneity parameter

Fig. 13 shows the architecture for calculating the \(\psi\) parameter. The vector_storage module stores three lines of the left image of the stereo pair. Then, using the stored lines and applying equations 5 and 6, the homogeneity_computation module determines the correlation homogeneity window centered in the pixel of interest and assigns the corresponding \(\psi\) value.

D. Composition

The composition module operates on the principle of a conventional multiplexer assigning for each point \((x, y)\) the value of one of its two inputs determined by the Sel parameter.
Finally, generated disparity maps are shown on a 4.3” LCD screen of terasIC with 800×480 pixels resolution connected to the 2nd expansion port of the DE2 board.

IV. EXPERIMENTAL RESULTS

The architecture presented in Section III was implemented using a top-down approach. All modules were coded in Verilog and were simulated using ModelSim-Altera 6.6c to verify its functionality. Quartus II Web Edition SP1 version 10.1 was used for the synthesis and download in a Cyclone II EP2C35F672C6 FPGA assembled in Altera DE2 breadboard. Resource consumption of the developed architecture is shown in TABLE II.

In order to evaluate the proposed method performance, the developed architecture was tested using different values for $\beta$ and $\lambda$ parameters. Tests were performed using the Tsukuba scene as test images. Whereas hardware implementation operates using binary strings and color depth equal to 8 bits, values with base 2 were tested. Fig. 15 shows the behavior of the error obtained in the final disparity map for $\beta = \{1, 8, 16, 32, 64, 128, 255\}$ and $\lambda = 1$. Fig. 16 shows the behavior of the error obtained in the final disparity map for $\lambda = \{1, 8, 16, 32, 64, 128, 255\}$ and $\beta = 1$. To determine the number of erroneous pixels, the generated disparity maps were evaluated using the Middlebury stereo vision system web site [21].

In TABLE III, quantitative results for the proposed method with a 3x3 correlation window compared with other methods in the literature [17] are presented, $\text{occ}$ is the error for points of occlusion in the input images, $\text{disc}$ is the error for points with discontinuities and $\text{all}$ is the error for total points in the scene. The shown values were obtained from the Middlebury stereo vision system evaluation [21] with following Image Settings: Tsukuba (384×288 resolution, $d_{\text{max}} = 15$), Venus (434×383 resolution, $d_{\text{max}} = 19$), Teddy (450×375 resolution, $d_{\text{max}} = 59$) Cones (450×375 resolution, $d_{\text{max}} = 59$) and performing re-designs in developed architecture to operate with $d_{\text{max}}$ values equal to $\{15, 31, 63, 63\}$ respectively. For all test images $\nu$, $\beta$ and $\lambda$ parameters were configured values equal to $\{1, 32, 1\}$ respectively, in Fig. 17 disparity maps generated by the proposed method for the same test scenes is shown. In TABLE IV the speed of the developed architecture compared with other FPGAs implementations reported in the literature are presented. In TABLE V hardware resources demand compared with other FPGAs implementations reported in the literature are shown.

By analyzing TABLE III a significant improvement over other methods reported in the literature is observed. This
Fig. 17: Disparity maps generated by the proposed method for different test images.
improvement is due to the left-edge distance that like feature-based methods represents the geometries of the input scene.

Due to mathematical simplicity of the proposed method, in the developed architecture a high processing speed is presented. Comparing the processing speed of the developed architecture with other FPGA implementations reported in the literature (TABLE IV) up to 32,768,000 pixels/s increase is observed.

Due to strategies such implementation of storage vectors and small correlation windows, the developed architecture has a low hardware resources demand (TABLE V). When comparing the consumption of logic elements and memory bits with other methods reported in the literature up to 76,694 and 60,259 reduction was observed.

<table>
<thead>
<tr>
<th>Method</th>
<th>Resolution</th>
<th>Frames/s</th>
<th>Pixels/s</th>
</tr>
</thead>
<tbody>
<tr>
<td>[16]</td>
<td>1024X1024</td>
<td>102</td>
<td>106,954,752</td>
</tr>
<tr>
<td>[18]</td>
<td>1280x1024</td>
<td>65</td>
<td>85,196,800</td>
</tr>
<tr>
<td>[19]</td>
<td>640x480</td>
<td>230</td>
<td>70,656,000</td>
</tr>
<tr>
<td>[20]</td>
<td>1280x1024</td>
<td>50</td>
<td>65,536,000</td>
</tr>
<tr>
<td>Proposed*</td>
<td>1280x1024</td>
<td>75</td>
<td>98,304,000</td>
</tr>
</tbody>
</table>

*Operating frequency = 50 MHz

<table>
<thead>
<tr>
<th>Method</th>
<th>Logic elements (LEs)</th>
<th>Memory bits (ALUTs)</th>
</tr>
</thead>
<tbody>
<tr>
<td>[16]</td>
<td>86,252</td>
<td>62,669</td>
</tr>
<tr>
<td>[18]</td>
<td>89,459</td>
<td>84,307</td>
</tr>
<tr>
<td>[19]</td>
<td>53,616</td>
<td>60,598</td>
</tr>
<tr>
<td>[20]</td>
<td>31,863</td>
<td>47,331</td>
</tr>
<tr>
<td>Proposed</td>
<td>12,765</td>
<td>24,048</td>
</tr>
</tbody>
</table>

TABLE IV: PROCESSING SPEED COMPARED WITH OTHER FPGAs IMPLEMENTATIONS.

V. CONCLUSIONS

In this paper a module for real-time disparity maps computation has been presented. The developed architecture shows a better performance in regions of uniform texture regarding other methods mentioned in the literature. The principal advantage with the developed module it’s a high processing speed and a low consumption of hardware resources, it allows to implement the proposed method in FPGA devices with relatively few resources which facilitates its application in real-time stereo vision applications.

Besides, one of the main characteristic of the developed architecture is its flexibility to be reconfigured or modified to work with different windows sizes and different maximum values of expected disparity.

REFERENCES