Dense Feature Matching Core for FPGA-based Smart Cameras

Abiel Aguilar-González, Miguel Arias-Estrada, François Berry

To cite this version:
Abiel Aguilar-González, Miguel Arias-Estrada, François Berry. Dense Feature Matching Core for FPGA-based Smart Cameras. 11th International Conference on Distributed Smart Cameras (ICDSC 2017), Sep 2017, Stanford, CA, United States. pp.41-48, 10.1145/3131885.3131922. hal-01657267

HAL Id: hal-01657267
https://hal.archives-ouvertes.fr/hal-01657267
Submitted on 6 Dec 2017

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Dense Feature Matching Core for FPGA-based Smart Cameras

Abiel-Aguilar-González1,2, Miguel Arias-Estrada1, François Berry2
1. Instituto Nacional de Astrofísica, Óptica y Electrónica (INAOE), Tonantzintla, Mexico
2. Université Clermont Auvergne (UCA), Institut Pascal, Clermont-Ferrand, France
Contact: abiel@inaoep.mx

ABSTRACT

Smart cameras are image/video acquisition devices that integrate image processing algorithms close to the image sensor, so they can deliver high-level information to a host computer or high-level decision process. In this context, a central issue is the implementation of complex and computationally intensive computer vision algorithms inside the camera fabric. For low-level processing, FPGA devices are excellent candidates because they support data parallelism with high data throughput. One computer vision algorithm highly promising for FPGA-based smart cameras is feature matching. Unfortunately, most previous feature matching formulations have inefficient FPGA implementations or deliver relatively poor information about the observed scene. In this work, we introduce a new feature-matching algorithm that aims for dense feature matching and at the same time straightforward FPGA implementation. We propose a new mathematical formulation that addressed the feature matching task as a feature tracking problem. We demonstrate that our algorithmic formulation delivers robust feature matching with low mathematical complexity and obtains accuracy superior to previous algorithmic formulations. An FPGA architecture is laid down and, hardware acceleration strategies are discussed. Finally, we applied our feature matching algorithm in a monocular-SLAM system. We show that our algorithmic formulation provides promising results under real world applications.

Keywords
Feature matching, Feature tracking, Smart camera; FPGA

1. INTRODUCTION

Smart cameras are image/video acquisition devices with self-contained image processing algorithms that simplify the formulation of a particular application, i.e., algorithms for video surveillance could detect and track pedestrians, but for a robotic application, algorithms could be edge and feature detection. In recent years, progress in microprocessor power and FPGA technology allowed the creation of compact smart cameras with low cost and, this increased the smart camera applications performance, so in current embedded vision applications, smart cameras represent a promising on-board solution under different application domains: motion detection, object detection/tracking, inspection and surveillance, human behavior recognition [7, 9], etc. In any case, flexibility of application domain relies on the large variety of image processing algorithms that can be implemented inside the camera. One task highly used by computer vision applications is feature matching between different camera views. In computer vision, feature matching aims for pixel/point correspondences across different viewpoints from the same scene/object (Fig. 1) and it is the basis of several computer vision applications such as, augmented reality, object recognition [2], etc. The most common formulation consists in detecting a set of feature points and associate each point with a visual descriptor. Once feature points and their descriptors have been extracted from at least two images, it is possible to match features across the images. In this context, feature-tracking seems to be a simple point matching problem, nevertheless, in practice it is a complex task since matching performance depends on the feature extractor/visual descriptor properties. Specific detectors and descriptors, appropriate for the input images content have to be used in specific applications. i.e., if input images are a microscopic view of bacteria or cells, a blob detector should be used. On the other hand, if the images are a city view, a corner detector is more suitable to find building structures. In addition, if input images have high degradation (rotation, orientation or scale changes), complex and intensive visual descriptors considering the image degradation are required in order to guarantee stability.

Figure 1: The feature matching problem: visual features (circles) have to be matched (lines) across different viewpoints from the same scene/object (squares).

2. RELATED WORK

In current computer vision systems, several applications use feature matching as keystone of their mathematical formulations, so a smart camera that contains feature matching in their self-contained algorithms is highly desirable. In recent work, there are several approaches that aim for an embedded feature matching core; several FPGA architectures have been developed and several solutions have been proposed. [24, 17]. In [8] an embedded system architecture for feature detection and matching was presented. The proposed FPGA architecture implements the FAST [15] (Features
from Accelerated Segment Test) feature detector and the BRIEF [5] (Binary Robust Independent Elementary Features) feature descriptor in a customizable FPGA block. The developed blocks were designed to use hardware interfaces based on the AMBA AXI4 interface protocol and were connected using a DMA (Direct Memory Access) architecture. The proposed architecture computes feature matching over two consecutive HD frames coming from an external memory at 48 frames per second. In [26] a FPGA architecture of SIFT (Scale Invariant Feature Transform) visual descriptor associated to an image matching algorithm was presented. For an efficient FPGA-SIFT image matching implementation (in terms of speed and hardware resources usage), the original SIFT algorithm was optimized as follows: 1) Upsampling operations were replaced with downsampling, in order to avoid interpolation operations. 2) Only four scales with two octaves were used. 3) Dimension of the visual descriptor was reduced to 72 instead of 128 in the original SIFT formulation. This implementation is able to detect and match features in 640×480 image resolution at 33 frames per second. More recently, Weheruss [25] have proposed a FPGA architecture for ORB [16] (Oriented FAST) descriptor associated to a feature matching algorithm. An “harris corner” detection [10] was the feature extractor and ORB visual descriptors were computed at each “corner”. Finally, the previous features (stored in a 2D Shift Register) and the current features were matched using the hamming distances as discrimination metric. In 2017, Vourvoulakis [23] presented an FPGA-SIFT architecture for feature matching. In order to achieve high hardware parallelism, procedures of SIFT detection and description were reformulated. At every clock cycle, the current pixel in the pipeline is tested and if it is a SIFT feature, its descriptor is extracted. Furthermore, every detected feature in the current frame is matched with one among the stored features of the previous frame, using a moving window, without breaking the “pixel pipeline”. False matches are rejected using RANSAC (Random Sample Consensus) algorithm. The architecture was implemented on Cyclone IV. Maximum supported clock frequency was set as 25 MHz and the architecture was capable to process 81 frames per second, considering 640×480 image resolution.

In most of cases, previous FPGA-based feature matching formulations, [24, 17, 8, 15, 5, 26, 25, 16, 23] provide relatively good performance under real world scenarios. Unfortunately, in several applications and in particular smart cameras applications, these algorithms are not compliant due to their relatively high hardware requirements and their algorithmic formulation. We can mention three important limitations affecting the current feature matching algorithms:

1. Low performance for embedded applications: nowadays computers can process several feature matching algorithms in real-time. Unfortunately, in embedded applications such as, smart cameras, mobile applications, autonomous robotics or compact smart vision systems, the use of computers is difficult due to their high power consumption and size. The use of FPGA technology is an alternative, but there are hard challenges due to previous visual descriptors (SIFT, BIERF, ORB) and matching techniques were designed for software implementation, and often, there are several iterative operations that could not be parallelized. As result, most previous FPGA architectures have high hardware requirements and relatively low processing rate.

2. Sparse matching: in order to maintaining high discrimination between descriptors only features with high thresholding response are matching (since it is assumed that these features have to be associated with high responsive visual descriptor that has low probability to be similar in other features). In practice, this assumption ensures consistency in the matching process, however, there is an important limitation because only a few image points are matched, then, scene/object information are available only at certain sparse points of the image, as shown in Fig. 2.

3. Outliers: In certain cases, the image ambiguities around features (color/texture repeatability, occlusion, etc.) generate similar visual descriptors for two or more different features, in such scenario the matching techniques deliver wrong results that can affect the global performance of several computer vision applications (camera calibration, structure from motion, visual odometry, etc.), see Fig. 2. To solve this problem, statistically robust methods like RANSAC have to be applied as outlier filter. In this case, statistical methods remove wrong matches, but they increment the matching cost.

![Figure 2: Feature matching algorithms limitations: most previous formulations work with few image points. Therefore, scene information is limited to a certain sparse points in the image. On the other hand, most previous work deliver outliers that effect performance in real world applications. (Figure modified from [25])](image)

### 3. THE PROPOSED ALGORITHM

Most previous FPGA-based feature matching formulations provide relatively good performance under real world scenarios, however, there are several important limitations (low performance for embedded applications, sparse matching and outliers) that affect performance. In this work, we assume that a more efficient solution consists in addressing the feature matching task as a feature tracking problem. In this way, we consider that a feature tracking approach will provide more data parallelism than previous formulations based on SIFT/ORB visual descriptors. In current state of the art, there are some FPGA architectures for feature tracking [22, 20]. Unfortunately, most previous work addressed the problem via the KLT (Kanade-Lucas-Tomasi) tracking algorithm, that is highly exhaustive and has high hardware requirements for the case of FPGA implementation.
3.1 Feature matching and Feature tracking.

The basis of feature matching is to extract visual features from two or more different viewpoints from the same scene/object and then, match these visual features by comparing visual descriptors computed around each feature, as shown in Fig. 3. On the other hand, feature tracking consists in extracting visual features from an image and then, try to find the same features back in a similar image (commonly the next frame from a video sequence), as illustrated in Fig. 4.

3.2 Image storage

Considering that in most cases the image sensor provides data as a stream, a storage is required to get two consecutive frames at the same time. More information/details about the storage architecture will be presented in Section 4.1. For mathematical formulation, first frame (frame at time t) is noted $I_t(x, y)$ while the second frame (frame at time $t + 1$) is $I_{t+1}(x, y)$.

3.3 Feature extractor

In this work, the Shi-Tomasi feature extractor is used, it provides a good trade-off between accuracy/robustness, speed processing and hardware requirements. This extractor is based on spatial gradients such as:

$$
A(x, y) = \frac{\partial I}{\partial x} \cdot \frac{\partial I}{\partial x}
$$
$$
B(x, y) = \frac{\partial I}{\partial y} \cdot \frac{\partial I}{\partial y}
$$
$$
C(x, y) = \frac{\partial I}{\partial x} \cdot \frac{\partial I}{\partial y}
$$

A gaussian filtering is applied over the $A, B, C$ matrices in order to reduce noise and to remove fine-scale structures that affect the performance of the corner response. Smoothed matrices are defined by $A', B', C'$. Original Shi-Tomasi corner metric Eq. 1, provides a high response value for corners and low response otherwise, as illustrated in Fig. 5b.

$$
D(x, y) = \sqrt{(A'(x, y) + B'(x, y))^2 + 4C'(x, y)^2}
$$

In order to determine if a pixel $P$ is a corner or not, maximum value of the corner response is retained. However, many pixels around each corner are detected in spite of filtering with a threshold $\alpha$. These pixels are false feature candidates and are difficult to match/track. A way to remove these false feature candidates consists in applying a non-maxima suppression step. An appropriate FPGA-based non-maxima suppression step could be defined as follows: Considering $D(x, y)$ as the corner response image and $\Omega(x, y)$ as fixed neighborhood size of $3 \times 3$ around $D(x, y)$, a “good features” is computed as :

$$
\beta(x, y) = \max \left[ \Omega(x, y) \ast \begin{pmatrix} 1 & 1 & 1 \\ 1 & 0 & 1 \\ 1 & 1 & 1 \end{pmatrix} \right]
$$

Finally, a thresholding ($\alpha$) have to be applied over $D(x, y)$ in order to select the “good features/corners”, see Eq. 2.

$$
corners(x, y) = \begin{cases} 1 & \text{if } \beta(x, y) > \alpha \\ 0 & \text{if otherwise} \end{cases}
$$

In our case, we propose a new feature tracking formulation that can be extended for feature matching. Our contributions are twofold: first, we propose a tracking/matching framework that improves most of the current limitations of the previous feature matching algorithms and that is fully compliant with FPGA architectures. Second, we lay down an FPGA architecture, suitable for smart cameras implementation. In Fig. 6 an overview of our algorthmic formulation is shown. In general, our algorithm first, stores two consecutive frames from an input video sequence. Visual features are extracted from the both frames. In addition, curl of the intensity gradient is computed over both frames. In this case, curl of the intensity gradient aims to remove image ambiguities around features (color/texture repeatability, pixel similarity, etc.). Then, a fully parallelized feature tracking algorithm computes preliminary matches between features in first frame and pixel points in second frame. For that, curls of the intensity gradient are considered as input since it is assumed that it guarantees consistent tracking around features. Finally, considering feature points in second frame, a feature matching step refines the tracking result by comparing tracked points in second frame with visual feature coordinates computed at the same frame.
3.4 Improvement feature discrimination

Previous work [19] demonstrated that simple pixel similarity metrics such as SAD (Sum of Absolute Differences), Hamming distances or NCC (Normalized Cross-Correlation) deliver poor results over real world scenarios. This is due to several image ambiguities around features, i.e., due to color/texture repetition, different features could have low difference between similarity metrics (close to zero). To solve these problems, previous feature tracking formulations [22, 20] used more complex approaches such as Eigenvalues of the Gradient matrix or Jacobian matrix as similarity metric. Unfortunately, they require high hardware resources for FPGA implementation.

In this work, we propose to improve the feature discrimination by using the curl of the intensity gradient $\frac{\partial I}{\partial x}$ in each point. Let curl as a vector operator that describes the infinitesimal rotation, then, at every point the curl of that point is represented by a vector where attributes (length and direction) characterize the rotation at that point. In our case, we use only the norm of $\text{Curl} I(x, y)$ given by:

$$\text{Curl} I(x, y) = \nabla \times \frac{dI(x, y)}{dx}$$

(3)

where $\nabla$ is the Del operator.

$$\text{Curl} I(x, y) = \sqrt{ \left( \frac{\partial I}{\partial y} \frac{\partial I}{\partial x} - \frac{\partial I}{\partial x} \frac{\partial I}{\partial y} \right)^2} = \frac{\partial I}{\partial y} \frac{\partial I}{\partial x} - \frac{\partial I}{\partial x} \frac{\partial I}{\partial y}$$

(4)

3.5 Feature tracking

Tracking process assumes that features displacements between frames is such as it exists an overlap on two successive "search regions". A search region is defined as a patch around a feature to track. This process is illustrated in Fig. 7. Considering that between $I_1$ and $I_2$, the illumination is stable, a similarity-based metric provides a good accuracy. This similarity is calculated by a SAD (Sum of Absolute Differences). This process is defined in Eq. 3.3. To solve this problem, a matching technique is used as outlier filtering. Considering $x_2(h), y_2(h)$ as reference spatial coordinates for the feature matching (Eq. 6 and 7) and given corners$S_2(x, y)$ the feature extraction from $I_2$ (Eq. 2), a pixel tracking is correct only if there is one unique feature in $I_2$ that is located in the region that surrounds the previously computed tracking localization (see Eq. 8 - 10).

$$SAD(a, b) = \sum_{u=r}^{u=r} \sum_{v=r}^{v=r} \left| \text{Curl} I_1(x + u, y + v) \right| - \left| \text{Curl} I_2(x + u + a, y + v + b) \right|$$

(5)

$$x_2(h) = \sum_{h=g}^{h=g} x_1(h) + \min_b SAD(a, b)$$

(6)

$$y_2(h) = \sum_{h=g}^{h=g} y_1(h) + \min_b SAD(a, b)$$

(7)
4. FPGA ARCHITECTURE

In Fig. 8, an overview of the FPGA architecture for the feature-matching algorithm is presented. The architecture is centered on an FPGA implementation where all recursive/parallelizable algorithms are accelerated in the FPGA fabric. In general, the basis of the proposed architecture is the frame storage unit. In this block, frames captured by the imager are feed to/from an external SDRAM memory using a DMA. Two consecutive frames are read out into the buffers used to hold local sections of the frames that are being tracked and allow for local parallel access that facilitates parallel processing.

4.1 Frame storage unit

Images from the image sensor are stored in an external SDRAM that holds at least 2 frames from the sequence, and later the SDRAM is read by the FPGA to cache parts of the frames into buffers. The frame storage unit is responsible for data transfers in segments of the image (usually several rows of pixels) to/from the SDRAM. The core of the FPGA architecture are the buffers attached to the local processors that can hold temporarily as cache, for image sections from two frames, and that can deliver parallel data to the processors. For the SDRAM controller, both Xilinx and Altera have IPs for this proposes. For the buffers, we use a circular buffer schema in which input data from the previous N rows can be stored using memory buffers till the moment when a n×n neighborhood is scanned along subsequent rows. This approach has high hardware reutilization and high flexibility for computer vision applications. For more details, see [1].

4.2 Feature extraction unit

Fig. 9a gives an overview of the feature extractor unit. First, the architecture computes the vertical/horizontal gradients, \( \partial I_1 / \partial x \) and \( \partial I_1 / \partial y \), respectively. Then it computes the \( A(x,y), B(x,y), C(x,y) \) variables. After that, a buffer delivers parallel data for the Gaussian filtering. Then, the reconfigurable convolution units (see [11]) compute the smoothing operation. Finally, the FPGA architecture computes the corner response metric and the non-maxima suppression step. In order to simplify the square root operation implementation in the feature extraction step, we adapted the architecture developed by Yamin Li and Wanming Chu [11]. This architecture uses a shift register mechanism and compares the more significant/less significant bits to compute square root from Eq. 1 with relatively high accuracy and low hardware resources.

4.3 Feature discrimination unit

Fig. 9b, an overview of the feature discrimination unit is shown. It resuses the gradient computation carried out in the feature extraction unit. Then, curl of the intensity gradient is computed as illustrated in Section 3.4. Two logical processes compute the curl of the intensity gradient for \( I_1 \) and \( I_2 \) in parallel.

\[
\text{filter}_h(x,y) = \sum_{u=-1}^{u=1} \sum_{v=-1}^{v=1} \text{corners}_2(x+u, y+v)
\]  

\[
x_f(h) = \begin{cases} 
x_2(h) & \text{if } \text{filter}_h(x,y) == 1 \\
0 & \text{otherwise}
\end{cases}
\]  

\[
y_f(h) = \begin{cases} 
y_2(h) & \text{if } \text{filter}_h(x,y) == 1 \\
0 & \text{otherwise}
\end{cases}
\]

4.4 Feature tracking unit

For the feature tracking unit, we consider that the tracking problem can be seen as a generalization of the dense stereo matching problem. i.e., stereo matching algorithms track (searching on the horizontal axis around points in the reference image), all points/pixels within a stereo pair. Feature tracking aims to track features points between two consecutive frames from a video sequence (searching around spatial coordinates of the features in the reference frame). Then, it is possible to adapt previous stereo matching FPGA architectures to fulfill with our application domain. In this work, we adapted the FPGA architecture presented in [14], which has low hardware requirements and high parallelism. In Fig. 10, the developed architecture is shown. Considering that feature points for \( I_1 \) are known, these are obtained by the feature extraction unit. Then, the search region modules (see Fig. 7), construct \( n \) search regions, where search regions are constructed via logical pointers under the input buffer for \( I_2 \). For each feature point in the reference image \( I_1 \), search region centers correspond to all patches within the search region on frame \( I_2 \). Once the search regions are constructed, similarity SAD modules compute the correlation response (applying the sum of absolute differences as similarity metric response), i.e., it compares all search regions with the reference region. Finally, a multiplexer tree can determine the \( a, b \) indices that minimize the correlation function, and therefore, the tentative position in \( I_2 \) of the feature points extracted in the reference image \( I_1 \).

Figure 9: (a) FPGA architecture for the feature extractor unit. (b) FPGA architecture for the feature discrimination unit.

Figure 10: FPGA architecture for the feature tracking unit.
4.5 Feature matching unit

The feature matching unit consists in a unique module that compares the feature tracking results with visual features extracted in $I_2$. Then, the feature matching $x_f(h), y_f(h)$ (final result) is the result obtained after comparisons between tracking positions $x_2(h), y_2(h)$ and visual features corners in frame $I_2$.

5. RESULTS

In order to validate our mathematical formulation, we implemented our feature-matching algorithm in a MatLab R2016b code that captures video sequences from a web camera. Then, feature points (corners) are tracked along the video sequence. In all experiments, feature points are obtained by applying the algorithm presented in Section 3.3. Video sequences of 1920$\times$1080 pixel resolution and 800 frames were used. In Fig. 11, results by applying our algorithm over an outdoor scenario are shown.

![Figure 11: Feature matching under outdoor scenarios. Our mathematical formulation reaches accurate matching (without outliers) and dense matching (more than 14000 pixels are matched).](image)

### 5.1 Performance for dataset scenes

For performance comparisons we compared our feature matching algorithm with previous feature matching/tracking algorithms. For feature tracking, we applied the KLT, KL and Mean-shift algorithms. For feature matching we used three classic feature matching frameworks, based on ORB, SIFT and SUFR visual descriptors, respectively. We evaluated the algorithms with several video sequences, all videos were obtained from [18]. Two performance tests were conducted. In Table 1, accuracy comparisons are shown.

To validate the accuracy in numerical form, RMS error is computed as: $\epsilon = \sqrt{x^2 + y^2}$; where $x$ is the error in $x$ axis, defined as the average difference between the ground truth $x$ position in each frame of the video sequence and the position in same frame computed by the testing algorithm (visual features ground truth were computed using the camera localization ground truth provided by the dataset). $y$ is the error in $y$ axis, it is computed similar to $x$. In all cases, our algorithm outperforms the KL and the MS tracking algorithms, and outperforms feature matching algorithms based on visual descriptors like ORB, SIFT and SURF. This is because the KL algorithm does not consider the occlusion problem. In addition, the KL algorithm uses simple similarity metrics, that introduces erroneous measurements under image ambiguities. On the other hand, MS algorithm was formulated for object-tracking in dynamic scenes, therefore, performance under rigid scenes is low. For algorithms that use robust visual descriptors (ORB, SIFT, SURF), occlusions, ambiguities and perspective changes between frames introduce outliers, therefore, accuracy is lower than tracking approaches. Although KLT algorithm outperforms our algorithmic formulation, KLT is highly exhaustive, processing speed is low, and implementation in real-time/embedded applications is highly limited.

For density comparisons: traditional feature matching algorithms extract and match visual features via visual descriptor comparisons. Unfortunately, in all cases, the maximal number of features that can be matched varies between 0.5% and 1.0% of all pixels in the image, depending on the selected descriptor and its particular configuration. In practice, computer vision systems work with configurations that allow extracting near 1% of the pixels from an image. This is illustrated in Table 2, where feature matching based on visual descriptors reach less than 200 matches per frame. This limits the real-world applications performance since less than 1% of the image points are available, thus, the visual environmental understanding, high-level descriptors application and objects/structures recognition in the scene could have low stability under real-world scenarios. Even the most current and popular feature matching approaches, for example such based on the ORB descriptor, are limited to sparse matching. On the other hand, tracking algorithms such as KLT and KL deliver high density compared with matching approaches, as shown in Table 2. Both, KL and KLT can track near 2% of all points in the scene (×44 more than the matching algorithms). In the case of our feature matching algorithm, it can track/match near 5% of the points within the scene (×2 more than previous feature tracking approaches and ×85 more than previous
5.2 Performance for real world applications

We implement our feature-matching algorithm in a monocular-SLAM system. We applied our algorithm to obtain point correspondences across the video sequence. In this case, point correspondences allow fundamental matrix estimation. Finally, using the fundamental matrix, it is possible to estimate a 3D reconstruction and camera pose along the video sequence. In this case our feature-matching algorithm is capable to increase the point cloud density, as shown in Fig. 12. We consider that this could be highly useful under several computer vision applications that use feature matching in their mathematical formulations (SLAM, SfM, 3D reconstruction), since more information \( \times 85 \) more than previous feature matching approaches and \( \times 2 \) more than previous feature tracking formulations) are available. Thus, visual environmental understanding, high-level descriptors application and objects/structures recognition performance could be improved. For our FPGA architecture, we consider that our architectural formulation could be implemented within a smart camera fabric. In this scenario, our feature matching algorithm could be an important contribution for smart cameras, because several computer vision applications use point correspondences between frames/camera views as a key role of their mathematical formulation. For more details about the performance of the proposed algorithm see the material adjoint to this manuscript, all material can found from https://dlr.mrs/i/S/1AgkoNeNXKa6ijFI0rkKB5LFp30_CP.

6. CONCLUSIONS

In this work, we have introduced a new feature matching algorithm that delivers accurate/dense feature matching under indoor/outdoor scenarios. We proposed a new mathematical formulation that addressed the feature matching task as a feature tracking problem, and we have used the curl of the intensity gradient as a feature discrimination technique. An FPGA architecture was laid down and, hardware acceleration strategies were discussed. Since several computer vision applications use feature matching a keystone of their mathematical formulations, we consider that feature matching within a smart camera fabric could be promising under current computer vision applications. We have applied our feature matching algorithm in a monocular-SLAM system. We have shown that our algorithmic formulation improves the performance under SLAM applications. As work in progress we are implementing our feature matching algorithm inside the DREAMCAM [4], a robust/flexible smart camera.

7. ACKNOWLEDGMENTS

This work has been sponsored by the French government research program "Investissements d’avenir" through the Mobs3 Laboratory of Excellence (ANR-10-LABX-16-01), by the European Union through the program Regional competitiveness and employ-ment 2007-2013 (ERDF Auvergne region), and by the Auvergne region. This work has been sponsored by the National Council for Science and Technology (CONACyT), Mexico, through the scholarship No. 567804.

8. REFERENCES


Table 1: The proposed algorithm compared with previous feature matching/tracking algorithms (accuracy). Error is measured in pixels.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>fr1/room</td>
<td>67.22</td>
<td>79.38</td>
<td>76.24</td>
<td>0.21</td>
<td>4.81</td>
<td>9.62</td>
<td>1.87</td>
</tr>
<tr>
<td>fr2/desk</td>
<td>69.83</td>
<td>81.12</td>
<td>73.63</td>
<td>0.45</td>
<td>4.9</td>
<td>9.81</td>
<td>1.55</td>
</tr>
<tr>
<td>fr1/plant</td>
<td>59.29</td>
<td>77.74</td>
<td>75.24</td>
<td>0.39</td>
<td>4.12</td>
<td>8.25</td>
<td>1.7</td>
</tr>
<tr>
<td>fr1/teddy</td>
<td>75.38</td>
<td>83.53</td>
<td>76.73</td>
<td>0.47</td>
<td>4.91</td>
<td>9.82</td>
<td>1.71</td>
</tr>
</tbody>
</table>

Table 2: The proposed algorithm compared with previous feature matching/tracking algorithms (density). Density is measured as the feature matches number per frame.

<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>fr1/room</td>
<td>167</td>
<td>174</td>
<td>79</td>
<td>7469</td>
<td>7546</td>
<td>85</td>
<td>14286</td>
</tr>
<tr>
<td>fr2/desk</td>
<td>183</td>
<td>188</td>
<td>77</td>
<td>7942</td>
<td>7252</td>
<td>77</td>
<td>14598</td>
</tr>
<tr>
<td>fr1/plant</td>
<td>124</td>
<td>158</td>
<td>78</td>
<td>6264</td>
<td>6576</td>
<td>74</td>
<td>13547</td>
</tr>
<tr>
<td>fr1/teddy</td>
<td>172</td>
<td>183</td>
<td>75</td>
<td>7722</td>
<td>7112</td>
<td>92</td>
<td>14968</td>
</tr>
</tbody>
</table>

Figure 12: The proposed algorithm applied in a monocular-SLAM system. (a) Feature matching via the BIREF visual descriptor. (b) Feature matching via the proposed algorithm. For our algorithm, 3D density is increased, then, visual environmental understanding, high-level descriptors application and objects/structures recognition performance can be highly improved.