DreamCam: A modular FPGA-based smart camera architecture
Merwan Birem, François Berry

To cite this version:

HAL Id: hal-01625648
https://hal.archives-ouvertes.fr/hal-01625648
Submitted on 28 Oct 2017
DreamCam: A modular FPGA-based smart camera architecture

Merwan Birem*, François Berry

Institut Pascal – UMR 6602 UBP/CNRS – Campus des Cézeaux, 24 Avenue des Landais, 63177 Aubière Cedex, France

**Article info**

*Article history:* Received 9 July 2012
Received in revised form 9 October 2013
Accepted 21 January 2014
Available online 31 January 2014

**Keywords:**
Smart camera
Image processing
Interest points
VHDL
Harris and Stephen algorithm
Field Programmable Gate Array (FPGA)
Hardware implementation
Real-time system

**Abstract**

DreamCam is a modular smart camera constructed with the use of an FPGA like main processing board. The core of the camera is an Altera Cyclone-III associated with a CMOS imager and six private Ram blocks. The main novel feature of our work consists in proposing a new smart camera architecture and several modules (IP) to efficiently extract and sort the visual features in real time. In this paper, extraction is performed by a Harris and Stephen filtering associated with customized modules. These modules extract, select and sort visual features in real-time. As a result, DreamCam (with such a configuration) provides a description of each visual feature in the form of its position and the grey-level template around it.

**1. Introduction**

Intelligent robots are becoming increasingly important. One of the key components of an intelligent robot is its ability to understand its environment and recognize its position. In the robot community, most researchers use information from sources such as odometry, laser-range-finders, and sonar sensors. In contrast, in the vision community, new methods using information from camera sequence are being developed, see [1,2].

Using an entire image as an observation is difficult or impossible owing to the high resolution, typically of the order of a hundred thousand pixels. Thus, to identify interest points, feature extraction which is the first crucial step, should be used [3].

The algorithms that extract features are time-consuming, which is a huge drawback when developing real-time applications. One solution to this problem is the use of dedicated hardware for the algorithms, such as Field Programmable Gate Array (FPGA), which can provide dedicated functional blocks that perform complex image processing operations in parallel. In addition to the parallel properties of FPGA, which lead to a high-throughput, FPGA has a small footprint system and low power consumption, which makes it ideal for mobile applications.

FPGAs have achieved rapid acceptance and growth over the past decade because they can be used in a very wide range of applications [4]. Although they are slower than the traditional Application Specific Integrated Circuit (ASIC), their design flexibility is a major advantage. Users can change program as desired at any stage of the experiment thereby saving time and cost.

Throughout this paper, we propose a customized smart sensor based on a CMOS imager. The main original feature is system management by a System-On-Chip integrated in an FPGA. Our approach allows most early perception processes to be performed in the main sensing unit (FPGA), and sends just the main sensing features to the host computer so as to reduce a classic communication bottleneck. Another advantage of this method is the real-time feedback on the sensor. The different parameters can be actively tuned to optimize perception to render it similar to primate perception. For instance, in strong light the pupil contracts and becomes small, but still allows light to be cast over a large part of the retina [5]. This embedded sensor can be considered as a reactive architecture, and above all, as a research platform for the smart sensor.

To highlight the novel elements in our work, we present in the following section previous research carried out in this field. In Section 3, we give a large overview of the smart sensor. An application for the extraction of visual features based on the Harris and Stephen algorithm is presented in Section 4. In this work, we consider feature extraction as a combination of a feature detection followed by a description. Thus, feature detection consists in finding the interest points (features) in the image, whereas feature extraction consists in representing them. The final goal is to compare them

* Corresponding author. Tel.: +33 760517676.
E-mail address: merwan.birem@hotmail.fr (M. Birem).
with other interest points for applications such as navigation, object recognition, etc. In Section 5 we include results that support the relevance of our approach and in Section 6 we give a conclusion.

2. Previous work

Comparison of our work with previous works can be done on two levels:

- System-level: At this level, we propose to study the most popular smart cameras developed in the last decade. As a reminder, a smart camera is defined as a vision system in which the fundamental function is the production of a high level understanding of the imaged scene. A camera is called a “smart camera” when it performs application specific information processing (ASIP). The output of such cameras is either the features extracted from the captured images or a high-level description of the scene. More details about this system can be found in articles by Wolf [6] and Shi and Lichman [7].

Table 1 presents an overview of the most common smart camera platforms found in the literature.

Others works have been based on Silicon-integrated smart cameras. In these systems, the authors propose an approach in which image sensing and processing are integrated on a single silicon die. This kind of device is called “vision chips”. The interested reader can find a detailed description in [14]. Among these works, the project scamp by P. Dudeck is one of the most well-known [15]. Other contributions in the same vein can be found in [16,17].

- Algorithm-level: As explained below, we implemented a “Harris and Stephen”-based algorithm to extract visual features in our camera. At the output of this extraction step, several modules (filtering, sorting, description) were added to provide high-level features from the image. Thus, in this part, we propose a short overview of published papers about the implementation of the Harris and Stephen detector on FPGA. To our knowledge, there is no work proposing a full approach with detection, filtering, sorting and description steps. Consequently, the works presented above, represent a fragmented bibliography mainly focused on the Harris detection.

The work presented in [18] implements a Harris detection on a FPGA connected to a stereo rig. The FPGA provides a simple extraction on 320 × 480 pixels at 27 fps. Another “stereo camera-based” work was proposed by Dietrich [19]. In this work, the author used a Spartan-3E to rectify the stereo images and to detect the Harris points. Most of the authors had the same basic approach on how to implement the Harris and Stephen detector. In [20], the authors used a Spartan Filter to define the region of interest. With these windows, a classifier is used to detect and identify some objects in the scene. However, these works propose only architectures to detect the corners by the Harris and Stephen method. In our work, we propose to filter the detected points, to sort the most robust ones and to describe each feature by a grey-level template. These last steps are fundamental in computer vision, in which the input is not images but semantic features. (see Table 2).

3. Hardware description of the “DreamCam”

The goal of artificial vision research is to exploit images or image sequences generated by sensors, in order to effectively translate an environment. From this translation, different processes can be used to identify or inspect an object, control robots, etc. The first way to treat the vision problem is to carry out passive vision approaches, which is the classic way to analyze images. In contrast, another approach exists known as “active vision”, which is the result of an attempt to simulate the human visual system.

Based on this concept, our approach consists in integrating the control of the imager in the perception loop, especially in the early perception processes. By integration of early processing, close to the imager, a reactive sensor can be designed. With such a smart sensor, it is possible to perform basic processing and the selection of relevant features [4]. For example, FPGAs have already been used to accelerate real-time point tracking [21], stereo-vision [22], color-based object detection [23], and video and image compression [24]. In our case, the notion of a system on programmable chip (SOPC) describes the whole system.

Most vision applications are focused on several small image areas and consequently acquisition of the whole image is not always necessary. It is evident, therefore, that one of the main goals of an efficient vision sensor is to select regions of interest (ROI) in the image and concentrate on processing resources on these. The notion of local study is then predominant and the choice of imaging technology becomes crucial. This explains why the CMOS imager was chosen. It is generally accepted that the CMOS technology, due to its capabilities will replace the traditional CCD technique in many applications:

- to allow accessing only parts of the image (ROI)
- to allow higher speeds (up to 60 MHz per output channel)
- to allow functionality on the chip (camera-on-the-chip)
- to provide a much higher dynamic range (up to 120 dB)
- to be based on standard manufacturing processes

Table 1 presents an overview of the most common smart camera systems.

Table 2

<table>
<thead>
<tr>
<th>System</th>
<th>Platform capabilities</th>
<th>Sensor</th>
<th>CPU</th>
<th>Power</th>
<th>Application</th>
</tr>
</thead>
<tbody>
<tr>
<td>CMUCam [8]</td>
<td>CMOS Omnivision</td>
<td>Proc. ARM7</td>
<td>Battery</td>
<td>Robotic applications</td>
<td></td>
</tr>
<tr>
<td>MeshEye [9]</td>
<td>ADNS-3060 optical mouse sensor + CMOS VGA</td>
<td>Micro-controller AT91SAM7S</td>
<td>Battery</td>
<td>Distributed imaging applications</td>
<td></td>
</tr>
<tr>
<td>SeeMOS [10]</td>
<td>CMOS Cypress Lupa 4000</td>
<td>FPGA Stratix 60</td>
<td>Mains</td>
<td>Tracking</td>
<td></td>
</tr>
<tr>
<td>LE2I-Cam [11]</td>
<td>CMOS Micron (MT9M413)</td>
<td>FPGA Vertex II</td>
<td>Mains</td>
<td>High speed imaging</td>
<td></td>
</tr>
<tr>
<td>WiCa mote [12]</td>
<td>VGA CMOS</td>
<td>Xetal IC3D</td>
<td>Battery</td>
<td>Vehicle detection and speed estimation</td>
<td></td>
</tr>
<tr>
<td>ITI [13]</td>
<td>LM-9618 CMOS</td>
<td>DSP TMS320C6415</td>
<td>Mains</td>
<td>Traffic control</td>
<td></td>
</tr>
</tbody>
</table>

Table 2

Hardware implementation of H&S algorithm.

<table>
<thead>
<tr>
<th>System</th>
<th>Platform capabilities</th>
<th>Sensor</th>
<th>Processor</th>
<th>Resol. @ Fps</th>
</tr>
</thead>
<tbody>
<tr>
<td>[19] EyeBot M6</td>
<td>OV6690</td>
<td>Xilinx</td>
<td>Spartan-3E</td>
<td>352 x 288 @ 7.37</td>
</tr>
<tr>
<td>[18] MT9P031 CMOS</td>
<td>Micron Video frame from a memory</td>
<td>Xilinx</td>
<td>Virtex-II</td>
<td>320 x 480 @ 27</td>
</tr>
<tr>
<td>[20] Video</td>
<td>Xilinx</td>
<td>Virtex-5</td>
<td>FPGA</td>
<td>640 x 480 @ 266</td>
</tr>
</tbody>
</table>
The global processing system is composed of a SOPC (System On Programmable Chip), by which an entire system of components is put on a single chip (FPGA). The fine-grained structure of the FPGA allows the development of extremely optimized implementations. Image processing is well known to be algorithmically simple but computationally costly. Moreover, FPGA is the best candidate for a wide group of peripheral devices. DreamCam is a modular smart camera in which image sensors or communication boards can be easily changed.

3.1. Global hardware architecture

The architecture of the camera is constructed with five interconnected boards as shown in Fig. 1. The core of this system is a FPGA which allows a high versatility. Thus, the image sensor board and the communication board can be easily replaced or updated in order to change the type of imager or the communication layer. Currently, we can propose two different image sensors and the ability to use a USB2.0 or Giga–Ethernet communication link. Each board is described in detail below.

3.1.1. Image sensor board

Both developed image sensor boards are based on a similar electronic architecture. This architecture can accept parallel differential or single-ended outputs from different kinds of image sensors. The image sensors used in this work are:

- MT9M031 imager: This 1.2-mega pixel (1280 × 960) CMOS image sensor is manufactured by Aptina. It can operate at 45 fps at full 1280 × 960 pixel resolution or at 60fps speed at 720pHD resolution (reduced FOV). The power consumption is 270 mW in 72p60 mode. The dynamic range is 83.5 dB – quite big for a global shutter sensor.
- EV76C560 imager: This is a 1.3-mega pixel (1280 × 1024) CMOS active pixel sensor dedicated to industrial vision features both rolling and global shutters. The pixel design offers excellent performance in low-light conditions with a high-readout speed of 60 fps in full resolution. Novel pixel integration/read-out modes and embedded image pre-processing deliver superior performance parameters, including a bi-frame wide dynamic range (>100 dB). Other on-chip pre-processing are included such as Image Histograms, Multi-ROI, Defective pixel correction, etc.

3.1.2. Processing board

This is the main part of the system using an Altera Cyclone-III EP3C120 FPGA (Fig. 2). The need for strong parallelization led us to connect 6×1MBytes SRAM asynchronous memory blocks to the FPGA. Each memory has a private data and a private address bus. Consequently, six processes (using 1 MB each) can access all the memories at the same time. We chose a low-power Cyclone III FPGA family. The reasons for this choice for Cyclone are given below:

- Firstly, its architecture consists of 120 K vertically arranged logic elements (LEs), 4 Mbits of embedded memory arranged as 9-Kbit (M9K) blocks, and 288 18 × 18 embedded multipliers.
- Secondly, Cyclone integrates DSP Blocks. These embedded DSP Blocks have been optimized to implement several DSP functions with maximum performance and minimum use of logic resource. In addition, these embedded DSP Blocks can be used to create DSP algorithms and complex math routines in high-performance hardware DSP Blocks and they can be viewed as custom instructions to the NIOS CPU.
- Lastly, Cyclone is optimized to maximize the performance benefits of SOPC integration based on a NIOS embedded Processor. A NIOS processor is a user configurable soft core processor, allowing many implementations and optimization options.

3.1.3. Communication board

This board is connected to the main board and manages all communications with the host computer. The communication layer is currently either high-speed USB 2.0 or Giga–Ethernet.

- USB2.0 is managed by the Cypress cy7c68013 microcontroller. It incorporates an enhanced processor based on a 8051 core and the instruction set is compatible with standard 8051, and in many ways improved. For example: The maximum operating
frequency up to 48 MHz, an instruction cycle is four clock cycles, two UART interfaces, and three counter, an I2C interface.

- The Giga-Ethernet protocol is taken in charge by the Marvel 88E1111 transceiver. It is a physical layer device containing a single Gigabit Ethernet (GbE) transceiver. The transceiver implements the Ethernet physical layer portion of the 1000BASE-T, 100BASE-TX, and 10BASE-T standards.

3.1.4. Memory board

The bank of memories contains six SRAM asynchronous memories, each of which has a size of 1MWords. These memories are high-speed, 16 M-bit static RAMs organized as 1024 K words by 16 bits. They have a high-speed access times 8 ns under 3.3 V and can be easily controlled. For instance, the read cycle consists only in accessing an address and after an output hold time the data can be read.

3.1.5. Power board

The different boards need different kinds of voltage according to the respective devices. The initial input voltage is 6.5 V and a set of regulators generates the different voltages. The global no-programmed power consumption is approximately 1.4 W. Of course, this consumption widely varies with the configuration of the FPGA. This board provides 18 different voltages from 1.2 V to 5 V. In addition, a JTAG programmer (USB Blaster-like) has been integrated to configure the Cyclone III FPGA.

3.2. Internal FPGA design

The aim of the proposed design is to create a flexible interface between the sensing device board and the host computer. This is how the whole system is separated into two main parts: a software part which is basically a C++ code that retrieves the data that are in the USB packets sent from DreamCam; and a hardware part, which is developed in this paper.

In this approach, two blocks are very important and must be used for each design: the first one controls the CMOS image sensor which is the Image sensor IP block, and the second manages communication between the host computer and the DreamCam. The Mem IP block is used when external memories are needed. Theses blocks control each memory by generating the appropriate input signal of the memory such as (Address Bus: A0–A19, Chip Enable signal: CE, Write Enable signal: WE, Output Enable signal: OE) and by receiving the data. Finally the Image processing algorithm block will contain the algorithm that we want to implement on FPGA (in the present work the Harris algorithm was chosen). The diagram of the system is shown in Fig. 3.

These different blocks will work with each other as follows. After powering the system, the CMOS imager starts to work under the control of the Image sensor IP inside the FPGA by sending the pixels of the image one by one and line by line (Flow noted Pix in Fig. 3). These pixels are sent to the Image processing algorithm block, where they will be processed according to the algorithm implemented on it. After that, the results are sent to the Communication IP block (Flow noted Data Frame in Fig. 3, where they will be packed and sent.

4. Harris corner extractor application

In an image, the corner is an important local feature which focuses on a great amount of important image information and is rarely affected by illumination change [25]. In addition, it has rotation invariant properties [26]. Provided there is no data loss, the corner feature is the smallest piece of data to deal with so that it improves the speed of detection. Thus, corner detecting has many important applications in practice, especially in the real-time target tracking field and autonomous navigation of vehicles.

In this section, we propose implementation of a feature extractor on the Dream-Cam. The term feature extractor is used to describe the combination of a feature detector and a feature descriptor. Detectors are used to find interest points in an image, after which a descriptor is created that describes the local neighborhood around the points. [27] have written a state-of-the-art overview of feature extractors.

Many extractors of features from an image have been reported in the literature. They differ in the method used to detect and describe the features, which implies a difference in algorithm complexity, processing time and resources needed. However, if the complexity of the algorithm increases, computation becomes heavier.

The feature extractor used in our work is a combination of the Harris corner detector [26] and a simple descriptor which gives for each interest point an intensity patch from the image. The Harris corner detector (also known as Harris–Stephens or Plessy detector) is one of the most widely used interest point detectors, owing to its improved detection rate over the Moravec [28] corner detector and to its high repeatability rate.

In the Harris corner detector the main operations used are the first derivative and convolution. These operations are single instruction multiple data (SIMD) and therefore highly parallelizable, which means they are suitable for implementation on FPGAs, which are low cost and high density gate arrays capable of performing many complex computations in parallel while hosted by conventional computer hardware.

A pixel is considered to be an interest point when its interest value $R_i$ is higher than the predefined threshold, and the higher this value the more accurate is the detected interest point. The value is computed for each pixel according to Harris and Stephens [26] using the following formula:

$$ R_i = \text{Det}(M_i) - k \times \text{Trace}(M_i)^2 $$

where

$$ M_i = \begin{pmatrix} A & C \\ C & B \end{pmatrix} $$

which means that

$$ R_i = (A \times B - C^2) - k \times (A + B)^2 $$

$k \in [0.04,0.06]$ is an empirical value [26]
\[ A = \left( \frac{\partial I}{\partial x} \right)^2 \otimes W \quad B = \left( \frac{\partial I}{\partial y} \right)^2 \otimes W \quad C = \left( \frac{\partial I}{\partial x} \frac{\partial I}{\partial y} \right) \otimes W \]

\[ W(u, v) = \exp\left( -\frac{u^2 + v^2}{2\sigma^2} \right) \]

and \( \frac{\partial}{\partial x}, \frac{\partial}{\partial y} \) are the x and y derivatives operators. \( I \) is the 3x3 window surrounding the \( i \)th pixel, and \( \otimes \) represents the convolution operator.

The feature extraction system proposed in this paper detects first the interest points on an image, sorts them, and then describes them using a patch of pixels from the image. The system is composed of several modules (see Fig. 4) that have been developed in VHDL (VHSIC Hardware Description Language), and are fully compatible with a FPGA implementation. The main modules of the system are as follows.

1. The Harris corner detector module: Detects the interest points and filters them, in order to obtain only one point (pixel) for each corner.
2. The sort module: Sorts the interest points in decreasing order according to their interest value \( R_i \).
3. The swap memories module: Retrieves the patch of each interest points, and constructs the data frame containing the interest point coordinates \( (X_i, Y_i) \) and their patches. More details about the data frame are given in Section 4.3.

In the first module the results of the detection are filtered, because when an image is treated using the Harris corner detector, several pixels around a corner will be considered as interest points. The desired outcome is having one interest point, which means one pixel for each corner (see Fig. 6). In previous works [29,19,30,20], this problem was solved by the non-maximum suppression method. To perform such a suppression, a window of a specific odd size is moved over the picture after treating all the pixels. If the center value of the window is the maximum interest value within the whole window the filter response is one, otherwise the filter response will be zero. In terms of hardware considerations, this method has several disadvantages such as the use of more memories and FIFO queue. In addition, it induces latency due to the buffering of three lines at minimum when a 3 x 3 window is used to perform the non-maximum suppression.

4.1. Harris corner detector module

This module represents the feature detector (the first part of the feature extractor). Fig. 5 gives an overview of the architecture used to implement the Harris detector algorithm on FPGA.

To achieve a higher frequency, the system has to be parallelized to the maximum degree possible allowed by the architecture of the smart camera presented in this article. As shown in Fig. 5, all operations that are independent of one another were implemented separately. The performance of the system can also be increased by using DSP-blocks for all the multiplications and summations that are in the Harris corner detector algorithm.

The system receives the stream of pixels and places them one by one in a FIFO queue. The calculation of the interest value \( R_i \) will start when the FIFO queue is almost full, more precisely when the second pixel of the fifth line is reached. After calculation of the interest value \( R_i \), a simple comparator is used to determine if the treated pixel is an interest point or not.

This module contains a submodule that has nearly the same function as non-maximum suppression. The main difference between the two is in the pixel that will be kept. In non-maximum suppression the pixel kept is the one with the highest interest value \( R_i \) in an odd-size window. In the submodule implemented on FPGA the pixel kept is the first one to appear, which means there is no need for more memory or latency to obtain the results. This submodule is based on two notions. When an interest point is detected the system will check if there is no interest point near it in the \( m \) previous lines. If this is the case, the pixel will be set as an interest point, and the following \( n \) pixels will not be treated at all. If there is an interest point near it in one of the \( m \) previous lines the system passes to the next pixel and so on. Fig. 6 shows an example of the results obtained with and without this module.
The most important output signals of this module are \( CE \), \( FE \), \( Xi \), and \( Yi \). The first and second ones are set to "1" (for one clock cycle) when an interest point is detected and when the last pixel of an image is treated, respectively. The last two signals, \( Xi \) and \( Yi \), represent the coordinates of the detected interest point.

### 4.2. Sort module

The robustness and accuracy of an interest point depends on the value of \( Ri \) and the higher the value of \( Ri \), the more robust and accurate the point.

The technique used to sort the interest points is that described in the paper of [31]. It is done in two steps:

- **Step 1**: The ordering process, in which the order of the detected interest points is found.
- **Step 2**: The rearranging process of the points, which places them in a memory according to their order.

In this method, the sorting of the detected interest points requires the presence of all points. This is why the sorting is done only after image processing. It means that sorting the points detected in the \("i\) th image" will be done while the \("i + 1\) th image" is being processed.

The principle of this sorting method is as follows. For a given set of interest points \( SetIP = \{IP_1, IP_2, \ldots, IP_n\} \), the order \( Ci \) (of each interest point) is easily calculated by counting results of comparisons between \( Ri \) (the interest value of the \( i \) th point) and all the other values. Each time a value higher than \( Ri \) is found, \( Ci \) is incremented by 1. \( Ci \) represents the number of items in the set having a value higher than the \( i \) th point, and represents the order of the point. The rearranging process uses the different values of \( Ci \) as addresses to put the points in decreasing order in a memory.

The basic algorithm to compute the order \( Ci \) is as follows.

**Algorithm 1. Compute the order: \( Ci \)**

\[
C_i = 0 \\
\text{while } j \leq n \text{ do} \\
\quad \text{if } R_i < R_j \text{ then} \\
\quad \quad C_i = C_i + 1 \\
\quad \text{end if} \\
\quad j = j + 1 \\
\text{end while}
\]

### 4.3. Swap memory module

This module represents the feature descriptor (the second part of the feature extractor). As mentioned previously, the descriptor chosen to be implemented on FPGA is a simple one, which gives an intensity patch for each interest point. This module receives signals \( CE \), \( FE \), \( Xi \), and \( Yi \) from the sort module. The signals are used to construct the data frame shown in Fig. 7.

The first two elements and all the \((Xi, Yi)\) coordinates of the data frame are obtained from the Harris corner detector and sort modules. The pixels that are in each patch are obtained from one of two memories, \( M_1 \) or \( M_2 \) (see Fig. 4). These memories contain the previous treated image (put on read mode), and the actual image under treatment (put on write mode), respectively. At the end of each image treatment the two memories change their operating mode i.e. swap mode.

This module is composed of two processes. The first one constructs the table that will contain the memory addresses of the detected interest points. The second process constructs the data frame. To do this, the process uses the memory put on read mode and the table constructed by the first process. This module controls the two asynchronous memories by controlling their \( WE \) signal, and their \( DATA \) and \( ADD \) buses.

The combination of this module with the Harris corner detector and the sort module will give us a full feature extractor that detects, sorts, and describes the interest points. In other words, the extractor takes images as input and provides semantic information as output. This information can be used for navigation, 3D reconstruction or other applications.

### 5. Experimental results

The proposed algorithm (presented in the previous section) was implemented on the DreamCam. For all results given in this section, the DreamCam was equipped with an E2V imager and a USB2.0 communication layer. The image resolution was of \( 800 \times 1024 \) and the size of each interest point patch was set at \( 15 \times 15 \). The software used for these experiments was Quartus II V13.0 in setting the default options (no optimization for performance or particular effort level).
5.1. Consumption of FPGA resources

A first result concerns the consumption of resources in the FPGA. The desired number of interest points directly impacts the consumption of logic elements in the FPGA. This is due to the feature descriptor module (named Swap memory module in the architecture), whose role is to prepare data output. To do this it has to store each point descriptor with its grey-level template, its coordinates and Ri value, and it is this storage process that draws on internal resources of the FPGA.

Fig. 8 shows the linear consumption of Logic Elements for a linear increase of desired interest points.

For information purposes, Table 3 gives all resources used for 200, 400 and 600 points of interest.

5.2. Maximum frequency

A second result consists of the maximum frequency of work according to the maximum number of Harris points. The maximum frequency decreases from 105 MHz to 80 MHz (Fig. 9). This decrease is explained by the increase in the length of the critical path in the Swap memory module.

5.3. Comparison with others works

As explained in the section “Previous works”, others authors focused only on the Harris and Stephen implementation. We propose therefore to compare the performance of the existing works with our implementation of Harris and Stephen module in Table 4.

The best result is given by [20], but they used a Virtex 5. Virtex 5 offers a clock tree up to 550 MHz, whereas the clock tree specification for Cyclone III (in 7 speed grade) is only 430 MHz.

---

**Table 3**

FPGA resources used.

<table>
<thead>
<tr>
<th>Number of interest points</th>
<th>200</th>
<th>400</th>
<th>600</th>
</tr>
</thead>
<tbody>
<tr>
<td>Total Logic Element</td>
<td>31,433</td>
<td>60,034</td>
<td>87,754</td>
</tr>
<tr>
<td>Combinatorial</td>
<td>16,601</td>
<td>29,121</td>
<td>41,636</td>
</tr>
<tr>
<td>Registers</td>
<td>23,014</td>
<td>43,814</td>
<td>64,615</td>
</tr>
</tbody>
</table>

**Table 4**

Comparison with others hardware implementations of H&S algorithm.

<table>
<thead>
<tr>
<th>System</th>
<th>Platform capabilities</th>
<th>Resol. @ Fps</th>
</tr>
</thead>
<tbody>
<tr>
<td>Our method</td>
<td>Altera</td>
<td>1024 × 800 @ 76</td>
</tr>
<tr>
<td></td>
<td>Cyclone III</td>
<td>F&lt;sub&gt;max&lt;/sub&gt; = 62 MHz</td>
</tr>
<tr>
<td></td>
<td>Xilinx</td>
<td>352 × 288 @ 8</td>
</tr>
<tr>
<td></td>
<td>Spartan-3E</td>
<td>F&lt;sub&gt;max&lt;/sub&gt; = 8 MHz</td>
</tr>
<tr>
<td>[19]</td>
<td>Xilinx</td>
<td>320 × 480 @ 27</td>
</tr>
<tr>
<td>[18]</td>
<td>Virtex-II</td>
<td>F&lt;sub&gt;max&lt;/sub&gt; = 41 MHz</td>
</tr>
<tr>
<td>[20]</td>
<td>Xilinx</td>
<td>640 × 480 @ 266</td>
</tr>
<tr>
<td></td>
<td>Virtex-5</td>
<td>F&lt;sub&gt;max&lt;/sub&gt; = 81 MHz</td>
</tr>
</tbody>
</table>
5.4. Experiments

Fig. 10 shows the results obtained when processing images captured from a mobile robot in experimental conditions. Images on the left are obtained without the feature descriptor, and images on the right are obtained with the feature descriptor. The size of images used for this were $256 \times 256$ pixels. The latter are constructed using the data frame that the Harris corner extractor sends to the host PC.

The Harris corner extractor implemented on FPGA works on the stream of pixels coming from the CMOS imager and because of this the size of the data frame must be smaller than or equal to that of the image treated to allow the data frame to be sent entirely without the loss of any information. In general, if the images treated have $L \times C$ pixels each, and the patch chosen to describe the interest points has $W \times W$ pixels, then the maximum number of interest points that the system can detect is $n < \frac{L \times C}{W^2}$, where $-2$ is for the first two elements of the data frame (the number of interest points detected, and the size of the patch), and $+4$ is for the two coordinates $X_i, Y_i$ of each interest point. $+4$ means that the two coordinates are encoded on two bytes each, which will allow the system to obtain the coordinate of interest points detected in images that have a size greater than $256 \times 256$ pixels.

6. Conclusion and future works

This paper describes the construction of a sensor for real-time visual applications. Its main originality consists in using CMOS imager and FPGA architecture to create a versatile smart system. The approach, based on FPGA Technology and CMOS imager, reduces the classic bottleneck between sensor and processing unit.

The system can acquire images of superior quality using the 1.3-mega pixel (1280 $\times$ 1024) CMOS image sensor IBISS. Precise timing control guarantees the accuracy of image data. ROI readout guarantees the high frame rate of the system (more than 100 fps for $640 \times 480$ pixels). The average transmission speed with USB is 48 MB/s, which will meet the demands of real-time data transmission. The system can be used in many applications with demands of high resolution, high frame rate and real-time requirements.

The feature extractor application was implemented with success on the Dream-Cam, which can process up to 42 fps ($800 \times 1024$ pixels) and gives good quality results, as seen in Section 5. The blocks of this application were developed in generic mode, which means the user can change the size of the image, the number of points needed, the lowest threshold allowed, and other parameters, and compile and synthesize the project to obtain a new system.

Two further steps could be implemented to improve the project. First, the development of an entire controller of the system from the PC, so that with a simple click on the mouse or the keyboard the DreamCam is reconfigured, Global or Rolling shutter mode, size of the image, integration time, threshold of the algorithm and other parameters are chosen or set to a particular value without recompilation of the HDL project. Second, the addition of blocks to do feature tracking or matching.

Acknowledgement

The work reported in this paper was supported by the Euripides European Program (Eureka), Seamoves Project, and the Altera Corporation under an equipment grant.
Appendix A. Supplementary data

Supplementary data associated with this article can be found in the online version, at http://dx.doi.org/10.1016/j.sysarc.2014.01.006.

References


Merwan Birem is graduated engineer from the National Polytechnic School, Algiers (Algeria), in 2010. He is now pursuing his Ph.D. at the Images Perception systems and Robotics (ISPR) Group of Pascal Institute-CNRS, Clermont-Ferrand (France). His research focuses on the Developments of neurominspired modules that will control autonomous mobile robots.

François Berry received his Doctoral degrees and the Habilitation to conduct researches in Electrical Engineering from the University of Blaise Pascal in 1999 and 2011, respectively. His PhD was on visual servoing and robotics and was undertaken at Pascal Institute in Clermont-Ferrand. Since September 1999, he is currently (Associate Professor) at the University of Blaise Pascal and is member of the Perception System and Robotics group (within GRAVIR, Pascal Institute-CNRS). He is researching smart cameras, active vision, embedded vision systems and hardware/software co-design algorithms. He is in charge of a Masters in Microelectronics and in head of DREAM Research on Embedded Architecture and Multi-sensor group. He has authored and coauthored more than 40 papers for journals, conferences and workshops. He has also led several research projects (Robea, ANR, Euripides) and has served as a reviewer and a program committee member. He has been co-founder of the Workshop on Architecture of Smart Camera (WASC) and Scabot (Workshop in conjunction with IEEE IROS).