Jitter-free registration for Unmanned Aerial Vehicle Videos

,


Introduction
Unmanned Aerial Vehicles (UAVs), are becoming increasingly popular for tasks such as video surveillance or remote data acquisition [15]. Tethered drones [10] can now fly during several hours up to 50 meters above ground in stationary flight. Their video flux looks a lot like that of a classic surveillance camera. However, their lack of stability makes tasks such as background subtraction or objects tracking more complex than with a fixed viewpoint ( Fig. 1.a and 1.b). This paper proposes a real-time, online method to convert videos acquired from a stationary drone into jitter-free, single viewpoint videos as much as possible.
Real-world applications for stationary UAVs such as traffic monitoring or crowd surveillance often present a high density of mobile objects, which may provoke drifting and jitter issues over time. Yet, prior works on UAVs image stabilization techniques have been studied on datasets that include very few mobile objects and rather short sequences [13,2].
We propose to tackle the problem of lengthy sequences with multiple mobile objects. In order to leave room for further analysis processes, our solution needs a b c d e Fig. 1. Images extracted from the M4 (left column) and the C2 (right column) sequences in our database. a: first frame, used as a reference image. b: current frame, after approx. 15 seconds (resp. 45 seconds) on left M4 (resp. C2), which we want to register to the first frame of the sequence. c: the output of StabNet [16]. d: the output of CNN-Registration [18]. e: the output of the proposed method.
to be online and computationally low-cost. For this purpose, we propose a generic model which can be applied to 2D rigid motion estimation methods. We show how to combine stabilization and registration techniques, and we apply this method to a lightweight 2D-rigid transformation registration algorithm.

Prior work
Producing a constant viewpoint video from a mobile camera is quite equivalent to determining the camera orientation. Determining the extrinsic parameters of a monocular camera within a 3D environment is a problem typically studied by Structure from Motion (SfM) [12] or Simultaneous Localization and Mapping (SLAM) approaches, some of which can operate in real time [11]. However, the latter are mostly designed to work on static environments and rely on parallax, ie. when there is enough camera movements to infer the 3D structure of the scene.
Video registration and video stabilization tackle this problem by searching for an image transformation that optimally compensates for camera motion. In the first case, this transformation is estimated between the current frame and a reference image. In the stabilization case, we calculate a trajectory, defined as a combination of consecutive inter-frame motion estimations. This trajectory is then filtered and the image is reprojected so to follow the desired, smooth trajectory.
In the UAV context, authors have stated that the direct application of a registration method to a reference frame leads to unsatisfying, jittery results [13,2,1]. Jitter is often linked to an unstable image source (handheld camera, mechanical high frequency noise, etc.). However, it can also correspond to a high-frequency noise caused by the image motion compensation itself. To our understanding, this happens because most registration methods are based on a sparse feature points matching solution, generally accompanied by a selection of inliers and outliers technique such as RANSAC. The intermittent presence of points caused by thresholds in the matching or the inliers selection processes may cause such high-frequency noise.
Classic video stabilization methods such as [8] and [6] have been adapted to the UAV context, sometimes associated with video stitching [7]. The motion estimation is often performed with a very popular approach known as Kanade Lukas Tracker (KLT) [14] but other techniques have been proposed, such as a specific optical flow model which enforces spatial coherence [9]. Most methods are able to handle mobile camera, and thus do not assume the existence of a constant background, which however applies in our context. They eliminate jitter very well, but they tend to imply drifting, which is the tendency to slowly change viewpoint over time.
More recently, convolutional neural networks have been applied to both registration [18] and stabilization [16] domains. Both methods have proposed to use a rich warping model based on Thin Plate Splines (TPS). While such approaches look promising, their direct application on our data proved problematic. We may observe on Figs. 1.c (both columns) and 1.d (right column) that 3 out of 4 frames are misaligned relatively to the reference image ( Fig. 1.a). Authors of StabNet [16] based their approach on a siamese convolutional network that was trained thanks to a stabilization database. This database was acquired with the help of a single handling device to which 2 different cameras were attached, only one of which was physically stabilized with a gimbal. Such ground-truth is not available in our settings. Moreover, this method still does not assume the presence of a constant background to which it should register. CNN-Registration [18] is able to handle large appearance changes and seems suitable to handle longterm registration with important lighting variations and the presence of multiple mobile objects. However, our experiments have shown that this approach is not invariant to rotation (e.g., it is not able to handle a video rotated to some extent, Fig. 1.d) or to very large displacement. In our context, applying it would thus require some prior alignment step, which confirms the need for a simple and robust registration method that takes temporal data into account.

Modelling the problem
The idea behind a video stabilization or registration algorithm is to compensate for undesired camera motion, while preserving the image content variability over time. At first, we need to define the degrees of freedom of our problem.
In first approximation, the effects of camera movements can be modeled and compensated through 2D-rigid warping transforms. On stationary drones, the camera is mostly affected by undesired, relatively low magnitude Yaw, Pitch and Roll motion (following the Tait Bryan chained rotations convention). Since the drone is never perfectly stationary, additional undesired 3D translational motion of the camera in space as well as the 3D geometry of the scene add up to the complexity of our problem.
Eventually, we estimate the camera motion between two images Im i and Im j through a 2D linear transformation matrixM(i, j). It is defined by an unique quartet of parameters (t x , t y , α, s) (resp. translation along the horizontal and vertical axis, rotation of angle α, and a positive scale in the 2D plane) which are used to approximate the effects on the image of a physical 3D motion performed by the drone (resp. Yaw, Pitch, Roll and translation along the optical axis). Warping Im i according to the transformation matrixM(i, j) aims at setting it in the closest possible viewpoint to Im j . Conversely, warping Im j according tõ M(j, i) =M(i, j) −1 sets it to the closest possible viewpoint to Im i .
In any case, we rely on the estimation of the motion between two images Im i and Im j , for which we propose the following decomposition: where: -M(i, j) is the estimated camera motion between Im i and Im j -M cam (i, j) corresponds to the motion associated to the actual, physical camera movement, measured as background motion between Im i and Im j -EM(i, j) corresponds to a motion estimation error. EM(i, j),M(i, j) and M cam (i, j) are all expressed as linear transformation matrices.
A lot of motion estimation or registration methods are available in the literature, ranging from holistic [5] to sparse [14], with various properties and advantages. Equation (1) can be used to characterize any movement estimation algorithm that outputs a 2D linear transform.
Registering a video is the problem of canceling the term M cam over the course of a video. With a reference frame denoted as 0, the applied warping can be expressed as Stabilizing a video is the problem of smoothing M cam over the course of a video. This is performed by constructing a trajectory, which is defined as which we denote as: The left arrow sign ( ) means that we perform a left-hand product. Then, we filter TM over time: the warping applied to the original images can be seen as the difference between the filtered trajectory and the original trajectory.
where F TM (i) is the output at frame i of a smoothing filter applied on the set of trajectories TM. It is also expressed as a 2D linear transform matrix. Any output of F should be a plausible approximation given the physical constraints of the problem. In practice, one can filter independently t x , t y , α and s. For real-time, online application, a Kalman Filter [17] can be used.
Given Eq. (1), we can reformulate Eq. (3) as follows: By definition, In the general case, we cannot develop any further Eq. (6). However, we can introduce an equivalent error term such as Eq. (3) becomes: The more dissimilar Im i and Im j , the more significant EM(i, j) is likely to be. Consecutive images being rather similar, they usually yield a EM term of low magnitude. However, in such cases, foreground objects often perform little movement from Im i to Im j . When i and j are close in time, a part of the term EM(i, j) may correspond to light foreground motion that was wrongly considered as background motion by the motion estimator. Such error accumulates into Eq. (6) to form a drifting trajectory (Eq. (8)). This drifting error term explains why it is not recommended to simply use T −1 M as a registration solution. Most of the literature in stabilization and registration topics is focused on minimizing the term EM within the motion estimation step. This minimization is essential towards achieving good performance, but the existence of such error is unavoidable. However, its nature tends to vary from jitter in a registration case to drifting in a stabilization case. We show how to take advantage of both jittery and drifting behaviors to propose an efficient and low-cost solution towards jitterfree constant viewpoint generation.

Proposed method
In this section, we show how to efficiently combine registration and stabilization approaches into a single hybrid method (Fig. 2). From now on, we will denote byM s (resp.M r the specific motion estimator for the stabilization (resp. registration) part of the proposed method. The idea is to calculate the product between the trajectory TM s of a stabilization method, and the correctionM r (0, i) −1 applied in a registration method.
Following the model proposed in Eq. (1), as reported in Eqs. (2) and (8), we can reformulate Eq. (9) as: As suggested previously, this matrix is essentially a product of a smooth, low frequency drifting error (E equiṽ M s (0, i)) and a jittery, high frequency error (EM r (0, i)). Filtering D allows us to isolate the drifting component.
Combining the output of this filter to T −1 M s finally allows us to obtain a jitter-free video registration on long sequences, without the need for particularly elaborate motion estimation techniques. Finally, the applied correction is the following:

Implementation
This approach was implemented using computationally lightweight algorithms provided by the OpenCV library [4] and C++ language. BothM s andM r were estimated on a sparse image representation basis using the KLT approach.
The image is first resized to 576x324 (30% of a 1080p resolution) and set to one-channel grayscale. The first frame was adopted as the reference frame for the tested videos.M r (0, i) (resp.M s (i − 1, i)) was estimated by extracting 200 Shi-Tomasi corners [14] on Im 0 (resp. Im i−1 ), further tracked on Im i using the Lukas and Kanade Pyramidal Optical Flow (LKPOF) algorithm [3]. A Least Square Regression (LSR) was used for the motion estimation matrix solving. We used a Kalman Filter [17] on the four motion estimation parameters (t x , t y , α, s) independently for the implementation of F in Eq. (12). The LKPOF algorithm being sensitive to its initialization, we used the KF prediction as the initialization for all of the tracked points locations on the registration part.
The experimentation was carried out on a 2.5 Ghz Intel Core i7 MacBook Pro with 16 Go DDR3 of memory under High Sierra OS. Under these settings, each frame is processed in less than 16 millisecond using CPU operation only, enabling real time applications and leaving space for further processing analysis.

Evaluation protocol
To show the benefits of the proposed combination, we have compared it with different combinations of its elementary components. The following settings were tested: -Raw: the original, unprocessed video.
-StabilizationKalman: the video stabilized by the algorithm described on Eq. (5), using the same computation ofM s and filter as described in section 5.2. -RegistrationLastPos: the video registered by the algorithm described asM r in section 5.2, with the registration proposed at frame i−1 as an initialization for the registration of frame i. -RegistrationKalman: the video registered by the algorithm described asM r in section 5.2, with a KF set as described in section 5.2 for both the initialization and the filtering. -Ours: the proposed method.
Evaluating a stabilization and registration algorithm in our context is a delicate task, since we do not benefit from ground-truth data about the actual camera movements or the image content on our dataset.
To quantify the registration performances of our approach, we propose to track a set of feature points from the reference frame to the current frame, using the same tracker setting as in section 5.2 forM r (0, i). The median displacement of all tracked reference points was used as a measure of registration quality. This measure, denoted as frame displacement (fd ), was computed on each frame independently.
To quantify the stabilization performances of our approach, we propose to calculate the mean absolute difference of pixel grayscale values between two consecutive frames, over the length of a video sequence. To avoid parts of the image where we had no data, this measurement was performed on the overlapping regions between both consecutive images (mpvd ).

Results
The first property that we wanted to quantify is whether our solution is capable of keeping registered to a constant viewpoint. We computed the proposed mean fd over the whole course of tested sequences (Table 1). Results emphasize the idea that a stabilization technique, such as StabilizationKalman, is not designed to guarantee a constant viewpoint over the course of a video. On videos 'C0', 'C1' and 'M2', which are subject to little camera motion, all tested registration methods (RegistrationLastPos, RegistrationKalman) and the proposed method perform very similarly. RegistrationKalman suffers from inertia, which degrades its performance, eventually leading to being badly registered during several hundreds of frames on 'C2'. On all tested videos, the difference between the proposed method and the better evaluated registration method is well within subpixel range. This suggests that the proposed algorithm effectively preserves the registration performances of its base component. The second evaluation focused on assessing the stability properties of the different methods based on the mpvd values ( Table 2). The assumption here is that on stable sequences, only mobile objects should cause pixel values to change significantly from one frame to the next one. On the other hand, jitter would cause pixel values to change suddenly over significant parts of the image, including the background. Stability here is assessed by the lowest possible mpvd value. This is verified in our experiment. In general, the poorest performance is visible on the original image, which is unstable. RegistrationLastPos, where jitter occurs despite the image being overall well registered, displays important values. Filtering the output of the registration (RegistrationKalman) significantly improves the results, which shows that this solution was able to tackle most of the jitter issues. On all of the sequences, the better performances are observed with Stabi-lizationKalman, and the proposed combination. The proposed method displays the highest performances because of its ability to keep consistently registered to the same viewpoint. This quantitative outcome confirms the robustness of the proposed method and the qualitative impression given by visual inspection of the videos 3 . Our proposed approach can be effectively labeled as a jitter-free registration method.

Conclusion and perspectives
In this paper, we have addressed the problem of generating a constant viewpoint from videos acquired by stationary UAVs. The camera being subjected to small movements, the view is unstable and poses a problem for applying automatic processing techniques, or long term analysis such as trajectory registration. In this context, we have proposed a generic model to describe the inherent error of motion estimation algorithms. We have used it as the foundation on how to combine registration and stabilization techniques into one single hybrid method. The method is real time and online. It prevents both jittery and drifting behavior, even in the presence of multiple mobile objects. Results show that it retains the better properties out of the tested stabilization and registration techniques.
Further work will focus on two main aspects. One of them is to investigate how to handle situations where linear 2D-rigid warping is inappropriate, for instance when significant parallax is observed. The second aspect is how to update the reference image during the course of a day. This should enable us to better cope with appearance changes on the background, such as lighting conditions, which is a common problem during video surveillance applications.