Delayed Monocular SLAM Approach Applied to Unmanned Aerial Vehicles

In recent years, many researchers have addressed the issue of making Unmanned Aerial Vehicles (UAVs) more and more autonomous. In this context, the state estimation of the vehicle position is a fundamental necessity for any application involving autonomy. However, the problem of position estimation could not be solved in some scenarios, even when a GPS signal is available, for instance, an application requiring performing precision manoeuvres in a complex environment. Therefore, some additional sensory information should be integrated into the system in order to improve accuracy and robustness. In this work, a novel vision-based simultaneous localization and mapping (SLAM) method with application to unmanned aerial vehicles is proposed. One of the contributions of this work is to design and develop a novel technique for estimating features depth which is based on a stochastic technique of triangulation. In the proposed method the camera is mounted over a servo-controlled gimbal that counteracts the changes in attitude of the quadcopter. Due to the above assumption, the overall problem is simplified and it is focused on the position estimation of the aerial vehicle. Also, the tracking process of visual features is made easier due to the stabilized video. Another contribution of this work is to demonstrate that the integration of very noisy GPS measurements into the system for an initial short period of time is enough to initialize the metric scale. The performance of this proposed method is validated by means of experiments with real data carried out in unstructured outdoor environments. A comparative study shows that, when compared with related methods, the proposed approach performs better in terms of accuracy and computational time.


Introduction
There are still important problems to be solved in autonomous robotics, and simultaneous localization and mapping (SLAM) is one of them. This paper tries to tackle this problem and contributes to give even more autonomy to mobile robots. Regarding the term SLAM, it is used to refer to a map building process in an unknown space and the use of this map to navigate through such an space tracking the position in a simultaneous process. Usually this map is shown to give accurate results when the availability of computational power is enough, filtering-based SLAM methods might be beneficial if limited processing power is available [15].
Objectives and contributions: In this work authors propose a new filter-based monocular SLAM scheme. The method presented in this research has been designed for taking advantage of hardware resources commonly available in this kind of platforms. The performance of the method is validated by means of experiments with real data carried out in unstructured outdoor environments. An extensive comparative study is presented for contrasting the operative and effectiveness of this proposal respect to other relevant methods. One of the contributions of this work is to present a novel technique for estimating the features depth. The proposed approach is based on a stochastic technique of triangulation. While this technique is inspired in a previous authors' work [34], crucial and contributive modifications have been introduced in order to accommodate it to the particularities of the current application: • In this work, the camera is mounted over a servo-controlled gimbal that counteracts the changes in attitude of the quadcopter. Due to the above assumption, the overall problem is simplified and it is focused on the position estimation of the MAV. Also, the tracking process of visual features is made easier due to the stabilized video.
• Instead of using an external pattern of known dimensions, in this work the GPS signal is used during a short initial period of time for recovering the metric scale of the estimates.
• Features are directly parametrized in their euclidean form, instead of the inverse depth parametrization. The consequence is a reduction of the computational cost of the filter due to the reduction of the dimension of the system state.
• A novel technique for the tracking process of candidate points is proposed. In this case the search of visual features is limited to regions of the image circumscribed by ellipses derived from epipolar constraints. The consequence is an improvement in the execution time.
Compared with other methods presented in literature, one of the contributions of this work is to demonstrate that the integration of very noisy GPS measurements into the system for an initial short period is enough to initialize the metric scale. For example in [35] the monocular scale factor is retrieved from a feature pattern with known dimensions. In [29] and [36], the map is initially set by hand, by aligning the first estimates with the ground-truth in order to get the scale of the environment. Additionally, the proposed approach is simpler when compared with similar approaches, because the estimation of the camera orientation is avoided by using the servo-controlled gimbal. In [26] feature depth is computed by direct triangulation between visual correspondences using SIFT descriptors. In this work, a novel technique, which is based on patch-correlation, is used for the tracking process of candidate points. It is well known that local descriptors like SIFT or SURF are more robust that the use of patch-correlation techniques for matching visual features. Nevertheless, the stabilized video and the stochastic nature of the whole initialization method makes reliable the technique proposed in this work for tracking candidate points, with the implicit gain in terms of computational cost.
Perhaps, the most extended technique that is used for initializing map features in EKF-SLAM is the UID based methods (e.g. [22,25]). Nevertheless, the comparison study presented in this work shows that the proposed method can surpass the UID method in terms of accuracy and computational time, at least for the kind of application studied.
Paper outline: Section 2 states the problem description and assumptions. Section 3 describes the proposed method in a detailed manner. In Section 4 experimental results are shown together with a comparative study and the discussion about those results and, finally, Section 5 presents the conclusions of this work.

Assumptions
The platform considered in this work is a quadrotor with free movements in any direction in R 3 Â SOð3Þ, shown in Fig 1. However, it is important to highlight that the proposed monocular SLAM method could be applied to other kind of platforms. The proposed method is mainly intended for local autonomous vehicle navigation. In this case, the local tangent frame is used as the navigation reference frame. Thus, the initial position of the vehicle defines the origin of the navigation coordinates frame. The navigation system follows the NED (North, East, Down) convention. The magnitudes expressed in the navigation and in the camera frame are denoted respectively by the superscripts N and C . All the coordinate systems are right-handed defined. It is also assumed that the location of the origin of camera frame respect to other elements of the quadcopter (e.g. GPS antenna) is known and fixed. In this case, the position of the origin of the vehicle can be computed from the estimated location of the camera.
In aerial vehicles applications, the attitude estimation is well handled by available systems (e.g. [37] and [38]), therefore, this work will focus in position estimation. Also, it is assumed that the monocular camera is mounted over a servo-controlled gimbal (see Fig 1). This kind of accessory, used mainly for stabilizing video capture, has become very common in aerial applications. In our case, the gimbal is configured in order to counteract the changes in attitude of the quadcopter, and therefore stabilizing the orientation of the camera towards the ground (down axis in NED coordinates). The above consideration has two important consequences: i) the tracking process of visual features is made easier due to the stabilized video, ii) the orthogonal matrix R CN , defining the rotation of the camera frame to the navigation frame, is assumed to be known.
An standard monocular camera is considered. In this case, a central-projection camera model is assumed. The image plane is located in front of the camera's origin where a noninverted image is formed. The camera frame C is right-handed with the z-axis pointing to the field of view. The R 3 ) R 2 projection of a 3D point located at p N = (x, y, z) T to the image plane p = (u, v) is defined by: Let u and v be the coordinates of the image point p expressed in pixel units, and: x 0 Let p C be the same 3D point p N , but expressed in the camera frame C by p C = R NC p N . Let R NC be the rotation matrix that allows to transform from the navigation frame N to the camera frame C . Also, it is fulfilled that R NC = (R CN ) T , and R CN is known by the use of the gimbal. Inversely, a directional vector h C ¼ ½h C x ; h C y ; h C z T can be computed from the image point coordinates u and v as Vector h C points from the camera optical center position to the 3D point location and it can be expressed in the navigation frame by h N = R CN h C . Note that for the R 2 ) R 3 mapping case, defined in Eq 3, depth information is lost. The distortion caused by the camera lens is considered through the model described in [39]. Using the former model (and its inverse form), undistorted pixel coordinates (u, v) can be obtained from (u d , v d ) and conversely. In this case, it is assumed that the intrinsic parameters of the camera are already known: focal length f, principal point (u 0 , v 0 ), and radial lens distortion k 1 , . . ., k n .

Problem description
The main goal of the proposed method is to estimate the following system state x: x ¼ ½x v ; y 1 ; y 2 ; :::; y n T ð4Þ where x v represents the state of the camera-quadcopter, and y i represents the location of the ith feature point in the environment. At the same time, x v is composed of: Let r N = [p x , p y , p z ] represent the position of the vehicle (camera) expressed in the navigation frame. Let v N = [v x , v y , v z ] denote the linear velocity of the robot expressed in the navigation frame. The location of a feature y i is parametrized in its euclidean form:

Prediction
The work presented in this paper is motivated by the application of monocular SLAM to small aerial vehicles. In this case, and due to limited resources commonly available in this kind of applications, the filtering-based SLAM methods seem to be still more appropriate than Keyframe methods. Moreover, filtering-based methods are better suited for incorporating, in a simple manner, additional sensors to the system. In this sense, most robotic applications make use of multiple sensor inputs.
The architecture of the system is defined by the typical loop of prediction-updates steps in the EKF in direct configuration, where the EKF propagates the vehicle state as well as the feature estimates. In this case, the camera-vehicle system state x v takes a step forward by the following simple model: ( At every step, it is assumed that there is an unknown linear velocity with acceleration zeromean and known-covariance Gaussian processes σ a , producing an impulse of linear velocity: It is assumed that the map features y i remain static (rigid scene assumption) so The state covariance matrix P takes a step forward by: where Q and the Jacobians rF x , rF u are defined as: Let @f v @x v be the derivatives of the equations of the nonlinear prediction model (Eq 7) with respect to the robot state x v . Let @f v @u be the derivatives of the nonlinear prediction model with respect to the system input u. Uncertainties are incorporated into the system by means of the process noise covariance matrix U ¼ s 2 a I 3Â3 , through parameter s 2 a .

Detection of candidate points
The proposed method states that a minimum number of features y i is considered to be predicted appearing in the image, otherwise new features should be added to the map. In the latter case, new points are detected in the image through a random search. For this purpose, Shi-Tomasi corner detector [40] is applied, but other detectors could also be used. These points in the image, which are not added yet to the map, are called candidate points (see Fig 2). Only image areas free of both, candidate points and mapped features, are considered for detecting new points with the saliency operator. At the k-th frame, when a visual feature is detected for the first time, the following entry c l is stored in a table: where y c i ¼ ½ðt N c 0 Þ T ; y 0 ; 0 0 models a 3D semi-line, defined on one side by the vertex ðt N c 0 Þ T , corresponding to the current optical center coordinates of the camera expressed in the navigation frame, and pointing to infinite on the other side with azimuth and elevation θ 0 and ϕ 0 , respectively, being: T is computed as indicated in Section 2. Let P y i be a 5 × 5 covariance matrix which models the uncertainty of y i . Therefore P y i = JPJ T , where P is the system covariance matrix and J is the Jacobian matrix formed by the partial derivatives of the function be the location in the image of the candidate point. Also, a p × p-pixel window, centered in [u, v] is extracted and linked to the corresponding candidate point.

Tracking of candidate points
At every subsequent frame k + 1, k + 2. . .k + n, the location of candidate points is tracked. In this case, a candidate point is predicted to appear inside an elliptical region S centered in the point [u, v], taken from c l , see Fig 3. In order to optimize the speed of the search, the major axis of the ellipse is aligned with the epipolar line defined by image points e 1 and e 2 .
The epipole e 1 is computed by projecting t N c 0 , which is stored in c l , to the current image plane by Eqs 1 and 2. The point e 2 is computed by projecting the 3D point p N defined by the data stored in c l , through Eqs 1 and 2 also, but assuming a depth equal to one (d = 1). In this case, p N models a 3D point located at: where m(θ i , ϕ i ) is a directional unitary vector defined by: The orientation of the ellipse S c is determined by α c = atan2(e y , e x ) where e = e 2 − e 1 and e y , e x represent the y and x coordinates respectively of e. The size of the ellipse S c is determined by its major and minor axis, respectively a and b. In this case a and b are free parameters constrained to b ( a.
The ellipse S c is represented in its matrix form by: The ellipse S c represents a probability region where the candidate point must lie in the current frame. In this case, patch cross-correlation is applied over all the image locations [u S , v S ] within the search region. If the score of a location [u S , v S ], determined by the best cross-correlation between the candidate patch and the n patches defined by the region of search, is higher than a threshold, then this pixel location [u S , v S ] is considered as the current candidate point location. Thus, c l is updated with [u, v]  At this stage, there is not yet reliable information about the depth of candidate points. For this reason, it is difficult to determine an optimal size of the search ellipse. In this case, the parameter a is chosen empirically in function of the particularities of the application as the velocity of the vehicle and the frame rate of the video. In this work, good results were found with a value of a = 20 pixels.
Nevertheless, the effects obtained by the variation of the relation of (b/a), which determines the proportion of the ellipse, can be investigated. In this case, some experimental results were obtained using the same methodology described in Section 4. The results can be summarized as follows (see Fig 4): • As the ellipse tends to be a circle, the time required to track a candidate point increases considerably (left plot).
• When the ellipse tends to be a circle the number of candidate points being tracked is lower (middle plot). This is because when the ellipse is too thin, some candidate points are lost and new ones must be detected.
• When the parameter b is chosen in order to define a very thin ellipse, the total time required for the whole tracking process of candidate points is much lower (right plot).
Based on the above results, the value of parameter b is recommended to be ten times lower than a.
It is important to note that no extra effort is put in order to obtain a more robust descriptor. There are two main reasons for supporting this approach: i) The method proposed for tracking the candidate points is applied only during an initial short period when a new visual feature is detected. During this initial period, prior to the initialization of the candidate point as a new map feature, some information about the feature depth is gathered. ii) Different from the general problem of the monocular SLAM, the stabilized video also makes easier the tracking process of candidate points.

Feature initialization
Depth information cannot be obtained in a single measurement when bearing sensors (e.g. a projective camera) are used. To infer the depth of a feature, the sensor must observe this feature repeatedly as this sensor moves freely through its environment, estimating the angle from the feature to the sensor center. The difference between those angle measurements is the parallax angle. Actually, parallax is the key that allows estimating features depth. In case of indoor sequences, a displacement of centimeters could be enough to produce parallax; on the other hand, the more distant the feature, the more the sensor has to travel to produce parallax (see Fig 5).
Every time that a new image location z uv = [u, v] is obtained for a candidate point c l , an hypothesis of depth d i is computed by: Let α i = π − (β + γ) be the parallax. Let e l ¼ t N c 0 À t N c indicate the displacement of the camera from its first observation to its current position with: Let β be the angle defined by h 1 and e l . Let h 1 be the normalized directional vector m(θ, ϕ) = (cos θ sin ϕ, sin θ sin ϕ, cos ϕ) T computed taking θ 0 , ϕ 0 from c l , and where γ is the angle defined by h 2 and −e l . Let h 2 = h N be the directional vector pointing from the current camera optical center to the feature location computed as indicated in Section 2 from the current measurement z uv . At each step, the hypothesis of depth d i is low-pass filtered because the depth estimated by triangulation varies considerably, specially for low parallax. In previous authors' work [34] is demonstrated that only a few degrees of parallax is enough to reduce the uncertainty in depth estimations.
When parallax α i is greater than a specific threshold (α i > α min ) a new feature y new = [p x i , p y i , p z i ] T is added to the system state vector x: where The system state covariance matrix P is updated by: Let P y new be the 3 × 3 covariance matrix which models the uncertainty of the new feature y new , and: In Eq 20, P y i is taken from c l . Let s 2 d be a parameter modelling the uncertainty of process of depth estimation. Let J be the Jacobian matrix formed by the partial derivatives of the function

Visual updates and map management
The process of tracking visual features y i is conducted by means of an active search technique [41]. In this case, and in different way from the tracking method described in subsection 3.4, the search region is defined by the innovation covariance matrix S i , where Assuming that for the current frame, n visual measurements are available for features y 1 , y 2 , . . ., y n , then the filter is updated with the Kalman update equations as: Let @h i @x v be the partial derivatives of the equations of the measurement prediction model h i with respect to the robot state x v . Let @h i @y i be the partial derivatives of h i with respect to feature y i . Note that @h i @y i has only a nonzero value at the location (indexes) of the observed feature y i . Let R ¼ ðI 2nÂ2n Þs 2 uv be the measurement noise covariance matrix. A SLAM framework that works reliably in a local way can easily be applied to large-scale problems using different methods, such as sub-mapping, graph-based global optimization [15] or global mapping [42]. Therefore, in this work, large-scale SLAM and loop-closing are not considered. Nevertheless these problems have been intensively studied in the past. Candidate points whose tracking process fails are pruned from the system. In a similar way, visual features with high percentage of mismatching are removed from the system state and covariance matrix.

Metric scale and System initialization
Even when GPS signal is available, the problem of position estimation could not be solved for some specific scenarios, for instance in an application requiring performing precision manoeuvres in a complex environment. In this case, and due to several sources of error, the position obtained with a GPS can vary even for meters in a matter of seconds for a static location [43]. In such a scenario, the use of GPS readings, smoothed or not, as feedback signal of the control system can be unreliable because the control is not able to discriminate between sensor noise or actual small movements of the vehicle (see Fig 6). On the other hand, in a robotics context, obtaining the metric scale of the world can be a tough requirement. However, one of the most challenging aspects of working with monocular sensors has to do with the impossibility of directly recovering the metric scale of the world. If no additional information is used, and a single camera is used as the solely source of data to the system, the map and trajectory can only be recovered without metric information [14]. In this case, neither monocular vision nor GPS are suitable to be used separately for navigation purposes.
In this work, noisy data obtained from the GPS is incorporated into the system at the beginning in order to incorporate the metric information of the environment. After some initial period of convergence, where the system is considered to be in a initialization mode, the system can operate relying only on visual information.
Position measurements obtained from the GPS are modelled by: where v r is a Gaussian white noise with PSD s 2 r ; and r N has been already defined in Eq 7. Commonly, position measurements are obtained from GPS devices in geodetic coordinates (latitude, longitude and height). Therefore, in Eq 23 it is assumed that GPS position measurements have been previously transformed to their corresponding local tangent frame coordinates. It is also assumed that the offset between the GPS antenna and the vehicle frame has been taken into account in the previous transformation.
For system updates, the simple measurement model h r = h(x v ) is used: In the next Section, the demonstration that the proposed method is robust enough to be initialized with noisy GPS measurements will be shown.

Experimental setup
In Fig 7 is shown the vehicle that authors used to obtain real data for experiments, the platform is a customized quadrotor. Such a platform uses an Ardupilot unit, [44], as flight controller. As main sensors, the platform is equipped with a radio telemetry unit (3DR at 915MHz), GPS unit (NEO-M8N), camera (DX201 DPS) with wide angle lens and a video transmitter (at 5-8 GHz). The camera is mounted over a very low-cost gimbal which is servo-controlled by standard servos. During the experiments, the quadrotor has been controlled by radio in a manual way.
For capturing sensor data and digitalized video from the vehicle a software application has been built by authors in C++ language. The protocol used for reception/transmission is MAV-LINK protocol [45]. GPS and AHRS (Attitude and Heading Reference System) data are synchronized between them and recorded in a database for their study. Video frames have been acquired at a resolution of 320x240 gray scale pixels at 25 fps. All the experiments have been performed in an outdoor park with trees, which surface is almost flat with grass and some dirt areas. Flight observations include some plants and small structured parts. In average 9-10 GPS satellites are visible at the same time. Finally, a MATLAB implementation of the proposed method was executed offline over the dataset in order to estimate the flight trajectory and the map of the environment. In experiments, for evaluating the performance of the proposed method, the technique P4P described in the Appendix was used in order to have an external reference of the flight trajectory. In the following website reader can download the different files containing all the data collected by robot sensors. This data has been used by authors to perform the experiments contained in this research paper (https://figshare.com/articles/ Experiments/4029111).

Flight trajectories
Two different flight trajectories (F a and F b ) were performed over the test field. In both cases, an initial period of 5 seconds (from t = 0s to t = 5s) of flight was considered for initialization purposes as it was explained in section 3.7. Fig 8 shows some frames of the video recorded in flight F a . At the beginning of the trajectory (left plot), at instant t = 2.84s, the first features are added to the system state. Note that at this moment, most of the tracked points are considered as candidate points. At instant t = 10.23s (middle plot), the system is operating relying only on visual information for estimating the position of the quadcopter and the map of the environment. The right plot shows a frame at instant t = 30.11s. Fig 9 shows a 3D perspective of the estimated map and trajectory for both flight trajectories F a and F b . In the next subsection, a more detailed analysis of the experimental results is presented.

Comparative study
A comparative study has been performed in order to gain more insight about the performance of the proposed delayed monoSLAM (DE) method. For this purpose, the DE method has been tested against the popular undelayed inverse depth method (UID), and its variant, the undelayed inverse depth to euclidean method (UID2E). The implementation of the UID and UID2E methods are based respectively on [23] and [46]. The UID and UID2E methods have been chosen because the undelayed inverse depth method has become almost a standard for implementing filter-based monocular-SLAM systems. In experiments, the 1-point RANSAC method [47] has been used for validating the visual matches of map features. In the particular case of the DE method, no extra validation technique was used for the matching process of candidate points. For the DE method, a value of α min = 5˚has been used. For the UID and UID2E methods, values of ρ ini = 1 and σ ρ ini = 1 have been used. In general all the methods are tested under the same conditions. Only the parameter s 2 r , used for modelling the uncertainty in GPS readings during the initialization period has slightly been tuned for each method in order to produce a good initial metric convergence.
The search of new candidate points in each frame is conducted in a random manner for the DE method as well as the search of new features in UID and UID2E methods. For this reason, the results of the methods can vary at each run. In this case, in order to have a better statistical appreciation of the performance of each method, 10 Monte Carlo runs have been executed for computing each result. Tables 1 and 2 show the results obtained respectively for the flight trajectory F a and F b . The number of visual features being tracked at each frame can affect the performance of monocular SLAM methods. For this reason, the methods have been tested by setting three different values of minimum distance (MD) between the visual features being tracked. In this case, the bigger the value, the lesser the number of visual features that can be tracked. Also, in experiments, features are removed from the system state if they are predicted to appear in the image but are not tracked in 25 periods.
Under the above conditions, Tables show the results obtained after applying the three different methods at the end of their trajectories. Some features have been computed for each method (DE, UID and UID2E) such as: i) number of the features initialized into the system state (NIF), ii) number of features deleted from the system state (NDF), iii) execution time per frame (ETF), iv) total time of execution (TTE) and v) average mean absolute error (aMAE) of the vehicle position. For computing the aMAE, the P4P trajectory has been used as an independent reference of the vehicle position (see the Appendix). However, it is important to note that the trajectory obtained by the P4P technique should not be considered as a perfect reference of groundtruth. Despite this consideration, the results obtained still reflect in a very good fashion the performance of every method. Fig 10 shows the estimated position obtained with each method for the flight trajectories F a and F b . A plot for each NED coordinate (north, east and down) is given. Only the results obtained with a minimum distance between features higher than 20 pixels (MD = 20) are

Discussion
According to the results of the comparative study some implications can be inferred. A slightly variation in the number of features, that are allowed to be tracked at each frame, can significantly affect the number of features that are initialized into the system state. In this case, a reduction of 10 pixels in the MD produces about twice of features initialized. Indeed, an increment of the features initialized into the system state implies an increment of the computational time. On the other hand, theoretically and due to the increment of information available, the increase of tracked features should improve the estimated trajectory. However, results do not show a considerable improvement in this sense. In this case, only with the trajectory F a , a consistent but minor improvement was obtained with the increase of features, but with an increment of about twice the computational time.
Regarding to the average mean absolute error (aMAE) computed for the estimated trajectory of the quadrotor, the DE method has shown consistently slightly better results with respect to the UID method. However, it is important to note that the difference could be within the margin of error of the methodology followed for computing de aMAE. Unfortunately, statistics about this margin of error are not available. For this reason, according to the results DE method can offer at least a similar performance in accuracy with respect to the UID method. On the other hand, the UID2E method shows, in every case, the worst behaviour of all the methods. It is worth noting that, for this application, the UID2E method has exhibited a considerable tendency to drift in the metric scale of the estimations. In Fig 11 (lower plots), it can be clearly appreciated this phenomenon where is specially notorious the degradation in scale of the estimated map.
Regarding to the computational efficiency of the methods, it is clear that the proposed DE method presents the best results. This result can be explained for two reasons: the use of the euclidean parametrization and the use of less but stronger visual features.
In the case of the undelayed methods, the use of the inverse depth (ID) parametrization becomes mandatory due to the nonlinear nature of the measurement equation when features are initialized right after they are detected. On the other hand, ID parametrization requires six parameters instead of three euclidean ones. Therefore, as the number of features increases, with the ID parametrization the length of the state tends to have twice the length that it has with the euclidean parametrization. For the EKF-based approaches, the above ID parametrization has as consequence a well known increment in the computational cost. In this sense, the UID2E method was designed for improving the computational efficiency of the UID method. Features whose depth converge are converted from the ID to euclidean parametrization. Results validate this claim, however, for the application presented in this work, the benefit in computational efficiency is minimal compared with the increase of error drift obtained with the UID2E.
For DE method the period used for candidate points tracking is mainly intended for obtaining information about the features depth, prior to its inclusion into the system state. This fact has also the collateral benefit of pruning weak visual features that fail to be tracked in this period. In contrast to the undelayed methods (UID and UID2E), where all the detected visual features are initialized into the system, delayed methods (DE) initialize less but stronger visual features. This is more evident if the number of initialized features is considered (see Tables 1  and 2), as well as the percentage of deleted features with respect to the number of initialized features: DE = 68%, UID = 77% and UID2E = 78%. These figures mean not only that the undelayed methods initialize a lot of useless visual features, but they also mean that the features initialized with the delayed method are better retained into the system.

Conclusion
In this work a novel monocular SLAM method with application to a quadcopter has been presented. In this case, a monocular camera is integrated into an UAV in order to provide visual information of the ground. Due to attitude estimation is well handled by available systems for this kind of applications, this research is focused only in position estimation. In order to avoid the need of estimating the camera orientation, a servo-controlled gimbal is used for stabilizing the orientation of the camera towards the ground.
Traditionally, the position estimation of UAVs has been addressed by the use of GPS. However, the GPS is not a fully reliable service as its availability can be limited in urban canyons and is unavailable in indoor environments. Moreover, even when GPS signal is available, the problem of position estimation could not be solved for some specific scenarios, for instance in an application requiring performing precise manoeuvres in a complex environment. Therefore, some additional sensory information should be integrated into the system in order to improve accuracy and robustness. In this context, the use of monocular vision has some advantages in terms of weight, space, energy consumption, or scalability.
On the other hand, two challenging aspects related with monocular sensors have to do with the impossibility of directly recovering the depth of visual features, and the metric scale of the world as well. To address the first aspect, a novel technique for estimating the features depth based in an stochastic technique of triangulation has been presented. Regarding the second aspect, it is assumed that GPS readings are available for some short period at the beginning of the system operation. After this initial period used for incorporating information about the metric scale of the world, the system can operate relying only on visual information for estimating the position of the vehicle.
The performance of the proposed method has been validated by means of experiments with real data carried out in unstructured outdoor environments. To check the contribution of this research, an extensive comparative study is presented for validating the performance of the proposed approach respect similar methodologies. For this kind of aerial application presented in this paper, and according to the experimental results, the proposed method has performed better, in terms of accuracy and execution time, than the UID and UID2E methods.

P4P reference trajectory
Experimental setups in natural outdoor environments can be a challenge for small aerial vehicles. Some difficulty arises with the absence of resources available in laboratories (e.g. Vicon system). In this particular case, for fine flight manoeuvres, the trajectory provided by the GPS is useless to be used as a reference of the actual flight trajectory. In this work, in order to have an independent reference for evaluating the performance of the proposal, the following methodology is proposed.
Four marks are placed in the ground, forming a square of known dimensions, see  [48], is applied iteratively in order to compute the relative position of the camera with respect to the known metric reference. At each frame, the image location of the four corners is provided by a simple tracking algorithm designed for this purpose.
The P4P technique used for estimating the camera position, defined by R CN and r N , is based on [49]. The following linear system is formed with the vector b as unknown parameter: where In Eq 27 the third column of matrix R CN is formed by the combinations of the values of first and second column of the same matrix. The results obtained with the above procedure can be very noisy, (see left plot of Fig 12). For this reason, a simple lowpass filter is applied in order to obtain the flight trajectory (right plot, Fig 12). The P4P trajectory is computed with respect to the metric reference. Trajectories obtained through visual SLAM have their own reference frame. In experiments, both reference frames are aligned in order to make the trajectories coincident at the beginning. In other words, it is assumed that the initial position of the quadcopter is known.