A New Pose Estimation Algorithm Using a Perspective-Ray-Based Scaled Orthographic Projection with Iteration

Pose estimation aims at measuring the position and orientation of a calibrated camera using known image features. The pinhole model is the dominant camera model in this field. However, the imaging precision of this model is not accurate enough for an advanced pose estimation algorithm. In this paper, a new camera model, called incident ray tracking model, is introduced. More importantly, an advanced pose estimation algorithm based on the perspective ray in the new camera model, is proposed. The perspective ray, determined by two positioning points, is an abstract mathematical equivalent of the incident ray. In the proposed pose estimation algorithm, called perspective-ray-based scaled orthographic projection with iteration (PRSOI), an approximate ray-based projection is calculated by a linear system and refined by iteration. Experiments on the PRSOI have been conducted, and the results demonstrate that it is of high accuracy in the six degrees of freedom (DOF) motion. And it outperforms three other state-of-the-art algorithms in terms of accuracy during the contrast experiment.


Introduction
Estimating the pose of a calibrated camera has lots of applications in augmented reality, air refueling, and unmanned aerial vehicle (UAV) navigation [1][2][3]. The augmented reality often operates on the basis of prior knowledge of the environment, which limits range and accuracy of registration. Pose estimation attempts to locate 3D features in the feature map, and provides registration when the reference map is in the sensing range [4]. In air refueling, a single monocular camera is mounted on the receiver aircraft while the probe and drogue is mounted on the tanker aircraft. Pose estimation algorithm is proposed for the purpose of tracking the drogue during the capture stage of autonomous aerial refueling [5]. In UAV navigation, pose estimation is employed in the formation flying of UAVs. To guarantee the relative positions of these UAVs, the IR-LEDs on the leader UAV is captured by the IR-camera on the follower UAV and the detected features are transmitted to the pose estimation algorithm [6].
Pose estimation, also known in the literature as the Perspective-n-Point (PnP) problem, measures the position and orientation of a calibrated camera with known image features [7]. The features available to solve the PnP problem are usually given in the form of a set of point correspondences, each constituting a space point expressed in object coordinates and its image projection expressed in image coordinates. In the past few decades, a huge amount of work has been done to address the problem. Various solutions to the PnP problem, including the EPnP [8], the DLS [9], the RPnP [10], the ASPnP [11], the LHM [12], etc., are developed. To the best of our knowledge, these PnP solutions can show high accuracy only when dealing with dozens or even hundreds of point correspondences. Unfortunately, considering the terrible environment in pose estimation applications, it is hard to offer too many stable and distinguishable point correspondences. Although the DLS is applicable to situations of n 7, the moving range of the target object is extremely limited [9]. The PnP solutions, especially the P4P solutions, have been in great demand in recent years. The P4P solutions can be classified into two types: model-based solutions, which depend on the approximation of a camera model, and geometric configuration solutions that handle the relationship between image space and object space with geometrical characteristic such as distance, angle, parallel, vertical, etc.. POSIT is a popular solution to the non-coplanar P4P problem and is one of the representative solutions in the first category [13]. Scaled orthographic projection is employed in the algorithm, and the rotation matrix and translation vector of a calibrated camera is obtained through the projection. Iteration is also introduced to refresh the old image coordinates of feature points, and then repeat the previous steps. The iteration does not stop until the output has satisfied the preset accuracy or the algorithm is circulated for preset times. For the stability and high accuracy, the POSIT is continuously introduced into applications in complex interference environment [14][15][16][17][18][19]. The latter solutions take advantage of the geometric configuration of the special feature points. The geometric configuration of the P4P problem is a core research. Liu. M. L. et al. [20] made full use of the geometric configuration of the four non-coplanar feature points, including the angle between two perspective lines, the mixed product among the perspective lines, the segments in object space, etc.. The follow-up researches did not surpass the category of the geometric configuration by Liu. M. L.. Z. Y. Hu et al. [21] mathematically analyzed the geometric configuration of non-coplanar P4P problem. They parameterized the relationship between the numbers of possible solutions and the numbers of geometric configuration. Wu PC et al. [22] focused on the plausible pose, and proposed an analytical motion model to interpret, or even eliminate, the geometric illusion. Yang Guo [23] researched the coplanar P4P problem. By converting perspective transformation to affine transformation and using invariance to 3D affine transformation, it is found that the upper bound of the coplanar P4P problem is two. A technique based on a singular value decomposition (SVD) is also proposed for the coplanar P4P problem by Yang Guo, unverified by any real test. To improve estimation accuracy, Long Li et al. [24] introduced Frobenius norm into the determinant of rotation matrix, instead of the SVD-based method. Unfortunately, the proposed method did not contribute to accuracy and noise resistance, it only reduced the runtime. Bujnak M. et al. [25] and Kuang Y. et al. [26] focused on the recovery of the unknown focal length from the P4P solutions, and were not interested in the accuracy of the P4P solutions. From the studies in [13][14][15][16][17][18][19][20][21][22][23][24][25][26], it can be concluded that the research concerning accuracy improvement of the P4P solutions is slow and unattractive.
To sum up, the camera model of the above solutions is a pinhole camera, in which all the incident rays are projected directly onto the detector plane through a single point, called the effective pinhole of the camera model [27]. In practice, the incident rays are deviated on account of the compound lenses. The P4P solutions are negatively influenced by the imprecise camera model. There are still other expressions proposed to describe the camera model [28][29][30][31]. By using lens geometry model, the geometric relationship between images and objects is established via Snell's Law and skew ray tracking in [28] and [29]. The camera model is represented by a matrix equation that relates the parameters of the image plane with the incident ray. But it is complex because each incident ray is represented by a set of six pose parameters. In a general imaging model, the cameral is regarded as "black box" [30,31]. A set of virtual sensing elements called "raxel" is used to describe a linear mapping from incident rays to the image plane. The "raxel" is composed of three parameters: an image projection, the yaw and pitch directions of the projective ray. The calibration of "raxel" is tedious and entirely depends on the accuracy of rotation stage. In this paper, an incident ray tracking model (IRT) is proposed, where two reference planes are regarded as camera model parameters. Through analyzing the geometric properties of the proposed model, the incident ray is mathematically summarized as a perspective ray which is positioned by two points respectively located in the two reference planes. Considering the excellent scaled orthographic projection, a perspectiveray-based scaled orthographic projection is employed in this paper. The projection formulates a linear system which calculates the approximation of object pose, and iteration loops are also introduced to obtain a more accurate approximation. The camera calibration based on the incident ray tracking model (IRT) and the perspective-ray-based scaled orthographic projection with iteration (PRSOI) for pose estimation will be described in detail in the following sections.

Incident Ray Tracking Model
Incident ray formulation Irrespective of its specific design, the purpose of an imaging system is to map incident rays from the scene onto pixels on the detector. Each pixel in Fig 1 collects energy from the incident ray in the optical system that has a non-zero aperture size. However, the incident ray can be represented by a perspective ray when studying the geometric properties of the imaging system. As shown in Fig 1, the system maps the incident ray to the pixel. Because the path that incident ray traverses from scene to the pixel can be arbitrarily complex, the incident ray should be replaced by an abstract mathematical equivalent that is referred to as a perspective ray l (I, P m , P n ). The IRT is composed of the incident rays, in the field of view. In the following section, the parameters of the IRT will be introduced.

Parameters of camera model
If the radiometric response function of each perspective ray is computable, one can linearize the radiometric response with respect to the image plane. In our context of camera model, the ray to image mapping may be parameterized as Fig 2. A point P i (x,y,t) imaged at (x,y) at depth t is imaged along a perspective ray l (t is the vertical distance from P i to the P n ). It will be more convenient to represent the model if the perspective rays, such as l, are arranged on two planes called reference planes, such as P m and P n . Each perspective ray will intersect the two reference planes respectively at only one point, namely P m and P n . The reference planes could be written as a function: ( A perspective ray could be determined uniquely through the reference planes P m and P n , and the IRT is parameterized by the two reference planes.

Computing parameters
The parameters used to specify the IRT are derived from the reference planes. The perspective ray passes through the two reference planes P m (x, y) and P n (x, y), and intersects the image plane P c (u, v) at point I(u,v). Ignoring the position of the planes, the mapping from P n (x, y) to There is a one-to-one mapping between the image plane and the reference plane. As the two planes are both represented by a set of points, the mapping is recasted as the following equation: where (C ij ,D ij ) are the mapping parameters, n is the order of the mapping, (x,y) is the space coordinates of the points in the plane ABCD while (u,v) is the image coordinates of them in the plane abcd.
The (C ij ,D ij ) are obtained using Levenberg-Marquard method [32]. The reference plane P m (x, y) is represented by rational functions g m x ðu; vÞ and g m y ðu; vÞ, consisting of C m ij and D m ij . m can be replaced by n.

Pose Estimation Based on Perspective Ray Object pose formulation
Considering the geometrical features of the perspective ray l k ðI k ; P m k ; P n k Þ, which is described in Fig 4. The points P i 0 and P j k are located on the object. P i 0 Àuvw is the object coordinate system, and O-ijk is the reference plane coordinate system.
Pose estimation in this paper aims to compute the rotation matrix and translation vector of the object. The purpose of the rotation matrix is to transform the object coordinates such as À! P i 0 P j k into coordinates defined in the reference plane coordinate system such as À! P n 0 P n k (n represents a point located on the plane P n ). The dot product À! P i 0 P j k i between the vector À! P i 0 P j k and the first row of the matrix correctly provides the projection of this vector on the unit vector i of the reference plane coordinate system. The rotation matrix can therefore be written as: where i u , i v , i w are the coordinates of i in the object coordinate system. To compute the rotation matrix, it is only needed to compute i and j in the object coordinate system. The vector k is then obtained by the cross-product i × j.
The translation vector, T, is the vector À! OP i 0 . The point P i 0 is determined by the perspective ray l 0 which can be expressed as: ( where z m and z n are respectively the z coordinate of the planes P m and P n . From Eq (4), the vector À! OP i 0 could be expressed as: ; 0Þ; f 0 y ðz i Þ À g n y ð0; 0Þ; z i À z n Þ ð 5Þ where z i is the z coordinate of the plane P i . Therefore to compute the object translation, only the z coordinate needs computing. Thus the object pose is fully defined once the unknowns i, j and z are found.

Projection on the perspective ray
The image point corresponding to the feature point, which projects on the perspective ray, is shown in Fig 5. Only two feature points P i 0 and P j k appeared in the projection. The perspective rays l 0 and l k are respectively in correspondence with the feature points P i 0 and P j k , and are computed by the calibration parameters. The object coordinate system is centered at P i 0 , and the coordinate of P j k relative to P i 0 is known. The point P i 0 locates on the plane P i which parallels to the planes P m and P n .
Scaled orthographic projection is an approximation to the perspective projection. It is assumed that the depths of different points can all be set as the same depth z i . The geometric construction to obtain the perspective ray l k of P j k in a perspective projection and the perspective ray 1 k 0 of P j k in a scaled orthographic projection is shown in Fig 5. The point P j k is projected on the plane P i at Q i k by a scaled orthographic projection.

Formulations of projections
Formulations of perspective projection. Now consider the equations that characterize a perspective projection and relate the unknown row vectors i and j of the rotation matrix and the unknown z i coordinate of the translation vector to the known coordinates of the vector À! P i 0 P j k in the object coordinate system, and to the known coordinates of P n 0 . In Fig 6, the perspective ray l k intersects the plane P i in P i k , and P j k projects on the plane P i at Q i k . The vector À! P i 0 P j k is the sum of three vectors: The vector À! P i 0 P i k is constrained by two perspective rays l 0 and l k . It can be expressed as: where ðf 0 x ; f 0 y Þ and ðf k x ; f k y Þ are the functions of l 0 and l k . The vector À! P i k Q i k is also constrained by l k and l k 0 . For the z coordinate of P j k is z i0 = z i (1+ε i ) (ε i ¼ À! P i 0 P j k k=z i ), the vector À! P i k Q i k is defined as: The vector À! Q i k P j k is perpendicular to the reference plane P i , and it can be defined as: The sum of the three vectors can then be expressed as: Then take the dot product of Eq (10) with the unit vector i and j. The dot products À! P i 0 P j k i and À! P i 0 P j k j are expressed as: 8 < : Solving Eq (11) for the unknowns would provide all the information required to define the object pose.
Formulations of scaled orthographic projection. The right hand sides of Eq (11), the terms f k x ðz i0 Þ and f k y ðz i0 Þ, are in fact the coordinates of the point Q i k , which are the scaled orthographic projections of the feature point P j k . Consider the points P i 0 , P j k , and the projections Q i k of P j k on the plane P i , the vector À! P i 0 P j k is the sum of two vectors Then take the dot product of the vector À! P i 0 P j k with the unit vector i. The dot product À! Q i k P j k i is zero, and the dot product Consequently, the dot products À! P i 0 P j k i and À! P i 0 P j k j are similar to Eq (11).

Iteration for scaled orthographic projection
Eq (11) can also be written: : As the points P i 0 and P n 0 , P j k and P n k respectively locate in the perspective rays l 0 and l k , Eq (13) could be approximated as: where I = s i Ái, j = s i Áj. Eq (14) provides a linear system of equations in which the only unknowns are respectively the coordinates of I and J. The norm of I and J are respectively the scaling factor s i and s j between the vector À! P i 0 P j k and À! P n 0 P n k . Then the length of the two vectors can be written as: It can be parameterized as: If values are given to the term ε i , z i is obtained from Eq (16). The proposed algorithm, used to determine the pose by solving the linear system, is called perspective-ray-based scaled orthographic projection (PRSO). The solution of the PRSO algorithm is only an approximation if the values given to the term ε i are not exact. But once the unknowns i and j have been computed, more exact values can be computed for the term ε i , and the equations can be solved again with these better values. The iteration algorithm is named PRSOI (PRSO with Iterations). It generally makes the values of i, j and z i converge towards values which correspond to a correct pose through iterations.
Initially, the term ε i is equal to zero. In fact, it can be assumed that P j k and Q i k coincide. When tracking an object, the initial value for the term ε i is preferably chosen equal to the value obtained at the last iteration of the pose estimation for the previous image. The computed error of coordinates, which is between the projection point Q n k of P j k in the prior iteration and the one in the current iteration, reaches the minimum at the end of iterations.

Solving the system of PRSO algorithm
Within the preceding iterative algorithm, the solution of Eq (14) is still a problem. This equation could be rewritten in a more compact form: : where The dot products of this equation are expressed in terms of vector coordinates in the object coordinate frame: ( These are linear equations where the unknowns are the coordinates of I and J. The other parameters are known: f 0 x ,f 0 y ,f k x ,f k y are the known functions of l 0 and l k , and u i , v i , w i are the known coordinates of P j k in the object coordinate frame. Substitute the n feature points for Eq (18), a linear system is generated for the coordinates of the unknown vectors I and J: ( where A is the matrix of the coordinates of the object points in the object coordinate frame, In general, if there are at least four non-coplanar points, the least square solution of the linear system is given by: ( where the object matrix B is the pseudo inverse of the matrix A. Once the least square solutions to I and J are obtained, the unit vectors i and j are simply obtained by normalizing I and J. Now the translation vector T of the object can be obtained. It is vector À! OP i 0 , and z i is computed by Eq (16). Then the vector T is computed by Eq (5).

Camera calibration results
In the experiment, a domestically developed CCD camera with image resolution 768×576 pixels, pixel size 0.0083mm×0.0086mm, and field angle 60°, is used. It is fixed on a linear stage via a bracket. The type of the linear stage is Zolix KSA300-11-X, with repeatability of 3μm, straightness of 10μm, and travel of 300mm. The calibration target is a solid circular array pattern with 7×9 circular points evenly distributed. The size of the target is 500×600mm 2 , and the distance between the adjacent points is 60mm in the horizontal and vertical directions.
Fix the target on the optical platform, and then move the camera to make the calibration target cover most of the field of view. The captured images are taken at six different positions, and the distance between the adjacent positions is 30mm. Two specific images, such as the images captured at 0mm and 150mm, are regarded as the calibration data. As the camera parameters are obtained, the captured images, including the two specific ones, are introduced into the IRT to compute the space error of the calibration points. The captured images are shown in Fig 6. Table 1 lists the camera parameters. The reference planes P m and P n are described as the fifth order polynomials.  The root mean square error (RMSE) of the calculated calibration points is 0.17mm in horizontal direction, and 0.12mm in vertical direction. According to the error statistics of the calibration points, it is obvious that the camera can be described by the IRT completely.

Pose estimation results
The experiment devices for pose estimation are shown in During the experimental process, the target is fixed on the rotating platform, and the image is captured at every 1°. The rotating angle of the target between the initial position and the current position is measured by the two captured images. The three directions of rotational motion are tested. Then fix the target on the linear stage, and capture the image of it at every 2mm. The moving distance of the target between the initial position and the current position is also measured by the two captured images. The three directions of translational motion are tested. Fig 9 shows a real image of four non-coplanar feature points captured by the calibrated camera.
Notice the central part in the Fig 9: the effective coverage of the four feature points in the captured image is just about 1.49%. This is very different from the captured image of the popular PnP solutions. The PRSOI is tested by the captured data, and compared with the stateof-the-art P4P solutions. For the pinhole camera, the geometric configuration solution by Liu ML and Wong KH [20], denoted by LW in short, as well as the popular iterative solution POSIT [13], are considered. For the IRT camera, the LW+IRT solution is considered, since the LW incorporates the IRT. The results of the P4P solutions are shown in Fig 10. The calculated pose of the target are checked by comparison with the standard positions which are obtained from the interface controller.
Statistics are used in estimation error analysis, and the RMSE of the P4P solutions are summarized in Table 2.
Through the comparison between the LW and the LW+IRT, it is obvious that the accuracy of the LW+IRT is higher than that of the LW. The result suggested that the IRT is effective in the P4P solutions. As the accuracy of the PRSOI is higher than that of the POSIT, it demonstrates that the perspective-ray-based scaled orthographic projection is superior to the scaled orthographic projection in a pinhole camera. Considering the accuracy of the four P4P solutions, it can be proved that the accuracy of the PRSOI outperforms the other three state-of-theart P4P solutions.
The PRSOI is an iterative solution, though powerful, does have a shortfall: planning the correct pose for each position is slow. In this paper, accuracy is the major concern while computational cost is ignored.

Conclusion
This paper puts forward and deeply analyzes the IRT and the PRSOI. The IRT, which with definite geometric meaning, consists of two reference planes P m and P n . The PRSOI introduces the IRT into a scaled orthographic projection, then adopts an iteration to make the perspective-ray-based scaled orthographic projection more accurate. Four non-coplanar points are used as feature points in the real image experiment. And three other P4P solutions are introduced to be compared with the PRSOI. Experiment results demonstrated that the PRSOI is of high accuracy in the six-DOF motion. The P4P solution proposed in this paper is of significance in the P4P applications such as the positioning of mechanical arm, the four-wheel aligners, the installation of super-huge workpiece, etc..
To the best of our knowledge, it is the first study to incorporate the perspective ray with the scaled orthographic projection, and the incorporation works effectively in the P4P situation. Supporting Information S1 Dataset. Camera captured dataset. This archive contains the captured data files used as the basis for the P4P solutions described in the manuscript. The data are provided in a directory hierarchy where each degree of freedom has a separate directory. And the calibration data is the captured data used in the camera calibration. (ZIP)