An Efficient Object Tracking Method on Quad-/Oc-Trees

We introduce a fast error-free tracking method applicable to sequences of two and three dimensional images. The core idea is to use Quadtree (resp. Octree) data structures for representing the spatial discretization of an image in two (resp. three) spatial dimensions. This representation enables one to merge into large computational cells the regions that can be faithfully described with such a coarse representation, thus significantly reducing the total number of degrees of freedom that are processed, without compromising accuracy. This encoding is particularly effective in the case of algorithms based on moving fronts, since the adaptive refinement provides a natural means to focus the processing resources on information near the moving front. In this paper, we use an existing contour based tracker and reformulate it to the case of Quad-/Oc-tree data structures. Relevant mathematical assumptions and derivations are presented for this purpose. We then demonstrate that, on standard bio-medical image sequences, a speed up of 5X is easily achieved in 2D and about 10X in 3D.


Introduction
Tracking objects in image sequences is a fundamental problem in computer vision. Error-free tracking is essential for various vision applications such as surveillance, security systems, and medical/biological image analysis. However, data uncertainty presents significant challenges for reliable object tracking. This includes, for example, illumination changes across consecutive frames, rigid/nonrigid deformation of the object of interest, partial occlusion, or data corruption.
Numerous frameworks have been proposed in recent years to address the above-mentioned problems. For example, a class of methods includes deformable shape models [1], active shape models (ASMs) [2] and active appearance models (AAMs) [3], which capture the statistics of shape and appearance variations using training examples. By combining ASM and AAM in a multiscale fashion, Mitchell et al. [4] achieved robustness against noise and clutter. Tsai et al. [5] applied principal component analysis (PCA) to model the variability in the training shapes represented by signed distance functions (SDFs). Zhu et al. [6] proposed a subject-specific dynamic model designed for medical application using multilinear PCA. A potential drawback of these methods is that their performance depends on the training data, which may not encompass all possible scenarios.
Recently, graph-based algorithms have been proposed for tracking deformable objects. For example, Ishikawa and Jermyn [7] developed a polynomial time algorithm to extract similar objects from a set of time sequence images. Schoenemann and Cremers [8] proposed a realtime solution for tracking, implemented on GPUs, that is based on finding a minimum cost cyclic path in the product space spanned by the template shape and the given image. The cost function is derived from both the image data and the shape prior. They then extended their previous approach [9] by incorporating the edge information, which provides robustness against illumination changes. In another work [10], they also introduced a motion layer decomposition framework, which is solved using both discrete and continuous optimization. Although this method has been shown to be robust against occlusion, it is unclear how this algorithm may be adapted for tracking objects with unknown shape statistics. Discriminative methods [11,12], which cast visual tracking as a two-class classification problem, dynamically update the classifiers in order to account for appearance changes and partial occlusion of the object-of-interest. Approaches such as [13] select the target as the one which minimizes the projection error in the space spanned by observed (tracked results from previous frames) and trivial templates (with one nonzero element).
Rich, dense flow information between consecutive frames in time sequence images can also be incorporated as a means to improve effectiveness. One way to include this information is to first compute a correspondence map (registration) between successive frames before computing the correlated segmentation of the video frames to solve the tracking problem. In recent works [14,15], it has been shown that tracking performance can be further improved by solving the registration and the segmentation problems simultaneously (SRS for simultaneous registration and segmentation) thanks to a function that establishes the correspondence between the target and the reference level-set functions. The performance is improved in [16] with the help of a dynamical prior term (SRS+DP). The same authors recently generalized the SRS+DP framework (GSRS for generalized SRS), which can handle shading, reflections, and illumination changes. In [17], Ghosh et al. extend GSRS [18] for fast and efficient implementation. Modifications include the reconstruction of the sparse-error (due to partial occlusion, shading and reflections) between consecutive frames using a regularized variant of the L 1 norm, a new formulation of the dynamical shape prior term and the incorporation of a combination term, which modifies the update rule for the estimated functions. Through examples with natural and biological images sequences, the authors have illustrated the robustness of their formulation.
However, another fundamental problem in computer vision is the efficiency of the underlying strategy, in terms of CPU and memory requirement. A fast error-free tracking can be determining in some applications such as medical analysis or security systems. Medical analysis, with the ever increasing amount of data generated by modern acquisition systems, is in particular an area where large data sets need to be processed efficiently. The effectiveness of the approach of Ghosh et al. [17] is in large part due to the use of rich dense flow information between consecutive frames, a computationally expensive feature. However, dense flow information is mainly useful in the region near the evolving front so that the use of adaptive grid techniques that automatically coarsen away from the moving front have a distinct advantage.
The key contribution of this paper is the use of Quad-/Oc-tree Cartesian grids for efficient processing of an algorithm that is robust against partial occlusion, drastic illumination changes, and data corruption.
The outline of this paper is as follows: we first briefly introduce the level-set formalism in section 2 and the SRS method with sparse error reconstruction in section 3. In section 4, we develop the algorithm on Quad-/Oc-tree Cartesian grids. We finally compare the results obtained with the present method and that of [17] in section 5.

Level-Set Formalism
The level-set method, introduced by Osher and Sethian [19], is a general numerical technique for capturing the evolution of a moving front. Specifically, the front is represented as the zero level set of Lipschitz continuous function ϕ. We will denote the moving front as Γ and thus define Γ = {x : ϕ(x) = 0}. The interior and exterior regions delimited by Γ are described by {x : ϕ (x) > 0} and {x : ϕ(x) < 0}, respectively. The implicit level-set representation is particularly convenient in the case of topological changes. Given a velocity field u, the motion of the front is given by the equation: t þ u Á r ¼ 0: Finally, while any Lipschitz continuous function can define Γ, it is advantageous to use the signed distance function to the front. For a general front described by an arbitrary level-set function ϕ init , the signed distance function is obtained by solving, in pseudo time τ, the equation: where Sign refers to the signum function. This is the so-called reinitialization equation, introduced for uniform grids by Sussman et al. [20], improved by Russo and Smereka [21] and extended to unstructured grids in Min and Gibou [22].

Simultaneous Registration and Segmentation
Consider two consecutive images Iðt À 1Þ : Oð R n Þ ! R, with n = 2, 3 and IðtÞ : Oð R n Þ ! R, in an image sequence at two consecutive time instances, t − 1 and t. Assuming that the initial contour of the object of interest is known a priori, we consider that the shape of the tracked object of interest at any arbitrary time t is embedded in the level-set function F 0 ðx; tÞ : O ! R n .

Registration Model
The aim of registration is to find a displacement vector field u : O ! R n that establishes the dense correspondence between two consecutive image frames. We denote the components of u as (u(x), v(x)) and (u(x), v(x), w(x)) in two and three spatial dimensions, respectively. This vector field is traditionally obtained by maximizing the posterior distribution P(u|I(t), I(t − 1)) (see for example [1,23] and references therein): In the present work, based on [17], we simultaneously estimate both misalignment (dense correspondence map) and corruption in the image sequence by assuming that the corruption due to partial occlusion, shading, reflections, etc., is sparse in nature. Consider that the mapping G : O ! R models the sparse corruptions between successive time frames. We can rewrite Eq (1) by incorporating G as follows: where we assumed in the last step that u and G are independent of each other. Maximizing the expression in Eq (2) is equivalent to minimizing the following energy functional under certain simplifying assumptions: where, T(x) = x + u(x), α u controls the smoothness of the derived vector field, while α g 1 and α g 2 control the sparsity and the smoothness in the estimated corruption. It has been demonstrated in [3] that L 1 -minimization is more suitable than the quadratic penalizer, ρ( , the most robust convex penalizer known in the optical flow community. Due to the small positive constant , ρ u is still convex, which offers advantages for the minimization process. The constant is chosen in our experiments to be 0.001.
Since the energy E RG is highly nonlinear, the minimization process is not trivial. A minimizer for u must satisfy the Euler-Lagrange equations: where I d ¼ IðTðxÞ; tÞ À Iðx; t À 1Þ À G; Similarly, the Euler-Lagrange equation that needs to be satisfied by G is: To solve these equations, we adapted the approach from [24,25]: u and G are both solved iteratively in multiple scales using fixed point schemes. We implement a fully implicit scheme for the smoothing terms (e.g., the term associated with α u and α g2 ) and a semi-implicit scheme for the data terms (e.g., the first term in Eq (3) and the term associated with α g1 ). The term corresponding to warping, I(T(x), t), is obtained using linear interpolation.

Segmentation Model
It is standard to segment the t-frame using the dynamic prior term, given the observed contour ϕ 0 (t). However, there are a few drawbacks with this approach. Examples include the absence of an energy minimizer or the dependence on the choice of the domain. These can be avoided with the formulation of Ghosh et al. [17], which is based on the dual formulation of the total variation (TV) norm, by including the shape prior term to define a new dynamic shape prior term, d 2 where ρ p (.) is a robust estimator; here we assume ρ p (y 2 ) = y 2 . One then considers the following constrained minimization problem: The solution is constrained to lie between 0 and 1 since the unconstrained problem does not have a stationary solution. The first term in Eq (6) measures the weighted total variation norm of the function ϕ 0 . It can be rigorously demonstrated (see [26,27] and references therein) that Eq (6) is equivalent to solving the following unconstrained problem: where zðyÞ :¼ max f0; 2jy À 1 2 j À 1g andâ > a s1 2 jjZðxÞjj L 1 ðOÞ . We can also rewrite z(y) as: 2ðy À 1Þ if y > 1; 0 otherwise: Since the two energies in Eqs (6) and (7) agree for {ϕ 0 2 L 1 (O) : 0 ϕ 0 (x) 1 ,8x} it is sufficient to prove that any minimizer of Eq (7) satisfies the constraint 0 ϕ 0 (x) 1. The variational formulation (7) can now be solved using a convex regularization [28] term: where β > 0 is a small constant, so that φ is a close approximation of ϕ o .

Simultaneous Registration and Segmentation
A commonly used objective function in this regard is the one which establishes the correspondence between the target (ϕ o (t)) and the reference (ϕ o (t − 1)) level-set functions using a mapping function u(x, t − 1, t). The functional combination term seeks to register the isocontours of the source and target shapes within a narrow band of the zero level sets. The width of the narrow band is chosen by a parameter 0 . The choice of 0 is crucial since it determines the scale at which objects are registered (see [29] for a detailed discussion). In contrast, Ghosh et al. [17] proposed a functional that does not require the scale information a priori. In this case, the combination term is (similar to Eq (5)) defined as: where ρ c (y 2 ) = y 2 . Finally, the method simultaneously estimates u, G, and ϕ 0 (t) for the current frame at time t. This combination term modifies the update rule for the estimated functions.
Recalling Eq (4), the new Euler-Lagrange equations that need to be satisfied for u can be written as: The update rule for the sparse error, G(x, y), remains the same since it is independent of the combination term E NC . For the segmentation problem, we use: This formulation has been shown to be robust against partial occlusion, drastic illumination changes, and data corruption. A drawback of the existing approach is the computational cost associated with solving the partial differential equations of the model on a grid with as many degree of freedom as the number of pixels/voxels. However, the region of interest for processing the data at every step is located in only a small band around the moving front. It is therefore desirable to adopt a strategy that can leverage the locality of information. Next, we describe an algorithm based on Quad-/Oc-tree data structure, which enables to significantly decrease the number of degrees of freedom, while conserving the accuracy of the method of [17].

Algorithm on Quad/Oc-tree Data Structures
In this section, we describe how the level-set method on Quad-/Oc-tree introduced by Min and Gibou [22] can be combined with the methodology of [17] to produce an efficient algorithm for object tracking and segmentation.

Quad/Oc-tree Data Structures
Quadtree (resp. Octree) data structures can be used to represent the spatial discretization of a geometrical domain in two (resp. three) spatial dimensions as depicted in Fig 1: initially the root of the tree is associated with the entire domain cell. Further refinement is obtained by recursively splitting each cell into four (resp. eight) children until the desired level of detail is achieved. We call 'level of the tree' its depth. We refer the reader to the books of Samet [30,31] for more details on the definition of these data structures and the standard algorithms associated with them.
In this work, the maximum level of the tree is that corresponding to the pixel size, i.e. a 1024 × 1024 image will be associated with a quadtree with a maximum level of 10 (as in 2 10 ).
Also, in the case where a tree cell is associated to a subset of an image, the cell's value is defined as the average pixels' value.

Constructing Quad-/Oc-tree
Since the accuracy of the method depends on the size of the cells adjacent to the moving front, we impose that the finest cells lie on the interface. This can be achieved by using the signed distance function to the interface along with the Whitney decomposition, as first proposed by Strain in [32]. For a general function : R n ! R with Lipschitz constant Lip(ϕ), the Whitney decomposition is extended in Min [22] as: starting from a root cell, split any cell C if max v2NodesðCÞ jðvÞj 1 2 LipðÞ Â diag À sizeðCÞ; where diag-size(C) refers to the length of the diagonal of the current cell C and Nodes(C) refers to the set of four (resp. eight) nodes of the current cell. This condition enforces the splitting of any cell whose edge length exceeds its distance to the interface, hence producing a computational grid where the smallest cells capture the interface location. Larger and larger cells are located as the distance from the interface increases (see [22] for details). Fig 2 depicts an example of an image (left) and its quadtree representation (right). Note that far from the simulated moving front (depicted by a red interface), the image is blurred due to averaging of pixels' value, while the original image resolution is kept near the interface. In the work of [16], the image was filtered by a gaussian filter to avoid spurious noise. In the case of Quad-/Oc-tree representation, the coarsening away from the front acts as a filter so we do not apply a Gaussian filter.

Finite Difference Discretizations
In the case of nonregular Cartesian grids, the main difficulty is to derive discretizations at Tjunction nodes, i.e. nodes for which there is a missing neighboring node in one of the Cartesian An Efficient Object Tracking Method on Quad-/Oc-Trees directions. For example, Fig 3 depicts a T-junction node, v 0 , with three neighboring nodes v 1 , v 2 and v 5 aligned in three Cartesian directions and one ghost neighboring node, v g , replacing the missing grid node in the x-direction. Referring to Fig 3, the value, u G g , of the node-sampled function u : fv i g ! R at the ghost node v g is given by the third-order interpolation: In three spatial dimensions, there are two ghost values (see Fig 4) that are defined as [22]: ¼ s 11 s 12 u 11 þ s 11 s 9 u 12 þ s 10 s 12 u 9 þ s 10 s 9 u 10 ðs 10 þ s 11 Þðs 9 þ s 12 Þ À s 10 s 11 s 2 þ s 1 The third-order interpolations defined above allow us to define finite difference formulas for the firstand second-order derivatives at every node using standard formulas in a dimensionby-dimension framework. For example, referring to Fig 5, we use the central difference formulas for u x and u xx : the forward and backward first-order accurate approximations of the first-order derivatives: and the second-order accurate approximations of the first-order derivatives: where the minmod slope limiter [33,34], defined as: y otherwise; ( is used to avoid differencing across regions where gradients are large (i.e. near kinks). Similarly, approximations for first-order and second-order derivatives are obtained in the y-and z-directions. These numerical approximations are sufficient to discretize all of the equations described in section 3.

Experimental Results
In this section, we first present results on two and three dimensional image sequences. When available, we compare our results to those obtained in Ghosh et al. [16], which allows us to compare the effectiveness of both methods in term of robustness as well as in terms of computational efficiency. These numerical examples illustrate that the present method is robust even in the presence of severe occlusions and provides a speed up over the approach of Ghosh et al. [16].
The advantage of the present method is that it significantly reduces the number of degrees of freedom processed at each iteration. In fact, since most degrees of freedom are those adjacent to the interface, this approach lowers the dimension of the problem. On the other hand, there are additional costs associated to the handling of the data structure, especially the O(log(n)) cost to access data stored in the leaves of the tree and the remeshing procedure. Despite this additional cost, the performance of the current approach is superior than that on uniform grids.

Parameter Setting
We keep all the parameters constant throughout our experiments. The parameters α u , α g 1 , and α g 2 are empirically set to 8 × 10 −4 , 9 × 10 −5 , and 3 × 10 −3 , respectively. The weight for the dynamical prior term d 2 p and the combination term d 2 c are both approximately set to 2 × 10 −5 . The coefficient α s 1 corresponding to Eq (8) is selected randomly in the range [0.02, 0.05].
We first extract a simple gray scale histogram (for both natural and synthetic images) of the foreground and background from a few frames. These are then used to obtain the likelihood values η(x) for the rest of the frames. In the current implementation, we also use the edge information of the image. The edge information is easily incorporated by minimizing TV gþd 2 p þd 2 c instead of TV d 2 p þd 2 c in Eq (9). Here, g ¼ 1 1þgjrI s j 2 is the inverse of the smoothed gradient magnitude function. In our experiments, we keep γ = 600.
The proposed tracking framework is implemented in C++ on a 2.3 GHz Intel Core i5 CPU. In our implementation we update u, ϕ 0 , and G twice for each time frame.

Two Dimensional Data
We consider a medical data sequence where the objects of interest are vessels. In the first slice, the objects of interest are four cross-sections of vessels, which merge into two larger vessels and then into one large vessel in subsequent slices. This example illustrates the error-free tracking effectiveness. Fig 6 compares the results obtained with the approach of Ghosh et al. [16] (top subfigures) with the results of the current approach (bottom subfigures). These results demonstrate that the four initial vessels are indeed properly detected and that the algorithm enables changes in topology (here merging). Fig 7 depicts the three-dimensional segmentation obtained from the series of two-dimensional segmentations.
The use of the adaptive grid significantly reduces the number of pixels processed. Specifically, the uniform grid contains 201 × 201 = 40, 401 cells, whereas the adaptive grid processes about 1000 cells at each iteration. Therefore, even though the adaptive approach requires the data structure's management, the remeshing at each step and the O(log(n)) complexity for An Efficient Object Tracking Method on Quad-/Oc-Trees accessing data (n being the number of cells processed), the adaptive approach increases the speed in a meaningful way: on this example, the process time is more than 25 minutes on the uniformly sampled image, compared to less than 5 minutes with the same parameters and the same number of frames in the case of adaptive grids. We note that no attempts has been made to optimize the implementation, for which, in average, the overhead time for considering the adaptive grid is about 0.03 second per slice. This amounts to 7.71 seconds over the course of the total segmentation that takes a total computational time of 201.66 seconds. For this example, the overhead for using adaptive grids is thus about 4%.

Three Dimensional Data
We consider three different sequences of three dimensional medical data consisting of snapshots of a beating heart, available from the BIRN data base [35]. Each frame has dimensions 100 × 100 × 34. The sequences, corresponding to patient 12, 13 and 15 consist of 20, 21 and 12 frames, respectively. Figs 8, 9 and 10 depict the results of the segmentation process using the current approach. In each case, three different cross sections (corresponding to a top, medium and bottom cross- section) are depicted at three different time in the image sequence. In all cases, the segmentation results are satisfactory. We note that since the algorithm requires an initial segmentation, we used one 2D cross-section of the first frame and applied the 2D algorithm to construct the initial 3D segmentation. Therefore, overall, only an initial 2D segmentation is needed to process the time sequence of three dimensional data. The algorithm takes about 55 seconds to process one 3D frame with the Octree representation and about 560 seconds for the uniform one.

Large Deformation and Disconnected Sets
We consider a synthetic example to demonstrate that the method is applicable to the case where the shape of interest undergoes large deformations. It also demonstrates that disconnected sets can be readily considered. We first built an image sequence consisting of two disks deforming under a given velocity field: in a domain O = [0, 1] 2 , we construct a disk of radius.15 and center (.5,.75) and a second disk with a radius of.1 and center (.2,.2). These two disks define the initial zero level set contour. The image intensity values are then randomly set to be between [200,255] inside the zero level set with a value of 200 near the level set contour and between [50, 155] outside the zero level-set with a value of 50 near the contour. The image is and the algorithm on quadtree is used to segment the image sequence. Fig 11 illustrates that the algorithm on adaptive grids enables the segmentation of a disconnected shape that undergoes large deformations. The computational time on this 100 × 100 image is slightly more than a second per frame.

Conclusion
We have extended the approach of [17] to the case of adaptive Cartesian grids using Quad-/ Oc-tree data structures. This approach is based on a robust variational framework for level-setbased simultaneous registration and segmentation procedures that can be applied to natural and biological image sequences. In addition, the use of adaptive grids produces results that can be processed much faster than [17] while retaining its accuracy and robustness. Using numerical results on image sequences in two and three spatial dimensions as well as timings, we have illustrated the computational efficiency of the approach. While we have presented the use of adaptive grids using a particular functional, the advantages of adaptive grids (reduction in the number of processed data and the ability to focus on processing data near the moving front) could be used with other functionals used in level-set formulations. The strategy presented in this paper could be use to the segmentation of large data sets, such as those generated by new generation CT acquisition scanners or by structural MRI scans.