Sparse Coding and Counting for Robust Visual Tracking

In this paper, we propose a novel sparse coding and counting method under Bayesian framework for visual tracking. In contrast to existing methods, the proposed method employs the combination of L0 and L1 norm to regularize the linear coefficients of incrementally updated linear basis. The sparsity constraint enables the tracker to effectively handle difficult challenges, such as occlusion or image corruption. To achieve real-time processing, we propose a fast and efficient numerical algorithm for solving the proposed model. Although it is an NP-hard problem, the proposed accelerated proximal gradient (APG) approach is guaranteed to converge to a solution quickly. Besides, we provide a closed solution of combining L0 and L1 regularized representation to obtain better sparsity. Experimental results on challenging video sequences demonstrate that the proposed method achieves state-of-the-art results both in accuracy and speed.


Introduction
Visual tracking plays an important role in computer vision and has many applications such as video surveillance, robotics, motion analysis and human computer interaction.Even though various algorithms have come out, it is still a challenge problem due to complex object motion, heavy occlusion, illumination change and background clutter.
Visual tracking algorithms can be roughly categorized into two major categories: discriminative methods and generative methods.Discriminative methods (e.g., [11,1,3]) view object tracking as a binary classification problem in which the goal is to separate the target object from the background.Generative methods (e.g., [4,17,13,23,12]) employ a generative appearance model to represent the target's appearance.
We focus on the generative one and will briefly review the relevant work below.Recently, sparse representation has been successfully applied to visual tracking (e.g., [15,10,25,6]).The trackers based on sparse representation are under the assumption that the appearance of a tracked object can be sparsely represented by a over-complete dictionary which can be dynamically updated to maintain holistic appearance information.Traditionally, the over-complete dictionary is a series of redundant object templates, however, a set of basis vectors from target subspace as dictionary is also used because an orthogonal dictionary performs as efficient as the redundant one.In visual tracking, we will call the L 1 regularized object representation "sparse coding" (e.g., [15]), and the L 0 regularized object representation "sparse counting" (e.g., [16]).[15] has been shown to be robust against partial occlusions, which improves the tracking performance.However, because of using redundant dictionary, heavy computational overhead in L 1 minimization hampers the tracking speed.Very recent efforts have been made to improve this method in terms of both speed and accuracy by using accelerated proximal gradient (APG) algorithm [2] or modeling the similarity between different candidates [25].Different from [15], IVT [17] incrementally learns a low-dimensional PCA subspace representation, which adapts online to the appearance changes of the target.To get rid of image noise, Lu et al. [21] introduce L 1 noise regularization into the PCA reconstruction, which is able to handle partial occlusion and other challenging factors.Pan et al. [16] employs L 0 norm to regularize the linear coefficients of incrementally updated linear basis (sparse counting) to remove the redundant features of the basis vectors.However, sparse counting will cause unstable solutions because of its nonconvexity and discontinuity.Although the sparse coding has good performance, it may cause biased estimation since it penalizes true large coefficients more, and produce over-penalization.Consequently, it is necessary to find a way to overcome the disadvantages of spare coding and sparse counting.
From the viewpoint of statistics, sparse representation are similar to variable selection when the dictionary is fixed.Besides, it is a bonus that Bayesian framework has been successfully applied to select variables by enforcing appropriate priors.Laplace priors were used to avoid overfitting and enforce sparsity in sparse linear model, which derives sparse coding problem.To further enforce sparsity and reduce over-penalization of sparse coding, each coefficient is assigned with a Bernoulli variable.Therefore, a novel model interpreted from a Bayesian perspective by carrying maximum a posteriori (MAP) is pro-Figure 1: The comparison of coefficients, optimal candidates and reconstruction.The top is the coefficients of our method versus unconstrained, spars coding and sparse counting regularization, respectively.The bottom is the optimal candidates and reconstruction results by using unconstrained, sparse coding, sparse counting and our method under same dictionary, respectively.posed, which turns out to be a combination of sparse coding and counting model.In paper [14], Lu et al. also consider L 0 and L 1 norm under a Bayesian perspective.However, considering that there will be occlusion, illumination change and background clutter in tracking, we restraint the noise with L 1 norm.Besides, We use an orthogonal dictionary to replace the redundant object templates as similar atoms of redundant templates may cause mistake of coefficients and huge computational complexity.Lastly, We propose closed solution of regularization which is the combination of the L 0 norm and L 1 norm.However Lu et al. obtain the approximate solution by using he Greedy Coordinate Descent.Tracking results by using unconstrained regularization, sparse counting, sparse coding and our model under the same dictionary D are shown in Fig. 1, respectively.As shown in Fig. 1, one can see that the coefficients of unconstrained regularization and sparse coding are actually not sparse and the target object is not tracked well.Similarly, sparse counting with sparsity coefficients sometimes cannot obtain appropriate linear combination of the orthogonal basis vectors, which will interfere with the tracking accuracy.However, we note that our method is able to reconstruct the object well and find the good candidate, then facilitating the tracking results.We also compare our model with unconstrained regularization, sparse counting, sparse coding over all 50 sequences in benchmark, the precision and success plots are shown in Fig. 2. One can see the parameter setting in the section Experimental Results.

Contributions:
The contributions of this work are threefold.
(1) We propose a sparse coding and counting model from a novel Bayesian perspective for visual tracking.Compared to the state-of-the-art algorithms, the proposed method achieves more reliable tracking results.
(2) We propose closed solution of combining the L 0 norm and L 1 norm based regularization in a unique one.
(3) Although the sparse coding and counting related minimization is an NP-hard problem,we show that the proposed model can be efficiently estimated by the proposed APG method.This makes our tracking method computationally attractive in general and comparable in speed with SP method [21] and the accelerated L 1 tracker [2].

Visual Tracking based on the Particle Filter
In this paper, we employ a particle filter to track the target object.The particle filter provides an estimate of posterior distribution of random variables related to Markov chain.Given a set of observed image vectors Y t = {y 1 , y 2 , ..., y t } up to the t-th frame and target state variable x t that describes the six affine motion parameters, the posterior distribution p(x t |Y t ) based on the Bayesian theorem is estimated by: where p(y t |x t ) is the observation model that estimates the likelihood of an observed image patch y t belonging to the object class, and p(x t |x t−1 ) is the motion model that describes the state transition between consecutive frames.
The Motion Model: The motion model p(x t |x t−1 ) = N(x t ; x t−1 , Σ) models the parameters by independent Gaussian distribution around the counterpart in x t−1 , where Σ is a diagonal covariance matrix whose elements are the variances of the affine parameters.In the tracking framework, the optimal target state xt is obtained by the maximal approximate posterior (MAP) probability: xt = arg max x i t p(x i t |Y t ), where x i t indicates the i-th sample of the state x t .
The observation model: In this paper, we assume that the tracked target object is generated by a subspace (spanned by D and centered at µ) with corruption (i.i.d Gaussian Laplacian noise), where y ∈ R N denotes an observation vector centered at µ, the columns of D = {d 1 , d 2 , . . ., d K } ∈ R N×K are orthogonal basis vectors of the subspace, α indicates the coefficients of basis vectors, and e stand for the Gaussian noise and Laplacian noise vector respectively.the Gaussian component models small dense noise and the Laplacian one aims to handle outliers.As proposed by [20], under the i.i.d Gaussian-Laplacian noise assumption, the distance between the vector y and the subspace (D, µ) is the least soft threshold squares distance: Thus, for each observation y t corresponding to a predicted state x t , the observation model p(y t |x t ) that is set to be where α * and e * are the optimal solution of Eq. ( 5) which will be introduced in detail in next section, and τ is a constant controlling the shape of the Gaussian kernel.
Model Update: It is essential to update the observation model for handling appearance change of the target in visual tracking.Since the error term e can be used to identify some outliers (e.g., Laplacian noise, illumination), we adopt the strat-egy proposed by [20] to update the appearance model using the incremental PCA with mean update [17] as follows, where y i , e i , and µ i are the i-th elements of y, e, and µ, respectively, µ is the mean vector computed the same as [17].

Object Representation under Bayesian Framework
Based on the discussion in aforementioned Section, If y is viewed as the vectorized target region, it can be represented by an image subspace with corruption, (4) [16] shows that sparse counting can remove redundant features (e.g., background portions) while selecting useful parts in the subspace.However, sparse counting will cause unstable solutions because of its nonconvexity and discontinuity.Sparse coding may produce over-penalization, despite its good stability.Considering that Bayesian framework has the capacity to encode prior knowledge and to make valid estimation of uncertainty, a novel model combining sparse coding and sparse counting is proposed for visual tracking.The model is where D D = I, • 0 denotes the L 0 norm which counts the number of non-zero elements, • 2 and • 1 denote L 2 and L 1 norms, respectively, γ, λ and β are regularization parameters, and I is an identity matrix.The term e 1 is used to reject outliers (e.g., occlusions), while α 0 and α 1 are used to select the useful subspace features.Next we will introduce the aforementioned model under Bayesian framework in detail.The joint posterior distribution of α, r, e and σ 2 based on the Bayesian theorem can be written as where p(y|D, α, r, e, σ 2 ), p(α|σ 2 , μ), p(r|κ), p(e| σ), p(σ 2 |τ 1 , τ 2 ), denote the priors on the noisy vectorized target region, the coefficient vector α = [α 1 , α 2 , . . ., α K ], the index vector r = [r 1 , r 2 , . . ., r K ] (r l = I(α l 0), l = 1, 2, . . ., K), the Laplacian noise, and the noise level, respectively.In Eq. ( 6), the parameters μ, τ 1 , τ 2 , σ, and κ are the relevant constant parameters of the priors.
With the definition of the index variable r l , Eq. ( 4) can be rewritten as We generally assume that the noise j follows the Gaussian distribution, i.e., p( j ) = N(0, σ 2 ).We treat the Laplacian noise term e j as missing values with the same Laplacian prior.Therefore, the Prior p(y|, D, α, r, e, σ 2 ) has the follow distribution: To enforce sparsity, the coefficients α are assumed to follow Laplace distribution.
Our goal is to remove redundant features while preserving the useful parts in the dictionary.As Laplace priors resulting sparse coding may lead to over penalization on the large coefficients, we assume the index variable r l of each coefficient α l to be a Bernoulli variable to enforce sparsity and reduce over penalization.
where κ ≤ 1/2.Here, the Bernoulli prior on r l means that r l will have probability κ to be 1 and 1 − κ to be 0, if the prior information is known.
The noise e j is aims at handling outliers, so it follows Laplace distribution: The variances of noises are assigned with Inverse Gamma prior as follow: where Γ(•) denotes the gamma function.
Then, the optimal α, r, e, σ 2 are obtained by the MAP probability.After taking the negative logarithm, the formula is Combining the aforementioned Eq. ( 6), Eq. ( 8), Eq. ( 9), Eq. ( 10), Eq. ( 11), Eq. ( 12), we have With fixing σ 2 = 1, Eq. ( 14) can be rewritten as where 15) can be rewritten as By observing the objective function in Eq. ( 16), it can be found that the essential regularization in Eq. ( 16) is a combination of the sparse coding and the sparse counting.With a fixed appropriate orthogonal dictionary D, Eq. ( 16) can be written as an optimization problems Eq.( 5).

Theory of Fast Numerical Algorithm
As we know, APG is an excellent algorithm for convex programming [9,18] and has been used in visual tracking.In this section, we propose a fast numerical algorithm for solving the proposed nonconvex and nonsmooth model by using APG approach.The experimental results show that it can converge to a solution quickly and achieve attractive performance.Besides, the closed solution of the combining L 0 and L 1 based regularization is provided.
APG Algorithm for Solving Eq. (17) Eq. ( 5) contains two subproblem: one is solving α given fixed e, the other one is solving e given fixed α, the formula is shown as follow Solving Eq. ( 17) is an NP-hard problem because it involves a discrete counting metric.We adopt a special optimization strategy based on the APG approach [9], which ensures each step be solved easily.In APG Algorithm, we need to solve The solutions of Eq. ( 18) can be obtained by where S θ (y) = sign(y) max(|y| − θ, 0), and E (δ,η) (y) is defined as The numerical algorithm for solving Eq. ( 17) is summarized in Algorithm 1. Due to the orthogonality of D, Algorithm 1 converges fast, and its computation cost does not increase compared to the solver of L 1 regularized model.

Closed Solution of combining L 1 and L 0 regularization
This subsection mainly focus on a sparse combinatory model which combines L 0 and L 1 norm together as the regularizer term where x, y ∈ R 1 , and |x| denotes L 0 norm: if x = 0, then|x| 0 = 0, and |x| 0 = 1, otherwise.
lemma.The optimal solution x * of the Eq. ( 21) is defined as The proof can be found in Supporting Information.If x ∈ R N , the Eq. ( 21) changes into min where It is obvious that Eq. ( 21) can be turned into min So it can be seen as a sequence of optimization of x i , i = 1, . . ., n, and each can be solved by Lemma.More analysis about combination of L 1 and L 0 regularization can be found in Supporting Information.In Eq. ( 23), if we set δ = 0 and η = 0, the model degenerates to the linear regression.If we set δ = 0, Eq. ( 23) reduces to L 0 regularized regression, while becoming L 1 regularized regression when η = 0. S2 Fig. 3 (a) shows the closed solutions of these four cases.We set δ = η = 0.5 in Eq. ( 23) (L 0 + L 1 regularized regression), η = 1 in L 0 regularized regression, and δ = 1 in L 1 regularized regression.We note that L 0 + L 1 regularized regression has the same sparsity as L 0 regularized regression, while causing little over penalization than L 1 regularized regression.In S2 Fig. 3 (b), sparsity threshold changes of L 0 , L 1 and L 0 + L 1 regularized regression are shown, respectively.When δ = 1 − η changes from 0 to 1, the sparsity threshold of L 0 + L 1 varies from that of L 0 to the threshold of L 1 .Besides, it is obvious that the threshold of L 0 + L 1 is larger than those of L 0 and L 1 in interval (0, 0.8].

Orthogonal Dictionary learning for Visual Tracking
In this section, we demonstrate dictionary learning in detail through three parts: dictioanry initialization, orthogonal dictionary update and dictionary reinitialization.
Dictioanry Initialization: There are two schemes to initialize the orthogonal dictionary, one is doing PCA for the set of initial first k frames Y k , the other is doing RPCA for Y k .When initial frames do not undergo corruption (e.g., occlusion or illumination), we do PCA for Y k instead of RPCA.The whole process of PCA is doing skinny SVD for Y k and get the basis vectors of column space as the initial dictionary.However, when initial frames have large sparse noise, RPCA is selected to get the intrinsic low-rank features Z k , which can be obtained by solving [23]: When solving Eq. ( 25), the skinny SVD of Z k is readily available: , and D = U k is the initial orthogonal dictionary.Fig. 4 (a) shows that PCA initialization and RPCA initialization both perform well when the initial first k frames have little noise.The initial frames is generally clean, therefore, we choose PCA initialization as the default.Orthogonal Dictionary Update: As the appearance of a target may change drastically, it is necessary to update the orthogonal dictionary D.Here we adopt an incremental PCA algorithm [8] to update the dictionary.
Dictionary reinitialization: When the tracker is prone to drift, dynamically reinitializing dictionary to obtain the intrinsic subspace features is needed.We adopt the strategy proposed by [23].The reinitialization is performed at t-th frame if σ = e t 0 /len(e t ) > thr, where e t is the noise item at t-th frame, len(.) is the length of vector, and thr > 0 is a threshold parameter (generally 0.5).If σ > thr, we reinitialize the dictionary in the same way as initialization of dictionary by doing RPCA, but Y t in Eq. ( 25) is different.Here, Y t consists of optimal candidate observations respectively from the initial n (generally 10) frames and the latest t − n frames (we set t = 30).Fig. 4 (b) compares the tracking performance within and without RPCA reinitialization when the object undergoes variable illumination.After reinitializing dictionary, our tracker retracks the object, so reinitializing dictionary is efficient to improve the reconstruction ability.In Algorithm 2, we summarize the overall tracking process for frame t.

Algorithm 2 Robust Visual Tracking Using Our tracker
Initialization: Initialize orthogonal dictionary D by performing PCA on Y k .Input: State x t−1 (t > k) and orthogonal dictionary D.
Step 1: Draw new samples x i t from x t−1 and obtain corresponding candidates y i t .
Step 2: Obtain α i t and e i t using (17).
Step 3: For each candidate, calculate the observation probability p(y i t |x i t ) using (2).
Step 4: Find the tracking result patch y * t with the maximal observation likelihood and its corresponding noise e * t .
Step 5: perform an incremental PCA algorithm to update the orthogonal dictionary D every five frames.If σ > thr, reinitializing Dictionary at t-th frame using (25).Output: State x * t and corresponding image patch; orthogonal dictionary D.

Qualitative Evaluation
Fig. 5 were taken the frames of the 50 videos to show the Qualitative results for our method, compared with the topperforming SP and SST.We choose some examples from part of 50 sequences to illustrate the effectiveness of our method.Fig. 6 shows the visualization results.
Heavy Occlusion: Fig. 6 (a) and (b) show four challenging sequences with heavy occlusion.In Faceocc1 and Faceocc2, the targets undergo with heavy occlusion and in-plane rotation, it can be seen that our method outperforms the other tracking algorithms.Freeman4 and David3 demonstrate that the proposed method can capture the accurate location of objects in terms of position, and scale when the target undergoes severe occlusion (e.g., Freeman4 #0144 and David3 #0085).However, IVT, L 1 APG, MIL, SP, SCM, ASLA, TLD, SPOT, FOT, SST, MTT, and Struck methods drift away from the target object when occlusion occurs.For these four sequences, the IVT method performs poorly since conventional PCA is not robust to occlusions.Although L 1 APG and SP utilize sparsity to model outliers, it is observed that their occlusion detection are not stable when drastic change of appearance happens.In contrast, our method is robust to heavy occlusion.This is because our combination of L 0 and L 1 regularized appearance model can exactly reconstruct the object.Fast Motion: Fig. 6 (c) show the sequences Boy and Jumping with fast motion.It is difficult to predict the locations of the tracked objects when they undergo abrupt motion.In Boy, the captured images are blurred seriously, but Struck and our method track the target faithfully throughout the images.IVT, MTT, ALSA, SCM and SST methods drift away seriously.We note that most of the other trackers have drift problem due to the abrupt motion in sequence Jumping.In contrast, the SST and our method successfully track the target for whole video.Drastic Pose, Scale and Illumination Changes: In Fig. 6 (d) and (e), we test five challenging sequences with drastic pose, scale and illumination change.Fish and Tiger1 chips contain significant illumination variation.We can see that the L 1 APG, MTT, and MIL methods are less effective in these cases (e.g., Fish #0305 and Tiger1 #0240).In Singer2 and Jogging-2, other trackers drift away when objects under variable illumination, and pose variation (e.g., Singer2 #0110 and Jogging-2 #0100 ), however, our method still performs well.Our method also achieves good performance in CarScale with scale variation (e.g., CarScale #0204).For subspace-based approaches, they may fail to update the appearance model as the calculation of coefficients in their models may have redundant background features.Our method can successfully adapt to variable drastic changes since the combination of sparse coding and sparse counting is not merely stable but also applicable to obtain the intrinsic features of the subspace.

Quantitative Evaluation
We use two metrics to evaluate the proposed algorithm with other state-of-the-art methods.The first metric is the center location error measured with manually labeled ground truth data.The second one is the overlap rate, i.e., score = area(R T R G ) area(R T R G ) , where R T is the tracking bounding box and R G is the ground truth bounding box.The larger average scores mean more accurate results.
Table 1 shows the average overlap rates.Table 2 reports the average center location errors (in pixels) where a smaller average error means a more accurate result.As can be seen from the table, the most sequences generated by our method have lower average error and higher overlap rate values.We provide the precision and success plots in Fig. 7 to evaluate our performance over all the 50 sequences.The evaluation parameters are set as default in [22].We note that the our algorithm performs well for the videos with occlusion, deformation, in plane rotation, and out of plane rotation based on the precision metric and the success rate metric as shown in Fig. 8 and Fig. 9 respectively.Both table and figures show that our method achieves favorable performance against other state-of-the-art methods.
To further compare the running time of four subspace-based tracking algorithms (i.e.IVT, L 1 APG, SP and our method), we calculated the average Frames Per Second (FPS) for 32 × 32 image patch (see the last row of Table 1).For L 1 APG, we reported FPS for its APG acceleration.It can be seen that IVT is  quite faster than other trackers as its computation only involves matrix-vector multiplication.Both SP and our method are faster than L 1 APG.It is also observed that our method is much faster than SP.This is due to the different choices of the optimization scheme.SP adopts a naive altering minimization strategy, in contrast, our method is efficiently solved by APG.

Conclusion
In this paper, we propose sparse coding and counting method under Bayesian framwork for robust visual tracking.The proposed method combines L 0 regularization and L 1 regularized sparse representation in a unique one, therefore, it has better ability to sparsely represent an object and the reconstruction result are also better.Besides, to solve the proposed model, we develop a fast and efficient APG algorithm.Moreover, the closed solution of the combination of L 0 norm and L 1 norm regularization is provided.Extensive experiments testify to the superiority of our method over state-of-the-art methods, both qualitatively and quantitatively.

Figure 2 :
Figure 2: Precision and success plots of overall performance comparison among unconstrained regularization, sparse counting, sparse coding and ours for the 50 videos in the benchmark.The mean precision scores are reported in the legends.

Figure 3 :
Figure 3: Analysis about combination of L 1 and L 0 regularization.(a) shows the closed solutions of linear regression, L 0 , L 1 , L 0 + L 1 regularized regression, respectively.(b) shows the sparsity threshold changes of L 0 , L 1 and L 0 + L 1 regularized regression, respectively.

Figure 4 :
Figure 4: Comparison of PCA process to RPCA process.The upper portion of the image is the tracking frame.the middle of the image consists of three sub-pictures, the left is the mean image, the middle is the reconstruction result, and the right is the Lapalace noise.the bottom of the image is the top ten basis vectors of dictionary.(a) shows the tracking results of PCA and RPCA dictionary initialization.The tracking performance with and without RPCA reinitialization is shown in (b).

Figure 5 :
Figure 5: Qualitative results for our method, compared with SP and SST.Reprinted from [22] under a CC BY license, with permission from Yi Wu, original copyright 2013.

Fig. 6 (
f) demonstrates the tracking results in Deer, Baskerball, and Football with background clutter.Baskerball is a difficult sequence because it contains cluttered background, illumination change, heavy occlusion and non-rigid pose variation.Unless our tracker, none of the compared algorithms can work well on it(e.g., Baskerball #0486 and #0614).As shown in Deer and Football, our tracker performs relatively well (e.g., Deer #0031 and Football #304) as it has excluded background clutters in the sparse errors, but TLD, FOT, and MIL fail.

Figure 7 :
Figure 7: Precision and success plots over all the 50 sequences.The mean precision scores are reported in the legends.

Figure 6 :
Figure 6: Sampled tracking results of evaluated algorithms on fourteen challenging image sequences.Reprinted from [22] under a CC BY license, with permission from Yi Wu, original copyright 2013.

Figure 8 :
Figure 8: The plots of OPE with attributes based on the precision metric.

Figure 9 :
Figure 9: The plots of OPE with attributes using the success rate metric.

Table 1 :
Average Overlap Rate (in pixels) and average frame per second (FPS).The best and the second results are shown in BOLD fonts and BOLD fonts,

Table 2 :
Average Center Location Error(in pixels) and average frame per second (FPS).The best and the second results are shown in BOLD fonts and BOLD fonts, respectively.