## Figures

## Abstract

In this paper, we propose a novel sparse coding and counting method under Bayesian framework for visual tracking. In contrast to existing methods, the proposed method employs the combination of *L*_{0} and *L*_{1} norm to regularize the linear coefficients of incrementally updated linear basis. The sparsity constraint enables the tracker to effectively handle difficult challenges, such as occlusion or image corruption. To achieve real-time processing, we propose a fast and efficient numerical algorithm for solving the proposed model. Although it is an NP-hard problem, the proposed accelerated proximal gradient (APG) approach is guaranteed to converge to a solution quickly. Besides, we provide a closed solution of combining *L*_{0} and *L*_{1} regularized representation to obtain better sparsity. Experimental results on challenging video sequences demonstrate that the proposed method achieves state-of-the-art results both in accuracy and speed.

**Citation: **Liu R, Wang J, Shang X, Wang Y, Su Z, Cai Y (2016) Sparse Coding and Counting for Robust Visual Tracking. PLoS ONE 11(12):
e0168093.
https://doi.org/10.1371/journal.pone.0168093

**Editor: **Quan Zou,
Tianjin University, CHINA

**Received: **January 10, 2016; **Accepted: **November 24, 2016; **Published: ** December 16, 2016

**Copyright: ** © 2016 Liu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the paper.

**Funding: **This work is partially supported by the National Natural Science Foundation of China (Nos. 61672125, 61300086, 61432003), the Fundamental Research Funds for the Central Universities (DUT15QY15), the Hong Kong Scholar Program (No. XJ2015008), and National Science and Technology Major Project (Nos. 2013ZX04005-021, 2014ZX001011). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Visual tracking plays an important role in computer vision and has many applications such as video surveillance, robotics, motion analysis and human computer interaction. Even though various algorithms have come out, it is still a challenge problem due to complex object motion, heavy occlusion, illumination change and background clutter.

Visual tracking algorithms can be roughly categorized into two major categories: discriminative methods and generative methods. Discriminative methods (*e.g.*, [1–3]) view object tracking as a binary classification problem in which the goal is to separate the target object from the background. Generative methods (*e.g.*, [4–8]) employ a generative appearance model to represent the target’s appearance.

We focus on the generative one and will briefly review the relevant work below. Recently, sparse representation has been successfully applied to visual tracking (*e.g.*, [9–12]). The trackers based on sparse representation are under the assumption that the appearance of a tracked object can be sparsely represented by a over-complete dictionary which can be dynamically updated to maintain holistic appearance information. Traditionally, the over-complete dictionary is a series of redundant object templates, however, a set of basis vectors from target subspace as dictionary is also used because an orthogonal dictionary performs as efficient as the redundant one. In visual tracking, we will call the *L*_{1} regularized object representation “sparse coding” (*e.g.*, [9]), and the *L*_{0} regularized object representation “sparse counting” (*e.g.*, [13]). [9] has been shown to be robust against partial occlusions, which improves the tracking performance. However, because of using redundant dictionary, heavy computational overhead in *L*_{1} minimization hampers the tracking speed. Very recent efforts have been made to improve this method in terms of both speed and accuracy by using accelerated proximal gradient (APG) algorithm [14] or modeling the similarity between different candidates [11]. Different from [9], IVT [5] incrementally learns a low-dimensional PCA subspace representation, which adapts online to the appearance changes of the target. To get rid of image noise, Lu *et al.* [15] introduce *L*_{1} noise regularization into the PCA reconstruction, which is able to handle partial occlusion and other challenging factors. Pan *et al.* [13] employs *L*_{0} norm to regularize the linear coefficients of incrementally updated linear basis (sparse counting) to remove the redundant features of the basis vectors. However, sparse counting will cause unstable solutions because of its nonconvexity and discontinuity. Although the sparse coding has good performance, it may cause biased estimation since it penalizes true large coefficients more, and produce over-penalization. Consequently, it is necessary to find a way to overcome the disadvantages of spare coding and sparse counting.

From the viewpoint of statistics, sparse representation are similar to variable selection when the dictionary is fixed. Besides, it is a bonus that Bayesian framework has been successfully applied to select variables by enforcing appropriate priors. Laplace priors were used to avoid overfitting and enforce sparsity in sparse linear model, which derives sparse coding problem. To further enforce sparsity and reduce over-penalization of sparse coding, each coefficient is assigned with a Bernoulli variable. Therefore, a novel model interpreted from a Bayesian perspective by carrying maximum a posteriori (MAP) is proposed, which turns out to be a combination of sparse coding and counting model. In paper [16], Lu *et al.* also consider *L*_{0} and *L*_{1} norm under a Bayesian perspective. However, considering that there will be occlusion, illumination change and background clutter in tracking, we restraint the noise with *L*_{1} norm. Besides, We use an orthogonal dictionary to replace the redundant object templates as similar atoms of redundant templates may cause mistake of coefficients and huge computational complexity. Lastly, We propose closed solution of regularization which is the combination of the *L*_{0} norm and *L*_{1} norm. However Lu *et al.* obtain the approximate solution by using he Greedy Coordinate Descent.

Tracking results by using unconstrained regularization, sparse counting, sparse coding and our model under the same dictionary *D* are shown in Fig 1, respectively. As shown in Fig 1, one can see that the coefficients of unconstrained regularization and sparse coding are actually not sparse and the target object is not tracked well. Similarly, sparse counting with sparsity coefficients sometimes cannot obtain appropriate linear combination of the orthogonal basis vectors, which will interfere with the tracking accuracy. However, we note that our method is able to reconstruct the object well and find the good candidate, then facilitating the tracking results. We also compare our model with unconstrained regularization, sparse counting, sparse coding over all 50 sequences in benchmark, the precision and success plots are shown in Fig 2. One can see the parameter setting in the section Experimental Results.

The top is the coefficients of our method versus unconstrained, spars coding and sparse counting regularization, respectively. The bottom is the optimal candidates and reconstruction results by using unconstrained, sparse coding, sparse counting and our method under same dictionary, respectively.

The mean precision scores are reported in the legends.

**Contributions**: The contributions of this work are threefold.

- We propose a sparse coding and counting model from a novel Bayesian perspective for visual tracking. Compared to the state-of-the-art algorithms, the proposed method achieves more reliable tracking results.
- We propose closed solution of combining the
*L*_{0}norm and*L*_{1}norm based regularization in a unique one. - Although the sparse coding and counting related minimization is an NP-hard problem, we show that the proposed model can be efficiently estimated by the proposed APG method. This makes our tracking method computationally attractive in general and comparable in speed with SP method [15] and the accelerated
*L*_{1}tracker [14].

## Visual Tracking based on the Particle Filter

In this paper, we employ a particle filter to track the target object. The particle filter provides an estimate of posterior distribution of random variables related to Markov chain. Given a set of observed image vectors **Y**_{t} = {**y**_{1}, **y**_{2}, …, **y**_{t}} up to the *t*-th frame and target state variable **x**_{t} that describes the six affine motion parameters, the posterior distribution *p*(**x**_{t}|**Y**_{t}) based on the Bayesian theorem is estimated by:
(1)
where *p*(**y**_{t}|**x**_{t}) is the observation model that estimates the likelihood of an observed image patch **y**_{t} belonging to the object class, and *p*(**x**_{t}|**x**_{t−1}) is the motion model that describes the state transition between consecutive frames.

**The Motion Model:** The motion model *p*(**x**_{t}|**x**_{t−1}) = *N*(**x**_{t}; **x**_{t−1}, **Σ**) models the parameters by independent Gaussian distribution around the counterpart in **x**_{t−1}, where **Σ** is a diagonal covariance matrix whose elements are the variances of the affine parameters. In the tracking framework, the optimal target state is obtained by the maximal approximate posterior (MAP) probability: , where indicates the *i*-th sample of the state **x**_{t}.

**The observation model:** In this paper, we assume that the tracked target object is generated by a subspace (spanned by **D** and centered at ** μ**) with corruption (i.i.d Gaussian Laplacian noise),
(2)
where denotes an observation vector centered at

**, the columns of are orthogonal basis vectors of the subspace,**

*μ***indicates the coefficients of basis vectors,**

*α***ϵ**and

**e**stand for the Gaussian noise and Laplacian noise vector respectively. the Gaussian component models small dense noise and the Laplacian one aims to handle outliers. As proposed by [17], under the i.i.d Gaussian-Laplacian noise assumption, the distance between the vector

**y**and the subspace (

**D**,

**) is the least soft threshold squares distance: (3) Thus, for each observation**

*μ***y**

_{t}corresponding to a predicted state

**x**

_{t}, the observation model

*p*(

**y**

_{t}|

**x**

_{t}) that is set to be (4) where

*** and**

*α***e*** are the optimal solution of Eq (18) which will be introduced in detail in next section, and

*τ*is a constant controlling the shape of the Gaussian kernel.

**Model Update:** It is essential to update the observation model for handling appearance change of the target in visual tracking. Since the error term **e** can be used to identify some outliers (*e.g.*, Laplacian noise, illumination), we adopt the strategy proposed by [17] to update the appearance model using the incremental PCA with mean update [5] as follows,
(5)
where *y*_{i}, *e*_{i}, and *μ*_{i} are the i-th elements of **y**, **e**, and ** μ**, respectively,

**is the mean vector computed the same as [5].**

*μ*## Object Representation by Bayesian Framework

### Motivation

Considering **y** as the vectorized target object region, it can be represented by an feature subspace with both sparse corruptions and dense errors, i.e.,
(6)
Most existing sparsity based trackers aim to directly utilize *L*_{1} regularization on ** α** to suppress small coefficients for subspace reconstruction. However, by carefully investigating the soft-thresholding operator corresponding to

*L*

_{1}minimization subproblem, it can be observed that such simple regularization will consistently suppress the values of the coefficients, thus destroy the discriminative property of the learned feature subspace.

To address this limitation in existing work, we here incorporate two different sparse regularization techniques within the Bayesian perspective, which has the capacity to encode prior knowledge and to make valid estimation of uncertainty. In other words, our goal is to propose a Bayesian inference framework to incorporate both the coefficients threshoding and selection to improve the discrimination of our feature subspace learning formulation Specifically, by defining an index vector **r** = [*r*_{1}, *r*_{2}, …, *r*_{K}] (), Eq (6) can be rewritten as
(7)
Here the additional index vector **r** can be considered as a dictionary selection operator and we will enforce particular prior distribution on it to enhance the discriminative power of our model for subspace reconstruction. To further enhance the representative ability of our model, we will also develop a novel dictionary learning framework to build orthogonal subspace dictionary for Eq (6). Please notice that the orthogonality of the learned dictionary will also significantly simply the numerical optimization process. Please see the following sections for more details.

### Bayesian Formulation

Now we will introduce our model under Bayesian framework in detail. The joint posterior distribution of ** α**,

**r**,

**e**and

*σ*

^{2}based on the Bayesian theorem can be written as (8) where

*p*(

**y**|

**D**,

**,**

*α***r**,

**e**,

*σ*

^{2}), ,

*p*(

**r**|

*κ*), ,

*p*(

*σ*

^{2}|

*τ*

_{1},

*τ*

_{2}), denote the priors on the noisy vectorized target region, the coefficient vector

**= [**

*α**α*

_{1},

*α*

_{2}, …,

*α*

_{K}], the index vector

**r**= [

*r*

_{1},

*r*

_{2}, …,

*r*

_{K}] (), the Laplacian noise, and the noise level, respectively. In Eq (8), the parameters ,

*τ*

_{1},

*τ*

_{2}, , and

*κ*are the relevant constant parameters of the priors.

We generally assume that the noise *ϵ*_{j} follows the Gaussian distribution, *i.e.*, *p*(*ϵ*_{j}) = *N*(0, *σ*^{2}). We treat the Laplacian noise term *e*_{j} as missing values with the same Laplacian prior. Therefore, the Prior *p*(**y**|, **D**, ** α**,

**r**,

**e**,

*σ*

^{2}) has the follow distribution: (9)

To enforce sparsity, the coefficients ** α** are assumed to follow Laplace distribution.
(10)

Our goal is to remove redundant features while preserving the useful parts in the dictionary. As Laplace priors resulting sparse coding may lead to over penalization on the large coefficients, we assume the index variable *r*_{l} of each coefficient *α*_{l} to be a Bernoulli variable to enforce sparsity and reduce over penalization.
(11)
where *κ* ≤ 1/2. Here, the Bernoulli prior on *r*_{l} means that *r*_{l} will have probability *κ* to be 1 and 1 − *κ* to be 0, if the prior information is known.

The noise *e*_{j} is aims at handling outliers, so it follows Laplace distribution:
(12)

The variances of noises are assigned with Inverse Gamma prior as follow: (13) where Γ(⋅) denotes the gamma function.

Then, the optimal ** α**,

**r**,

**e**,

*σ*

^{2}are obtained by the MAP probability. After taking the negative logarithm, the formula is (14) Combining the aforementioned Eqs (8)–(13), we have (15) With fixing

*σ*

^{2}= 1, Eq (15) can be rewritten as (16) where . With and , Eq (16) can be rewritten as (17)

### Final Optimization Model

By observing the objective function in Eq (17), it can be found that the essential regularization in Eq (17) is a combination of the sparse coding and the sparse counting. With a fixed appropriate orthogonal dictionary D, Eq (17) can be written as the following optimization problem
(18)
where ‖⋅‖_{0} denotes the *L*_{0} norm which counts the number of non-zero elements, *γ*, *λ* and *β* are regularization parameters, and ‖⋅‖_{2} and ‖ ⋅ ‖_{1} denote *L*_{2} and *L*_{1} norms, respectively. The term ‖**e**‖_{1} is used to reject outliers (*e.g.*, occlusions), while ‖** α**‖

_{0}and ‖

**‖**

*α*_{1}are used to select the most discriminative subspace features. Notice that we also implicitly assume that

**D**

^{⊤}

**D**=

**I**, where

**I**is an identity matrix.

## Theory of Fast Numerical Algorithm

It is known that APG is an excellent algorithm for convex programming [18, 19] and has been used in visual tracking. In this section, we propose a fast numerical algorithm for solving the proposed nonconvex and nonsmooth model by using APG approach. The experimental results show that it can converge to a solution quickly and achieve attractive performance. Besides, the closed solution of the combining *L*_{0} and *L*_{1} based regularization is provided.

### APG Algorithm for Solving Eq (19)

Eq (18) contains two subproblem: one is solving ** α** given fixed

**e**, the other one is solving

**e**given fixed

**, the formula is shown as follow (19)**

*α*Solving Eq (19) is an NP-hard problem because it involves a discrete counting metric. We adopt a special optimization strategy based on the APG approach [18], which ensures each step be solved easily. In APG Algorithm, we need to solve
(20)
where , ∇_{α} *F*(** α**,

**e**) =

**D**

^{⊤}(

**D**

**+**

*α***e**−

**y**), ∇

_{e}

*F*(

**,**

*α***e**) =

**e**− (

**y**−

**D**

**), and**

*α**L*is a Lipschitz constant.

The solutions of Eq (20) can be obtained by (21) where , and is defined as (22)

The numerical algorithm for solving Eq (19) is summarized in Algorithm 1. Due to the orthogonality of **D**, Algorithm 1 converges fast, and its computation cost does not increase compared to the solver of *L*_{1} regularized model.

**Algorithm 1** Fast Numerical Algorithm for Solving Eq (19)

**Initialize:** Set initial guesses *α*_{0} = *α*_{−1} = **0**, **e**_{0} = **e**_{−1} = **0**, and *t*_{0} = *t*_{−1} = 1.

**while** not convergence or termination **do**

**Step 1:**
;

**Step 2:**
;

**Step 3:**
;

**Step 4:**
;

**Step 5:**
, *k* ← *k*+1.

**end while**

### Closed-form Solution for Combining *L*_{1} and *L*_{0} Regularization

This subsection mainly focus on a sparse combinatory model which combines *L*_{0} and *L*_{1} norm together as the regularizer term
(23)
where , and |*x*| denotes *L*_{0} norm: if *x* = 0, then |*x*|_{0} = 0, and |*x*|_{0} = 1, otherwise.

**Proposition 1.** *The optimal solution* *x** *of the* Eq (23) *is defined as*
(24)

*Proof*. First, we denote . It is obvious that if *x* = 0, then . Then we need to discuss the case that *x* ≠ 0:

- if
*x*> 0, then . Writing its K.K.T condition, we get*x*=*y*−*δ*, and the objective value is . - if
*x*< 0, then . It is easy to get*x*=*y*+*δ*, and the objective value is .

Then, we need to compare these three cases, if *E*(0) > *E*(*x* − *δ*), we have (*δ* − *y*)^{2} > 2*η*. Combining with *x* = *y* − *δ* > 0, we have . Similarly, if *E*(0) > *E*(*x* + *δ*), then we have . And *x* = 0, otherwise.

If , the Eq (23) changes into
(25)
where and . It is obvious that Eq (23) can be turned into
(26)
So it can be seen as a sequence of optimization of *x*_{i}, *i* = 1, …, *n*, and each can be solved by proposition.

In Eq (25), if we set *δ* = 0 and *η* = 0, the model degenerates to the linear regression. If we set *δ* = 0, Eq (25) reduces to *L*_{0} regularized regression, while becoming *L*_{1} regularized regression when *η* = 0. Fig 3(a) shows the closed solutions of these four cases. We set *δ* = *η* = 0.5 in Eq (25) (*L*_{0} + *L*_{1} regularized regression), *η* = 1 in *L*_{0} regularized regression, and *δ* = 1 in *L*_{1} regularized regression. We note that *L*_{0} + *L*_{1} regularized regression has the same sparsity as *L*_{0} regularized regression, while causing little over penalization than *L*_{1} regularized regression. In Fig 3(b), sparsity threshold changes of *L*_{0}, *L*_{1} and *L*_{0} + *L*_{1} regularized regression are shown, respectively. When *δ* = 1 − *η* changes from 0 to 1, the sparsity threshold of *L*_{0} + *L*_{1} varies from that of *L*_{0} to the threshold of *L*_{1}. Besides, it is obvious that the threshold of *L*_{0} + *L*_{1} is larger than those of *L*_{0} and *L*_{1} in interval (0, 0.8].

(a) shows the closed solutions of linear regression, *L*_{0}, *L*_{1}, *L*_{0} + *L*_{1} regularized regression, respectively. (b) shows the sparsity threshold changes of *L*_{0}, *L*_{1} and *L*_{0} + *L*_{1} regularized regression, respectively.

## Orthogonal Dictionary Learning for Visual Tracking

In this section, we demonstrate dictionary learning in detail through three parts: dictioanry initialization, orthogonal dictionary update and dictionary reinitialization.

**Dictioanry Initialization:** There are two schemes to initialize the orthogonal dictionary, one is doing PCA for the set of initial first *k* frames **Y**_{k}, the other is doing RPCA for **Y**_{k}. When initial frames do not undergo corruption (*e.g.*, occlusion or illumination), we do PCA for **Y**_{k} instead of RPCA. The whole process of PCA is doing skinny SVD for **Y**_{k} and get the basis vectors of column space as the initial dictionary. However, when initial frames have large sparse noise, RPCA is selected to get the intrinsic low-rank features **Z**_{k}, which can be obtained by solving [7]:
(27)
When solving Eq (27), the skinny SVD of **Z**_{k} is readily available: , and **D** = **U**_{k} is the initial orthogonal dictionary. As the analysis in [6], the skinny SVD of **Z**_{k} is readily available when solving Eq (27): Fig 4(a) shows that PCA initialization and RPCA initialization both perform well when the initial first *k* frames have little noise. The initial frames is generally clean, therefore, we choose PCA initialization as the default.

The upper portion of the image is the tracking frame. the middle of the image consists of three sub-pictures, the left is the mean image, the middle is the reconstruction result, and the right is the Lapalace noise. the bottom of the image is the top ten basis vectors of dictionary. (a) shows the tracking results of PCA and RPCA dictionary initialization. The tracking performance with and without RPCA reinitialization is shown in (b). Reprinted from [20] under a CC BY license, with permission from Yi Wu, original copyright 2013.

**Orthogonal Dictionary Update:** As the appearance of a target may change drastically, it is necessary to update the orthogonal dictionary **D**. Here we adopt an incremental PCA algorithm [21] to update the dictionary.

**Dictionary Reinitialization:** When the tracker is prone to drift, dynamically reinitializing dictionary to obtain the intrinsic subspace features is needed. We adopt the strategy proposed by [7]. The reinitialization is performed at *t*-th frame if *σ* = ‖**e**_{t}‖_{0}/*len*(**e**_{t}) > *thr*, where **e**_{t} is the noise item at *t*-th frame, *len*(.) is the length of vector, and *thr* > 0 is a threshold parameter (generally 0.5). If *σ* > *thr*, we reinitialize the dictionary in the same way as initialization of dictionary by doing RPCA, but **Y**_{t} in Eq (27) is different. Here, **Y**_{t} consists of optimal candidate observations respectively from the initial *n* (generally 10) frames and the latest *t* − *n* frames (we set *t* = 30). Fig 4(b) compares the tracking performance within and without RPCA reinitialization when the object undergoes variable illumination. After reinitializing dictionary, our tracker retracks the object, so reinitializing dictionary is efficient to improve the reconstruction ability. In Algorithm 2, we summarize the overall tracking process for frame *t*.

## Experimental Results

In this section, we compare the performance of our proposed tracker with several state-of-the-art tracking algorithms, such as TLD [22], IVT [5], ASLA [23], *L*_{1}APG [14], MTT [11], SP [15], SPOT [24], FOT [25], SST [26], SCM [27], MIL [2], and Struck [3], on twenty-two video sequences from the popular benchmark [20] including basketball, bolt, boy, car4, carDark, carScale, crossing, david, david2, david3, deer, faceocc1, faceocc2, fish, football, mountainBike, shaking, skating1, trellis, walking, walking2 and woman. These sequences are publicly available online at http://cvlab.hanyang.ac.kr/tracker_benchmark/datasets.html. Representative videos including Tiger1 and Singer1 have been downloaded from the open video data-sets of the paper [28]. Our tracker is implemented in MATLAB and runs at 4.2 fps on an Intel 2.53 GHz Dual-Core CPU with 8GB memory, running Windows 7 and Matlab (R2013b). We empirically set *η* = 0.1, *λ* = 0.5, *γ* = 0.1, *τ* = 0.05 and the Lipschitz constant *L* = 2. Before solving Eq (18), all the candidates **y** are centralized. Considering the efficiency, the updated orthogonal dictionary **D** is taken columns corresponding to the 16 largest eigenvalues of PCA or RPCA, 600 particles are adopted, and the model is incrementally updated every 5 frames. In the following, we present both qualitative and quantitative comparisons of above mentioned methods.

**Algorithm 2** Our Robust Visual Tracking Algorithm

**Initialization:** Initialize orthogonal dictionary **D** by performing PCA on .

**Input:** State **x**_{t−1} (*t* > *k*) and orthogonal dictionary **D**.

**Step 1:** Draw new samples from **x**_{t−1} and obtain corresponding candidates .

**Step 2:** Obtain and using Eq (19).

**Step 3:** For each candidate, calculate the observation probability using Eq (4).

**Step 4:** Find the tracking result patch with the maximal observation likelihood and its corresponding noise .

**Step 5:** perform an incremental PCA algorithm to update the orthogonal dictionary **D** every five frames. If *σ* > *thr*, reinitializing Dictionary at *t*-th frame using Eq (27).

**Output:** State and corresponding image patch; orthogonal dictionary **D**.

### Qualitative Evaluation

We choose some examples from part of 22 sequences to illustrate the effectiveness of our method. Fig 5 shows the visualization results.

**Heavy Occlusion:** Fig 5(a) and 5(b) show three challenging sequences with heavy occlusion. In *Faceocc1* and *Faceocc2*, the targets undergo with heavy occlusion and in-plane rotation, it can be seen that our method outperforms the other tracking algorithms. *David3* demonstrates that the proposed method can capture the accurate location of objects in terms of position, and scale when the target undergoes severe occlusion (*e.g.*, *David3* #0085). However, IVT, *L*_{1}APG, MIL, SP, SCM, ASLA, TLD, SPOT, FOT, SST, MTT, and Struck methods drift away from the target object when occlusion occurs. For these four sequences, the IVT method performs poorly since conventional PCA is not robust to occlusions. Although *L*_{1}APG and SP utilize sparsity to model outliers, it is observed that their occlusion detection are not stable when drastic change of appearance happens. In contrast, our method is robust to heavy occlusion. This is because our combination of *L*_{0} and *L*_{1} regularized appearance model can exactly reconstruct the object.

**Fast Motion:** Fig 5(c) show the sequences *Boy* and *Deer* with fast motion. It is difficult to predict the locations of the tracked objects when they undergo abrupt motion. In *Boy*, the captured images are blurred seriously, but Struck and our method track the target faithfully throughout the images. IVT, MTT, ALSA, SCM and SST methods drift away seriously. We note that most of the other trackers have drift problem due to the abrupt motion and background clutter in sequence *Deer*. In contrast, the SST and our method successfully track the target for whole video.

**Illumination Changes and Scale Variation:** In Fig 5(d) and 5(e), we test three challenging sequences with illumination changes and scale variation. *Fish* chips contain significant illumination variation. We can see that the *L*_{1}APG, MTT, and MIL methods are less effective in these cases (*e.g.*, *Fish* #0305). In *CarDark*, our method still performs well, but TLD, FOT, and MIL fail. Our method also achieves good performance in *CarScale* with scale variation (*e.g.*, *CarScale* #0204). For subspace-based approaches, they may fail to update the appearance model as the calculation of coefficients in their models may have redundant background features. Our method can successfully adapt to variable drastic changes since the combination of sparse coding and sparse counting is not merely stable but also applicable to obtain the intrinsic features of the subspace.

**Background Clutters:** Fig 5(f) demonstrates the tracking results in *Baskerball* and *Football* with background clutter. *Baskerball* is a difficult sequence because it contains cluttered background, illumination change, heavy occlusion and non-rigid pose variation. Unless our tracker, none of the compared algorithms can work well on it(*e.g.*, *Baskerball* #0486 and #0614). As shown in *Football*, our tracker performs relatively well (*e.g.*, *Football* #304) as it has excluded background clutters in the sparse errors, but TLD, FOT, and MIL fail.

### Quantitative Evaluation

We use two metrics to evaluate the proposed algorithm with other state-of-the-art methods. The first metric is the center location error measured with manually labeled ground truth data. The second one is the overlap rate, i.e., , where *R*_{T} is the tracking bounding box and *R*_{G} is the ground truth bounding box. The larger average scores mean more accurate results.

Table 1 shows the average overlap rates. Table 2 reports the average center location errors (in pixels) where a smaller average error means a more accurate result. Notice that the results are calculated by averaging 5 runs of these algorithms. As can be seen from the table, the most sequences generated by our method have lower average error and higher overlap rate values. We provide the precision and success plots in Fig 6 to evaluate our performance over all the 22 sequences. The evaluation parameters are set as default in [20]. We note that the our algorithm performs well for the videos with occlusion, low resolutionn, in plane rotation, and background clutter based on the precision metric and the success rate metric as shown in Figs 7 and 8 respectively. Both table and figures show that our method achieves favorable performance against other state-of-the-art methods.

The best and the second results are shown in fonts and fonts, respectively.

The best and the second results are shown in fonts and fonts, respectively.

The mean precision scores are reported in the legends.

To further compare the running time of four subspace-based tracking algorithms (i.e. IVT, *L*_{1}APG, SP and our method), we calculated the average Frames Per Second (FPS) for 32 × 32 image patch (see the last row of Table 1). For *L*_{1}APG, we reported FPS for its APG acceleration. It can be seen that IVT is quite faster than other trackers as its computation only involves matrix-vector multiplication. Both SP and our method are faster than *L*_{1}APG. It is also observed that our method is much faster than SP. This is due to the different choices of the optimization scheme. SP adopts a naive altering minimization strategy, in contrast, our method is efficiently solved by APG.

## Conclusion

In this paper, we propose sparse coding and counting method under Bayesian framwork for robust visual tracking. The proposed method combines *L*_{0} regularization and *L*_{1} regularized sparse representation in a unique one, therefore, it has better ability to sparsely represent an object and the reconstruction result are also better. Besides, to solve the proposed model, we develop a fast and efficient APG algorithm. Moreover, the closed solution of the combination of *L*_{0} norm and *L*_{1} norm regularization is provided. Extensive experiments testify to the superiority of our method over state-of-the-art methods, both qualitatively and quantitatively.

## Acknowledgments

This work is partially supported by the National Natural Science Foundation of China (Nos. 61300086, 61432003, 61672125), the Fundamental Research Funds for the Central Universities (DUT15QY15), the Hong Kong Scholar Program (No. XJ2015008), and National Science and Technology Major Project (Nos. 2013ZX04005-021, 2014ZX001011).

## Author Contributions

**Conceptualization:**RL JW ZS.**Data curation:**JW.**Formal analysis:**RL JW.**Funding acquisition:**RL ZS.**Investigation:**RL JW.**Methodology:**RL JW.**Project administration:**RL ZS.**Software:**JW YC.**Supervision:**ZS.**Validation:**JW XS.**Visualization:**JW YW.**Writing – original draft:**RL JW.**Writing – review & editing:**RL JW XS.

## References

- 1.
Liu R, Cheng J, Lu H. A robust boosting tracker with minimum error bound in a co-training framework. In: ICCV; 2009. p. 1459–1466.
- 2.
Babenko B, Yang MH, Belongie SJ. Visual tracking with online Multiple Instance Learning. In: CVPR; 2009. p. 983–990.
- 3.
Hare S, Saffari A, Torr PHS. Struck: Structured output tracking with kernels. In: ICCV; 2011. p. 263–270.
- 4. Jepson AD, Fleet DJ, El-Maraghi TF. Robust Online Appearance Models for Visual Tracking. IEEE TPAMI. 2003;25(10):1296–1311.
- 5. Ross DA, Lim J, Lin RS, Yang MH. Incremental Learning for Robust Visual Tracking. IJCV. 2008;77(1-3):125–141.
- 6.
Liu R, Lin Z, Su Z, Gao J. Linear time Principal Component Pursuit and its extensions using
*ℓ*_{1}filtering. Neurocomputing. 2014;142:529–541. - 7. Zhang C, Liu R, Qiu T, Su Z. Robust visual tracking via incremental low-rank features learning. Neurocomputing. 2014;131:237–247.
- 8. Liu R, Jin W, Su Z, Zhang C. Latent Subspace Projection Pursuit with Online Optimization for Robust Visual Tracking. IEEE MultiMedia. 2014;21:47–55.
- 9.
Mei X, Ling H. Robust visual tracking using
*ℓ*_{1}minimization. In: ICCV; 2009. p. 1436–1443. - 10.
Liu B, Yang L, Huang J, Meer P, Gong L, Kulikowski C. Robust and fast collaborative tracking with two stage sparse optimization. In: ECCV; 2010. p. 624–637.
- 11. Zhang T, Ghanem B, Liu S, Ahuja N. Robust Visual Tracking via Structured Multi-Task Sparse Learning. IJCV. 2013;101(2):367–383.
- 12.
Jin W, Liu R, Su Z, Zhang C, Bai S. Robust visual tracking using latent subspace projection pursuit. In: ICME; 2014. p. 1–6.
- 13.
Pan J, Lim J, Su Z, Yang MH.
*ℓ*_{0}-Regularized Object Representation for Visual Tracking. BMVC. 2013;. - 14.
Bao C, Wu Y, Ling H, Ji H. Real time robust
*ℓ*_{1}tracker using accelerated proximal gradient approach. In: CVPR; 2012. p. 1830–1837. - 15. Wang D, Lu H, Yang MH. Online Object Tracking With Sparse Prototypes. IEEE TIP. 2013;22(1):314–325.
- 16. Lu X, Wang Y, Yuan Y. Sparse Coding From a Bayesian Perspective. IEEE Transactions on Neural Networks and Learning Systems. 2013;24(6):929–939. pmid:24808474
- 17.
Wang D, Lu H, Yang MH. Least Soft-thresold Squares Tracking. In: CVPR; 2013. p. 2371–2378.
- 18.
Lin Z, Ganesh A, Wright J, Wu L, Chen M, Ma Y. Fast convex optimization algorithms for exact recovery of a corrupted low-rank matrix. UIUC; 2009.
- 19.
Tseng P. On accelerated proximal gradient methods for convex-concave optimization; 2008. submitted to SIAM J. Optimiz.
- 20.
Wu Y, Lim J, Yang MH. Online Object Tracking: A Benchmark. In: CVPR; 2013. p. 2411–2418.
- 21. Levey A, Lindenbaum M. Sequential Karhunen-Loeve basis extraction and its application to images. IEEE Trans on IP. 2000;9(8):1371–1374.
- 22. Kalal Z, Mikolajczyk K, Matas J. Tracking-Learning-Detection. IEEE TPAMI. 2012;34(7):1409–1422.
- 23.
Jia X, Lu H, Yang MH. Visual tracking via adaptive structural local sparse appearance model. In: CVPR; 2012. p. 1822–1829.
- 24.
Zhang L, Maaten L. Structure preserving object tracking. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2013. p. 1838–1845.
- 25.
Vojíř T, Matas J. The enhanced flock of trackers. In: Registration and Recognition in Images and Videos. Springer; 2014. p. 113–136.
- 26.
Zhang T, Liu S, Xu C, Yan S, Ghanem B, Ahuja N, et al. Structural sparse tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 150–158.
- 27.
Zhong W, Lu H, Yang MH. Robust object tracking via sparsity-based collaborative model. In: CVPR; 2012. p. 1838–1845.
- 28.
Bai Q, Wu Z, Sclaroff S, Betke M, Monnier C. Randomized ensemble tracking. In: Proceedings of the IEEE International Conference on Computer Vision; 2013. p. 2040–2047.