Multi-part and scale adaptive visual tracker based on kernel correlation filter

Accurate visual tracking is a challenging issue in computer vision. Correlation filter (CF) based methods are sought in visual tracking based on their efficiency and high performance. Nonetheless, CF-based trackers are sensitive to partial occlusion, which may reduce their overall performance and even lead to failure in tracking challenge. In this paper, we presented a very powerful tracker based on the kernelized correlation filter tracker (KCF). Firstly, we employ an intelligent multi-part tracking algorithm to improve the overall capability of correlation filter based tracker, especially in partial-occlusion challenges. Secondly, to cope with the problem of scale variation, we employ an effective scale adaptive scheme, which divided the target into four patches and computed the scale factor by finding the maximum response position of each patch via kernelized correlation filter. With this method, the scale computation was transformed into locating the centers of the patches. Thirdly, because the small deviation of the central function value will bring the problem of location ambiguity. To solve this problem, the new Gaussian kernel functions are introduced in this paper. Experiments on the default 51 video sequences in Visual Tracker Benchmark demonstrate that our proposed tracker provides significant improvement compared with the state-of-art trackers.


Introduction
Visual object tracking is a crucial research problem in computer vision and has many applications including video surveillance, traffic monitoring, robotics and human computer interface. In the past decade, great improvement has been made by some visual tracking algorithms [1,2,3,4,5,6], but visual tracking is still considered as a big challenge in some scenarios such as illumination variation, scale variation, occlusion, deformation and background clutters, etc.
Recently, correlation filter based methods are sought in visual tracking because of their efficiency and high performance. Correlation filters usually generate correlation peaks for each interested patch in one frame while producing low responses to background, which are often used as detectors of expected model. Kernelized Correlation Filter (KCF) tracking has the highest speed while balancing the tracking performance. For a given image, the KCF tracker achieves target tracking by learning the target's appearance by the kernel least squares classifier. However, the KCF tracker does not have the ability to handle the scale problem. Danelljan et al. [7] relieves the scaling issue using feature pyramid and 3-dimensional correlation filter. Yang Li et al. [8] applies a scaling pool to handle scale variations. The above methods have largely solved the scaling problem. Moreover, occlusion is also a tricky problem for these correlation filter based trackers. In general, multi-part tracking scheme can be helpful to gain robustness against partial occlusions. In this respect, Akin et al. [9]proposes a tracker depends on coupled interactions between a global tracker and several part trackers. Jeong et al. [10] applies a naive multi-block scheme based on DSST [7]. These methods can solve partial occlusion to a large extent.
However, negative effects for comprehensive performance of tracker are generated by using sub-part trackers frequently, since sub-part trackers will process part of the target as background during training and detecting. To avoid accumulating negative effects, sub-trackers should only be employed in frames which object is occluded or deformed.
In this paper, we employ an effective spatial distribution to divide target into two sub-parts. To avoid applying sub-trackers frequently, we endue sub-trackers a reliability weight based on the fluctuation of correlation response from globe tracker so that sub-trackers will be chosen only when target is occluded or deformed. We assign different learning rates to different trackers based on the ratio of response values. Moreover, robust scale calculation is a challenging problem in visual tracking. Most existing trackers fail to handle large scale variations in complex videos. To address this issue, this paper proposed a robust and efficient scale-adaptive tracker in tracking-bydetection framework, which divided the target into four patches and computed the scale factor by finding the maximum response position of each patch via kernelized correlation filter. With this method, the scale computation was transformed into locating the centers of the patches. Because the small deviation of the central function value will bring the problem of location ambiguity. To solve this problem, the new Gaussian kernel functions are introduced in this paper.

Related works
The KCF tracker [11] achieves very excellent results and high-speed performance on Visual Tracker Benchmark [12], despite the ideal and implementation of KCF tracker are very simple. The KCF tracker achieves excellent results and high-speed performance on Visual Tracker Benchmark [10], despite the idea and implementation of KCF tracker are very simple. KCF tracker collects positive and negative samples around the target using the structure of the circulant matrix, to improve the discriminative capability of the track-by-detector tracker. The circulant matrix can be diagonalized with the Discrete Fourier Transform (DFT), enabling fast dot-product instead of expensive Matrix algebra.
The goal of KCF tracker is to find a function that minimizes the squared error over data matrix X and their regression target y, where the square matrix X contains all circulant shifts of the base sample x, the regression target y is Gaussian-shaped, and the λ is a regularization parameter to ensure the generalization performance of the classifier, Eq (1) has the closed-form solution.
The circulant matrix X has some intriguing properties [16] [11], and the most useful one is that the circulant matrix can be diagonalized by the Discrete Fourier Transform (DFT) as below: where F is the DFT matrix, and F H is the Hermitian transpose.x denotes the DFT of x, Fx: Applying Eq (3) into the solution of linear regression (Eq (2)), we have the solution as below:ŵ ¼x Jŷ wherex � is the a complex-conjugate ofx. The symbol J and the fraction denote element-wise product and division respectively.
For detecting the new location of target in the next frame, we can compute the response f(z) for all candidate patches z, and diagonalize f(z) to obtain as below: f ðzÞ ¼ŵ

Kẑ ð5Þ
The candidate patch with the maximum response is considered as the new location of target.

The proposed tracker
In this section, we describe our tracker based on the kernelized correlation filter (KCF) [11]. Firstly, we described the Multi-part tracking tracker, and then the adaptive scale calculation method will be introduced. The selection of Gaussian function is discussed. Moreover, we presented our powerful Multi-part tracking algorithm to improve the correlation filter based trackers.

Multi-part tracking
In visual tracking tasks, partial occlusion is one of the major challenges limiting performance of tracker. Simply, multi-part scheme [13] [14] splits the target into multi-parts and track them independently. When target is partially occluded or deformed, tracker can still locate target rely on the effective sub-part. The high frame rate of KCF also allows multi-part scheme to be applied to real-time tasks. However, the performance of the sub-part tracker does not perform as well as the global tracker in most non-occluded frames, even though sub-part tracker has a higher response value sometimes, since sub-part trackers will process part of the target as background during training and detecting. Therefore, the best method is to use the global tracker when the object is non-occluded, and use a sub-tracker when occlusion occurs.
In our work, our goal is to develop a multi-part tracker that sub-part trackers and global tracker will take effect in their efficient frames respectively. We employ effective spatial distributions to divide target into two sub-parts, one for the horizontally and one for the vertically aligned object based on the ratio of the height and width of the target. As illustrated in Fig 1. The key in our method is how to select the optimal tracker from both globe and sub-part trackers for different frames, as illustrated in Fig 2. If we simply choose the tracker that has the maximum response, sub-part trackers will be frequently applied to non-occluded frames. Fortunately, when the target is occluded or deformed, the response value of globe tracker will fluctuate significantly relative to frames which the target is non-occluded. Based on above fact, we propose a reliability weight w for sub-part trackers. w endues multi-part tracker the ability to identify whether the object is occluded or not, and multi-part tracker can select the optimal tracker for different frame itself.
Firstly, we introduce a fluctuation value parameter of global tracker Δ t .
for the 1-th frame of tracking, Δ 1 is set as 0, the R g t is the globe tracker's response value of current frame and R g L is the response value of the global tracker which was selected as the optimal tracker for the last time, they can be obtained by Eq (5). The parameter indicates the change of response value after the object is occluded or deformed. The smaller the parameter, the greater the occluded area of the object, that means the globe tracker's reliability is reduced.
To avoid sub-part trackers are selected as the optimal tracker in non-occluded frames, we assign a reliability weight to response value of sub-part trackers. The reliability weight at the tth frame is defined as: where η and θ are the reliability and sensitivity parameter respectively, in our experiments, η sets as 0.4 and θ = 1. The reliability weight reduces the probability that the sub-tracker is selected as the optimal one unless the weight is less than -0.4, and it imply the object is likely to be occluded that the reliability weight less than -0.4. Multi-part tracker can choose the optimal tracker using Eq (8), and R si t is the response value of i-th sub-tracker.
If the optimal tracker is globe tracker, the new position can be obtained directly. If one of sub trackers is selected as optimal tracker, the new position can get by shifting in correspondence to the previous center coordinates.

Subsection scale calculation method
Assume the location of the target center in the t−1 frame is p t−1 , the target scale is w t−1 ×h t−1 . In the t−1 frame,take p t−1 as the center,the image block z t−1 with size βw t−1 ×βh t−1 is selected to update the appearance templatex and coefficientα, ( where β is expansion coefficient, η is learning rate. Coordinate system is constructed with p t−1 as the origin. The image of w t−1 ×h t−1 is divided into four equal sub-blocks, and the center of each block is (w 1 (t−1)×h 1 (t−1)), (w 2 (t−1)×h 2 (t−1)), (w 3 (t−1)×h 3 (t−1)) and (w 4 (t−1)×h 4 (t−1)), Train the respective linear classifiers on the four sub-blocks, the training class of the classifier (1), the update of the template and the coefficients (9).
In the t-frame, the target scale calculation process is, first of all, take p t−1 as the center, selected the image block z t0 with size βw t−1 ×βh t−1 . Calculate the maximum response position p t is the current frame target center location. Then take p t as the center, selected the image block z t1 with size w t−1 ×h t−1 . Coordinate system is constructed with p t as the origin. Two axes divide image block w t−1 ×h t−1 into four sub-blocks. Using the classifier trained on the four subblocks to find the position with the largest response on the sub-block (w 1 (t)×h 1 (t)), (w 2 (t)×h 2 (t)), (w 3 (t)×h 3 (t)) and (w 4 (t)×h 4 (t)), then, the scaling factor γ t can be given by the relative change of the center position in w and h dimensions [15] g t ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi After calculating the scaling factor γ t , in order to reduce the influence of noise on scale calculation and increase its robustness, moving average (MA) is used to calculate the target scale. Assuming that the moving average parameter is T, the moving average of the expansion coefficient is In particular, when T = 1 in Eq (11), the moving average degenerates to ρ t = γ t . Then, the target scale in the t−th frame is ( Where w 1 and h 1 were initial frame target sacle. After calculating the target scale in the t−th frame, take p t as the center, selected the image block z t with size βw t ×βh t to update the appearance templatex and coefficientα. At the same time, the w t ×h t target area is divided into four sub-blocks, and the coefficients of the sub-block center, the sub-block template and the classifier on the sub-block are updated.

Selection of Gaussian kernel function
In the tracking algorithm, the objective function generally uses a Gaussian function, Where σ is constant, p = (m,n), p 0 = (m 0 ,n 0 ) is the target center position. jp À p 0 j ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ðm À m 0 Þ 2 þ ðn À n 0 Þ 2 q ð14Þ Since the partial derivative of the Gaussian function at p 0 = (m 0 ,n 0 ) is zero, which is @y @m The above equation shows that the deviation of the function value of the objective function near p 0 = (m 0 ,n 0 ) is small, and the target position in the tracking process is determined by the maximum response position. Therefore, the small deviation of the central function value will bring the problem of location ambiguity. To solve this problem, the following Gaussian kernel functions are introduced in this paper.
Where θ>0 is constant. The partial derivative of the function shown in Eq (16) @ŷ @m ¼ À m À m 0 2yjp À p 0 j expðÀ jp À p 0 j=2yÞ @ŷ @n ¼ À n À n 0 2yjp À p 0 j expðÀ jp À p 0 j=2yÞ ð17Þ 8 > > > < > > > : In particular, the partial derivative at p 0 = (m 0 ,n 0 ), Where m ¼ m þ 0 , n ¼ n þ 0 right partial derivative, m ¼ m À 0 , n ¼ n À 0 left partial derivative. Eq (18) shows that the left and right partial derivatives of the Gaussian kernel function at p 0 = (m 0 ,n 0 ) are not equal, so the partial derivatives at p 0 = (m 0 ,n 0 ) do not exist, but both the left and right partial derivatives exist and are constant, which means that the deviation of the target function near p 0 = (m 0 ,n 0 ) is large, which is beneficial to the accurate positioning of the target center during the tracking process.

Experiments
In this section, we first introduce the experimental setup and methodology. Moreover, to evaluate the performance of the proposed Multi-part and Scale Adaptive Tracker (MSAT), we implemented our method to compare with s correlation filter based trackers and other stateof-art trackers on the default 51 video sequence in Visual Tracker Benchmark [12].

Experimental setup and methodology
The proposed tracker is implemented in MATLAB R2014a version. All the experiments are conducted on an Intel Xeon(R) E3-1226 V3 CPU (3.30 GHz) PC with 16GB RAM. The HoG cell size is 4×4 and the number of bin is 9. The padding windows is 2.5 times of target object, and learning rate parameter γ is set to 0.015. The σ used in Gaussian kernel is set to 0.5.
We select two quantitative evaluation criteria. The first one is mean overlap precision (OP), OP calculates the percentage of frames in sequences where the intersection-over-union (IOU) overlap exceeds a given threshold of 0.5. The second criteria is the area under curve (AUC), which is computed from the average of the success rates corresponding to the sampled overlap thresholds from 0 to 1.
We have tested the performance of the proposed method with different values of the reliability parameter η, as shown in Fig 3, the η is set from 0.1 to 0.7. The smaller the η, the higher the probability that sub-tracker is selected as the optimal tracker. Frequently choosing subtracker will reduce performance of the proposed method. On the contrary, assigning η too large value is equivalent to using only the global tracker.
To evaluate the comprehensive performance of the proposed approach, we first run seven Correlation Filter-based trackers, and then make comparison with other State-of-art trackers on the default 51 video sequences in Visual Tracker Benchmark [12].

Comparison to correlation filter based trackers
To indicate the performance improvements of our approach with multi-part and scale adaptive scheme, we compare our MSAT tracker with the recent correlation filter based trackers that include CSK [16], KCF [11], DSST [7], SAMF [17], OCT_KCF [18],CN[19] on the OTB dataset. All of these trackers are the use of circulant matrix or kernelized correlation filters. Fig  4 shows that mean OP and AUC score of overall, occlusion and scale variation for these trackers. Table 1 summarizes overall comprehensive evaluation for seven trackers. And Fig 5 compare these trackers in challenging situations. It is apparent from success plots of Fig 4 that our MSAT tracker has better performance than the other correlation filters based trackers. We also observe from the result that our Multipart scheme brought high OP and AUC scores in the occlusion challenge, and our tracker is the unique tracker that solves partial occlusion problem in Fig 5(B). Additionally, the results from our experiment shows that trackers(MSAT, SAMF, DSST) explicitly used scale adaptive strategy address the scale change problem have an advantage in the experiments.
The features are essentially significant to the visual object tracking tasks. CSK only employs the raw pixel, whose rank is the lowest one among the correlation filter based trackers. CN uses both raw pixel and color-naming as features, and realizes a lot of improvement upon CSK. Trackers(MSAT, SAMF) with HoG and color-naming features outperform KCF which only employs the HoG feature.
In the precision plots, the OCT_KCF [12] has the highest OP score. Because that the OCT_KCF models the distribution of correlation response in a Bayesian optimization

PLOS ONE
framework to alleviate the drifting problem, making the position in each frame more accurate. In Fig 5(H), the performance of our tracker is inferior to DSST [7] which uses 33 different scales for tracking, but this scale strategy of DSST brings larger cost of computational time. Table 1 indicates that our tracker has the best overall comprehensive evaluation in seven kernel correlation filter based trackers. Comparing to KCF [11], the MSAT tracker gets a 10.1% and 16% improvement for OP score and AUC score respectively. The result also demonstrates that MSAT promotes the performance of the SAMF [11] which use the same features and scale strategy as our tracker, especially in occlusion challenge. Our proposed MSAT tracker runs at about 10 fps, which is still within real time range.

Comparison with the state-of-art trackers
In our next experiment, we have compared our approach and KCF [11] with 29 different stateof-the-art trackers which reported in the benchmark experiment in [12] on the OTB dataset. Fig 6 presents the overall scores of proposed tracker against the top nine performing stateof-art trackers on the default 51 video sequence in Visual Tracker Benchmark [12]. Correlation Filter Based Trackers (MSAT, KCF, CSK) have the performance with advantage against other State-of-art Trackers. The trackers with HoG feature (MSAT, KCF) achieved an overwhelming performance compared against SCM [4] and Struck [1] in both success and precision plots. The top nine performing state-of-art trackers obtain mean AUC score of 0.446, compared to 0.596 for our MSAT tracker, which is a great improvement for the visual object trackers. Table 2 shows the mean OP score on the Visual Tracker Benchmark dataset and its challenging sub-categories for the top ten tracking algorithms. Impressively, our MSAT tracker obtains 7 the best and 2 the second best score in 9 sub-categories tasks. The promising result suggests that our tracker with Multi-part and scale adaptive scheme is more effective in the visual tracking challenge.

Conclusions
This paper present a very powerful tracker based on the kernelized correlation filter. It proposes a multi-part tracking algorithm to improve the overall capability of correlation filter based tracker, especially in partial-occlusion challenges. By using a reliability weight, we endue multi-part tracking algorithm the ability to select the optimal tracker for different frame itself. Moreover, this paper proposed a robust and efficient scale-adaptive tracker in tracking-bydetection framework, which divided the target into four patches and computed the scale factor by finding the maximum response position of each patch via kernelized correlation filter.  With this method, the scale computation was transformed into locating the centers of the patches. In order to solve the problem of location ambiguity, a new Gaussian kernel functions are introduced in this paper. Our proposed MSAT tracker runs at about 10 fps, which is still within real time range. Extensive experiments have been implemented to demonstrate the validity of our proposed tracker.