Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Improved siamese tracking for temporal data association

  • Yi Tao,

    Roles Conceptualization, Formal analysis, Project administration, Resources, Supervision

    Affiliations Xi’an Aerospace Automation Co., Ltd., The 6th Academy of China Aerospace Science and Industry Corporation, Xi’an, Shaanxi, China, Xidian University, Xi’an, Shaanxi, China

  • Fei Wang ,

    Roles Data curation, Investigation, Methodology, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    wf_asam@163.com

    Affiliation Xi’an Aerospace Automation Co., Ltd., The 6th Academy of China Aerospace Science and Industry Corporation, Xi’an, Shaanxi, China

  • Mohan Li,

    Roles Data curation

    Affiliation Xi’an Aerospace Automation Co., Ltd., The 6th Academy of China Aerospace Science and Industry Corporation, Xi’an, Shaanxi, China

  • Jie Liu,

    Roles Writing – review & editing

    Affiliation Xi’an Aerospace Automation Co., Ltd., The 6th Academy of China Aerospace Science and Industry Corporation, Xi’an, Shaanxi, China

  • Juncheng Zhou,

    Roles Formal analysis

    Affiliation Xi’an Aerospace Automation Co., Ltd., The 6th Academy of China Aerospace Science and Industry Corporation, Xi’an, Shaanxi, China

  • Bo Dong,

    Roles Writing – review & editing

    Affiliation Xi’an Aerospace Automation Co., Ltd., The 6th Academy of China Aerospace Science and Industry Corporation, Xi’an, Shaanxi, China

  • Ruidong Liu,

    Roles Writing – review & editing

    Affiliation Xi’an Aerospace Automation Co., Ltd., The 6th Academy of China Aerospace Science and Industry Corporation, Xi’an, Shaanxi, China

  • Sihao Chen,

    Roles Writing – review & editing

    Affiliation Xi’an Aerospace Automation Co., Ltd., The 6th Academy of China Aerospace Science and Industry Corporation, Xi’an, Shaanxi, China

  • Kan Jiao

    Roles Writing – review & editing

    Affiliation Xi’an Aerospace Automation Co., Ltd., The 6th Academy of China Aerospace Science and Industry Corporation, Xi’an, Shaanxi, China

Abstract

Temporal image data association is essential for visual object tracking tasks. This association task is typically stated as a process of connecting signals from the same object at different times along the time axis. Temporal data association is usually performed before state estimation. The accuracy of data association processing results is fundamental to guaranteeing the correctness of all subsequent procedures. This paper proposes an efficient approach for temporal data association focused on obtaining accurate data association processing results in Siamese network framework. Siamese network has recently achieved strong power in visual object tracking owing to its balanced accuracy and speed. Based on data association processing and multi-tracker collaboration, our algorithm achieves high accuracy and strong robustness, which outperforms several state-of-the-art trackers, including standard Siamese trackers.

1. Introduction

Large-scale time series datasets represents a sequence of data points collected or recorded at time intervals and ordered chronologically. Such intricate data collection is imperative for the surveillance of dynamic alterations within multifarious systems, serving as a fundamental element for both analytical review and prospective estimations, thereby delivering significant value in a breadth of fields.

The pertinence of time series datasets is manifold, facilitating profound examinations into enduring tendencies and elucidating the trajectory of specific indices over temporal spans. These data are central to discerning trends, which are particularly salient for entities whose operations are susceptible to temporal variability. Historical data points within these series are central in forecasting imminent events.

Visual tracking [113] provides a deep understanding of how objects move and evolve over time, aiding tasks from security surveillance to environmental studies. Visual tracking ascertains the persistent observation of mobile subjects and is particularly advantageous in a range of scenarios, encompassing meteorological surveillance to faunal tracking. This process involves the initial localization of an object and the continuous detection and localization of that object in subsequent frames, even in the presence of complex environmental changes, occlusions, illumination changes, or camera motion.

Temporal image data association is an intermediate key step in visual object tracking tasks. In the context of visual object tracking, temporal data can include the location and size of the object. These data points constitute the trajectory of the target in time and space and are essential for understanding and predicting the target behavior. Given that the ultimate outcome of visual object tracking tasks largely hinges on the quality of the association, it is indeed worthwhile to invest in the pursuit of an optimal solution. General, a time-context model trained with information about the target and the surrounding background detects the location of the target.

In video scenes, a multitude of challenges, such as complex environmental changes, occlusions, illumination changes, and camera motion, can significantly impair the continuity of video data as a time series. As illustrated in Fig 1, occlusions, where objects overlap or are partially hidden by other elements within the scene, are particularly problematic. These occlusions can cause tracking algorithms to lose sight of the target, leading to incorrect associations and disrupted temporal coherence, and complicate post-occlusion recovery due to appearance changes or similar objects. Additionally, variations in lighting, weather conditions, and camera motion can introduce noise and distortions, further complicating the tracking process. Fast-moving targets, abrupt changes in direction, and cluttered backgrounds also pose significant obstacles, as they can easily confuse tracking systems and result in lost tracks or false positives. These challenges collectively undermine the reliability and consistency of the time series data, necessitating robust algorithms capable of maintaining accurate and continuous tracking despite these adversities.

Recently, Siamese network has demonstrated strong power in visual tracking owing to its balanced accuracy and speed [1419]. Siamese network, which is specifically designed to address the image similarity problem, is inherently more appropriate than classical Convolutional Neural Networks (CNNs) for visual object tracking. By formulating the visual tracking task as a matching problem, Siamese tracker, which consists of a template branch and a test branch sharing all parameters in CNN, is trained as a generic similarity function between the two branches on a video dataset. Then Siamese tracker searches for the target in a search region by correlation with a sliding window. The fully convolutional Siamese network (SiamFC) [14] has achieved well tracking performance in the case of no model update at all.

However, relying only on the information of the first frame for visual object tracking restricts the discrimination ability of SiamFC. Naturally, the discrimination capability of SiamFC is weak in case of multiple challenges, e.g., target appearance changes. Temporal image data association is an intermediate key step in visual object tracking tasks. Given that the ultimate outcome of visual object tracking tasks largely hinges on the quality of the association, it is indeed worthwhile to invest in the pursuit of an optimal solution.We propose a template ensemble by adapt online reliable update processes to improve tracking performance. We combine multiple templates online trained via different reliable update processes in the Siamese network framework. These templates complement each other. This strategy ensures the template ensemble always containing the previous reliable information, which is effective for improving tracking robustness.

In this paper, we propose a decision-level fusion to combine multiple templates online trained via different reliable update processes into a compound tracker and correct the errors induced by similar distractors. This method ensures the coherence and consistency of time series data, and can construct more accurate motion model to improve the stability and accuracy of tracking. Specifically, we first obtain multiple templates by online training via different reliable update processes in the Siamese network framework and select the appropriate one from them. We implement the reliable update strategy by discriminating the change of the response map. Then we propose to correct the errors from similar distractors to improve tracking robustness. Fig 2 shows that our method effectively improves the performance of our baseline SiamFC. Our collaborative tracking algorithm combines diverse templates online trained via different reliable update processes to achieve an improved tracking performance. And our tracker re-detects the target from multiple target-like regions to maintain the continuity of tracking when the target suffers from background clutter, similar appearance objects, deformations and occlusions. Our contributions are summarized as follows:

thumbnail
Fig 2. Performance improvement of SiamFC by the proposed algorithm in this chapter.

https://doi.org/10.1371/journal.pone.0320746.g002

  • We explore model fusing strategies to make a better use of diverse models via different update processes.
  • We harness historical information through a variety of online update mechanisms and corresponding models, each meticulously trained utilizing time-series data, to enhance the precision and robustness of object tracking.
  • To maintain the coherence and consistency of time series data, we discriminate the change of the response map to determine whether reliable or not.
  • To maintain the continuity of tracking, we re-detect the target from multiple target-like regions to handle background clutter, similar appearance objects, deformations and occlusions.

2. Related work

2.1. Deep similarity tracking

Siamese trackers formulate visual object tracking as a generic similarity problem. First, a model is trained as a generic similarity function between the two branches on a video dataset during an offline phase. Then the model is applied to evaluate the similarity by correlation between two network inputs: the target template and the current frame. SiamFC [14] trains two identical fully convolutional networks to represent object and search area, and further generate the tracking result through finding the maximum value of correlation response map.

SiamFC outperformed several state-of-art trackers, while achieving real-time speed. Several improvements were subsequently proposed. For example, rather than performing correlation on deep features directly, correlation filter network (CFNet) [20] trained a correlation filter based on the extracted features of object to speed-up tracking. Siamese network with semantic and appearance features (SA-Siam) [16] encoded the target by a semantic branch and an appearance branch to improve tracking robustness. But since these Siamese trackers only use the output of the last convolutional layers, more detailed target specific information from earlier layers is not exploited. In our work, we propose a Siamese tracker that combines features from different hierarchical levels.

2.2. Online updating with time series data

The fixed models of the original Siamese trackers are more likely to fail in the case of target appearance changes. This weakness can be improved by online update. Some discriminative correlation filter (DCF) based trackers employ a straightforward linear strategy that exhibits simplicity in implementation while ensuring efficiency with regard to memory consumption and computational complexity. This approach effectively updates the object appearance model through the utilization of feature extraction from each frame. This strategy assumes that the object appearance changes in a fixed rate in consecutive frames. Inspired by this strategy, several Siamese trackers [17,21,22] online update the object appearance template with a fixed learning rate across all frames in the video. However, some contaminated templates, which may lead to model drift and even tracking failure, are undesirable under severe occlusions, for example. For overcoming this problem, we propose two conservative strategies to apply the previous reliable information for the object template branch update.

(1)

Where and are the updated parameters and α is a learning rate. For template-based tracker, g is the identity function. For correlation filter-based tracker, g is .

2.3. Multi-branch tracking fusion

The diversity of target appearances during tracking shows that a single fixed template cannot be discriminative in all tracking situations with varied challenges. In the pursuit of more precise and reliable object tracking, the concept of multi-branch tracking fusion holds significant relevance. As we’ve established, the ever-changing nature of target appearances demands adaptable tracking mechanisms. Multi-branch tracking fusion, with its various manifestations like Multi-domain convolutional neural networks (MDNet) [23], Tracking adaptation context-aware auto-encoders (TRACA) [24], Multi-branch Siamese tracking (MBST) [25], and Multi-features Siamese tracker (MFST) [26], attempts to counter the challenges posed by diverse target conditions. MDNet [23] pretrain multiple branches using independent information belonging to different domains. However, a notable shortcoming is that the independence of the domains might lead to suboptimal integration during actual tracking. TRACA [24] trains multiple auto-encoders according to different contexts and selects the best one by a context-aware network. While this context-based selection mechanism is innovative, it suffers from high computational complexity. MBST [25] selects the optimal branch from multiple branches with diverse feature representations according to their response maps. Although it provides a practical way to adapt to varying target appearances, it heavily depends on the quality and diversity of the predefined feature representations. MFST [26] fuses multiple feature representations which are extracted from different layers of two models to improve the tracking performance. This fusion aims to enhance the tracking performance by combining different levels of semantic and visual information. Nevertheless, the challenge lies in determining the optimal combination weights for the different features. An improper weighting scheme can either overemphasize less relevant features or underutilize crucial ones, leading to subpar tracking accuracy.

3. Our proposed tracker

As illustrated in Fig 3, we propose a decision-level fusion, which combines multiple templates online trained via different reliable update processes into a compound tracker, and a re-examination method, which re-detects the target misclassified as background by the basic tracking network, to improve the temporal consistency of time series data.

thumbnail
Fig 3. Overall framework of our proposed tracking algorithm.

https://doi.org/10.1371/journal.pone.0320746.g003

3.1. Siamese baseline

In the recent landscape of visual target tracking, tracking algorithms derived from Siamese networks have become a focal point due to their capacity to balance precision with computational efficiency. In this work, we implement SiamFC [14] as our baseline. SiamFC treats the task of visual target tracking as an issue of learning similarities. The theoretical underpinning of SiamFC involves the deployment of two congruent fully convolutional networks tasked to act as a target exemplar and an investigative branch for abstracting features from the target and its proximate search area. Post feature extraction, these attributes are assimilated using a cross-correlation procedure, creating a response map indicative of the correlation. The apex of this map directs to the definitive position of the target for the ensuing tracking activity. Both image patches x and searching areas z are processed by Alex network (AlexNet) [27] ϕ and share all parameters to obtain the feature map and . The response map is calculated by a cross-correlation function,

(2)

The SiamFC system ascertains the position of the target within the search zone by applying cross-correlation functions between the sample portrayal of the target and a translational scanning through the proposed area, pinpointing the object locus at the highest point on the resultant response map. For the high discriminative power, the network is offline trained by many random pairs in a video dataset, and employs the logistic loss as follows:

(3)

Where is the true label of the image pair, and is the actual correlation value between the sample image and the candidate image.

3.2. Multi-template trained with time series data fusion

Challenges such as target appearance variations, similar distractors and occlusions, which widely exist in the video dataset, demand more from template branches. The fixed target template extracted from the first frame is not sufficient to handle these challenges. We propose to add online appearance information extracted from the subsequent frames to templates to improve the tracking performance. In the general update strategy, all update possibilities are as shown in Equation 1. Different from this strategy, we initial store all update possibilities and only update the selected template in the current frame.

The target template of the first frame is the only deterministic template that can cope with frequent and severe occlusion challenges. Updating the template complements the fixed template of the first frame and has complementary properties. It is a research problem to efficiently combine online updated templates and fixed templates of the first frame. Reliable sample judgment mechanism is mentioned in the next section to screen reliable samples. Multiple different templates are trained through different updating processes using the screened samples. The template used in the current frame is updated online under the premise of ensuring the reliability of the estimation information of the current frame, while other templates are retained. Fusing these templates gives them the ability to adaptively adjust based on the scene of the current frame, essentially selecting historical sample information for online template training adaptively. The evaluation criterion for multiple template fusion is peak-to-sidelobe ratio (PSR), which is a classic measurement standard used by correlation filtering tracking algorithms to evaluate the discriminative power of models. PSR can accurately select the most discerning template in the template set.

3.3. Reliable samples selection policy

Numerous tracking systems incrementally refine their visual models [8,11,20,26] with fresh samples taken from each subsequent frame to accommodate alterations in the observed target. These models consistently adapt at a predetermined learning pace with every new frame, without weighing the dependability of the incorporated samples. This practice can unfortunately lead to model corruption when samples are compromised due to concealment or errors in tracking. Such corruption may build up progressively and eventually result in significant deviations within the model, or even a total breakdown in tracking capabilities. To effectively filter for high-quality samples, we introduce a method to gauge the trustworthiness of samples from each frame.

Conceptually, a frame that yields dependable samples is characterized by a response map that shows nominal fluctuation. Yet, when the target is obscured or if a misstep in tracking occurs, the response map is likely to exhibit a striking transformation. Therefore, we harness these variations in response map orientation to determine the trust level of each frame. Specifically, the direction of the response map is shown by the rest of 0.5 times the search region excluding 0.4 times the search region around the maximum confidence score. The workable segment is segregated into eight distinct sectors based on their spatial positioning. This generates a series of eight spatial bins arrayed at even intervals throughout the entire 360-degree scope, corresponding to 0–2π radians. The collective response tallies, viewed as weighted contributions, are grouped into their matching spatial bins relying on their respective positions within the frame.

Then, the total response value of each region is projected to the X-axis and Y-axis to get and according to the angle coefficient corresponding to each region:

(4)

Where and are the corresponding weighting coefficients in the X-axis and Y-axis directions, which are obtained by trigonometric functions according to the approximate angle of the region. The direction angle of the response map A is obtained through the following equation:

(5)

The difference in direction angles between adjacent frames is inversely mapped to the interval using the following equation:

(6)

The change in the direction of the response map between two adjacent frames is described by the following equation:

(7)

3.4. Correcting the errors from similar distractors

Under conditions of extreme clutter, frequent occlusions and similar appearance objects, detecting the precise item of interest amidst multiple misleading regions poses a demanding problem. These deceptive elements can often lead to errors in the tracking process, as differentiating the intended focus from these so-called distractors is not straightforward. Influences from such elements can bias the indicators used to gauge similarities in the tracking templates, potentially leading to a comprehensive breakdown in the tracking methodology. We propose a re-examination method, which re-detects the target misclassified as background by the basic tracking network, and achieves object tracking while filtering out false backgrounds, improving the temporal consistency of the algorithm. We propose to evaluate the confidence of similar target regions using PSR and the local maximum of each peak to determine the final tracking result. Preliminary to this decision process, there is a need to scrutinize the effects caused by peaks that are not in the vicinity of the presumed target, to verify if the distractions have significantly deviated the tracking or if a re-calibration of the target position is justified.The requirement for remediation is triggered under a set of specific conditions: Initially, when the PSR associated with the dominant value in a tailored response visualization—the cosine window—is minimized, reflecting insufficient differentiation capacity by the tracking template. Subsequently, when the difference in magnitude between the primary and secondary peaks on an alternative response map, devoid of the cosine window, is inconsequential, indicating confusion within the template in recognizing the actual focus from the paramount distractor. This level of dissimilarity is captured by a unique quotient, the first-to-second ratio (FSR), representing the relationship between the maximum peak location and the second-peak location without cosine window.Moreover, when there is a significant physical divide between the apex without the cosine feature and the second in command relative to the position with the aforementioned feature, it paints a picture of notable disparity between the mirroring target zone and the true intended area, which if incorrectly identified, could be detrimental to the tracking fidelity.Meeting all these specified conditions signals that the anticipated target has been influenced heavily, necessitating a renewed analysis of its positioning. If, in such a scenario, the PSR of the highest peak on the unmodified response map reigns supreme, it denotes an erroneous initial identification of the target position. Past determinations hinting at a sizable gulf between the falsified position swayed by distractions and the legitimate target space mandates a prompt and vital adjustment, realigning the focus to the position of the peak as charted on the unfiltered response map.

4. Experiments

The performance of the proposed tracking algorithm is comprehensively evaluated using three video datasets in this section. Specifically, the datasets used for evaluation are temple color 128 (TC-128) [28], unmanned aerial vehicle 123 (UAV123) [29], and long-term unmanned aerial vehicle 20 (UAV20L) [29], as shown in Table 1. These datasets cover a wide range of scenarios, including pedestrians, close-range toys, and vehicles, ensuring the complexity and diversity of the video material. This variety enables a more comprehensive assessment of the tracking algorithm’s performance under different conditions. In this section, the experimental setup is first described. Subsequently, detailed information about the proposed tracking algorithm and comparisons with several state-of-the-art tracking algorithms on each dataset are provided. Finally, qualitative evaluations are conducted on selected representative video sequences to further illustrate the strengths and weaknesses of the proposed algorithm.

thumbnail
Table 1. Performance of the proposed tracking algorithm and comparison algorithms in three datasets. The higher the metrics, the better.

https://doi.org/10.1371/journal.pone.0320746.t001

4.1. Implementation details

Before correcting the errors caused by similar distractors, it is crucial to evaluate the degree of interference on the tracking results. This evaluation is performed using different thresholds across various datasets. In the TC-128 [28] dataset, the conditions for determining that the target is strongly interfered are and , and the maximum value area of the response map with cosine window, the maximum value area of the response map without cosine window and the second largest value area do not intersect. In the UAV123 [29] data set, the condition to determine that the target is strongly interfered is and , and the maximum value area of the response map with cosine window, the maximum value area of the response map without cosine window, and the second largest value area of the three areas do not intersect. In the UAV20L [29] data set, the conditions for determining that the target is strongly disturbed are and , and the maximum value area of the response map with cosine window, the maximum value area of the response map without cosine window and the second largest value area do not intersect. In TC-128 dataset and UAV123 dataset, when , the determination target changed drastically so that the samples were unreliable. In the UAV20L dataset, when the determination target changed so drastically that the sample was unreliable.

4.2. Quantitative evaluation of our proposed tracking algorithm performance

4.2.1. Performance of our proposed tracking algorithm on TC-128 dataset.

The fully convolutional Siamese network for temporal data association (SiamFC_TDA), which is proposed by us, is evaluated against five representative trackers in the TC-128 dataset [28]: SiamFC [14], Spatially Regularized Discriminative Correlation Filter with Deconvolution (SRDCFdecon) [32], Spatially Regularized Discriminative Correlation Filter (SRDCF) [31], Staple [32], and Adaptive Structured Local Analysis (ASLA) [33]. SiamFC serves as the baseline tracker for our proposed algorithm. Fig 4 presents the precision and success rate plots for these algorithms on the TC-128 dataset. Compared to the other trackers, SiamFC_TDA demonstrates superior performance in both the precision and success rate plots. Specifically, in the precision curve, SiamFC_TDA outperforms the second best performing SiamFC by 8.7%. In the success rate plot, SiamFC_TDA surpasses Staple, the second best performer, by 5% and improves upon the base tracker SiamFC by 6.6%.

thumbnail
Fig 4. Performance of SiamFC_TDA algorithm and comparison algorithm on TC-128.

https://doi.org/10.1371/journal.pone.0320746.g004

To conduct a detailed analysis and comparison of the performance of each tracker, we evaluated their performance across 11 distinct attributes. The comparative performance of the seven tracking algorithms on these attributes is illustrated in Figs 5 and 6, and summarized in Table 2. Performance is measured in terms of both accuracy and success rate, and SiamFC_TDA outperforms all other algorithms in both metrics on seven attributes. Specifically, on the attribute of Motion Blur (MB), SiamFC_TDA exhibits the highest success rate. Across most attributes, SiamFC_TDA significantly improves upon the base tracker SiamFC, particularly in areas where SiamFC performs poorly, such as Background Clutter (BC), Deformation (DEF), In-Plane Rotation (IPR), Occlusion (OCC), Out-of-Plane Rotation (OPR), and Scale Change (SV).

thumbnail
Fig 5. Precision plots of SiamFC_TDA algorithm and comparison algorithm on TC-128 for each challenge attribute.

https://doi.org/10.1371/journal.pone.0320746.g005

thumbnail
Fig 6. Success plots of SiamFC_TDA algorithm and comparison algorithm on TC-128 for each challenge attribute.

https://doi.org/10.1371/journal.pone.0320746.g006

thumbnail
Table 2. AUC values of SiamFC_TDA algorithm and comparison algorithm on each challenge attribute of TC_128. The higher the metrics, the better.

https://doi.org/10.1371/journal.pone.0320746.t002

Since SiamFC relies solely on the first frame as a template and lacks the ability to adapt to appearance changes through template updates, it performs poorly in scenarios involving DEF, IPR, OCC, OPR, and SV. Conversely, our proposed reliable template update strategy, based on a reliable sample selection mechanism, not only adapts to appearance changes through template updates but also avoids template contamination during the update process in the presence of occlusion. Furthermore, the use of multi-template fusion allows the algorithm to effectively handle appearance changes, making it highly effective for dealing with changes in the target’s appearance.

The notable improvement in BC attribute indicates that the proposed method of correcting the errors of similar distractors plays a crucial role in scenes with increasing interference enhancement, substantially improving the performance of the tracking algorithm. Different update processes generate template sets, which are then fused using the concept of ensemble learning. This approach further enhances the proposed tracking algorithm SiamFC_TDA by integrating the advantages of each template, enabling it to perform well in multiple challenging scenarios and adapt to a wider range of complex and changing scenes.

The extensive evaluation on the TC-128 dataset highlights the superior performance of SiamFC_TDA in terms of both accuracy and success rate. The proposed innovations, including a reliable template update strategy, error correction methods for similar distractors, and multi-template fusion, contribute significantly to the algorithm’s robustness and reliability, positioning SiamFC_TDA as a leading solution for video tracking tasks.

4.2.2. Performance of SiamFC_TDA on UAV123 dataset.

Six tracking algorithms were compared on the UAV123 dataset [27]: SiamFC_TDA, SiamFC [14], SRDCFdecon [28], SRDCF [29], Staple [30], and ASLA [31]. Fig 7 presents the accuracy and success rate curves for these tracking algorithms on the UAV123 dataset, which comprises aerial video sequences. The proposed tracking algorithm SiamFC_TDA demonstrates superior performance in both the accuracy and success rate curves. Specifically, in the precision curve plot, SiamFC_TDA achieves an average distance precision (DP) value of 72.1% with a threshold of 20 pixels, which is 3.6% higher than the second-best performing algorithm, SiamFC, the baseline tracker for SiamFC_TDA. In the success rate curve, SiamFC_TDA attains an area under curve (AUC) value of 49.9%, which is 4.6% higher than the second-best performing baseline tracker SiamFC.

thumbnail
Fig 7. Performance of SiamFC_TDA algorithm and comparison algorithm on UAV123.

https://doi.org/10.1371/journal.pone.0320746.g007

The UAV123 dataset introduces several unique challenges compared to the TC-128 dataset. Video sequences in the UAV123 dataset are annotated with 12 video attribute labels. The comparative performance of the seven tracking algorithms across these 12 attributes is depicted in Figs 8 and 9, and summarized in Table 3. Performance is measured in terms of both accuracy and success rate, and SiamFC_TDA excels in both metrics on nine attributes. Notably, on the attribute of BC, SiamFC_TDA exhibits the highest accuracy. Across most attributes, SiamFC_TDA shows improved performance compared to the baseline tracker SiamFC.

thumbnail
Fig 8. Precision plots of SiamFC_TDA algorithm and comparison algorithm on UAV123 for each challenge attribute.

https://doi.org/10.1371/journal.pone.0320746.g008

thumbnail
Fig 9. Success plots of SiamFC_TDA algorithm and comparison algorithm on UAV123 for each challenge attribute.

https://doi.org/10.1371/journal.pone.0320746.g009

thumbnail
Table 3. AUC values of SiamFC_TDA algorithm and comparison algorithm on each challenge attribute of UAV123. The higher the metrics, the better.

https://doi.org/10.1371/journal.pone.0320746.t003

These enhancements are primarily attributed to the proposed reliable template update strategy based on a reliable sample selection mechanism, the method to correct the errors caused by similar distractors, and the application of multi-template fusion. These innovations are particularly effective in addressing the most challenging scenarios, such as those involving BC, FM, and IV. The robust template update strategy ensures that the tracking model adapts dynamically to changes in the target appearance, thereby maintaining high accuracy even in complex environments. The multi-template fusion technique leverages multiple templates to enhance the system’s resilience against occlusions and other visual disturbances, further improving the overall tracking performance.

The UAV123 dataset’s diverse set of video attributes provides a comprehensive evaluation framework for assessing the robustness of tracking algorithms. SiamFC_TDA’s superior performance across these attributes underscores its capability to maintain stable tracking even in challenging and dynamic aerial video sequences. This robustness is critical for real-world applications where consistent and reliable tracking is paramount.

4.2.3. Performance of SiamFC_TDA on UAV20L dataset.

The proposed tracking algorithm SiamFC_TDA has been rigorously evaluated against six other tracking algorithms on the challenging UAV20L dataset [29]: SiamFC [14], SRDCFdecon [30], SRDCF [31], Staple [32], and ASLA [33]. The UAV20L dataset is specifically designed for long aerial video sequences, presenting a more significant challenge than the UAV123 dataset and requiring higher robustness from the tracking algorithms.

As illustrated in Fig 10, SiamFC_TDA demonstrates superior performance in both accuracy and success rate metrics. In the precision curve plot, the proposed tracking algorithm SiamFC_TDA achieves an average DP value of 65.1% with a threshold of 20 pixels. This performance is 7.5% higher than the second-best performing algorithm, SiamFC, which serves as the baseline tracker for SiamFC_TDA. Furthermore, in the success rate curve, SiamFC_TDA attains an AUC value of 44.1%, surpassing the second-best performer, ECO, by 1.4% and outperforming the baseline tracker SiamFC by 10.6%.

thumbnail
Fig 10. Performance of SiamFC_TDA algorithm and comparison algorithm on UAV20L.

https://doi.org/10.1371/journal.pone.0320746.g010

As depicted in Figs 11 and 12, and summarized in Table 4, SiamFC_TDA outperforms all other algorithms across various attributes, with the exception of Low Resolution (LR) in terms of precision rate and Similar Objects (SOB) in terms of AUC values. This exceptional performance underscores the robustness of the proposed algorithm, even in highly complex and challenging long video sequences. The robustness of SiamFC_TDA is attributed to its effective solutions for addressing common issues such as sample contamination, template contamination, and interference.

thumbnail
Fig 11. Precision plots of SiamFC_TDA algorithm and comparison algorithm on UAV20L for each challenge attribute.

https://doi.org/10.1371/journal.pone.0320746.g011

thumbnail
Fig 12. Success plots of SiamFC_TDA algorithm and comparison algorithm on UAV20L for each challenge attribute.

https://doi.org/10.1371/journal.pone.0320746.g012

thumbnail
Table 4. AUC values of SiamFC_TDA algorithm and comparison algorithm on each challenge attribute of UAV20L. The higher the metrics, the better.

https://doi.org/10.1371/journal.pone.0320746.t004

The UAV20L dataset presents several unique challenges that exacerbate the difficulties faced by tracking algorithms. Long aerial video sequences often contain dynamic scenes with significant variations in lighting, weather conditions, and camera movements, which can lead to frequent occlusions and rapid changes in target appearance. Despite these challenges, SiamFC_TDA consistently maintains high tracking accuracy and stability, demonstrating its ability to handle complex scenarios effectively.

The improvements introduced by SiamFC_TDA, such as enhanced feature extraction techniques and adaptive learning mechanisms, contribute significantly to its superior performance. These enhancements enable the algorithm to adapt quickly to changing conditions and maintain consistent tracking accuracy, even in the presence of occlusions and other visual disturbances. The robustness of SiamFC_TDA is further evidenced by its consistent performance across various attributes, indicating its suitability for a wide range of applications in aerial surveillance and tracking.

The comprehensive evaluation on the UAV20L dataset highlights the robustness and effectiveness of the proposed SiamFC_TDA tracking algorithm. Its superior performance in terms of both precision and success rates, along with its ability to maintain stable tracking under challenging conditions, positions SiamFC_TDA as a leading solution for long aerial video sequence tracking tasks.

4.3. Qualitative evaluation of SiamFC_TDA algorithm performance

Fig 13 illustrates the qualitative outcomes of six different tracking algorithms, including SiamFC_TDA, SiamFC [14], SRDCFdecon [30], SRDCF [31], Staple [32], and ASLA [33], evaluated across three distinct video sequences. These sequences—Busstation_ce2, Bicycle, and Basketball_ce3—are characterized by their inclusion of multiple challenge attributes, such as occlusion, lighting, and cluttered background, which reflect the complexity and variability inherent in real-world tracking scenarios. As shown in Fig 13, the primary difficulty arises from occlusion, which commonly occurs when individuals overlap one another or are obscured by other objects. This phenomenon poses significant challenges for tracking algorithms, leading to the potential misidentification of targets. Occlusion not only complicates the tracking process but also disrupts the continuity of time series data, making it difficult for algorithms to maintain consistent and accurate target identification throughout the sequence. The results presented in Fig 13 highlight the superior performance of SiamFC_TDA compared to the other algorithms, particularly when contrasted with the baseline SiamFC approach. Consequently, the qualitative assessment underscores the robustness and effectiveness of the proposed SiamFC_TDA algorithm in handling intricate environments.

thumbnail
Fig 13. Visualization of tracking results of SiamFC_TDA algorithm and comparison algorithm on challenging sequences.

https://doi.org/10.1371/journal.pone.0320746.g013

5. Conclusion

In this paper, we propose a novel approach for multi-template tracking to enhance the coherence and consistency of time series data in complex scenes. Our method addresses the limitations of the original Siamese network tracking algorithm, which often struggles with significant variations in target appearance and environmental conditions such as occlusions and background clutter. To achieve this, we introduce a multi-template fusion strategy that combines multiple online update templates with the initial frame template. This combination allows us to leverage the adaptability of the online update templates to cope with changes in the target’s appearance, while the initial frame template helps mitigate the effects of occlusions and background noise. By fusing these templates, our approach corrects for potential template contamination in challenging environments. Furthermore, we propose two key strategies to improve the reliability of the tracking process: Interference-aware strategy minimizes the generation of incorrect samples by filtering out unreliable data points, ensuring that only high-quality samples contribute to the tracking process. Sample reliability-aware strategy blocks the propagation of erroneous information to the templates, thereby preventing the degradation of tracking performance due to contaminated data. Experimental evaluations conducted on extensive datasets demonstrate the effectiveness of our proposed algorithm. It not only enhances the robustness of the Siamese network tracking algorithm in complex scenes but also significantly improves its tracking accuracy.

Supporting information

S1 File. Results_TC-128.

The results of the experiment on TC-128.

https://doi.org/10.1371/journal.pone.0320746.s001

(RAR)

S2 File. Results_UAV123.

The results of the experiment on UAV123.

https://doi.org/10.1371/journal.pone.0320746.s002

(RAR)

S3 File. Results_UAV20L.

The results of the experiment on UAV20L.

https://doi.org/10.1371/journal.pone.0320746.s003

(RAR)

References

  1. 1. Marvasti-Zadeh SM, Cheng L, Ghanei-Yakhdan H, Kasaei S. Deep learning for visual tracking: A comprehensive survey. IEEE Trans Intell Transp Syst. 2021:1–26.
  2. 2. Li P, Wang D, Wang L, Lu H. Deep visual tracking: Review and experimental comparison. Pattern Recogn. 2018;76:323–38.
  3. 3. Smeulders AW, Chu DM, Cucchiara R, Calderara S, Dehghan A, Shah M. Visual tracking: An experimental survey. IEEE Trans Pattern Anal Mach Intell. 2014;36(7):1442–68. pmid:26353314
  4. 4. Fiaz M, Mahmood A, Javed S, Jung SK. Handcrafted and deep trackers. ACM Comput Surv. 2019;52(2):1–44.
  5. 5. Xu X, Liu W, Wang Z, Hu R, Tian Q. Towards generalizable person re-identification with a bi-stream generative model. Pattern Recogn. 2022;132:108954.
  6. 6. Yuan X, Xu X, Wang Z, Zhang K, Liu W, Hu R. Searching parameterized retrieval & verification loss for re-identification. IEEE J Sel Top Signal Process. 2023;17(3):560–74.
  7. 7. Liu W, Xu X, Chang H, Yuan X, Wang Z. Mix-modality person re-identification: A new and practical paradigm. arXiv preprint. 2024.
  8. 8. Wang S, Xu X, Chen H, Jiang K, Wang Z, Tang K. Low-light salient object detection meets the small size. IEEE Trans Emerg Top Comput Intell. 2024.
  9. 9. Wang S, Xu X, Ma X, Jiang K, Wang Z. Informative classes matter: Towards unsupervised domain adaptive nighttime semantic segmentation. Proceedings of the 31st ACM International Conference on Multimedia. 2023 Oct:163–72.
  10. 10. Xu X, Wang S, Wang Z, Zhang X, Hu R. Exploring image enhancement for salient object detection in low light images. ACM Trans Multimedia Comput Commun Appl. 2021;17(1s):1–19.
  11. 11. Li S, Yeung DY. Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. National Conference on Artificial Intelligence. 2017:4140–6.
  12. 12. Wu Y, Lim J, Yang MH. Online object tracking: A benchmark. Proc IEEE Conf Comput Vis Pattern Recognit. 2013:2411–8.
  13. 13. Wu Y, Lim J, Yang MH. Object Tracking Benchmark. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1834–48. pmid:26353130
  14. 14. Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS. Fully-convolutional Siamese networks for object tracking. Proc Eur Conf Comput Vis; 2016:850–65.
  15. 15. Guo Q, Feng W, Zhou C, Huang R, Wan L, Wang S. Learning dynamic siamese network for visual object tracking. 2017 IEEE International Conference on Computer Vision (ICCV). 2017:1781–9.
  16. 16. He A, Luo C, Tian X, Zeng W. A twofold Siamese network for real-time object tracking. Proc IEEE Conf Comput Vis Pattern Recogn. 2018:4834–43.
  17. 17. Li B, Yan J, Wu W, Zhu Z, Hu X. High performance visual tracking with Siamese region proposal network. Proc IEEE Conf Comput Vis Pattern Recogn. 2018:8971–80.
  18. 18. Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS. Fast online object tracking and segmentation: A unifying approach. Proc IEEE Conf Comput Vis Pattern Recognit; 2019:1328–38.
  19. 19. Wang X, Li C, Bin L, Tang J. SINT++. Robust visual tracking via adversarial positive instance generation. Proc IEEE Conf Comput Vis Pattern Recognit. 2017:4864–73.
  20. 20. Valmadre J, Bertinetto L, Henriques JF, Vedaldi A, Torr PHS. End-to-end representation learning for correlation filter based tracking. Proc IEEE Conf Comput Vis Pattern Recognit. 2018:5000–8.
  21. 21. Wang Q, Teng Z, Xing J, Gao J, Hu W, Maybank S. Learning attentions: Residual attentional Siamese network for high performance online visual tracking. Proc IEEE Conf Comput Vis Pattern Recognit. 2018:4854–63.
  22. 22. Zhu Z, Wang Q, Li B, Wu W, Yan J, Hu W. Distractor-aware Siamese networks for visual object tracking. Proc Eur Conf Comput Vis. 2018:101–17.
  23. 23. Nam H, Han B. Learning multi-domain convolutional neural networks for visual tracking. Proc IEEE Conf Comput Vis Pattern Recognit. 2016;2016:4293–302.
  24. 24. Choi J, Chang HJ, Fischer T, Yun S, Lee K, Jeong J, et al. Context-aware deep feature compression for high-speed visual tracking. Proc IEEE Conf Comput Vis Pattern Recognit. 2018:479–88.
  25. 25. Li Z, Bilodeau G, Bouachir W. Multi-branch siamese networks with online selection for object tracking. Proc Int Symp Visual Comput. 2018:309–19.
  26. 26. Li Z, Bilodeau G, Bouachir W. MFST: Multi-Features Siamese Tracker. Proc Int Conf Pattern Recognit. 2021:8416–22.
  27. 27. Krizhevsky A, Sutskever I, Hinton GE. ImageNet classification with deep convolutional neural networks. Neural Inf Process Syst. 2012;25:1097–105.
  28. 28. Liang P, Blasch E, Ling H. Encoding color information for visual tracking: Algorithms and benchmark. IEEE Trans Image Process. 2015;24(12):5630–44. pmid:26415202
  29. 29. Mueller M, Smith N, Ghanem B. A benchmark and simulator for UAV tracking. Proc Eur Conf Comput Vis. 2016:445–61.
  30. 30. Danelljan M, Hager G, Khan FS, et al. Adaptive decontamination of the training set: A unified formulation for discriminative visual tracking. Proc IEEE Conf Comput Vis Pattern Recognit. 2016;2016:1430–8.
  31. 31. Danelljan M, Hager G, Khan FS, Felsberg M. Learning spatially regularized correlation filters for visual tracking. 2015 IEEE Int Conf Comput Vis (ICCV). 2015:4310–8.
  32. 32. Bertinetto L, Valmadre J, Golodetz S, et al. Staple: Complementary learners for real-time tracking. Proc IEEE Conf Comput Vis Pattern Recognit. 2016:1401–9.
  33. 33. Jia X, Lu H, Yang M-H. Visual tracking via adaptive structural local sparse appearance model. Proc IEEE Conf Comput Vis Pattern Recognit. 2012:16–21.