UAV target tracking method based on global feature interaction and anchor-frame-free perceptual feature modulation

Yuanhong Dan; Jinyan Li; Yu Jin; Yong Ji; Zhihao Wang; Dong Cheng

doi:10.1371/journal.pone.0314485

Abstract

Target tracking techniques in the UAV perspective utilize UAV cameras to capture video streams and identify and track specific targets in real-time. Deep learning UAV target tracking methods based on the Siamese family have achieved significant results but still face challenges regarding accuracy and speed compatibility. In this study, in order to refine the feature representation and reduce the computational effort to improve the efficiency of the tracker, we perform feature fusion in deep inter-correlation operations and introduce a global attention mechanism to enhance the model’s field of view range and feature refinement capability to improve the tracking performance for small targets. In addition, we design an anchor-free frame-aware feature modulation mechanism to reduce computation and generate high-quality anchors while optimizing the target frame refinement computation to improve the adaptability to target deformation motion. Comparison experiments with several popular algorithms on UAV tracking datasets, such as UAV123@10fps, UAV20L, and DTB70, show that the algorithm balances speed and accuracy. In order to verify the reliability of the algorithm, we built a physical experimental environment on the Jetson Orin Nano platform. We realized a real-time processing speed of 30 frames per second.

Citation: Dan Y, Li J, Jin Y, Ji Y, Wang Z, Cheng D (2025) UAV target tracking method based on global feature interaction and anchor-frame-free perceptual feature modulation. PLoS ONE 20(1): e0314485. https://doi.org/10.1371/journal.pone.0314485

Editor: Zhiquan Liu, Jinan University, CHINA

Received: July 1, 2024; Accepted: November 11, 2024; Published: January 16, 2025

Copyright: © 2025 Dan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The data and code supporting the findings in this paper are available from Zenodo(https://doi.org/10.5281/zenodo.13689999).

Funding: The author(s) received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

In recent years, unmanned aerial vehicles (UAVs) have been widely used in civil and scientific research due to their small size, easy operation, and flexibility. UAV-based target tracking integrates data processing, communication, target detection and tracking, and automatic control systems, promoting the rapid development and broad application of UAV target tracking technology. These technological advances have led to the further expansion of UAV applications in environmental detection, agricultural monitoring, and security surveillance. The main generic problem faced by UAV visual tracking in most cases is that in the operation of the actual target tracking algorithm, when the UAV flies to a certain altitude, the amplitude of the image varies considerably due to the jittering of the fuselage, resulting in blurring of the target, target occlusion, and other impacts thereby restricting the accuracy of the feature extraction and modeling, which may result in the loss of the target by the tracker. Also, in practical applications, considering the need to deploy on embedded devices, the tracker faces challenges in maintaining speed and accuracy compatibility, which becomes necessary to improve the performance of the tracker.

Among the deep learning-based target tracking methods, algorithms based on the Siamese Network framework have achieved remarkable results. Such methods mainly use Convolutional Neural Networks (CNNs) or other deep learning techniques to improve the accuracy and robustness of tracking. However, the framework suffers from two significant drawbacks. First, the cross-correlation operation is used to compute the similarity between the template image and the search region to determine the target’s location. However, this operation leads to a tracker lacking contextual information and more sensitive to scale and shape changes. The limited ability to deal with nonlinear and complex similarity relationships restricts the model’s performance in complex targets or contexts. To overcome these shortcomings, SiamAttn introduces an attention mechanism in the feature matching of twin networks, which can significantly improve the tracking performance without significantly increasing the computational effort. Specifically, SA-Siam [1] introduces an attention mechanism in the feature extraction process of twin networks, which can dynamically adjust the network’s attention to different regions and feature channels, thus improving the tracking accuracy. SiamBAN [2] combines the inter-correlation operation with an attention mechanism in the regression and classification positions and weights the feature maps by the Balanced Attention Module) Weighted feature maps, thus, better highlight the target region during matching and perform better when dealing with background clutter or occlusion. SiamGAT [3] embeds Graph Attention Network (GAT) into the Siamese framework to model the targets and neighborhood relationship. Thereby performing well in dealing with target deformation. In addition, SCSA [4] introduced the Stacked Channel-Spatial Attention (SCSA) mechanism, which was applied to visual tracking tasks to improve the efficiency and accuracy of the tracker. Therefore, it is prevalent to enhance the feature refinement representation of networks through attention, and we consider the introduction of a spatio-temporal attention mechanism that considers the global after the inter-correlation operation between the template image and the search region to improve the efficiency of the feature refinement representation of the model for nonlinear and complex cases.

In addition, anchor frames are used to generate candidate regions in image processing. However, traditional anchor frame designs often face problems of complexity, hyperparameter sensitivity, and high computational overhead, all of which affect the tracker’s real-time performance. In order to solve these problems, subsequent anchorless frame family of algorithms have been improved by generating anchor points. For example, SiamRPN [5] incorporates a region suggestion network to handle scale and shape variations better. SiamFC++ [6] uses an anchorless frame design to simplify the model design and improve tracking accuracy by directly predicting the target’s center position, width, and height without relying on predefined anchor frames. Ocean [7] network is an anchorless frame design of the Siamese network, which employs a bounding box center-based localization method for enhanced target sensing. SiamCAR [8] is an anchor-box-free Siamese network tracker that predicts the position and size of a target through two branches, classification and regression, without relying on predefined anchor boxes, thus providing higher tracking accuracy, especially in multi-scale target tracking scenarios. SiamAPN [9] introduces the Anchor Proposal Network, which aims to generate more suitable candidate regions. APN can automatically generate anchor frames that match the target’s characteristics, making these anchor frames closer to the actual target in terms of size and location, thus achieving efficient computational performance. Therefore, the anchor frame mechanism mainly affects the efficiency and complexity of the algorithm. We can enhance the training of the tracker to face multi-scale situations without affecting efficiency by combining the Anchor Proposal Network with the perceptual feature modulation, which realizes the tracking for the target deformation.

In order to solve the problems caused by the inter-correlation operation and the anchor frame mechanism and to consider both speed and compatibility, we first introduce a global attention mechanism [10] after the inter-correlation operation of the template branch and the search branch in order to enhance the global feature capturing ability of the response graph and to suppress the background noise. This enables the network to capture the salient features of the target better, thus improving the tracking accuracy in small targets and complex backgrounds. Subsequently, we designed the Anchor-Frame-Free Aware Feature Modulation (AFAFM) module. After the Anchor Proposal Network generates high-quality anchor frame drifts, we combine it with the SAFM [11] module, and by weighting the modulation of the features at each spatial location, the network is able to focus on the high-frequency information in critical regions more effectively and enhance the features, while reducing the computational burden on unimportant regions. This approach achieves multi-scale refinement in predicting the target’s location and size, simplifies the model’s complexity, enhances the adaptability to target deformation, and preserves efficiency while improving the accuracy of target localization, thus achieving the effect of compatibility between efficiency and speed.

The main contributions can be summarized as follows:

A novel UAV target tracking method is proposed under a deep learning tracking framework based on the Siamese family. This framework combines global feature interaction and anchor-free frame-aware feature modulation techniques.
A Global Attention Mechanism (GAM) [10] module is introduced, which enhances the global feature capture capability of the response map by feature refinement of the response map generated by the inter-correlation operation between the template frame and the search region. This improvement optimizes the feature representation of the response map and enables the model to perform target tracking more effectively when facing tiny targets and complex scenes. Meanwhile, the anchor-free frame-aware feature modulation (AFAFM) module is designed to perform spatial position-weighted modulation after the drift of high-quality anchor frames. This makes the features of the candidate frames more prominent, enhancing the difference between the target and the background and better adapting to the target’s shape and scale changes.
Comparisons with other tracking methods are made on UAV vision datasets such as UAV123, UAV20L, and DTB70, verifying our algorithm’s advantages in terms of accuracy and speed. Meanwhile, a physical platform was built to verify the algorithm’s practical feasibility and ensure that the speed requirements in real applications are met.

This study designed a new approach to address the above challenges. First, a global attention mechanism (GAM) is introduced to enhance feature representation by considering spatial and channel information. In addition, we design the anchor-frame-free perceptual feature modulation (AFFPFM) module, which performs feature optimization and improves target localization accuracy by modulating the feature map at multiple scales. Finally, to verify the practicality of the algorithm, we built a physical platform and verified the algorithm on a real UAV, which successfully realized the tracking of the target.

Relate workers

Single-target tracking algorithms are mainly categorized into two main branches: tracking algorithms based on correlation filtering and tracking algorithms based on deep learning. This study focuses on the Siamese family of deep learning-based frameworks and, therefore, will focus on the Siamese family of related methods.

The correlation filtering algorithm significantly improves the tracking speed by converting the solution process to the frequency domain, which enables the algorithm to realize the real-time tracking effect. However, the core challenge of the correlated filter tracking algorithm is solving the filter efficiently. In 2010, Bolme et al. introduced the correlated filter algorithm into target tracking for the first time and proposed the Minimum Output Sum of Squared Error (MOSSE) tracking algorithm [12]. Subsequently, Danelljan et al. proposed the DSST algorithm [13], which deals with large-scale variations in complex image sequences through a robust scale estimation method. Henriques et al. proposed the kernelized correlation filter (KCF) [14], which reduces the amount of storage and computation through the discrete Fourier transform to improve the speed further. The DCF correlation tracking algorithm introduces multi-channel features (e.g., HOG features) to improve the tracking performance further. Yiming et al. proposed the AutoTrack algorithm [15], which introduces an online automatic adaptive learning of spatio-temporal regularization based on the DCF algorithm, using the spatial local response map variation as the spatial regularization. SRDCF [16] proposes a method to optimize the correlation filter using the spatial regularization technique, which can effectively suppress background interference and adapt to the target scale changes. STRCF [17] extended the SRDCF to improve the filter’s ability to deal with the target appearance changes and motion uncertainty by applying regularization in spatial and temporal dimensions. DI YUAN et al., based on the DCF filter, proposed ASTCA [18], which more comprehensively characterizes the dynamic changes of the target during the tracking process by simultaneously considering the spatial and temporal characteristics of the target. SCSTCF [19] is a target-tracking algorithm based on correlation filtering, which improves the tracking effect by enhancing the spatial and channel selection capabilities of the correlation filter and introducing a temporal regularization. Zhang et al. [20] introduced sparsity and spatial regularization mechanisms to enhance the tracking performance of correlation filters further. The main advantage of such algorithms is that they are real-time usually require only CPU support, and are easy to deploy on embedded devices. However, relative to deep learning target tracking algorithms, correlation filtering algorithms usually use HOG or grayscale features, which have a limited ability to represent complex objects and make it challenging to capture the deep features of the target adequately. The problem of target loss is more common in complex situations such as illumination change, scale change, fast motion, and severe occlusion, and the robustness of the model is relatively poor. Thus, there is a gap in precision and accuracy compared to deep learning methods.

In addition, there are a number of studies combining filters with deep learning algorithms. For example, self-SDCT [21] combines deep learning and correlation filtering to train a deep neural network through self-supervised learning and utilizes correlation filters for efficient target tracking. Another example is Recapture-SiamCF [22], which combines the traditional correlation filtering method and the Siamese network structure, synthesizing the advantages of both. The recapture mechanism further improves tracking stability and accuracy.

On the other hand, lightweight twin networks of the Siamese family occupy an important place in deep learning-based target-tracking algorithms. Bertinetto et al. proposed the SiamFC algorithm [23], which learns the similarity of targets by training a fully convolutional twin network. The network uses a fully convolutional architecture and inter-correlation operations and performs target tracking through template matching in the inference phase, validating the great potential of Siamese family-based trackers. Thus, an efficient Siamese-based tracker is also a promising option for UAV tracking.

Under the branch of Siamese networks, Siamese-based attention networks were first actively explored to enhance the expressive ability of feature refinement. SiamAttn incorporated attention mechanisms to improve the adaptive ability to changes in complex backgrounds and target appearance through finer-grained feature selection. SA-Siam [1] used spatial attention and Channel Attention mechanisms to enhance the network’s ability to discriminate between targets and backgrounds. SiamBAN [2] incorporates Adaptive Attention Mechanisms into a twin network and uses an anchorless frame design to improve its ability to handle changes in target scale. SiamGAT [24] introduces Graph Attention Mechanisms (GAMs) to enhance its ability to adapt to changes in target appearance. Attention Mechanism (Graph Attention Mechanism) captures complex relationships between targets and backgrounds by building graph structures on feature graphs. As a result, the Attention Mechanism plays different roles at the module positions in each branch. In addition, transformer structures have even been introduced. For example, CorrFormer [25] combines the Transformer structure with the cross-correlation operation to enhance the context-awareness capability through the Transformer and utilizes cross-correlation for target matching and tracking. SCATT [26] proposes a Transformer-based target tracking algorithm, which combines the symmetric cross-attention mechanism that fully utilizes the advantages of the Transformer and enhances the context-awareness capability.

Next, for the anchor frame mechanism in the Siamese family, SiamRPN introduces a region proposal network (RPN) based on SiamFC to generate multiple candidate frames, which enhances the target localization accuracy, especially in complex backgrounds. DaSiamRPN [27] employs the anchor point mechanism based on SiamRPN and extends the negative sample generation strategy to improve the robustness to background interference and appearance changes. SiamRPN++ [28] improved SiamRPN by using deep features and introducing multi-scale anchors and multi-scale feature fusion, improving tracking accuracy and robustness. SiamMask [29] added a saliency detection module based on SiamRPN, which is capable of outputting the target’s accurate segmentation mask, which realizes the unification of target segmentation and tracking. SiamFC++ [6] incorporates Anchor-free Design (AFD), which simplifies the network structure and improves the flexibility and accuracy of localization by directly regressing to the target centroid and bounding box. SiamCAR [8] simplifies the design of SiamRPN by using Anchor-free classification and regression branching instead of RPN, which reduces the complexity associated with anchor point design and provides more efficient and accurate target tracking. SiamAPN [9] introduces an Adaptive Proposition Network (APN) module to predict the target size and shape better, enhancing the tracking performance in complex scenarios. Therefore, the choice of anchor frame mechanism and the way target location determination is performed after generating anchors affects the accuracy of the classification and regression results, affecting the tracker’s efficiency and speed.

In summary, considering the location of the attention mechanism and the design of the anchor frame structure, we propose an improved UAV target tracking method based on global feature interaction with anchor frame-free perceptual feature modulation. First, we introduce the global attention mechanism (GAM) after the necessary operations of the twin network to interrelate the template frames and search branches. By combining temporal attention and channel attention, GAM can enhance the global feature-capturing ability of the response map and optimize the feature representation so that the model can track the target more efficiently in the face of tiny targets and complex scenes. In addition, we design the Anchor-Frame-Free Aware Feature Modulation (AFAFM) module. This module performs spatial position-weighted modulation after the drift of high-quality anchor frames to make the features of candidate frames more prominent. This method not only enhances the difference between the target and the background but also better adapts to the shape and scale variations of the target, thus improving the tracking accuracy and robustness.

Global feature interaction with anchorless frame perception feature modulation UAV tracking algorithm

Overall algorithm

This study proposes a UAV target tracking method based on global feature interaction with anchor frame-free perceptual feature modulation to address the above challenges. The general twin network-based target tracking algorithm mainly includes the input of the template frame and target frame, feature modeling, target localization, and the final output of the predicted position of the target frame; the flow of the algorithm is shown in Fig 1 is specifically divided into the following four parts:

(1) Feature extraction part: a twin network framework using AlexNet as the backbone network is used. At the same time, the hierarchical feature extraction between the fourth and fifth layers is performed, and the fifth layer features of the template image are inter-correlated with the fifth layer features of the search image for mutual correlation operation, which is entered into the feature fusion network. At the same time, the fourth layer features of the template image are fused with the fourth layer features of the search image by mutual correlation and entered into the SAPN network to improve the feature expression ability of the model.
(2) Feature fusion part: after the inter-correlation operation of the fourth and fifth layer network feature maps, the connection operation of the feature maps is carried out. The GAM module is introduced to dynamically adjust the importance of each channel to enhance the quality of feature representation, which helps better capture the critical information of the input data and provide robust information for subsequent target classification and regression branching.
(3) AFFPFM network: The anchor-frame-free perceptual feature modulation (AFFPFM) module adopts the anchor frame-free mechanism, which significantly improves the speed of target search. Meanwhile, the SAFM module is introduced to enhance the feature representation. This module further improves the feature expressiveness by modulating the features at multiple scales, thus realizing more accurate target localization.
(4) Classification and regression network: appropriate loss functions are constructed through a three-layer classification network and a two-layer regression feature fusion network to enhance the classification and regression capabilities of the target and generate accurate prediction frames.

Download:

Fig 1. The overview of the GAFFPFM tracker.

It composes of four subnetworks, i.e., feature extraction network, feature fusion network, anchor-frame-free perceptual feature modulation (AFFPFM), and multi-classification regression network.

https://doi.org/10.1371/journal.pone.0314485.g001

First, the feature extraction network uses AlexNet as the backbone and adopts the classical twin network structure. One branch receives the template image as input, and the other receives the search target image as input. The feature maps of the fourth and fifth layers are extracted simultaneously after five layers of convolution. The output of the upper layer template branch is φ₄(z), φ₅(z) and the output of the search target branch is φ₄(x), φ₅(x). Next, a mutual correlation operation is performed on φ₄(z), φ₄(x) and φ₅(z), φ₅(x) to obtain the response graphs R₁, R₂, which are then joined by a join operation. The connected feature map is input to the channel-space interaction module (GAM). This module is designed to enhance the expressive power of the template features, adjust the contextual information, and strengthen the relationship between global and local features to reduce information dispersion effectively. This structural design can more fully utilize the twin network structure of AlexNet to enhance the target tracking algorithm’s feature expression capability and accuracy through the mutual correlation operation and the channel-space interaction module. (1) (2)

The c() in Eqs (1) and (2) denote the corresponding different convolution operations, and R₁, R₂ represent different response maps, respectively. After the deep inter-correlation operations R₁, R₂, a global attention enhancement mechanism is introduced to improve the effectiveness of the feature fusion network. Next, R₁ is input to the anchorless frame mechanism network, while the SAFM network with multi-scale features is introduced. The SAFM network improves the resolution of the features without affecting the efficiency of the anchorless frame mechanism network and provides robust information for target regression. Finally, the outputs of the fused spatial and channel feature information and the anchorless frame-aware feature modulation network are fed into the target’s classification and regression networks, respectively, to determine the target’s location to be tracked.

GAM global attention mechanism

In single-target tracking algorithms based on the Siamese family, a mutual correlation operation is used to compute the similarity between the template image and the search region to determine the location of the target. However, this approach results in a tracker that lacks contextual information, is sensitive to scale and shape variations, and has limited ability to deal with nonlinear and complex similarity relationships, thus limiting the model’s performance when dealing with complex targets or backgrounds. To overcome these problems, we introduce the Global Attention Mechanism (GAM) [10], which is an enhanced version of the CBMA module [30] capable of capturing global information. After the inter-correlation operation, the GAM mechanism is used to perform feature refinement, enhance the global feature capture capability of the response map, and optimize the feature representation, which enables the model to perform target tracking more efficiently when confronted with tiny targets and complex scenes. Rahman [4] et al. improved the tracking performance by combining the CBMA module with the SiamFC fusion, which enhances the tracking effect but reduces the tracking speed of the algorithm. Therefore, to improve the performance of the tracking algorithm, we introduced the Global Attention Mechanism (GAM) module, specifically designed to enhance cross-dimensional interactions between features [9]. This module provides vital information for subsequent classification and regression tasks. The Global Attention Mechanism enhances the interaction between channels and spatial information by dynamically adjusting the importance of each channel in the feature map. It captures more global contextual information and efficiently selects and emphasizes important features, thereby reducing information loss and improving gradient propagation. This approach significantly improves the model’s feature representation and overall performance.

Specifically, the spatial and channel attention modules are shown in Fig 2, which illustrates a schematic diagram for enhancing cross-dimensional interactions by preserving channel and spatial information to improve the performance of deep neural networks. In this figure, the Global Attention Mechanism (GAM) enhances the performance of the image categorization task by introducing 3D alignment and Multi-Layer Perceptron (MLP) for channel attention in conjunction with the Convolutional Spatial Attention sub-module. The introduction of 3D alignment and MLP for channel attention dynamically adjusts the importance of each channel in the feature map to enhance the channel dependency between features. On the other hand, the convolutional spatial attention sub-module helps to capture and improve the spatial structures in the feature maps, which are particularly critical when dealing with image classification tasks. This combined use of channel and spatial attention allows the model to capture global contextual information more efficiently and optimize the selection and emphasis of essential features, thus improving the performance of deep neural networks in complex tasks. (3) (4)

Download:

Fig 2. The overview of GAM.

Channel attention and spatial attention are included.

https://doi.org/10.1371/journal.pone.0314485.g002

In Eqs (3) and (4), M_c and M_s denote the channel and spatial attention maps, respectively, and denote the elements that undergo a multiplication operation. This operation applies channel attention weights to each channel of the input feature map to enhance the representation of essential features. The P₁, P₂, P₃ denote the input feature maps, the feature maps generated by the channel attention module, and the feature maps generated by the spatial channel module, respectively. This operation applies spatial attention weights to each spatial location of the input feature map to enhance the representation of specific spatial regions.

Fig 3 shows the channel attention module. It uses a three-dimensional arrangement to preserve three-dimensional information and then a two-layer multilayer perceptron (MLP) with powerful function approximation to amplify the channel-space correlation across dimensions, generating a channel attention feature map.

Download:

Fig 3. Channel attention module.

https://doi.org/10.1371/journal.pone.0314485.g003

The spatial attention module is shown in Fig 4. In the spatial attention sub-module, two convolution layers are used for spatial feature fusion to centralize the spatial information. To effectively integrate features, improve reconstruction performance, and enhance cross-dimensional interactions, we adopt the same channel attention module as BAM and set the reduction rate r to reduce the number of intermediate-layer feature maps, thus reducing the computation and storage requirements. We avoid pooling operations to preserve feature mapping, further, accelerate model convergence through dimensionality reduction and normalization, and finally generate spatial attention maps for weighting.

Download:

Fig 4. Spatial attention module.

https://doi.org/10.1371/journal.pone.0314485.g004

Anchorless frame mechanism

For target tracking algorithms in UAV scenarios, lightweight and adaptivity not only improve the tracking stability under the UAV perspective but also meet the real-time requirements of the algorithms. The anchor frame mechanism in this paper generates an anchor frame only at each position in the similarity map and, after the response map, generates an anchor point (i, j) at each of its positions. It corresponds to searching for the x-image under the (p_i, p_j). (5) (6)

In Eqs (5) and (6), w_s and h_s denote the width and height of the searched image, respectively, and s is the step size of the network. Additional classification and regression branches are introduced to determine the location and size of the target frame. The classification and regression branches are induced from the response graph R. The classification branch outputs three classification feature maps: , each point (w, h,), is used to indicate the positivity or negativity of each anchor point, in the ratio to the real, we set the threshold to 0.7, when greater than 0.7 is a positive case anchor, and less than 0.7 is a negative case anchor, and the two-dimensional vectors in denote the foreground and background in the search image x’s fractions, which are foreground when falling within the true frame and background otherwise. The denotes the mass of each frame: (7)

As in Eq (7) above, are the positions corresponding to the corresponding locations in the search image to the four edges of the bounding box, where (x₀, y₀), (x₁, y₁) are the coordinates of the upper-left corner and the lower-right corner of the real box, and (x, y) are the positions corresponding to the images generated by our anchor point (i, j), So, to summarize, the total classification loss function [9] can be found as Eq (8): (8)

In addition, the response map with the feature maps of the branches of the regression: and , which are outputted in the response map, indicates that the distance of the searching image x, y to the borders of our actual image can be expressed by the above Eq (7), while the distance of the searching image to the borders of the actual image corresponding to of the search image to the actual image border distance can be expressed as the following equation: (9) where in Eq (9), g_x, g_y, g_w, g_h denote the center point coordinates of the true coordinates and the width and height of the box, respectively, and p_x, p_y, p_w, p_h denote the center point coordinates of the predicted coordinates and the width and height of the box, respectively. This results in an overall regression loss function [9] of: (10)

In summary, the overall loss function is: (11)

Anchor-frame-free perceptual feature modulation

SAFM (Spatially-Adaptive Feature Modulation) was originally used to independently computationally learn multi-scale feature representations and dynamically aggregate these features for spatial modulation [11], which is mainly applied to image super-resolution tasks. In this paper, an anchor-free frame-aware feature modulation module is designed to perform spatial location-weighted modulation after the drift of high-quality anchor frames, which can make the features of these candidate frames more prominent, not only to enhance the difference between the target and the background but also to better adapt to the target’s shape and scale changes.

In this paper, we design the SAFM module to accept feature maps generated by anchorless frames as input. By taking the feature maps as input, the SAFM module can utilize the information extracted by the neural network more efficiently, thus improving the performance and efficiency of the super-resolution task. We explore the mechanism of remote adaptation based on multi-scale feature modulation of the anchorless frame feature maps. To enhance the local context information, we further develop a compact method that improves the feature expressiveness and detail-capturing ability while effectively utilizing the multi-scale feature information, reducing the computational complexity, and significantly improving the features’ quality and the overall network’s efficiency.

The FMM module consists of SAFM and CCM to select features. Thus, the network can be referred to as a unified feature mixing module, thereby selecting representative features that can be represented as: (12) (13) Where LN is the LayerNorm layer in Eqs (12) and (13), and is the variable generated by the intermediate transformation.

As shown in Fig 5 above, the network mainly consists of a Functional Mixing Module (FMM) and an upsampler layer:

A 3 × 3 convolutional layer transforms the input feature map to generate shallow features.
These features are fed into the FMM for high-resolution feature map reconstruction. The FMM consists of a spatial adaptive feature mechanism (SAFM) and a convolutional channel mixer (CMM), while global residuals are introduced to learn high-frequency details.
A lightweight upsampling layer performs fast reconstruction.

Download:

Fig 5. An overview of the proposed SAFM.

Mainly includes the FFM module.

https://doi.org/10.1371/journal.pone.0314485.g005

Under the SAFM module, the features are divided into four different scales. Each scale is processed with varying depths of convolution, and then features are aggregated by 1 × 1 convolutional layers. Then, the ReLU activation function is applied to realize feature modulation. The feature with scale 0 is processed by deep convolution, while the remaining three scales are reduced in feature resolution by adaptive maximum pooling. Deep convolution is then applied again, and features are sampled back to their original size using nearest neighbor interpolation. Finally, the training process is normalized by a normalization layer. In CCM, local information is fused by enhancing the local spatial modeling capability by first encoding the spatial local context using a 3 × 3 convolution and then reducing the number of channels back to the original input dimension by a 1 × 1 convolution. (14) (15)

In Eqs (14) and (15), F₀ represents the shallow features after convolution, F_e represents the input features, F_e and then the FMM module, ℧ represents the upsampler function, and M_θ represents the FFM processing, which generates the feature map after F_re reshaping. (16) where in Eq (16), f represents the Fourier function transform, F_r is the high-quality real feature map, and F_re is the post-generation remodeled feature map.

Experimental analysis

Implementation details and evaluation criterion

Implementation details.

The experimental platform for this algorithm was configured as follows: a 12th Gen Intel(R) Core(TM) i7-12700K CPU with a GTX 3080 Ti graphics card was used, and the operating system was Ubuntu 18.04. The algorithm’s implementation was based on PyTorch 1.10 under Python version 3.6. The source of the training data was the COCO [31] dataset, which contains 91 classes of targets and 2.5 million labels.

The network uses Alexnet as the backbone; the parameters of the first two convolutional layers were frozen and loaded with pre-trained models, the last three convolutional layers were fine-tuned, and the feature extraction network was frozen for the first ten epochs. The anchor-less frame mechanism network was trained, the feature fusion network, the feature augmentation, and the spatio-temporal augmentation network, and the last 30 epochs were implemented to train the whole network end-to-end, and the learning rate was reduced from 0.005 to 0.0005 with a momentum of 0.9. At the same time, the template and search image sizes are 127*127 and 287*287 pixels, respectively. Meanwhile, taking into account the short and efficient training, as well as the comparison of training accuracy, we are taking to compare P and S under the same conditions; using the same dataset on a BASELINE is a common and effective way to evaluate the performance, to ensure the consistency of the data, and to ensure the same training set, as well as the test set, i.e., the demonstrated SiamAPN as well as our algorithms, were trained under the coco dataset to perform the accuracy comparison.

Evaluation criterion.

In order to fully evaluate the performance of our algorithm, we use the following evaluation metrics:

Precision of OPE: One-pass Evaluation (OPE) method is used for evaluation. We initialize the first frame at the position of the target in ground-truth, and after running the tracking algorithm, we calculate the center position error between the predicted frame and the real frame. Accuracy is measured by calculating the proportion of frames where the center error is less than a specified threshold (e.g., 20 pixels).
Success Plot of OPE: The success rate is calculated based on the Intersection over Union (IoU) between the predicted frame and the real frame. The success rate represents the proportion of frames with IoU greater than or equal to a specified threshold. By plotting the success rate, the performance of the algorithm under different IoU thresholds can be visualized.
Frames Per Second (FPS): FPS measures the number of frames processed per second by the algorithm. A higher frame processing speed means that the algorithm is more real-time and thus better able to meet real-time tracking requirements.

State-of-the-Art comparison

Experiment on UAV123@10fps dataset.

The UAV123@10FPS dataset contains 123 sequences, each captured at ten frames per second. This adds multiple challenges due to the large scene variations between frames, including fast motion, low resolution, aspect ratio variations, background clutter, fast camera movement, motion blur, and occlusion. Therefore, the UAV123@10FPS is more suitable for evaluating tracker performance than the UAV123 dataset. This dataset is suitable for evaluating the performance of the tracker in UAV low frame rate videos, especially the performance of the target in relatively stable or slow-moving scenes. To validate the tracking effectiveness of our algorithm GAFFPFM, as shown in Table 1 and Fig 6, we compare it with 25 other trackers, including SiamAPN [9], TADT [32], SiamFC [23], UDT [33], AutoTrack [15], ARCF [34], ARCF_F [34], CSRDCF [35], ECO_HC [36], STRCF [17], MCCT_H [37], Staple [38], Staple_CA [38], SRDCF_decon [39], SRDCF [16], BACF [40], KCC, SAMF [41], SAMF_CA [41], fDSST [42], DSST [43], LCT [44], CN, DCF, and KCF [14]. These comparisons include classical deep learning algorithms (e.g., SiamAPN, SiamFC, UDT, AutoTrack, etc.) as well as classical filter-like tracking algorithms (e.g., ARCF, SRDCF, BACF, STRCF, KCF, etc.). The results demonstrate the superior performance of our tracker in terms of precision (Precision) and area under the curve (AUC). Compared with the baseline algorithm SiamAPN, our Precision score improves by 2.6%, and our AUC score improves by 3.4%, which fully demonstrates our advantages in feature refinement and perceptual modulation.

Download:

Fig 6. Experimental results on the UAV123@10fps dataset.

https://doi.org/10.1371/journal.pone.0314485.g006

Download:

Table 1. Experimental results on the UAV123@10fps dataset of precision scores, success scores.

https://doi.org/10.1371/journal.pone.0314485.t001

Experiment on UAV20L dataset.

The UAV20L dataset contains 20 long-duration tracking video sequences of over 2,000 frames each, totaling over 58,000 frames, covering a wide range of scenario designs and providing a basis for real-world tracking challenges. The dataset is suitable for evaluating tracker performance in long-duration tracking, especially in the face of target disappearance, reappearance, and long-term environmental changes. To validate the performance of our algorithm GAFFPFM in such scenarios, as shown in Table 2 and Fig 7, we compare it with 25 other trackers, including SiamAPN [9], TADT [32], SiamFC [23], UDT [33], AutoTrack [15], ARCF [34], ARCF_F [34], CSRDCF [35], ECO_HC [36], STRCF [17], MCCT_H [37], Staple [38], Staple_CA [38], SRDCF_decon [39], SRDCF [16], BACF [40], KCC, SAMF [41], SAMF_CA [41], fDSST [42], DSST [43], LCT [44], CN, DCF, and KCF [14]. These comparison algorithms include classical deep learning methods (e.g., SiamAPN, SiamFC, UDT, AutoTrack, etc.) as well as traditional filter-based tracking algorithms (e.g., ARCF, SRDCF, BACF, STRCF, KCF, etc.). The results show that our tracker performs superiorly in terms of both Precision and area under the curve (AUC). In the context of long-time tracking, our Precision improves by 6.4%, and our AUC score improves by 6.1%, compared to the baseline algorithm, SiamAPN, demonstrating the high performance of our algorithm in dealing with target disappearance and reappearance as well as long-term environmental changes.

Download:

Fig 7. Experimental results on the UAV20L dataset.

https://doi.org/10.1371/journal.pone.0314485.g007

Download:

Table 2. Experimental results on the UAV20L dataset of precision scores, success scores.

https://doi.org/10.1371/journal.pone.0314485.t002

Experiment on DTB70 dataset.

The DTB70 dataset consists of 70 video sequences that feature targets and backgrounds of high complexity and challenge, including multiple factors interfering, diverse target classes, scene diversity, and long tracking times. Therefore, this dataset is particularly suitable for evaluating the performance of the tracker in the presence of complex backgrounds and interferences, testing the robustness of the tracking algorithms. As shown in Table 3 and Fig 8, we compare GAFFPFM with 25 other trackers, including SiamAPN [9], TADT [32], SiamFC [23], UDT [33], AutoTrack [15], ARCF [34], ARCF_F [34], CSRDCF [35], ECO_HC [36], STRCF [17], MCCT_H [37], Staple [38], Staple_CA [38], SRDCF_decon [39], SRDCF [16], BACF [40], KCC, SAMF [41], SAMF_CA [41], fDSST [42], DSST [43], LCT [44], CN, DCF, and KCF [14]. Compared to the baseline algorithm SiamAPN, our algorithm improves by 0.3% and 1% in Precision and AUC, respectively, showing the enhancement of tracker performance by perceptual feature modulation and feature refinement in complex contexts.

Download:

Fig 8. Experimental results on the DTB70 dataset.

https://doi.org/10.1371/journal.pone.0314485.g008

Download:

Table 3. Experimental results on the DTB070 dataset of precision scores, success scores.

https://doi.org/10.1371/journal.pone.0314485.t003

Meanwhile, in order to compare our tracking speed, as shown in Table 4, in general, filtering-based algorithms have the advantage of fast-tracking speed; for example, KCF demonstrates its amazing tracking speed, but in deep learning-based, we demonstrate an amazing tracking speed of more than 100 frames in Table 4, which is ranked fifth in UAV123@10fps, UAV20L, and DTB70, and is a major breakthrough in deep learning, because we greatly improve as well as maintain the tracking speed based on the leading retention accuracy and AUC relative to the above algorithms. This is a major breakthrough in deep learning because compared to the above algorithms, our algorithm greatly improves and maintains the tracking speed on the basis of retaining the leading accuracy and AUC, i.e., it embodies the advancement of the anchorless frame mechanism, as well as the superiority of feature modulation and feature optimization in terms of detail capture and feature refinement, and proves the effectiveness of our algorithm in striking a balance between efficiency and accuracy.

Download:

Table 4. Comparison of tracking speed on UAV123@10pfs UAV20L and DTB70 by different algorithms.

https://doi.org/10.1371/journal.pone.0314485.t004

Ablation experiments

Table 5 and Fig 9 shows the comparison results under baseline, baseline + GAM, baseline + SAFM, and baseline + GAM + SAFM on the challenge datasets UAV123@10fps and UAV20L in UAV view. Under UAV123@10fps, when the baseline algorithm uses the coco dataset and is trained under the same conditions, the precision (P) and success rate (S) are 0.699 and 0.486, respectively. The baseline algorithm using global attention augmentation achieves a precision of 0.700 and a success rate of 0.488 under the same data and conditions. Meanwhile, the baseline with the SAFM improves the accuracy to 0.699 and the success rate to 0.507. When the two enhancement mechanisms are applied simultaneously, the tracking accuracy and the success rate are significantly enhanced to 0.716 and 0.513, respectively, as shown in the following table and figure. Similarly, the test results on the UAV20L dataset are displayed in the following table.

Download:

Fig 9. Experimental results on UAV20L and UAV123@10fps.

https://doi.org/10.1371/journal.pone.0314485.g009

Download:

Table 5. Performance comparison of different combinations on UAV123@10FPS and UAV20L datasets.

https://doi.org/10.1371/journal.pone.0314485.t005

Qualitative comparison

As shown in Fig 10, for comparison with the baseline, for the tracking effect under deep learning with the baseline algorithm, SiamAPN, we compared our algorithm with the Siamapn algorithm using the same training dataset on the UAV123@10fps dataset. The upper left corner represents the number of frames in the image: the yellow box indicates the prediction frame of our algorithm, the green box indicates the actual frame, and the blue box indicates the prediction frame of the Siampan algorithm. Our algorithm is closer to the range of the green prediction frame for pedestrian prediction (frame 64). For vehicle prediction, e.g., frame 228, the blue box becomes smaller and distorted when the vehicle is about to disappear from the field of view. In contrast, the yellow box of our algorithm still maintains a good overlap with the actual box. In addition, at frame 242 for the car and frame 514 for the swimmer’s rowing boat, our algorithm can accurately predict the target location and maintain overlap with the actual frame despite the significant magnitude changes in the upper and lower frames.

Download:

Fig 10. Comparison under UAV123@10fps dataset.

https://doi.org/10.1371/journal.pone.0314485.g010

As shown in Fig 11, two scenes selected from the UAV20L dataset (frames 179 and 732, respectively) demonstrate a small target car and a pedestrian walking in a dynamically changing viewpoint. The yellow frame in our algorithm follows the spatiotemporal transformation more accurately and is closer to the real frame.

Download:

Fig 11. Comparison under UAV20L dataset.

https://doi.org/10.1371/journal.pone.0314485.g011

Physical verification

For this purpose, we built a complete UAV live system, as shown in Fig 12, for deploying and running our algorithms in real applications to verify their usability. The actual flow of the whole system is shown in Fig 13: First, the first frame of the target image for our algorithm is input. Then, the UAV will check its status to confirm whether it usually takes off and whether the target tracking algorithm runs normally. Subsequently, the system will select the target and frame it. The UAV acquires the pixel information, performs attitude solving to calculate the distance of the actual object from the target, and sends out speed control commands to fly toward the target. When the target position is updated, the UAV will adjust the flight direction while keeping the nose direction unchanged, adopting the headless mode to delineate the quadrant better and determine the exact position. The whole process is executed cyclically until the distance between the UAV and the target is less than a set threshold, indicating that the target has been reached directly above the target, and the tracking process then ends.

Download:

Fig 12. Physical drone.

https://doi.org/10.1371/journal.pone.0314485.g012

Download:

Fig 13. Flowchart of the drone takeoff system.

https://doi.org/10.1371/journal.pone.0314485.g013

The target is boxed after the UAV takes off, as shown in Fig 14 below; we have selected the target after the UAV takes off, thus realizing target acquisition and then target tracking. As shown in Fig 15, the motion trajectories of the UAV and the target object are shown separately under the simulation conditions. Their motion trajectories overlap. In the illustration, the red color indicates the position change of the UAV, and the blue color indicates the position change of the target. At the beginning of the virtual flight, we specify the direction of the nose of the UAV as the X-axis, the direction perpendicular to it as the Y-axis, and the takeoff point as the coordinate origin. In the coordinate axes in the figure, the horizontal axis is the X-axis, and the vertical axis is the Y-axis. As can be seen from the figure, the UAV recognizes the target after takeoff and follows towards it, and the trajectories of the two are coincident. As shown in Fig 15, the actual trajectory of the UAV and the target in the actual flight state is demonstrated in the physical condition. The curves are rather messy due to the frequent changes in the target’s position in the field of view caused by the changes in the UAV’s attitude during the actual flight. In the upper right corner of the figure, the UAV fails to follow the target accurately due to the rapid change of the target’s position and the UAV’s untimely response. However, after the takeoff point (0, 0), the UAV gradually follows the target and shows an overall trend of tracking the target.

Download:

Fig 14. UAV target finding and target tracking.

https://doi.org/10.1371/journal.pone.0314485.g014

Download:

Fig 15. UAV target finding and target tracking.

https://doi.org/10.1371/journal.pone.0314485.g015

Conclusion

In this paper, a UAV target tracking method based on global feature interaction with anchorless frame-aware feature modulation is proposed. In the tracking subsystem, a channel space interaction mechanism is introduced to construct a real-time multi-scale feature modulation network of anchorless frame mechanisms for UAV target tracking with finer information representation and feature refinement. The P and S of the three datasets, UAV123@10fps, UAV20L, and DTB70, reached 0.716—0.659,0.5130—469,0.690—0.485, respectively. At the same time, we constructed a physical platform for UAV flights to validate the practical deployment and reliability of our algorithm. The experimental results show that the method performs well in terms of realism and effectiveness of target tracking. Later on, in terms of deep learning, we will also consider combining active learning methods with deep learning [45], aiming to optimize the training process of the deep learning tracking algorithm and improve the learning efficiency and performance of the model.

However, in the actual control subsystem, we found that the flight process of the UAV was not smooth enough, and the handling of target loss could be improved. In the future, we plan to optimize the smooth processing effect of target tracking by adjusting the PID parameters of tracking, including the attitude, angle, and speed control of the UAV. At the same time, we will employ Kalman filtering to predict the target’s position information, thus compensating for the time delay required for decision-making, transmission of coordinate points, and aircraft response to achieve synchronized tracking with the target. These improvements aim to design a complete UAV target tracking system to promote its relevance and effectiveness in practical applications.

References

1. He, A., Luo, C., Tian, X., & Zeng, W. A twofold siamese network for real-time object tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4834-4843.
2. Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. Siamese box adaptive network for visual tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 6668-6677.
3. Lu J., Li S., Guo W., Zhao M., Yang J., Liu Y., & Zhou Z. Siamese graph attention networks for robust visual object tracking. Computer Vision and Image Understanding, 2023, 229: 103634.
- View Article
- Google Scholar
4. Rahman M. M., Fiaz M., & Jung S. K. Efficient visual tracking with stacked channel-spatial attention learning. IEEE Access, 2020, 8: 100857–100869.
- View Article
- Google Scholar
5. Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. High performance visual tracking with siamese region proposal network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8971-8980.
6. Xu, Y., Wang, Z., Li, Z., Yuan, Y., & Yu, G. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(07): 12549-12556.
7. Zhang, Z., Peng, H., Fu, J., Li, B., & Hu, W. Ocean: Object-aware anchor-free tracking. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer International Publishing, 2020: 771-787.
8. Guo, D., Wang, J., Cui, Y., Wang, Z., & Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 6269-6277.
9. Fu, C., Cao, Z., Li, Y., Ye, J., & Feng, C. Siamese anchor proposal network for high-speed aerial tracking. 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021: 510-516.
10. Liu, Y., Shao, Z., & Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv preprint arXiv:2112.05561, 2021.
11. Sun, L., Dong, J., Tang, J., et al. Spatially-adaptive feature modulation for efficient image super-resolution. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 13190-13199.
12. Bolme, D. S., Beveridge, J. R., Draper, B. A., et al. Visual object tracking using adaptive correlation filters. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010: 2544-2550.
13. Danelljan, M., Häger, G., Khan, F., et al. Accurate scale estimation for robust visual tracking. British Machine Vision Conference, Nottingham, September 1-5, 2014. BMVA Press, 2014.
14. Henriques J. F., Caseiro R., Martins P., et al. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 37(3): 583–596.
- View Article
- Google Scholar
15. Li, Y., Fu, C., Ding, F., et al. AutoTrack: Towards high-performance visual tracking for UAV with automatic spatio-temporal regularization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11923-11932.
16. Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. Learning spatially regularized correlation filters for visual tracking. Proceedings of the IEEE International Conference on Computer Vision, 2015: 4310-4318.
17. Li, F., Tian, C., Zuo, W., Zhang, L., & Yang, M. H. Learning spatial-temporal regularized correlation filters for visual tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4904-4913.
18. Yuan D., Chang X., Li Z., & He Z. Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2022, 18(3): 1–18.
- View Article
- Google Scholar
19. Zhang J., Feng W., Yuan T., Wang J., & Sangaiah A. K. SCSTCF: Spatial-channel selection and temporal regularized correlation filters for visual tracking. Applied Soft Computing, 2022, 118: 108485.
- View Article
- Google Scholar
20. Zhang J., He Y., & Wang S. Learning adaptive sparse spatially-regularized correlation filters for visual tracking. IEEE Signal Processing Letters, 2023, 30: 11–15.
- View Article
- Google Scholar
21. Yuan D., Chang X., Huang P. Y., Liu Q., & He Z. Self-supervised deep correlation tracking. IEEE Transactions on Image Processing, 2020, 30: 976–985. pmid:33259298
- View Article
- PubMed/NCBI
- Google Scholar
22. Zhang J., Sun J., Wang J., Li Z., & Chen X. An object tracking framework with recapture based on correlation filters and siamese networks. Computers & Electrical Engineering, 2022, 98: 107730.
- View Article
- Google Scholar
23. Bertinetto, L., Valmadre, J., Henriques, J. F., et al. Fully-convolutional siamese networks for object tracking. Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14. Springer International Publishing, 2016: 850-865.
24. Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. Graph attention tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 9543-9552.
25. Zhang J., He Y., Chen W., Kuang L. D., & Zheng B. CorrFormer: Context-aware tracking with cross-correlation and transformer. Computers and Electrical Engineering, 2024, 114: 109075.
- View Article
- Google Scholar
26. Zhang J., Chen W., Dai J., & Zhang J. SCATT: Transformer tracking with symmetric cross-attention. Applied Intelligence, 2024: 1–16.
- View Article
- Google Scholar
27. Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. Distractor-aware siamese networks for visual object tracking. Proceedings of the European Conference on Computer Vision (ECCV), 2018: 101-117.
28. Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4282-4291.
29. Hu W., Wang Q., Zhang L., Bertinetto L., & Torr P. H. SiamMask: A framework for fast online object tracking and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3072–3089. pmid:37022470
- View Article
- PubMed/NCBI
- Google Scholar
30. Woo, S., Park, J., Lee, J. Y., et al. CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision (ECCV), 2018: 3-19.
31. Lin, T. Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., et al. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), 2014: 740-755.
32. Li, X., Ma, C., Wu, B., He, Z., & Yang, M. H. Target-aware deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1369-1378.
33. Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., & Li, H. Unsupervised deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1308-1317.
34. Huang, Z., Fu, C., Li, Y., Lin, F., & Lu, P. Learning aberrance repressed correlation filters for real-time UAV tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 2891-2900.
35. Lukezic, A., Vojir, T., Čehovin Zajc, L., Matas, J., & Kristan, M. Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6309-6318.
36. Danelljan, M., Bhat, G., Shahbaz Khan, F., & Felsberg, M. ECO: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6638-6646.
37. Zhang, T., Xu, C., & Yang, M. H. Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4335-4343.
38. Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., & Torr, P. H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1401-1409.
39. Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015: 58-66.
40. Kiani Galoogahi, H., Fagg, A., & Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, 2017: 1135-1143.
41. Li, Y., & Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part II 13. Springer International Publishing, 2014: 254-265.
42. Danelljan, M., Häger, G., Khan, F., & Felsberg, M. Accurate scale estimation for robust visual tracking. In British Machine Vision Conference, Nottingham, September 1-5, 2014. BMVA Press, 2014.
43. Danelljan M., Häger G., Khan F. S., & Felsberg M. Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(8): 1561–1575. pmid:27654137
- View Article
- PubMed/NCBI
- Google Scholar
44. Ma, C., Yang, X., Zhang, C., & Yang, M. H. Long-term correlation tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 5388-5396.
45. Yuan D., Chang X., Liu Q., Yang Y., Wang D., Shu M., et al. Active learning for deep visual tracking. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- View Article
- Google Scholar

[ref1] 1. He, A., Luo, C., Tian, X., & Zeng, W. A twofold siamese network for real-time object tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4834-4843.

[ref2] 2. Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. Siamese box adaptive network for visual tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 6668-6677.

[ref3] 3. Lu J., Li S., Guo W., Zhao M., Yang J., Liu Y., & Zhou Z. Siamese graph attention networks for robust visual object tracking. Computer Vision and Image Understanding, 2023, 229: 103634.
View Article
Google Scholar

[4] View Article

[5] Google Scholar

[ref4] 4. Rahman M. M., Fiaz M., & Jung S. K. Efficient visual tracking with stacked channel-spatial attention learning. IEEE Access, 2020, 8: 100857–100869.
View Article
Google Scholar

[7] View Article

[8] Google Scholar

[ref5] 5. Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. High performance visual tracking with siamese region proposal network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8971-8980.

[ref6] 6. Xu, Y., Wang, Z., Li, Z., Yuan, Y., & Yu, G. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(07): 12549-12556.

[ref7] 7. Zhang, Z., Peng, H., Fu, J., Li, B., & Hu, W. Ocean: Object-aware anchor-free tracking. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer International Publishing, 2020: 771-787.

[ref8] 8. Guo, D., Wang, J., Cui, Y., Wang, Z., & Chen, S. SiamCAR: Siamese fully convolutional classification and regression for visual tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 6269-6277.

[ref9] 9. Fu, C., Cao, Z., Li, Y., Ye, J., & Feng, C. Siamese anchor proposal network for high-speed aerial tracking. 2021 IEEE International Conference on Robotics and Automation (ICRA), 2021: 510-516.

[ref10] 10. Liu, Y., Shao, Z., & Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv preprint arXiv:2112.05561, 2021.

[ref11] 11. Sun, L., Dong, J., Tang, J., et al. Spatially-adaptive feature modulation for efficient image super-resolution. Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023: 13190-13199.

[ref12] 12. Bolme, D. S., Beveridge, J. R., Draper, B. A., et al. Visual object tracking using adaptive correlation filters. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2010: 2544-2550.

[ref13] 13. Danelljan, M., Häger, G., Khan, F., et al. Accurate scale estimation for robust visual tracking. British Machine Vision Conference, Nottingham, September 1-5, 2014. BMVA Press, 2014.

[ref14] 14. Henriques J. F., Caseiro R., Martins P., et al. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014, 37(3): 583–596.
View Article
Google Scholar

[19] View Article

[20] Google Scholar

[ref15] 15. Li, Y., Fu, C., Ding, F., et al. AutoTrack: Towards high-performance visual tracking for UAV with automatic spatio-temporal regularization. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 11923-11932.

[ref16] 16. Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. Learning spatially regularized correlation filters for visual tracking. Proceedings of the IEEE International Conference on Computer Vision, 2015: 4310-4318.

[ref17] 17. Li, F., Tian, C., Zuo, W., Zhang, L., & Yang, M. H. Learning spatial-temporal regularized correlation filters for visual tracking. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4904-4913.

[ref18] 18. Yuan D., Chang X., Li Z., & He Z. Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2022, 18(3): 1–18.
View Article
Google Scholar

[25] View Article

[26] Google Scholar

[ref19] 19. Zhang J., Feng W., Yuan T., Wang J., & Sangaiah A. K. SCSTCF: Spatial-channel selection and temporal regularized correlation filters for visual tracking. Applied Soft Computing, 2022, 118: 108485.
View Article
Google Scholar

[28] View Article

[29] Google Scholar

[ref20] 20. Zhang J., He Y., & Wang S. Learning adaptive sparse spatially-regularized correlation filters for visual tracking. IEEE Signal Processing Letters, 2023, 30: 11–15.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref21] 21. Yuan D., Chang X., Huang P. Y., Liu Q., & He Z. Self-supervised deep correlation tracking. IEEE Transactions on Image Processing, 2020, 30: 976–985. pmid:33259298
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref22] 22. Zhang J., Sun J., Wang J., Li Z., & Chen X. An object tracking framework with recapture based on correlation filters and siamese networks. Computers & Electrical Engineering, 2022, 98: 107730.
View Article
Google Scholar

[38] View Article

[39] Google Scholar

[ref23] 23. Bertinetto, L., Valmadre, J., Henriques, J. F., et al. Fully-convolutional siamese networks for object tracking. Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14. Springer International Publishing, 2016: 850-865.

[ref24] 24. Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. Graph attention tracking. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 9543-9552.

[ref25] 25. Zhang J., He Y., Chen W., Kuang L. D., & Zheng B. CorrFormer: Context-aware tracking with cross-correlation and transformer. Computers and Electrical Engineering, 2024, 114: 109075.
View Article
Google Scholar

[43] View Article

[44] Google Scholar

[ref26] 26. Zhang J., Chen W., Dai J., & Zhang J. SCATT: Transformer tracking with symmetric cross-attention. Applied Intelligence, 2024: 1–16.
View Article
Google Scholar

[46] View Article

[47] Google Scholar

[ref27] 27. Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. Distractor-aware siamese networks for visual object tracking. Proceedings of the European Conference on Computer Vision (ECCV), 2018: 101-117.

[ref28] 28. Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 4282-4291.

[ref29] 29. Hu W., Wang Q., Zhang L., Bertinetto L., & Torr P. H. SiamMask: A framework for fast online object tracking and segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(3): 3072–3089. pmid:37022470
View Article
PubMed/NCBI
Google Scholar

[51] View Article

[52] PubMed/NCBI

[53] Google Scholar

[ref30] 30. Woo, S., Park, J., Lee, J. Y., et al. CBAM: Convolutional Block Attention Module. Proceedings of the European Conference on Computer Vision (ECCV), 2018: 3-19.

[ref31] 31. Lin, T. Y., Maire, M., Belongie, S., Bourdev, L., Girshick, R., Hays, J., et al. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), 2014: 740-755.

[ref32] 32. Li, X., Ma, C., Wu, B., He, Z., & Yang, M. H. Target-aware deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1369-1378.

[ref33] 33. Wang, N., Song, Y., Ma, C., Zhou, W., Liu, W., & Li, H. Unsupervised deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019: 1308-1317.

[ref34] 34. Huang, Z., Fu, C., Li, Y., Lin, F., & Lu, P. Learning aberrance repressed correlation filters for real-time UAV tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 2891-2900.

[ref35] 35. Lukezic, A., Vojir, T., Čehovin Zajc, L., Matas, J., & Kristan, M. Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6309-6318.

[ref36] 36. Danelljan, M., Bhat, G., Shahbaz Khan, F., & Felsberg, M. ECO: Efficient convolution operators for tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6638-6646.

[ref37] 37. Zhang, T., Xu, C., & Yang, M. H. Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 4335-4343.

[ref38] 38. Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., & Torr, P. H. Staple: Complementary learners for real-time tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016: 1401-1409.

[ref39] 39. Danelljan, M., Hager, G., Shahbaz Khan, F., & Felsberg, M. Convolutional features for correlation filter based visual tracking. In Proceedings of the IEEE International Conference on Computer Vision Workshops, 2015: 58-66.

[ref40] 40. Kiani Galoogahi, H., Fagg, A., & Lucey, S. Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, 2017: 1135-1143.

[ref41] 41. Li, Y., & Zhu, J. A scale adaptive kernel correlation filter tracker with feature integration. In Computer Vision-ECCV 2014 Workshops: Zurich, Switzerland, September 6-7 and 12, 2014, Proceedings, Part II 13. Springer International Publishing, 2014: 254-265.

[ref42] 42. Danelljan, M., Häger, G., Khan, F., & Felsberg, M. Accurate scale estimation for robust visual tracking. In British Machine Vision Conference, Nottingham, September 1-5, 2014. BMVA Press, 2014.

[ref43] 43. Danelljan M., Häger G., Khan F. S., & Felsberg M. Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(8): 1561–1575. pmid:27654137
View Article
PubMed/NCBI
Google Scholar

[68] View Article

[69] PubMed/NCBI

[70] Google Scholar

[ref44] 44. Ma, C., Yang, X., Zhang, C., & Yang, M. H. Long-term correlation tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015: 5388-5396.

[ref45] 45. Yuan D., Chang X., Liu Q., Yang Y., Wang D., Shu M., et al. Active learning for deep visual tracking. IEEE Transactions on Neural Networks and Learning Systems, 2023.
View Article
Google Scholar

[73] View Article

[74] Google Scholar

Figures

Abstract

Introduction

Relate workers

Global feature interaction with anchorless frame perception feature modulation UAV tracking algorithm

Overall algorithm

GAM global attention mechanism

Anchorless frame mechanism

Anchor-frame-free perceptual feature modulation

Experimental analysis

Implementation details and evaluation criterion

Implementation details.

Evaluation criterion.

State-of-the-Art comparison

Experiment on UAV123@10fps dataset.

Experiment on UAV20L dataset.

Experiment on DTB70 dataset.

Ablation experiments

Qualitative comparison

Physical verification

Conclusion

References