Figures
Abstract
Few-shot learning techniques have enabled the rapid adaptation of a general AI model to various tasks using limited data. In this study, we focus on class-agnostic low-shot object counting, a challenging problem that aims to achieve accurate object counting with only a few annotated samples (few-shot) or even in the absence of any annotated data (zero-shot). In existing methods, the primary focus is often on enhancing performance, while relatively little attention is given to inference time—an equally critical factor in many practical applications. We propose a model that achieves real-time inference without compromising performance. Specifically, we design a multi-scale hybrid encoder to enhance feature representation and optimize computational efficiency. This encoder applies self-attention exclusively to high-level features and cross-scale fusion modules to integrate adjacent features, reducing training costs. Additionally, we introduce a learnable shape embedding and an iterative exemplar feature learning module, that progressively enriches exemplar features with class-level characteristics by learning from similar objects within the image, which are essential for improving subsequent matching performance. Extensive experiments on the FSC147, Val-COCO, Test-COCO, CARPK, and ShanghaiTech datasets demonstrate our model’s effectiveness and generalizability compared to state-of-the-art methods.
Citation: Yang Q, Liu B, Tian Y, Shi Y, Du X, He F, et al. (2025) An efficient low-shot class-agnostic counting framework with hybrid encoder and iterative exemplar feature learning. PLoS One 20(6): e0322360. https://doi.org/10.1371/journal.pone.0322360
Editor: Panos Liatsis, Khalifa University of Science and Technology, UNITED ARAB EMIRATES
Received: November 30, 2024; Accepted: March 20, 2025; Published: June 6, 2025
Copyright: © 2025 Yang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The FSC147 dataset can be downloaded as instructed in FamNet: https://github.com/cvlab-stonybrook/LearningToCountEverything. The CARPK dataset is available at: https://lafi.github.io/LPN/. The ShanghaiTech Dataset is available at: https://github.com/desenzhou/ShanghaiTechDataset. The source code is available at: https://github.com/Erica-Yang/Efficient-low-shot-counting.
Funding: This research was funded by Special projects in key areas of ordinary colleges and universities in Guangdong Province (2021ZDZX1074), School-level scientific research project of Guangdong Institute of Petrochemical Technology (72100003152) and Technology Innovation for Science and Technology-based Small and Medium-sized Enterprises (2022DZXHT039). R&D Program of Beijing Municipal Education Commission (No. KM202211417005) and Supported by the Academic Research Projects of Beijing Union University (No. ZK90202106). The funders’ roles included investigation, analysis, and provision of resources.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Deep learning technologies has become an essential component across various fields [1, 2]. To tackle the challenges posed by dynamic changes in task diversity, few-shot learning [3, 4] facilitates rapid model adaptation with limited annotated data, significantly enhancing the flexibility and operational efficiency of systems. Moreover, some applications extend beyond object recognition, requiring quantitative or density-based analyzes to fulfill specific task requirements. In this paper, we focus on low-shot object counting (LSC), which encompasses both few-shot counting (FSC) and Zero-Shot Counting (ZSC).
Object counting aims to count objects of interest within an image. Early studies primarily focused on counting specific categories, such as crowds [5–7], vehicles [8, 9], animal species [10–12], and plants [13, 14]. However, these methods typically demand extensive annotated datasets for training and exhibit limited adaptability for counting objects in novel categories, thus limiting their applications.
To overcome these limitations, GMN [15] proposed class-agnostic counting (CAC), which aims to count any class of object based on given exemplars. Following the release of the challenging FSC147 dataset [16], few-shot object counting (FSC) has garnered significant attention in the research community [17–31], aiming to enable counting in arbitrary categories using only a few exemplars. This advancement enables models to generalize to unseen classes, offering promising applications across diverse scenarios.
As illustrated in Fig 1, in LSC, the model is trained on base classes using only a few (or none) exemplars and tested on novel classes under low-shot settings. ZSC is a reference-less, class-agnostic approach aimed at counting the most frequently occurring class in an image.
Most current FSC methods follow an extract-then-match paradigm: first extracting features from both the image and exemplars, processing them, and then matching them by calculating the similarity between the query image and exemplar features. The resulting similarity map is then regressed to generate a density map, with the sum of the density map representing the object count. These methods primarily differ in how image embeddings and exemplar-feature similarities are constructed. ZSC methods follow a similar paradigm but rely on identifying exemplars based on repeated objects in the image [18] or via attention mechanisms [19].
However, as model performance improves, these methods have become increasingly complex, leading to higher training costs and slower inference speeds. Real-time performance is critical for practical applications, particularly in UAV operations. However, existing CAC counting methods seldom address or thoroughly analyze this aspect.
We analyze existing low-shot counting methods, comparing their model sizes and inference speeds. Through our analysis, we find that limitations in the Transformer attention modules’ ability to process image feature maps often lead to high training costs, slow convergence, and restricted feature spatial resolution. To overcome these issues, we propose a hybrid encoder that reduces training costs and inference time. Specifically, we use a hybrid encoder to extract multi-scale features from the backbone, which is critical for accommodating various object sizes. Multi-head self-attention (MHA) is applied only to top-level image features, which are reduced to of the original image size. Given that computational complexity increases sharply with feature size, this strategy reduces computational cost significantly compared to vanilla attention applied at multiple levels (e.g.,
). We then employ a two-path, top-down and bottom-up approach using convolutional neural netowrks (CNN) to fuse features across different levels.
To ensure accurate exemplar matching, we iteratively update the exemplar features by learning from query image features. First, we extract shape information from the exemplar, project it to a high-dimensional space, and fuse exemplar features with the shape embeddings via cross-attention. The fused exemplar features then interact with the image features via cross-attention, learning from other similar objects in the image to enrich the exemplar features with class-level characteristics, thus improving followed matching accuracy.
To address overlap issues in the ground truth (GT) density map, which arise from considering only the shape size of a few-shot exemplar in an image, we propose a method to dynamically generate a more accurate GT density map by combining both object shape and distance information among GT points in the image, providing more precise supervisory information during training.
Our contributions can be summarized as follows:
- we propose a method to adaptively generate a more accurate GT density map by combining both object shape and distance information among GT points in the image.
- We propose a novel low-shot counting model incorporating an effective hybrid encoder, which achieves real-time inference without compromising high performance.
- Experiments on counting benchmarks demonstrate the effectiveness of our approach. Compared to state-of-the-art density-based method [24], our model achieved substantially lower latency (18.22 ms vs. 44.77 ms on the validation set, 17.92 ms vs. 45.37 ms on the test set) and reduced MAE by 9.5% and 6.3% on the validation and test sets, respectively.
Related works
Early object counting tasks were initially approached with class-specific detectors, which could accurately locate object positions. However, these detection-based methods struggle in occluded and crowded scenarios. To address this limitation, regression-based methods [10, 32, 33] were developed, treating counting as a supervised regression task. Under this paradigm, most research focuses on optimizing model architectures [33], multi-scale strategies [34, 35], or novel learning targets [32, 36]. These approaches only require point annotations, which are less labor-intensive than detection-based methods that rely on bounding box annotations. However, all of these methods require large training datasets and are limited to specific classes, making them less generalizable to unseen classes.
Few-shot counting (FSC) has gained significant attention due to its ability to count objects using only a few exemplars as references, with adaptability to novel classes during testing. GMN [15] introduced a generic matching network architecture for class-agnostic counting, extracting exemplar and image features in a two-stream fashion, pooling the exemplar features into 1 1 dimensions, and concatenating with the query features for regression into a density map. To address the unreliable location precision caused by direct concatenation, CFOCNet [17] drew on Siamese network principles from object tracking [37], using the exemplar feature as a 2D kernel to convolve over the query feature map to compute similarity. FamNet [16] introduced the widely used FSC147 dataset for FSC research and proposed a Siamese backbone adaptation strategy to improve correlation robustness during testing. BMNet [21] proposed a similarity-aware framework that jointly learns representations and similarity metrics end-to-end, using self-attention to reduce intra-class appearance variability in the test image. SAFECount [23] introduced a similarity-aware feature enhancement block that compares exemplar and query image features to create a score map, subsequently generating a reliable similarity map. This is followed by a feature enhancement module that uses similarity values as weighting coefficients to integrate the support features into the query image features. CounTR [22] proposed a Transformer-based architecture [38] that uses cross-attention to fuse image and exemplar features and employs a two-stage training regimen, starting with self-supervised pre-training and followed by supervised fine-tuning. LOCA [24] separately considers exemplar shape and appearance features, iteratively adapting them into object prototypes, while DAVE [30] implements a detect-and-verify paradigm that generates a high-recall detection set and verifies detections to filter out outliers.
More recently, with the advancement of large language models, some studies have explored open-world object counting, integrating both visual and language modality features. In this paradigm, objects of interest can be specified by text (class names or descriptions), exemplars, or both. ZSC [26], CLIP-Count [25], and VLCounter [28] target zero-shot object counting, while CounTX [27] can directly predict object counts using an inference image and an arbitrary object class description. Although these methods do not yet perform as well as previous approaches, they show considerable potential.
Our approach falls within density-based methods, using state-of-the-art LOCA [24] as our baseline for comparison.
Methods
Preliminaries
We adopted the general settings of few-shot counting (FSC), training our model on base classes and evaluating it on novel classes
, where there is no overlap between the two, i.e.,
. The ground truth annotations for each image comprise the center points of all objects within the target classes, along with the bounding boxes of K (K-shot) representative exemplars. Generate the ground truth density map based on the points. Given a query image
, the counting model predicts a density map
. Summing the values in
yields the estimated object count in I for the specified class.
Model architecture
As shown in Fig 2, our model architechture proceeds through five modules: (i) image feature extraction (backbone), (ii) image feature enhancement (hybrid encoder), (iii) exemplars feature learning (iteratively update exemplar feature), (iv) exemplar-image matching (similarity maps), (v) density regression(decoder, generate density map).
Hybrid Encoder is a module to enhance image features. i-EFL (iterative Exemplar Feature Learning) module is used to iteratively enriching exemplar features from image. The symbol ‘*’ is a convolution with enhanced exemplar features as kernel. ‘x’ is multiply the weights after softmax. ‘+’ is to generate a response map.
The input image is resized to H W pixels, and image augmentation techniques, including tiling, color jittering and horizontal flipping [22]. Multi-scale image features are extracted using a Swin Transformer [39] backbone. To reduce the computational cost of attention mechanisms, our hybrid encoder applies attention only to high-level features, complemented by cross-scale fusion with CNN layers. Then the high level features (
) are upsampled to the same size. Concatenated different level features and projected to generate the encoded image features
.
Next, for the K-shot exemplars, RoIAlign is used to obtain exemplar features. In the iterative exemplar feature learning (i-EFL) module, a learnable shape embedding is fused with exemplar features using multi-head cross-attention (MHCA) and then enriched with image features via another MHCA, producing exemplar features with class-level commonalities.
In the matching module, image features are depth-wise correlated with the enhanced exemplar features, generating a similarity tensor. The similarity tensors of the exemplars are reweighted by their softmax scores and combined to produce a joint response tensor (Response Map).
Finally, a progressive up-sampling regression head in the decoder module predicts the final density map, where the sum of its values represents the total count of objects in the image relevant to the exemplars.
Ground truth density map generation
The density map is essential in density-based counting models. To more effectively capture shape features, we adaptively generate the ground truth density map based on object size and inter-point distances. Specifically, we calculate the average height and width
of objects from the K-shot annotated boxes, which decide the window size of the Gaussian function used to generate the density map. As illustrated in the left image of Fig 3, the density map corresponding to a object accounts for both its length and width, rather than merely generating a circular region. However, when objects vary widely in size and distribution density, as shown in the right image of Fig 3, densely packed regions in the generated density map may suffer from significant overlap, leading to inaccuracies. To address this, we use a hybrid density generation method that accounts for both object shape and neighbor distances. For each point, we compute the distance to all other points, taking the average of the three smallest values as its distance metric d. Based on this metric, the window size is dynamically adjusted, and a Gaussian density map
is generated. Points in the image are grouped by the value of d, and our experiments indicate that two groups are sufficient for accurate density map generation. The steps are as follows:
Adaptively adjust the window size based on point distances to solve the above issue.
where the hyper parameter is used to filter points by adjusting the ratio of object size to distance metric, while
is used to adjust the parameter of window size in Gaussian filter, with
. As shown in the second line of Fig 3, the left image is generated using Eq (2), the middle image is generated using Eq (3). Combining these maps produces a final density map where the overlap problem is alleviated, resulting in a more accurate ground truth density map.
Hybrid encoder.
In the FSC147 dataset [16], object sizes vary significantly, making multi-scale features essential for improving counting accuracy, accelerating training convergence, and enhancing performance [40]. Despite the strong performance of Transformers, one key limitation is their high computational and memory requirements for large numbers of key elements.
To address this, we introduce a hybrid encoder module, combining High-level Features Self-Attention (HFSA) and Cross-level Features Fusion (CFF). The encoder structure is shown in Fig 4. High-level features (S5) occupy only of the input image, yet contain richer semantic information, capturing relations among conceptual entities, thereby aiding object localization and recognition. As analyzed in RT-DETR [51], applying self-attention to high-level features captures conceptual connections, enhancing subsequent modules’ ability to recognize objects. Lower-level intra-scale interactions, however, are unnecessary due to the lack of semantic concepts and the risk of redundancy and confusion with high-level feature interactions.
Composition: a High-Level Features Self-Attention (HFSA) mechanism and a Cross-Level Features Fusion (CFF) module. The figure below shows the detailed structure of the CFF module.
In the CFF module, two adjacent scale features are fused via top-down and bottom-up pathways. Finally, we resize all feature levels to a consistent size, concatenate them, and project to a lower dimension, yielding the final image features FI. This process is described by:
In the CFF module, we employ a content-guided attention (CGA)-based mixup fusion scheme [49], which effectively fuses features and enhances gradient flow. As shown in Fig 4, CGA generates channel-specific spatial importance maps (SIMs), producing an exclusive SIM for each input channel in a coarse-to-fine manner. This approach integrates both channel and spatial attention weights, ensuring effective information interaction and guiding the model to focus on significant regions within each channel.
Iterative exemplar features learning.
The structure of the iterative exemplar feature learning module is illustrated in Fig 5, which is almost the same to the module of object prototype extraction in LOCA [24], and we elaborate on its details below.
In the K-shot setting, given K bounding boxes as support exemplars, we first obtain K exemplar features
via RoIAlign [41]. To learn the shape information of each exemplar, we apply a mapping function
that maps each object’s height and width
to match the shape of the exemplar features
. This function is implemented as a multilayer perceptron (MLP) with three linear layers, each followed by ReLU activations. After mapping, we obtain the shape embeddings
.
Next, we fuse the exemplar features and shape embeddings using multi-head cross-attention to generate refined exemplar features . These features are further fused with the image feature FI through cross-attention, querying exemplar similar objects in the image to learn class-level characteristics and generate prototype-like exemplar features FP. To reduce computational complexity, we have replaced the cross attention with linear attention here. To acquire more generalized exemplar features, we iteratively update FP by repeating the above steps. Since the query image contains multiple objects of the same class as the exemplar, this iterative approach enables the model to capture additional similarities within the same category, yielding generalized exemplar features. The iterative process of the algorithm as follows:
Algorithm 1: Iterative to update the exemplar features.
Input:
Exemplar shape features:
Exemplar features: Fexm
Image features: FI
Iterative update
for the iter k:
1.
2.
3.
4. Then repeat step 1.
where MHCA denotes multi-head cross attention. In our experiments, we performed two iterations, resulting in three generalized exemplar features for subsequent matching operations. During training,
,
are used alongside
to assist in model training. For testing, only the final feature
is used for matching.
Training losses
The training loss is defined as the Mean Squared Error (MSE or l2 loss) between the predicted and ground-truth density maps in pixel space:
where Np represents the number of objects that calculated from ,
is the model’s predicted density map, and
is ground-truth density map.
Additionally, the intermediate enhanced exemplar features are matched with image features to generate auxiliary density maps . The auxiliary loss is calculated as:
Thus, the total loss is
where the in the paper.
Experiments
Datasets and metrics
FSC147.
The FSC-147 dataset [16] was introduced for few-shot object counting tasks and comprises 6,135 images across a wide range of 147 object categories, ranging from kitchen utensils and office stationery to vehicles and animals. The dataset exhibits a significant variation in object counts, with images containing between 7 and 3,731 objects and an average of 56 objects per image. In each image, every object instance is annotated with a dot marking its approximate center. Additionally, three object instances per image are randomly selected as exemplars, each accompanied by axis-aligned bounding box annotations.
Following [16], we partitioned the dataset into 89 object categories for training, 29 for validation, and 29 for testing. These subsets comprise 3,659 images in the training set, 1,286 images in the validation set, and 1,190 images in the test set. Fig 6 illustrates the width and height distributions of objects across these sets. Notably, while many objects in FSC-147 are small, the dataset demonstrates a substantial range in object sizes.
The second row visualizes the relationship between width and height across these datasets. The model is trained on 89 classes and evaluated on 29 classes for both validation and testing.
CARPK & ShanghaiTech.
To further evaluate the generalizability of the model, we introduce two datasets designed for specific class counting. The CARPK dataset [8] is a class-specific counting benchmark focused exclusively on car instances. It consists of 1,448 drone-captured images from four different parking lots, with approximately 90,000 annotated car instances. The ShanghaiTech Part B dataset [50], commonly used for crowd counting, contains 716 images with a total of 88,488 annotated human instances.
Metrics.
In line with prior studies [15, 16, 21, 23, 24], we assess the counting method’s performance using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE), defined as follows:
where Nt is the number of test images, and Ci and represent the ground-truth and predicted object counts for the i-th test image, respectively.
Implementation details
We resized the image to 512 512, and applied data augmentation techniques including color jittering, horizontal flipping, and tiling. Tiling was applied to images, bounding boxes, and density maps. This augmentation technique divides the image into smaller small pieces (tiling), applies random transformations (e.g., horizontal flipping, scaling) to each piece, and then recombines them. This approach increases dataset diversity and enhances model generalization. The ground-truth density map for each image was generated using a Gaussian kernel function, with the kernel size set to
of the average size of the exemplar bounding boxes.
In low-shot counting (LSC) methods, the classes in the training and test sets are different. To preserve generic semantic information learned from pre-trained models, the backbone is often initialized with weights trained on large datasets and then frozen (or partially frozen). Previous methods such as GMN [15], FamNet [16], and LOCA [24] use ResNet50 pre-trained on ImageNet [47] as the backbone, while CounTR [22] and CACViT [29] use a Vision Transformer (ViT) [38] pre-trained on ImageNet. In this paper, we employ SwinT-T [39] as the backbone, initializing its parameters with those of GroundingDINO [42], an open-vocabulary object detection model pre-trained on large-scale datasets. We trained our model using the AdamW optimizer [43] with a learning rate of 10−4 and weight decay of 10−4, running on an RTX A4000 GPU with a batch size of 4, completing training in under six hours.
Comparison with state-of-the-art methods
Our model is evaluated on the FSC147 benchmark [16], following the standard evaluation protocols [15, 16, 21, 23, 24]. We compare the mean absolute error (MAE) and root mean squared error (RMSE) with other approaches.
Few-shot counting.
In the few-shot setting, each query image is accompanied by three support exemplars during both training and inference. Table 1 summarizes the results across different methods on the validation and test sets, including details on model backbone, image resolution, performance, and inference latency. To enhance generalization to novel object counting, we freeze the backbone in both training and testing phases. Although including small objects in high-resolution images aids accuracy, it also increases computational complexity—particularly for transformer-based methods. To balance performance and efficiency, we use a resolution of 512 for query images. As shown in Table 1, our model surpasses all others on the validation set and ranks second on the test set. Moreover, it achieves more than double the inference speed of the baseline LOCA [24], making it suitable for real-time applications. The CACViT [29] model, which has the best performance on the test set, aims to count objects in few-shots, unlike our model which can also perform zero-shot counting.
One-shot counting.
In the one-shot scenario, only a single object annotation is provided. Table 2 presents a comparison of various methods, where our model outperforms others on the validation set and second best on the test set, while maintaining lower latency, demonstrating robustness even with limited data.
Zero-shot counting.
For zero-shot counting, where reference target annotations are missing, we used the zero-shot handling method from LOCA [24]. This method modifies the iterative module by removing Step 1 in Algorithm 1. Additionally, the embedding of in Step 2 is initialized using learnable object queries (
). For more details, please refer to LOCA [24]. As shown in Table 3, our model substantially outperforms all methods with a relative improvement of 29.0%, 11.5% in terms of MAE on validation and test sets, respectively, sets a solid new state-of-the-art.
Qualitative analysis under various counting settings.
To assess qualitative performance, we selected images with diverse target sizes and densities, overlaying predicted density maps onto original images for visualization. As illustrated in Fig 7, our method adapts effectively to different scenarios—including dense scenes, colorful environments, and large depth variations. These advantages demonstrate model’s ability to accurately localize and count objects.
The selected images contain a variety of challenges. These include images with large variations in target size, target density and target depth.
Cross-dataset generalization
Few-shot transfer to CARPK dataset with and without fine-Tuning.
Following established protocols in [16, 22–24, 29], we evaluated our model’s cross-dataset generalization by training on FSC147 dataset [16] and testing on CARPK dataset [8]. To focus on generalization, we excluded the ’car’ class from FSC147’s training set. In the few-shot scenario, twelve exemplars were sampled as supervised annotations.
For comparison, we conducted two evaluations: one without fine-tuning and one with fine-tuning. The results are presented in Table 4. Our model achieves better cross-dataset generalization under these two evaluations. From the penultimate row of the Table 4, it can be noticed that directly fine-tune the pre-trained model that without skip Car class, the model works on CARPK dataset. While fine-tuning with the pre-trained model that skip the Car class, our method outperforms the current state-of-the-art with a relative improvement of 22.8% MAE and 24.3% RMSE on the validation set.
Qualitative results on CARPK.
For CARPK, images originally sized 1280 720 were resized to
to reduce computational load. Fig 8 visualizes counting results, illustrating our model’s ability to accurately localize and count objects of various shapes, sizes, and densities.
Few-shot transfer to ShanghaiTech dataset.
We also evaluated our method on crowd counting datasets to assess its generalizability. Consistent with the settings in SAFECount, only five support images were randomly sampled from the training set and fixed for both fine-tuning and testing. The results are presented in Table 5. While the performance of our model is slightly lower, the number of epochs required for fine-tuning is significantly reduced (35 vs. 1000).
The qualitative results on ShanghaiTech are shown in Fig 9. The five given exemplars are shown on the top. We choose images containing different numbers of people for comparison. The positions of individuals are represented by dots, and the predicted total count is shown in the bottom-right corner.
Convergence speed during fine-tuning.
We compared performance using a pre-trained model with and without the car class. Table 6 shows that our model converges quickly, reaching optimal performance within four fine-tuning epochs.
Comparison with object detectors on Val-COCO and Test-COCO
Val-COCO and Test-COCO [16] are FSC-147 subsets derived from COCO, designed for evaluating object counting models. We compared our model against several detection-based models, including Faster-RCNN [45], RetinaNet [46], Mask-RCNN [41], and recent few-shot models such as FamNet [16], BMNet+ [21], CounTR [22], and LOCA [24]. Results for direct evaluation on Val-COCO and Test-COCO (without fine-tuning) using our pre-trained model are shown in Table 7, from which we find that our model gets better performance.
Ablation study
Ablation on the adaptive generation of ground truth density map.
We propose a dynamic adjustment of and
based on the inter-point distances in high-density regions, thereby generating improved density maps. The steps are shown in the Algorithm 2. The threshold
and the weight parameter 1.6 in
are set based on the experiments. We conducted experiments for
within the range of [0.2, 0.5] with an interval of 0.05, and for the weight parameter within the range of [1, 2] with an interval of 0.1. The optimal values for both parameters were selected based on the results. These ablation experiments were all conducted on the FSC147 dataset under the few-shot setting.
We compare the results before and after applying the dynamic adjustment of density maps in Table 8. From the table, it can be observed that the dynamically generated density maps improved the model’s performance on both the validation and test sets, demonstrating the effectiveness of the module.
To further validate the method, we refer to the CACViT [29] method, dividing images into subsets based on the number of targets they contain. The low-density subset includes images with 8–37 targets, while the high-density subset contains 37–3701 targets. The experimental results are shown in Table 9.
Algorithm 2. Dynamic adjustment of Gaussian parameters for density map generation.
Input:
The distribution map of target center points P in the image.
The average height and width of the given K-shot annotated boxes: and
The distance metric of all points in the image d.
Output:
Dynamically adjusted density map .
Steps:
1. Compute the global mean distance .
2. Roughly select points that may belong to high-density regions using the condition:
3. Classify the image into high-density and low-density regions based on the Eq (1).
4. Generate Density Map for Low density Regions using the Eq (2), where .
5. Generate Density Map for High-Density Regions: a) Calculate the average points distance that belongs to high density regions . b) Dynamically adjust the Gaussian gamma parameter using the formula:
, where
if
. c) Apply the adjusted gamma to generate the high density map using Eq (3).
6. Combine low density and high desity map to obtain the final density map .
Encoder ablation.
To evaluate the effectiveness of our hybrid encoder, we experimented with three encoder structures: one using standard multihead attention (MHA), another using multi-scale deformable attention (MSDA), and the proposed encoder. These module structures are illustrated in Fig 10.
We conduct experiments on the FSC147 dataset under few-shot settings to compare the performance of various encoders. Each input image is resized to pixels. High-level features are extracted from the backbone at different stages:
,
, and
. Given an input image, the processing pipeline is as follows:
(a) Standard transformer attention (3 layers):
(b) Multi-scale deformable attention (3 layers): ,
(c) Hybrid encoder (1 layer):
We evaluated three encoders on the FSC147 validation set, comparing their MAE, parameter numbers, and latency. The results are summarized in Table 10. Although our hybrid encoder includes a CNN-based fusion module that slightly increases the parameter numbers, it achieves low latency without compromising performance.
We also tracked on the size of GPU and memory required by LOCA and our method, while training on FSC147 in few-shot settings. As shown in the Table 11, which shows that for the same training setup, our model requires considerably less GPU memory and a decrease in CPU memory.
Ablation study on the exemplar feature learning module.
To evaluate the effectiveness of the i-EFL module, we analyzed the response maps generated with and without it. After matching exemplar features with image features, the model produces a response map, denoted as . For visual comparison, we project this response map to
. As shown in Fig 11, the features are more effectively learned with the enhanced exemplar feature.
Comparison of results across different categories.
As shown in Fig 6, the distribution of the number of objects across categories in the FSC147 dataset is highly imbalanced. We compared and visualized the average MAE for each category in the validation and test sets, with the results presented in Fig 12. From the figure, we can observe that the performance has slightly improved for categories with highly imbalanced densities, such as books, chairs and shirts.
Analysis of the auxiliary loss weight setting.
The auxiliary loss is defined as the sum of two intermediate layer errors. To balance it with the primary loss, , we conducted experiments using
values ranging from 0 to 0.5 in increments of 0.1. These experiments were carried out on the FSC147 dataset under few-shot learning scenarios. The results, summarized in Table 12, demonstrate that incorporating auxiliary loss enhances the model’s performance. However, an excessive auxiliary loss can negatively impact the optimization of the primary task.
Convergence speed and performance comparison.
LOCA [24] represents the state-of-the-art in density-based methods. We compare our model’s convergence speed and performance during training under different few-shot settings. As shown in Fig 13, our model achieves faster convergence and improved performance across various settings.
We compare the MAE and latency across different methods, as shown in Fig 14. Our model achieves the best performance and lowest latency on the validation set. On the test set, although our model’s performance ranks second to CACViT, it demonstrates significantly lower latency (17.93 ms vs. 64.57 ms).
Discussion
The application of AI technology for analyzing and processing images holds substantial potential for enhancing data-driven decision-making across various domains. This work addresses the generalized visual object counting problem, specifically counting objects from arbitrary categories using only a few exemplars. To balance the performance and inference speed, we design a hybrid encoder module that reduces model complexity and an iterative exemplar feature enhancer module that boosts performance. Additionally, we employ a synthetic approach to generate more accurate ground-truth density maps. Experiments on FSC-147, Val-COCO and Test-COCO demonstrate that our method meets real-time requirements while achieving high accuracy, experiments on CARPK and ShanghaiTech demonstrate model’s generalizability.
Supporting information
S1 Fig. Different domain images in FSC147 dataset.
https://doi.org/10.1371/journal.pone.0322360.s001
(TIF)
S2 Fig. Application in different domain results.
https://doi.org/10.1371/journal.pone.0322360.s002
(TIF)
References
- 1. Tampuu A, Matiisen T, Kodelja D, Kuzovkin I, Korjus K, Aru J, et al. Multiagent cooperation and competition with deep reinforcement learning. PLoS One. 2017;12(4):e0172395. pmid:28380078
- 2. Wang Q, Bi S, Sun M, Wang Y, Wang D, Yang S. Deep learning approach to peripheral leukocyte recognition. PLoS One. 2019;14(6):e0218808. pmid:31237896
- 3. Wang M, Cai Y, Gao L, Feng R, Jiao Q, Ma X, et al. Study on the evolution of Chinese characters based on few-shot learning: From oracle bone inscriptions to regular script. PLoS One. 2022;17(8):e0272974. pmid:35984774
- 4. Zhang Y, Fang M, Wang N. Channel-spatial attention network for fewshot classification. PLoS One. 2019;14(12):e0225426. pmid:31830065
- 5.
Cao XK, Wang ZP, Zhao YY, Su F. Scale aggregation network for accurate and efficient crowd counting. Computer Vision - ECCV 2018, Pt V. 2018;11209:757–73.
- 6. Szarka N, Biljecki F. Population estimation beyond counts-Inferring demographic characteristics. PLoS One. 2022;17(4):e0266484. pmid:35381028
- 7.
Arteta C, Lempitsky V, Noble JA, Zisserman A. Interactive object counting. Computer Vision - ECCV 2014, Pt III. 2014;8691:504–18.
- 8.
Hsieh MR, Lin YL, Hsu WH. Drone-based object counting by spatially regularized regional proposal network. In: 2017 IEEE International Conference on Computer Vision (ICCV), 2017. p. 4165–73.
- 9.
Mundhenk TN, Konjevod G, Sakla WA, Boakye K. A Large contextual dataset for classification, detection and counting of cars with deep learning. Computer Vision - ECCV 2016, Pt III. 2016;9907:785–800.
- 10.
Arteta C, Lempitsky V, Zisserman A. Counting in the wild. Computer Vision - ECCV 2016, Pt VII. 2016;9911:483–98.
- 11. Linchant J, Lhoest S, Quevauvillers S, Lejeune P, Vermeulen C, Semeki Ngabinzeke J, et al. UAS imagery reveals new survey opportunities for counting hippos. PLoS One. 2018;13(11):e0206413. pmid:30427890
- 12. Zavrtanik V, Vodopivec M, Kristan M. A segmentation-based approach for polyp counting in the wild. Eng Appl Artif Intel. 2020;88.
- 13. Lu H, Cao Z, Xiao Y, Zhuang B, Shen C. TasselNet: counting maize tassels in the wild via local counts regression network. Plant Methods. 2017;13:79. pmid:29118821
- 14. Madec S, Jin X, Lu H, De Solan B, Liu S, Duyme F, et al. Ear density estimation from high resolution RGB imagery using deep learning technique. Agricultural and Forest Meteorology. 2019;264:225–34.
- 15.
Lu E, Xie WD, Zisserman A. Class-agnostic counting. Computer Vision - ACCV 2018, Pt III. 2019;11363:669–84.
- 16.
Ranjan V, Sharma U, Nguyen T, Hoai M. Learning to count everything. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, 2021. p. 3393–402.
- 17.
Yang SD, Su HT, Hsu WH, Chen WC. Class-agnostic few-shot object counting. In: IEEE Wint Conf Appl, 2021:869–77.
- 18.
Ranjan V, Nguyen MH. Exemplar Free Class Agnostic Counting. Computer Vision - ACCV 2022, Pt IV. 2023;13844:71–87.
- 19.
Hobley M, Prisacariu V. Learning to count anything: reference-less class-agnostic counting with weak supervision. arXiv preprint 2022. https://arxiv.org/abs/2205.10203
- 20.
Nguyen T, Pham C, Nguyen K, Hoai M. Few-shot object counting and detection. Computer Vision, ECCV 2022, Pt XX. 2022;13680:348–65.
- 21.
Shi M, Lu H, Feng C, Liu CX, Cao ZG. Represent, compare, and learn: a similarity-aware framework for class-agnostic counting. Proc CVPR IEEE. 2022. p. 9519–28.
- 22.
Liu C, Zhong Y, Zisserman A, Xie W. Countr: transformer-based generalised visual counting. In: British Machine Vision Conference (BMVC), 2022.
- 23.
You Z, Yang K, Luo W, Lu X, Cui L, Le X. Few-shot object counting with similarity-aware feature enhancement. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 2023. p. 6304–13. https://doi.org/10.1109/wacv56688.2023.00625
- 24.
Dukic N, Lukezic A, Zavrtanik V, Kristan M. A low-shot object counting network with iterative prototype adaptation. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV 2023). 2023. p. 18826-–35.
- 25.
Jiang R, Liu B, Chen C. CLIP-count: towards text-guided zero-shot object counting. In: Proceedings of the 31st ACM International Conference on Multimedia MM 2023. 2023. p. 4535–45.
- 26.
Xu JY, Le H, Nguyen V, Ranjan V, Samaras D. Zero-shot object counting. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023. p. 15548–57.
- 27.
Amini-Naieni N, Amini-Naieni K, Han T, Zisserman A. Open-world text-specified object counting. British Machine Vision Conference (BMVC). 2023.
- 28. Kang S, Moon W, Kim E, Heo J-P. VLCounter: Text-Aware Visual Representation for Zero-Shot Object Counting. AAAI. 2024;38(3):2714–22.
- 29. Wang Z, Xiao L, Cao Z, Lu H. Vision Transformer Off-the-Shelf: A Surprising Baseline for Few-Shot Class-Agnostic Counting. AAAI. 2024;38(6):5832–40.
- 30.
Pelhan J, Lukežič A, Zavrtanik V, Kristan M. DAVE – a detect-and-verify paradigm for low-shot counting. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024. p. 23293–302. https://doi.org/10.1109/cvpr52733.2024.02198
- 31.
Xu YW, Song FF, Zhang HF. Learning spatial similarity distribution for few-shot object counting. In: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024. 2024. p. 1507–15.
- 32. Wang B, Liu H, Samaras D, Nguyen MH. Distribution matching for crowd counting. Adv Neural Inf Process Syst. 2020;33:1595–607.
- 33.
Zhang YY, Zhou DS, Chen SQ, Gao SH, Ma Y. Single-image crowd counting via multi-column convolutional neural network. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 589–97.
- 34.
Song Q, Wang C, Wang Y, Tai Y, Wang C, Li J, et al., editors. To choose or to fuse? Scale selection for crowd counting. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2021.
- 35.
Xiong HP, Lu H, Liu CX, Liu L, Cao ZG, Shen CH. From open set to closed set: counting objects by spatial divide-and-conquer. IEEE I Conf Comp Vis (ICCV). 2019. p. 8361–70.
- 36.
Ma ZH, Wei X, Hong XP, Gong YH. Bayesian loss for crowd count estimation with point supervision. IEEE International Conference on Computer Vision (ICCV). 2019. p. 6141–50.
- 37.
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr PHS. Fully-convolutional siamese networks for object tracking. Computer Vision - ECCV 2016 Workshops, Pt II. 2016;9914:850–65.
- 38.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An image is worth 16x16 words: transformers for image recognition at scale. International Conference on Learning Representations (ICLR), 2021.
- 39.
Liu Z, Lin YT, Cao Y, Hu H, Wei YX, Zhang Z, et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021 Ieee/Cvf International Conference on Computer Vision (ICCV 2021). 2021. p. 9992–10002.
- 40.
Zhu X, Su W, Lu L, Li B, Wang X, Dai J. Deformable detr: Deformable transformers for end-to-end object detection. 2010;3. arXiv preprint 2020. https://arxiv.org/abs/2010.04159
- 41.
He K, Gkioxari G, Dollár P, Girshick R, editors. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision; 2017.
- 42.
Liu SL, Zeng ZY, Ren TH, Li F, Zhang H, Yang J, et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. Computer Vision - ECCV 2024, Pt XLVII. 2025;15105:38–55.
- 43.
Loshchilov I, Hutter F. Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR), 2019.
- 44.
Lin H, Hong X, Wang Y. Object counting: You only need to look at one. CVPR 2022. arXiv preprint 2021. https://arxiv.org/abs/2112.05993
- 45. Ren S, He K, Girshick R, Sun J. Faster R-CNN: towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst NIPS. 2015;28:2015.
- 46.
Lin TY, Goyal P, Girshick R, He KM, Dollár PF. Focal loss for dense object detection. In: 2017 IEEE International Conference on Computer Vision (ICCV), 2017. p. 2999–3007.
- 47.
Deng J, Dong W, Socher R, Li LJ, Li K, Li FF. ImageNet: a large-scale hierarchical image database. In: CVPR: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009.p. 248–55.
- 48.
Gong SJ, Zhang SS, Yang J, Dai DX, Schiele B. Class-agnostic object counting robust to intraclass diversity. In: Computer Vision - ECCV 2022, Pt XXXIII, 2022;13693:388–403.
- 49. Chen Z, He Z, Lu Z-M. DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention. IEEE Trans Image Process. 2024;33:1002–15. pmid:38252568
- 50.
Zhang YY, Zhou DS, Chen SQ, Gao SH, Ma Y. Single-image crowd counting via multi-column convolutional neural network. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2016. p. 589–97.
- 51.
Zhao Y, Lv WY, Xu SL, Wei JM, Wang GZ, Dan QQ, et al. DETRs beat YOLOs on real-time object detection. In: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2024. p. 16965–74.