Figures
Abstract
In open environments, complex and variable backgrounds and dense multi-scale targets are two key challenges for crowd counting. Due to the reliance on supervised learning with labeled data, current methods struggle to adapt to crowd detection in complex scenarios when training data is limited; Moreover, detection-based methods may lead to numerous missed detections when dealing with dense, small-scale target groups. This paper proposes a simple yet effective point-based contrastive learning method to alleviate these issues. Initially, we construct contrastive cropped samples and feed them into a convolutional neural network to predict head points of each image patch. Based on the classification and regression loss of these points, we incorporate an auxiliary supervision contrastive learning loss to enhance the model’s ability to differentiate between foreground heads and the background. Additionally, a multi-scale feature fusion module is proposed to obtain high-quality feature maps for detecting targets of different scales. Comparative experimental results on public crowd counting datasets demonstrate that the proposed method achieves state-of-the-art performance.
Citation: Cao R, Yu J, Liu Z, Liang Q (2025) Towards real-world monitoring scenarios: An improved point prediction method for crowd counting based on contrastive learning. PLoS One 20(7): e0327397. https://doi.org/10.1371/journal.pone.0327397
Editor: Ayesha Maqbool, National University of Sciences and Technology NUST, PAKISTAN
Received: November 14, 2024; Accepted: June 13, 2025; Published: July 2, 2025
Copyright: © 2025 Cao et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The UCF_CC_50 dataset analyzed during the current study is available in the https://www.crcv.ucf.edu/data/ucf-cc-50/. The ShanghaiTech Part A and Part B dataset analyzed during the current study is available in the https://github.com/desenzhou/ShanghaiTechDataset.
Funding: No funding was received to assist with the preparation of this manuscript.
Competing interests: The authors have no competing interests to declare that are relevant to the content of this article.
Introduction
Crowd counting is a crucial research topic in the fields of public safety and video surveillance, with broad application prospects. For instance, by analyzing crowd counts, we can estimate the aggregation state of crowds in large stadiums, squares, conference centers, and entertainment hubs. This enables situational awareness of crowd dynamics, thereby preventing incidents such as stampedes and mass brawls. Additionally, by examining crowd density distributions, we can assess the commercial value of specific locations or regions, facilitating the formulation of reasonable market planning and development strategies. In traffic management, counting the number of people at intersections, major thoroughfares, and public transportation hubs helps devise effective traffic management measures, such as security personnel allocation and evacuation plans.
Technically, crowd counting, often referred to as crowd density estimation, involves using computer vision techniques to estimate the number of individuals in a given image. Crowd density estimation based on computer vision has been studied for nearly 30 years. Early methods, primarily based on shallow visual computing techniques, have been quite comprehensive [1]. Representative methods include pixel feature statistics [2,3], texture analysis [4], head template features [5], background estimation and Expectation Maximization (EM) [6], Gaussian Process Regression (GPR) [7], support vector machine regression based on feature points [8], multi-camera information fusion [9], real-time head counting based on feature points [10], multi-output regression models based on feature mining [11], Bayesian regression [12], unsupervised Bayesian detection [13], semi-supervised elastic net [14], among others. These methods have significantly advanced crowd counting techniques. However, due to their reliance on manual feature extraction, their performance in adapting to complex scenarios remains suboptimal, falling short of practical technical standards.
With the widespread application of deep learning methods in computer vision, crowd counting techniques have seen substantial advancements [15,16]. Under the deep learning framework, existing crowd density estimation methods can be roughly divided into three categories: detection-based methods, density map-based methods, and regression-based methods. Detection-based methods treat pedestrians as visual targets, using deep learning techniques to obtain a series of target boxes, from which the total number of individuals in the image is counted. For instance, head visual feature learning based on cascade Adaboost and convolutional networks [17] and adaptive head candidate region generation based on scale maps [18] improve the accuracy of pedestrian detection by enhancing the network structure design for crowd scenes. Additionally, frameworks such as Faster-RCNN [19] and the YOLO series [20–26] are widely applied in crowd counting systems, enhancing pedestrian counting accuracy across various scenarios and crowd sizes. However, these methods heavily depend on the accuracy of detection boxes, leading to significant counting errors when pedestrians are densely packed and occlude each other.
Density map-based methods achieve crowd counting primarily through traditional kernel density estimation or density map estimation based on deep learning. Within the deep learning framework, methods such as dilated kernel convolutional networks [27], spatial divide-and-conquer networks [28], shallow feature dense attention networks [29], dilation rate adaptive convolution [30], and local counting maps [31] have improved the quality of crowd density map estimation. Some methods focus on generating high-quality crowd density maps, such as Fusion Count [32], adaptive density map generators [33], patch-level density map generation networks [34], and dynamic crowd density map refinement networks [35,36]. These methods have enriched the technical means of crowd counting. However, density map-based crowd counting methods in crowded areas are affected by cluttered backgrounds, crowd scale, perspective effects, target occlusion, and density loss [37–39]. Furthermore, density map methods struggle to directly obtain the location of each individual, impacting practical video surveillance applications.
Point-based methods use key points to locate heads or bodies, directly identifying individuals in the scene. Essentially, these methods belong to a localization approach. Song et al. [40] proposed a point-based joint crowd counting and individual localization framework, introducing density-normalized average precision and constructing a Point to Point Network (P2PNet) that directly predicts a set of point proposals representing head locations in the image. They employed the Hungarian algorithm to match predicted points with ground truth points, with matched points representing head locations, thus counting the total number of individuals. Additionally, methods such as point confidence prediction based on Transformers [41], self-attention-guided center point methods [42], bipartite matching point-supervised crowd counting methods [43], self-training methods based on point-level annotations [44], and Bayesian loss probability models [45] have improved the accuracy of pedestrian point prediction and localization. Despite providing precise individual locations, these methods depend on the network’s feature extraction capabilities, often struggling with differentiating positive and negative samples and detecting heads of varying scales.
To address these issues, this paper introduces contrastive learning techniques into a head point prediction framework to enhance the model’s ability to distinguish between positive and negative samples. During model training, a series of positive sample image patches containing heads and negative sample patches without heads are randomly cropped to construct positive and negative sample groups for contrastive learning loss calculation. This approach trains a network model that better distinguishes between head foreground and background. Additionally, considering the varying sizes of real crowd objects in video applications, a multi-scale feature fusion module is designed to enhance the model’s feature extraction capabilities for head targets in real scenes. This method directly predicts head points in images during inference, thus counting the total number of individuals.
For current practical applications, since video surveillance requires crowd location information, we primarily consider the YOLOv7 target detection method [46] and the head point detection-based model method proposed in this paper. The comparison of detection box and point detection effects in dense crowd scenarios is shown in Fig 1. To improve the robustness of the model, more real-world training data and the use of data augmentation techniques have been introduced. We could train the model with different lighting conditions, weather changes, and different viewpoints to make it perform well in real-world applications.
The upper image shows the performance of the YOLOv7 bounding box detection method, while the lower image presents the results of our point-based detection method. As we can see, the point-based detection method performs better. All faces are blurred in Fig 1 for privacy preservation.
The main contributions of this paper include the following aspects.
- We propose a novel and effective method for crowd counting and localization, which achieves state-of-the-art performance on crowd counting datasets and has been widely applied in practical video surveillance scenarios.
- We propose a crowd feature representation network based on patch-supervised auxiliary contrastive learning, which fully leverages the local density characteristics of the crowd in scene images, enhancing the discriminative capability between head targets and background without adding extra inference burden.
- We introduce a multi-scale feature fusion module, designing a weighted cross-scale connection structure to aggregate features at different resolutions, thereby improving the model’s ability to learn head features of varying scales.
Related work
In this section, we review some recent works on crowd counting and contrastive learning. As detection-based methods and point-based methods can be summarized as localization-based methods, we discuss these two crowd counting approaches.
Density map-based methods
For a given crowd image, density map-based methods aim to generate a density map and then sum the predicted density map to obtain the count [27–31,47,48]. Specifically, Li et al. [27] proposed the Congested Scene Recognition Network (CSRNet) model, which uses dilated kernels to provide a larger receptive field to generate high-quality density maps, thus achieving high-precision crowd counting. Liu et al. [47] introduced a nonlinear continuous counting quantization strategy and transformed the problem of sample block counting imbalance into a class imbalance of counting levels. Xiong et al. [28] proposed the Spatial Divide-and-Conquer Network (S-DCNet), whose core idea is to learn a counting classifier on a closed set and then extend it to open-set counting. Miaott et al. [29] proposed a Shallow Feature-based Dense Attention Network (SDANet), capturing multi-scale information through densely connected hierarchical feature maps, achieving crowd counting for static images. Bai et al. [30] proposed a dilation rate adaptive convolution operation, establishing a self-correcting supervision mechanism to improve the accuracy of crowd counting. Liu et al. [31] proposed a local counting map, constructing a scale-aware module, hybrid regression module, and adaptive soft region module to achieve high-precision counting regression by focusing on the difference between the global sum of the crowd count and the density map during testing. Liu et al. [48] transformed the counting problem into a sequential decision problem, using a weighing strategy from scales to achieve a crowd counting model based on deep reinforcement learning.
Most existing density map-based crowd counting methods mainly focus on crowd density estimation. Relatively few works focus on generating crowd density maps. Ma et al. [32] proposed a crowd counting model that extensively leverages representations learned during encoding to compute first-phase multiscale features, and its decoder further fuses these scale-aware features to generate the density map. Wan et al. [33] analyzed the impact of different density maps and constructed a density map refinement network. They built an adaptive density map generator, using annotation dot maps as input to learn the density map representation of the counter, generating ground-truth density maps. Xu et al. [49] extracted patch-level density maps through a density estimation model, introducing multi-polar center loss to automatically normalize each patch density map online, achieving density map clustering. Jiang et al. [50] constructed a multi-level convolutional neural network that adaptively learns multi-level density maps, each focusing on handling pedestrians of specific sizes and fusing them to predict the final output. Tian et al. [35] proposed a more intuitive and understandable Density Map Dynamic Refinement Network (DDRNet), consisting of a counter and refiner. The refiner, composed of convolution layers with different dilation rates, iteratively refines and improves the quality of the density map using the counter’s output as dynamic input. Liu et al. [36] constructed a dynamic fine density map network with a designed regional attention module (RAM) that adaptively adjusts the head size relationship in different positions of the dot map, refining existing ground-truth density maps through joint training of the counter and learnable refinement network.
These methods have advanced crowd counting techniques but have yet to meet practical requirements. Firstly, manually annotating crowd density maps is challenging. Additionally, these methods heavily rely on the quality of crowd density maps. In practice, the density maps generated by existing methods are easily affected by changes in head proportions due to the multi-scale nature of pedestrians in images, failing to reflect the actual size of heads in the image, thereby impacting counting accuracy.
Localization-based methods
Localization-based methods utilize object detection techniques [51–53] for crowd counting, with the core idea of treating pedestrians or heads as visual targets and counting the crowd by locating these targets. Gao et al. [17] proposed using the cascade Adaboost algorithm to replace the candidate region generation module in the R-CNN framework [54], using convolutional networks to learn head visual features and employing a support vector classifier for pedestrian and non-pedestrian classification. Khan et al. [18] proposed using a scale map to generate scale-adaptive head candidate regions, followed by using convolutional neural networks for head detection. Rani et al. [19] first used Faster-RCNN [55] to detect heads and then built a pedestrian counting system. Additionally, the YOLO series object detection frameworks [20] have been widely applied in pedestrian detection and crowd counting tasks [21–26]. Sam et al. [34] developed a tailored detection framework for dense crowd counting by predicting head bounding boxes, using a top-down feature modulation strategy to better distinguish pedestrian targets and producing fine-grained predictions at multiple resolutions, reliably outputting head localization results from sparse to dense crowds.
Moreover, some point-based methods have been applied to crowd counting. The primary task of point prediction is to use supervised deep learning methods to achieve pedestrian target point regression based on pedestrian (particularly head) point annotations, thereby providing individual localization positions. Song et al. [40] proposed a Point to Point Network (P2PNet), a classic architecture among such methods, directly predicting a set of point proposals to locate head positions. Yuan et al. [41] constructed a Localization Guided Transformer (LGT) framework, a point-based model using regression heads and classification heads to simultaneously predict head point proposals and point confidence, providing more discriminative representations for high-quality density map estimation. Ma et al. [42] proposed a self-attention guidance-based crowd localization and counting network (SA-CLCN), using original point annotations from crowd datasets as supervision to train the network, predicting each head’s center point coordinates and the crowd count. Liu et al. [43] proposed a bipartite matching-based point-supervised crowd counting method, matching annotated pixel points through bipartite matching to reduce the impact of incorrect point matching on counting performance. Wang et al. [44] proposed a novel self-training method, using point-level annotations and crowd-aware loss to guide network training, predicting pedestrian center points and sizes in crowded scenes. Ma et al. [45] proposed a density contribution probability model based on Bayesian loss and point annotations, where the training loss constrains not the value of each pixel in the density map but the expected count of each annotated point, improving crowd counting accuracy in dense scenes.
Contrastive learning
The concept of contrastive learning was proposed as early as 2006 to learn invariant representations of patterns [56]. Subsequently, Khosla et al. [57] developed supervised contrastive learning, whose core idea is to cluster the embeddings of positive samples while separating the embeddings of negative samples. Supervised contrastive learning has significantly enhanced the learning ability of image feature representations. Recently, contrastive learning methods have gradually been applied to pedestrian detection and crowd counting. For instance, Lin et al. [58] proposed an example-guided contrastive learning framework to guide feature learning. Under the contrastive learning framework, they used pedestrian appearance as a prior knowledge example dictionary, constructing effective contrastive training pairs and using the constructed example dictionary to evaluate the quality of candidate pedestrians. Chen et al. [59] proposed a discriminative feature learning framework for crowd counting, consisting of a Masked Feature Prediction Module (MPM) and a Contrastive Learning Module (CLM). MPM randomly masks feature vectors in the feature map, enhancing the model’s pedestrian localization ability in high-density areas through a supervised reconstruction strategy. CLM brings the representations of pedestrian targets closer and pushes the background features away from pedestrian representations, increasing their discriminability. However, these methods do not solve the problem of crowd counting in both sparse and dense crowd scenarios. Unlike previous works, our method combines patch-level foreground and background contrastive learning with a point detection framework, providing more accurate results in various applications.
Problem description
For images from wide-area surveillance scenarios that may contain crowds, let represent the number of people, and we use
to denote the center point of the
-th head. The set of real crowd points in the image can be represented as
. The predicted set of head points by the model is denoted as
, and the confidence score set of each head point is
. Thus, the crowd counting problem can be described as ensuring that the predicted point
with high confidence
is as close as possible to the real point
, while the predicted number of people
is as close as possible to the real number
. Therefore, our method design needs to simultaneously consider accurate category prediction and precise position regression. Given the varying head sizes in real scenarios and the presence of negative samples with features similar to heads, our model needs to have multi-scale feature learning capabilities and strong positive-negative sample discrimination.
Our method
In this section, we propose a point-based crowd counting method incorporating contrastive learning, which consists of five key components: (1) Random cropping to obtain contrastive samples; (2) Backbone for extracting image features; (3) Multi-Scale Feature Fusion Module (MSFM) to improve the ability to detect heads of different sizes; (4) Multi-branch head modules for classification, regression, and projection; (5) Point matching during the training process and point prediction during the inference process.
Contrastive learning network design
The objective of this section is to construct positive and negative samples for contrastive learning. Given a crowd image, we randomly crop a fixed number of 128 × 128 image patches, which may or may not contain people. Our key insight lies in aggregating the feature representations of crowd regions while separating those of background regions. By batching the image samples, we can highly likely obtain a series of samples that simultaneously include both crowd and background regions. These samples can then be utilized as positive and negative samples to establish a contrastive learning framework. For instance, when the batch size is set to 16 and the number of patches is 8, each training iteration will generate 128 samples, either containing or excluding human elements, which are applicable for the contrastive learning process.
Similar to existing crowd counting models [40], we can use VGG-16_bn [60] as the backbone network for extracting image features. As shown in Fig 2, by outputting three different levels of feature maps, we design a Multi-Scale Feature Fusion Module (MSFM) to obtain better feature representations of crowds of varying scales and distributions. Subsequently, based on the same output of the MSFM, we construct three different head branches. During the training phase, we adopt a linear projection layer as the projection head, and utilize three stacked convolutional layers interwoven with ReLU activation as the classification head and regression head. During inference, only the classification and regression branches are maintained for point prediction.
The blue dashed box includes four different-sized feature layers from the VGG backbone network. The backbone network section can be replaced with other structures such as ResNet. The dashed box containing the projection head, contrastive loss, and point matching is used only during the training process.
Multi-scale feature fusion module
The designed Multi-Scale Feature Fusion Module (MSFM) is shown in Fig 3. It takes three feature maps as input and outputs a fused feature map. The input feature maps ,
, and
have sizes H/4xW/4, H/8xW/8 and H/16xW/16 respectively. The output feature map has a size of H/8xW/8. We first utilize FPN [61] to introduce top-down and bottom-up paths, obtaining three new feature maps with stronger semantics and finer information, namely L, M, and H. Considering the limitations of FPN in multi-scale feature fusion, we further introduce two additional connections. The mathematical description of the new feature maps is as follows:
L, M, and H represent Low-level, Medium-level, and High-level features, respectively. W1, W2, and W3 are learnable weights for features at different levels. The symbol Σ denotes the element-wise weighted summation operation.
where Conv2d is the standard two-dimensional convolution operation, Upsample is the upsampling operation, and the Swish activation function [62] is chosen for better performance, specifically defined as:
Here, is a constant or trainable parameter, set a
in our network module.
For cross-scale connections, we additionally add a bottom-up path from the L level to the M level, using a MaxPool operation with stride 2 to adjust the output size of the L level to the same shape as the M level. Moreover, we add an extra edge connection in to sum operations of the same level as the M output, enabling the fusion of more features without incurring significant costs. Due to different input features at different resolutions, their contributions to the output features are typically unequal [63]. To address this, we perform a weighted summation operation on the inputs
, M, and L, with the final fusion formula as follows.
Point matching and prediction
Similar to P2PNet [40], we use a one-to-one matching strategy for point matching. A set of fixed reference points is introduced. These reference points are densely and uniformly arranged over the patch obtained from the input image crop, with coordinates represented as
. The classification branch is designed to output the confidence scores of these point coordinates, while the regression branch generates the offsets for these point coordinates. For a predicted point
, the offset is represented as
, and the predicted point coordinates are calculated as follows.
where is a regularization parameter used to adjust the magnitude of the offset.
For each point proposal in the set , a one-to-one matching strategy is employed, assigning them to the ground truth target set P. Based on the Euclidean distance and confidence score of these points, a pairwise cost matrix D is constructed, defined as follows.
where is a weighting parameter, and
is the confidence score of
.
Using the above pairwise cost matrix, the Hungarian algorithm [64,65] is employed for matching. In our implementation, the total number of reference points exceeds the number of pixels in the image, ensuring that the number of predicted points exceeds the number of ground truth points. Thus, matched predicted points can be labeled as positive samples, while unmatched predicted points are labeled as negative samples.
Loss function design
In our framework, there are three loss objectives: classification loss, regression loss, and contrastive loss. The first two are based on the matched and unmatched points obtained above, while the last one is calculated based on the deep feature maps output by the MSFM.
For the classification loss, we adopt focal loss [66] for dense target detection, which is described as follows.
where is the model-estimated probability for the label being 1,
is a weighting parameter to address class imbalance,
is a focusing parameter that reduces the loss contribution from easy examples and extends the range of examples receiving low loss. MMM represents the mini-batch size.
For the regression loss, we introduce MSE loss to supervise the point regression, defined as follows.
where denotes the position of the ground truth point,
denotes the point matched with the ground truth point
, and N is the number of matched points.
The contrastive loss is calculated on the feature maps output by the MSFM. Since we use random cropping to obtain patches during training and input them into the network, the first dimension of the output feature map is the product of batch size and the number of patches, denoted as . Let
be the index of any sample in
,
be the index set excluding
itself, and
be the index set of other aligned visual features with the same label as
. The contrastive loss for a training process is defined as.
where is the temperature hyperparameter, and
denotes dot product.
is the cosine similarity between sample
and sample
,
and
are the high-dimensional feature vectors of samples
and
, respectively. Therefore, the overall loss function can be formulated as follows.
where the weights for these loss functions ,
,
are hyperparameters. In our experiments, we set
to 0.01,
to 1, and
to 2e-4.
Experiment
Experimental details
We utilize multiple publicly available crowd counting datasets, such as UCF_CC_50 [67], ShanghaiTech Part A and Part B [68], to demonstrate the superiority of our method. The UCF_CC_50 dataset contains 50 images with the number of people ranging from 94 to 4543, with an average count of 1280. This dataset is diverse in scene distribution and challenging for detection. We use the 5-fold cross-validation method, similar to other papers. The ShanghaiTech Part A dataset is randomly collected from the internet, containing 300 training images and 182 test images. Part B is captured from busy streets in the Shanghai metropolis, with 400 training images and 316 test images. The crowd density varies significantly between Part A and Part B, making this dataset more challenging and representative than most existing datasets. Additionally, we collected a non-public Tower dataset for practical mid-to-high viewpoint video surveillance applications, consisting of 671 training images and 100 test images.
Similar to previous works, we apply random scaling, keeping the shorter side no less than 128. We then randomly crop 8 fixed-size 128 × 128 image patches from the resized images. Finally, we introduce random flipping with a probability of 0.5. The Adam algorithm with an initial learning rate of 1e-4 is first used for training, followed by switching to the SGD optimizer for fine-tuning the better model [69]. The backbone network is pretrained on ImageNet, similar to previous works.
For the contrastive cropping sample parameter settings, we focus on the training data batch size and the number of patches per image. We set the batch size to 8 and the patch number to 3. Therefore, for each training process, we obtain 24 samples per batch, with labels indicating whether they contain people or not, which can be used for the contrastive learning process.
Model evaluation
Our method has been compared with state-of-the-art methods on several publicly available crowd counting datasets, as shown in Table 1. We use Mean Absolute Error (MAE) and Root Mean Squared Error (MSE) as evaluation metrics, defined as follows.
where is the number of test images,
is the actual number of people in the
-th image, and
is the estimated number of people in the
-th image. Roughly speaking, MAE measures the accuracy of the estimates, while MSE measures the robustness of the estimates.
Our framework achieves the best performance among all methods. Specifically, on the dense crowd dataset SHTech Part A, our method reduces the MAE by 1.4 and the MSE by 1.16 compared to the state-of-the-art method P2PNet [40]. On the sparse crowd dataset SHTech Part B, our method also achieves the best performance, reducing the MAE by 0.12 compared to the state-of-the-art method RSI-ResNet50 [70]. On the challenging UCF_CC_50 dataset with a wide range of crowd densities, our method balances accuracy and robustness. Compared to P2PNet, our method reduces the MSE by 7.3 while maintaining a similar MAE, and compared to CAN, our method reduces the MAE by 38.7 while maintaining a similar MSE. Considering both MAE and MSE metrics, our method not only achieves high accuracy but also demonstrates high robustness, achieving the best overall performance.
As shown in Fig 4, we present the visualized crowd counting results based on the ShanghaiTech Part A and Part B datasets. In the ShanghaiTech Part A dataset, we observe densely crowded regions, while the ShanghaiTech Part B dataset is more sparsely populated. The images in ShanghaiTech Part A have higher densities, with the number of people ranging from 33 to 3139, whereas the images in ShanghaiTech Part B have lower densities, with the number of people ranging from 9 to 578 [36]. Due to the varying object shapes and sizes in these datasets, our multi-scale feature fusion model demonstrates significant advantages over traditional methods. Experiments on these datasets demonstrate that our method performs well in both sparse and crowded crowd scenarios.
The left image is from ShanghaiTech Part A, with a predicted crowd count of 375; the right image is from Part B, with a predicted crowd count of 18. All faces are blurred in Fig 4 for privacy preservation.
Additionally, we evaluated our model using the Tower dataset. As shown in Fig 5, our method accurately predicts the number of people in both near-field images with fewer people and far-field images with more people. This indicates that our algorithm is adaptable to both near-field and far-field, as well as sparse and dense crowd scenarios.
The left image shows a close-up view of a high-speed rail station exit, with a predicted crowd count of 2; the right image shows a distant view of a street, with a predicted crowd count of 33. All faces are blurred in Fig 5 for privacy preservation.
Ablation experiments
Considering that the main innovations of the paper revolve around contrastive learning and the multi-scale feature fusion, our ablation experiments focus on four aspects: the effectiveness of the projection head and contrastive loss related to contrastive learning, the effectiveness of the multi-scale feature fusion module, and the number of patches cropped from a single image.
Effectiveness of the projection head.
Given that different types of projection heads have varying performance, we conducted an ablation study on the structure of the projection head, including three settings: identity mapping, linear projection head using a fully connected (FC) layer of size 256*128, and nonlinear projection head using a 256*256 FC layer, a 256*128 FC layer, and a ReLU activation in between. As shown in Table 2, under the same framework with the patch number set to 4 by default, the linear projection head shows a certain advantage. The learnable linear transformation between the representations and the contrastive loss significantly improves the quality of learned representations.
Effectiveness of the multi-scale feature fusion module.
To analyze the impact of feature fusion, we compared the original FPN module with our Multi-Scale Feature Fusion Module (MSFM). Using the same VGG backbone network and the default hyperparameter settings mentioned above, the experimental results are shown in Table 3. The multi-scale feature fusion module reduces the MAE by 1.18 and the MSE by 2.45 compared to the original FPN module. As we can see, by replacing FPN with our proposed MSFM, we achieved better crowd counting performance. The results indicate that multi-scale feature fusion can extract scale-related features from crowd images, facilitating crowd detection in different contexts.
Effectiveness of the contrastive loss.
We conducted an ablation study on the use of contrastive loss, and the analysis results are shown in Table 4. To ensure the reliability of the experiments, the baseline model was based on the same model without contrastive loss, with the patch number set to 4 by default. We can see that adding contrastive loss reduces the MAE by 0.52 and the MSE by 1.16, showing that the model with contrastive loss performs better.
Patch number parameter analysis.
Under uniform parameter configurations, experiments were conducted on the SHTechA dataset to investigate the influence of the contrastive cropping module on crowd counting by varying the number of patches cropped from a single image. As depicted in Fig 6, when the patch number is set to 3, the crowd counting accuracy, measured by mean absolute error (MAE), attains an optimal value of 51.34. This outcome is attributable to the ability of a moderate patch count to enable the contrastive learning framework to effectively capture multi-scale contextual information while maintaining a balanced distribution of positive/negative sample pairs, which is crucial for learning discriminative features in crowd scenes. As the patch number exceeds 3, significant fluctuations in the MAE metric emerge, primarily due to the introduction of redundant or semantically similar regions from excessive patches, which dilutes the informativeness of contrastive pairs. In contrastive learning, the efficacy of the loss function hinges on meaningful semantic disparities between sample pairs; thus, an overly high patch number forces the model to learn from highly correlated patches, leading to unstable gradient updates and degraded feature representation quality. Specifically, when the patch number surpasses 8, the overload of intra-image patches disrupts the equilibrium of the contrastive learning objective, causing the MAE to exceed the baseline value of 52.36. These findings underscore the necessity of constraining the patch number within a range below 8 to optimize contrastive feature diversity while mitigating information redundancy, thereby validating the theoretical rationale of the proposed contrastive cropping module. By adaptively selecting an appropriate number of patches, the module balances multi-scale feature extraction with contrastive sample quality, enhancing the discriminative power of learned representations and consequently improving crowd counting performance.
Patch number parameter is the number of samples cropped from a single image for contrastive learning.
Qualitative study on contrastive learning
To obtain more insights into the effectiveness of contrastive learning, we apply t-SNE [78] to MSFM features by different models to compare their capability to distinguish between positive samples and negative samples. Fig 7 presents t-SNE maps for the baseline model and our model based on contrastive learning. The t-SNE map of features after the MSFM module integrated with contrastive learning shows more separability than the baseline t-SNE map.
Left: features after the MSFM model integrated with contrastive learning; Right: features of the baseline model.
Complexity analysis
Table 5 reports a comparison of model size and inference speed computed with a single NVIDIA T4 GPU. The inference time is the average time of 100 runs on testing 1920 × 1080 sample. Those test images are collected from real-world monitoring scenes. It can be observed that our method has less parameters and faster inference speed than the previous state-of-the-art method STEERER [77]. Evidently, our model excels in addressing real-world scenarios in terms of both resource consumption and inference performance.
Conclusion
This paper proposes a novel and effective point-based crowd counting method using contrastive learning. Recent studies have shown that detecting dense small objects is a significant and challenging problem in public safety and video surveillance. To address this issue, we introduce a point-based crowd detection framework and leverage auxiliary supervised contrastive learning to enhance the model’s ability to represent crowd foreground and background. Additionally, a multi-scale feature fusion module is proposed for detecting crowds in both high-density and low-density regions. Several experiments conducted on public crowd datasets demonstrate the effectiveness of our method. To further improve the detection accuracy in dense crowd scenes, other deep learning models such as Transformers could be considered to enhance the feature extraction capability. In the future, our method can also be applied to count other objects, such as vehicles and animals, and analyze their behaviors based on the point locations obtained by our model.
References
- 1. Saleh SAM, Suandi SA, Ibrahim H. Recent survey on crowd density estimation and counting for visual surveillance. Eng Appl Artif Intell. 2015;41:103–14.
- 2. Chow TWS, Yam JY-F, Cho S-Y. Fast training algorithm for feedworward neural network: application to crowd estimation at underground stations. Artif Intell Eng. 1999;13(3):301–7.
- 3. Cho SY, Chow TS, Leung CT. A neural-based crowd estimation by hybrid global learning algorithm. IEEE Trans Syst Man Cybern B Cybern. 1999;29(4):535–41. pmid:18252328
- 4.
Marana AN, Costa LF, Lotufo RA, et al. On the efficacy of texture analysis for crowd monitoring[C]//Proceedings SIBGRAPI’98. International Symposium on Computer Graphics, Image Processing, and Vision (Cat. No. 98EX237). IEEE; 1998. pp. 354–61.
- 5. Lin SF, Chen JY, Chao HX. Estimation of number of people in crowded scenes using perspective transformation. IEEE Trans Syst Man Cybern-Part A: Syst Hum. 2001;31(6):645–54.
- 6. Hou Y-L, Pang GKH. People counting and human detection in a challenging situation. IEEE Trans Syst Man Cybern - Part A: Syst Hum. 2011;41(1):24–33.
- 7.
Fradi H, Dugelay J-L. People counting system in crowded scenes based on feature regression. 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO). Bucharest, Romania; 2012. pp. 136–40.
- 8. Conte D, Foggia P, Percannella G. A method for counting moving people in video surveillance videos. EURASIP J Adv Signal Process. 2010;2010:1–10.
- 9.
Dittrich F, Koerich AL, Oliveira LES. People counting in crowded scenes using multiple cameras. 2012 19th International Conference on Systems, Signals and Image Processing (IWSSIP). Vienna, Austria; 2012. pp. 138–41.
- 10.
Riachi S, Karam W, Greige H. An improved real-time method for counting people in crowded scenes based on a statistical approach. 2014 11th International Conference on Informatics in Control, Automation and Robotics (ICINCO). Vienna, Austria; 2014. pp. 203–12.
- 11. Chen K, Loy CC, Gong S, et al. Feature mining for localised crowd counting[C]. Bmvc. 2012;1(2): 3.
- 12. Chan AB, Vasconcelos N. Counting people with low-level features and Bayesian regression. IEEE Trans Image Process. 2012;21(4):2160–77. pmid:22020684
- 13.
Brostow GJ, Cipolla R. Unsupervised Bayesian Detection of Independent Motion in Crowds. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). New York, NY, USA; 2006. pp. 594–601.
- 14. Tan B, Zhang J, Wang L. Semi-supervised elastic net for pedestrian counting. Pattern Recogn. 2011;44(10–11):2297–304.
- 15.
Akhtar R, Malhotra D. Intelligent techniques for crowd detection and people counting—A systematic study. In: Bansal JC, Engelbrecht A, Shukla PK, editors. Computer vision and robotics. Algorithms for intelligent systems. Singapore: Springer; 2022.
- 16.
Cahyadi N, Rahardjo B. Literature review of people counting. 2021 International Conference on Artificial Intelligence and Mechatronics Systems (AIMS). Bandung, Indonesia; 2021. pp. 1–6.
- 17. Gao C, Li P, Zhang Y, Liu J, Wang LJN. People counting based on head detection combining Adaboost and CNN in crowded surveillance environment. Neurocomputing. 2016;208:108–16.
- 18.
Khan SD, Ullah H, Ullah M, Conci N, Cheikh FA, Beghdadi A. Person Head Detection Based Deep Model for People Counting in Sports Videos. 2019 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). Taipei, Taiwan; 2019. pp. 1–8.
- 19. Baby Rani N, Lavanya J, Sathwika K, Likhitha M, Sowjanya N. People Counting System Based on Head Detection using Faster RCNN from Both Images and Videos. Turk J Comput Math Educ. 2023;14(03):947–54.
- 20. Redmon J, Divvala S, Girshick R, Farhadi A. You Only Look Once: Unified, Real-Time Object Detection arXiv:1506.02640v5. 2016.
- 21.
Elaoua A, Nadour M, Cherroun L, Elasri A. Real-Time People Counting System using YOLOv8 Object Detection. 2023 2nd International Conference on Electronics, Energy and Measurement (IC2EM). Medea, Algeria; 2023. pp. 1–5.
- 22.
Ren P, Fang W, Djahel S. A novel YOLO-based real-time people counting approach. International Smart Cities Conference (ISC2). IEEE; 2017. pp. 1–2.
- 23. Rahim A, Maqbool A, Rana T. Monitoring social distancing under various low light conditions with deep learning and a single motionless time of flight camera. PLoS One. 2021;16(2):e0247440. pmid:33630951
- 24.
Menon A, Omman B, Asha S. Pedestrian counting using Yolo V3. International Conference on Innovative Trends in Information Technology (ICITIIT). IEEE: 2021. pp. 1–9.
- 25.
Ruchika RK, Purwar S, Verma S. Analytical study of yolo and its various versions in crowd counting. Intelligent Data Communication Technologies and Internet of Things: Proceedings of ICICI 2021. Springer; 2022. pp. 975–89.
- 26. Gündüz MŞ, Işık G. A new YOLO-based method for real-time crowd detection from video and performance analysis of YOLO models. J Real Time Image Process. 2023;20(1):5. pmid:36744218
- 27.
Li Y, Zhang X, Chen D. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018. pp. 1091–100.
- 28.
Xiong H, Lu H, Liu C, Liu L, Cao Z, Shen C. From open set to closed set: Counting objects by spatial divide-and-conquer. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. pp. 8362–71.
- 29.
Miao Y, Lin Z, Ding G, Han J. Shallow feature based dense attention network for crowd counting. Proceedings of the AAAI Conference on Artificial Intelligence. 2020. pp. 11765–72.
- 30.
Bai S, He Z, Qiao Y, Hu H, Wu W, Yan J. Adaptive dilated network with self-correction supervision for counting. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020. pp. 4594–603.
- 31.
Liu X, Yang J, Ding W, Wang T, Wang Z, Xiong J. Adaptive mixture regression network with local counting map for crowd counting. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK: Springer; 2020. pp. 241–57.
- 32. Ma Y, Sanchez V, Guha T. FusionCount: Efficient Crowd Counting via Multiscale Feature Fusion. 2022.
- 33.
Wan J, Chan A. Adaptive density map generation for crowd counting. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South); 2019. pp. 1130–9.
- 34. Sam DB, Peri SV, Sundararaman MN, Kamath A, Babu RV. Locate, Size, and Count: Accurately Resolving People in Dense Crowds via Detection. IEEE Trans Pattern Anal Mach Intell. 2021;43(8):2739–51.
- 35.
Tian S, Sang J, Qiao X, Liu X, Liu K, Xia X. Crowd counting based on density map dynamic refinement. International Joint Conference on Neural Networks. Padua, Italy; 2022. pp. 1–7.
- 36. Liu Y, Cao G, Ge Z, Hu G. Crowd counting method via a dynamic-refined density map network. Neurocomputing. 2022;497:191–203.
- 37.
Luo A, Yang F, Li X, Nie D, Jiao Z, Zhou S, et al. Hybrid graph neural networks for crowd counting. Proceedings of the AAAI Conference on Artificial Intelligence. 2020. pp. 11693–700.
- 38.
Oh MH, Olsen P, Ramamurthy KN. Crowd counting with decomposed uncertainty. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2020. pp. 11799–806.
- 39.
Ma Z, Wei X, Hong X, Lin H, Qiu Y, Gong Y. Learning to count via unbalanced optimal transport. Proceedings of the AAAI Conference on Artificial Intelligence. 2021. pp. 2319–27.
- 40.
Song Q, Wang C, Jiang Z, Wang Y, Tai Y, Wang C, et al. Rethinking counting and localization in crowds: A purely point-based framework. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. pp. 3365–74.
- 41. Yuan L, Chen Y, Wu H, Wan W, Chen P. Crowd counting via localization guided transformer. Comput Elec Eng. 2022;104(Part B):1–13.
- 42. Zhouzhou M, Guanghua G, Wenrui Z. Self-attention Guidance Based Crowd Localization and Counting. Mach Intell Res. 2024.
- 43.
Liu H, Zhao Q, Ma Y. Bipartite Matching for Crowd Counting with Point Supervision. IJCAI; 2021. pp. 860–6.
- 44. Wang Y, Hou J, Hou X, Chau L-P. A self-training approach for point-supervised object detection and counting in crowds. IEEE Trans Image Process. 2021;30:2876–87. pmid:33539297
- 45.
Ma Z, Wei X, Hong X, Gong Y. Bayesian Loss for Crowd Count Estimation With Point Supervision. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South); 2019.
- 46.
Wang CY, Bochkovskiy A, Liao HYM. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. pp. 7464–75.
- 47. Liu L, Lu H, Xiong H, Xian K, Cao Z, Shen C. Counting Objects by Blockwise Classification. IEEE Transactions on Circuits and Systems for Video Technology. 2020;30(10):3513–27.
- 48.
Liu L, Lu H, Zou H, Xiong H, Cao Z, Shen C. Weighing counts: Sequential crowd counting by reinforcement learning. Computer Vision–ECCV 2020: 16th European Conference. Glasgow, UK: Springer; 2020. 164–81.
- 49.
Xu C, Qiu K, Fu J, Bai S, Xu Y, Bai X. Learn to Scale: Generating Multipolar Normalized Density Maps for Crowd Counting. 2019 IEEE/CVF International Conference on Computer Vision (ICCV). Seoul, Korea (South); 2019. pp. 8381–9.
- 50. Jiang X, Zhang L, Lv P, Guo Y, Zhu R, Li Y, et al. Learning Multi-Level Density Maps for Crowd Counting. IEEE Trans Neural Netw Learn Syst. 2020;31(8):2705–15. pmid:31562106
- 51.
Girshick R. Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision. 2015. 1440–8.
- 52.
Li Z, Zhang L, Fang Y, Wang J, Xu H, Yin B, et al. Deep people counting with faster R-CNN and correlation tracking. Proceedings of the International Conference on Internet Multimedia Computing and Service. 2016. pp. 57–60.
- 53.
Ge Z, Jie Z, Huang X, Xu R, Yoshie O. Ps-rcnn: Detecting secondary human instances in a crowd via primary object suppression. IEEE International Conference on Multimedia and Expo (ICME). IEEE; 2020. 1–6.
- 54.
Girshick R, Donahue J, Darrell T, Malik J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2014. 580–7.
- 55.
Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Advances in Neural Information Processing Systems. 2015. pp. 91–9.
- 56.
Hadsell R, Chopra S, LeCun Y. Dimensionality reduction by learning an invariant mapping. 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06). IEEE; 2006. pp. 1735–42.
- 57. Khosla P, Teterwak P, Wang C. Supervised contrastive learning. Adv Neural Inf Process Syst. 2020;33:18661–73.
- 58. Lin Z, Pei W, Chen F, Zhang D, Lu G. Pedestrian Detection by Exemplar-Guided Contrastive Learning. IEEE Trans Image Process. 2023;32:2003–16. pmid:35839180
- 59. Chen Y. Learning discriminative features for crowd counting. arXiv preprint. 2023.
- 60. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint. 2014.
- 61.
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S. Feature pyramid networks for object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017. pp. 2117–25.
- 62. Ramachandran P, Zoph B, Le QV. Searching for activation functions. arXiv preprint. 2017.
- 63.
Tan M, Pang R, Le QV. Efficientdet: Scalable and efficient object detection. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition2020, pp. 10781–90.
- 64. Kuhn HW. The Hungarian method for the assignment problem. Naval Research Logistics Quarterly. 1955;2(1–2):83–97.
- 65.
Stewart R, Andriluka M, Ng AY. End-to-end people detection in crowded scenes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. pp. 2325–33.
- 66.
Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision. 2017. pp. 2980–8.
- 67.
Idrees H, Saleemi I, Seibert C, Shah M. Multi-source multi-scale counting in extremely dense crowd images. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013. pp. 2547–54.
- 68.
Zhang Y, Zhou D, Chen S, Gao S, Ma Y. Single-image crowd counting via multi-column convolutional neural network. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. pp. 589–97.
- 69.
Keskar NS, Socher RJ. Improving generalization performance by switching from adam to sgd. 2017.
- 70.
Cheng ZQ, Dai Q, Li H, Song J, Wu X, Hauptmann AG. Rethinking spatial invariance of convolutional networks for object counting. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. pp. 19638–48.
- 71.
Liu W, Salzmann M, Fua PF. Context-aware crowd counting. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019. pp. 5099–108.
- 72.
Ma Z, Wei X, Hong X, Gong Y. Bayesian loss for crowd count estimation with point supervision. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. pp. 6142–51.
- 73.
Cheng ZQ, Li JX, Dai Q, Wu X, Hauptmann AG. Learning spatial awareness to improve crowd counting. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. pp. 6152–61.
- 74.
Hu Y, Jiang X, Liu X, Zhang B, Han J, Cao X, et al. Nas-count: Counting-by-density with neural architecture search. Computer Vision–ECCV 2020: 16th European Conference. Glasgow, UK; 2020. pp. 747–66.
- 75.
Wang B, Liu H, Samaras D, Nguyen MHJA. Distribution matching for crowd counting. 2020;33(1):1595–607.
- 76.
Shu W, Wan J, Tan KC, Kwong S, Chan AB. Crowd counting in the frequency domain. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. pp. 19618–27.
- 77.
Han T, Bai L, Liu L, Ouyang W. Resolving scale variations for counting and localization via selective inheritance learning. Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. 21848–59.
- 78. Van der Maaten L, Hinton G. Visualizing data using t-sne. J Mach Learn Res. 2008;9(11).