Fig 1.
The comparison between bounding box detection and point detection on the Tower dataset collected from practical applications.
The upper image shows the performance of the YOLOv7 bounding box detection method, while the lower image presents the results of our point-based detection method. As we can see, the point-based detection method performs better. All faces are blurred in Fig 1 for privacy preservation.
Fig 2.
Overall framework of point-based crowd counting with contrastive learning.
The blue dashed box includes four different-sized feature layers from the VGG backbone network. The backbone network section can be replaced with other structures such as ResNet. The dashed box containing the projection head, contrastive loss, and point matching is used only during the training process.
Fig 3.
Structure of the multi-scale feature fusion module.
L, M, and H represent Low-level, Medium-level, and High-level features, respectively. W1, W2, and W3 are learnable weights for features at different levels. The symbol Σ denotes the element-wise weighted summation operation.
Table 1.
The overall performance of our framework.
Fig 4.
Visualization of our method on the ShanghaiTech Part A and Part B datasets.
The left image is from ShanghaiTech Part A, with a predicted crowd count of 375; the right image is from Part B, with a predicted crowd count of 18. All faces are blurred in Fig 4 for privacy preservation.
Fig 5.
Example visualization results of our method on the Tower dataset.
The left image shows a close-up view of a high-speed rail station exit, with a predicted crowd count of 2; the right image shows a distant view of a street, with a predicted crowd count of 33. All faces are blurred in Fig 5 for privacy preservation.
Table 2.
Ablation study on projection head.
Table 3.
Evaluation of the effectiveness of MSFM.
Table 4.
Evaluation of the effectiveness of contrastive loss.
Fig 6.
Evaluation of MAE accuracy metric based on different patch numbers for our method on the SHTechA crowd counting dataset.
Patch number parameter is the number of samples cropped from a single image for contrastive learning.
Fig 7.
The t-SNE map of MSFM features by different models, in which the blue dots and red dots refer to the positive samples and negative samples.
Left: features after the MSFM model integrated with contrastive learning; Right: features of the baseline model.
Table 5.
Comparison of the Parameters (M) and Inference speed (s/100 images).