Towards real-world monitoring scenarios: An improved point prediction method for crowd counting based on contrastive learning

doi:10.1371/journal.pone.0327397

Fig 1.

The comparison between bounding box detection and point detection on the Tower dataset collected from practical applications.

The upper image shows the performance of the YOLOv7 bounding box detection method, while the lower image presents the results of our point-based detection method. As we can see, the point-based detection method performs better. All faces are blurred in Fig 1 for privacy preservation.

More »

Expand

Fig 2.

Overall framework of point-based crowd counting with contrastive learning.

The blue dashed box includes four different-sized feature layers from the VGG backbone network. The backbone network section can be replaced with other structures such as ResNet. The dashed box containing the projection head, contrastive loss, and point matching is used only during the training process.

More »

Expand

Fig 3.

Structure of the multi-scale feature fusion module.

L, M, and H represent Low-level, Medium-level, and High-level features, respectively. W1, W2, and W3 are learnable weights for features at different levels. The symbol Σ denotes the element-wise weighted summation operation.

More »

Expand

Table 1.

The overall performance of our framework.

More »

Expand

Fig 4.

Visualization of our method on the ShanghaiTech Part A and Part B datasets.

The left image is from ShanghaiTech Part A, with a predicted crowd count of 375; the right image is from Part B, with a predicted crowd count of 18. All faces are blurred in Fig 4 for privacy preservation.

More »

Expand

Fig 5.

Example visualization results of our method on the Tower dataset.

The left image shows a close-up view of a high-speed rail station exit, with a predicted crowd count of 2; the right image shows a distant view of a street, with a predicted crowd count of 33. All faces are blurred in Fig 5 for privacy preservation.

More »