Fig 1.
Analysis of loss variation in different epoch for a VGGT-Count network.
A crowd image is first fed into VGG-19 network for convolution. Then the flatten output feature map is transmitted into the transformer encoder with Multi-Head Attention. Finally, a regression decoder predicts the density map. The Optimal Transport (OT) and Total Variation (TV) loss function is optimized during the training process.
Fig 2.
Analysis of loss variation in different epoch for a VGGT-Count network.
Table 1.
Comparison with the state-of-the-art methods on ShanghaiTech A, ShanghaiTech B, and UCF-QNRF.
The top performance is highlighted in bold, while the second best is underlined.
Table 2.
Comparison of real-time performance in different models with size, frames and inference time.
Table 3.
Optimizing performance by using different components and structures on ShanghaiTech B datasets.
Fig 3.
Visualization results of VGGT-Count vs DM-Count.
Fig 4.
Visualization results of VGGT-Count in different scenarios.