Fig 1.
Our CTP network avoids the complex matching task.
(a) We estimate 2D heatmaps from all views. (b) When all 2D keypoint heatmaps projected into 3D common space, the space is voxelized into regular grids. (c) After convolution by front-layer in backbone, we get the preliminary 3D feature maps. (d) The 3D feature maps are transformed into 2D feature maps and passed into 2D CNN network. The center of one person is generated in top view. (e)The 3D bounding box is regressed. (f) The 3D bounding box is voxelized into more detailed grids for estimating accurate 3D pose. (g) The estimation of 3D poses outputs from our network.
Fig 2.
The structure of our backbone.
The inputs are the voxel common space with all projected 2D heatmaps. The CTP head outputs the center of people and the bounding box, then the 3D pose comes from PRM. C, M, N, and P represent the input parameters of convolution block.
Fig 3.
The results of ablation analysis experiment.
(a) shows our best result and it is chosen as the benchmark for comparison. (b) and (c) shows the visual result for changing the 3D common space voxel anchor scale as 0.15m and 0.25m. (d) and (e) shows the visual result when the 3D bounding box voxel anchor scale is changed as 0.1m and 0.25m. In (f), (g) and (h), the 3D bounding box scale is changed as 1m, 2.8m and 6.4m for ablation analysis. In order to speed up calculations, the 3D bounding box voxel anchor scale is set to 0.1m and compared with (d).
Table 1.
The PCP result of ablation about our CTP network.
The number in parameters name represent the anchor scale or box scale.
Fig 4.
The visual results on shelf dataset.
(a1) is the regular show on shelf dataset. (a2) shows the projected 2D pose from 3D pose and the 2D poses are not the estimated poses from HRNet. The result in (a2) shows that the multiple views 3D pose estimated can compensate for invisible pose information.
Fig 5.
The result shows the 3D pose estimation on Campus dataset.
There are only three people with small scale, the complete 3D poses are also estimated, which proves that our backbone is effective.
Table 2.
Quantitative comparison of campus and shelf datasets with PCP.
Results for other methods are taken from their respective papers.
Fig 6.
The multiple people visual result on CMU Panoptic dataset.
We select some datasets with different numbers of people to evaluate our method. Our method is also robust in multi-person scenes with more than 5 people.
Table 3.
Comparison with [37] on CMU Panoptic dataset under 5 views.