Controlled comparative study of YOLOv8-Pose, YOLOv11-Pose, and Detectron2 for vertebrae detection and keypoint estimation

doi:10.1371/journal.pone.0347290

Table 1.

List of common abbreviations.

More »

Expand

Fig 1.

Image distribution across dataset splits:

The 698 fully annotated images were divided into training, validation, and test sets.

More »

Expand

Table 2.

Comparison of anchor-free and anchor-based architectures for vertebrae detection.

More »

Expand

Fig 2.

Unified view of YOLOv8n-Pose and YOLOv11n-Pose architectures:

Shared components are in gray, YOLOv8n path in blue, and YOLOv11n path in orange.

More »

Expand

Fig 3.

Detectron2 keypoint R-CNN architecture with R-50 and R-101 FPN backbones:

Blue indicates the ResNet-50 path, orange indicates the ResNet-101 path, and gray shows shared components.

More »

Expand

Table 3.

Software and hardware setup used for all training and evaluation experiments, ensuring reproducibility.

More »

Expand

Table 4.

Detailed training configurations for pose estimation models. This table corresponds to the results presented in Table 6.

More »

Expand

Table 5.

Training configuration under data augmentation. Remaining parameters are as in Table 4. This table corresponds to the results presented in Table 7.

More »

Expand

Table 6.

Quantitative evaluation of models on the original dataset. Metrics are reported separately for validation and test sets to assess both model performance and generalization. Bounding box and keypoint results are expressed as percentages (%), while inference time, best epoch, and training time provide insights into computational efficiency.

More »

Expand

Table 7.

Quantitative evaluation of models trained on the augmented dataset. Results illustrate the impact of increased data diversity and can be directly compared with Table 6.

More »

Expand

Fig 4.

Keypoint mAP before and after augmentation:

Results are shown for both strict (0.5:0.95) and loose (0.5) thresholds on the test set.

More »

Expand

Fig 5.

Qualitative comparison of model predictions in a simple case with and without augmentation:

The top row shows the input image and ground truth (GT) overlay. The middle row shows predictions from models trained without augmentation, while the bottom row shows results with augmentation. Predicted bounding boxes are in green, with red dots indicating keypoints. Predictions are visualized using an IoU threshold of 0.5 for bounding boxes and an OKS threshold of 0.5 for keypoints. All models perform well here; however, YOLOv8n-Pose and Detectron2-R101 without augmentation show one extra bounding box with keypoints compared to the ground truth. After augmentation, all bounding boxes and keypoints align correctly. Yellow arrows highlight keypoints that appear missing, usually due to occlusion rather than prediction failure.

More »

Expand

Fig 6.

Qualitative results for a challenging case:

Performance of different models with and without augmentation in vertebrae detection and keypoint localization(row layout as in Fig 5). Predictions are visualized using an IoU threshold of 0.5 for bounding box matching and an OKS threshold of 0.5 for keypoint evaluation. Blue arrows highlight incorrectly predicted keypoints. Yellow arrows indicate keypoints that appear missing, usually due to occlusion rather than prediction failure. Orange dotted boxes mark expected vertebra locations, and purple boxes denote missing detections. In cases with missing detections, only purple outlines are shown without keypoints to reduce visual clutter. Only a subset of representative errors is annotated for clarity.

More »

Expand

Fig 7.

Comparison of box loss and pose loss on the original dataset:

Loss curves for YOLO (v8n, v11n) and Detectron2 (R50, R101) models are shown, illustrating convergence behavior during training.

More »

Expand

Fig 8.

Comparison of box loss and pose loss on the augmented dataset:

Loss curves for YOLO (v8n, v11n) and Detectron2 (R50, R101) models are shown, illustrating the effect of data augmentation on training convergence.

More »

Expand

Table 8.

Performance of additional YOLO variants on the validation set (original dataset). All parameters are consistent with those in Table 4.

More »

Expand

Table 9.

Performance of Detectron2 under the best possible configurations. Training parameters are listed first for clarity, followed by performance metrics on the validation set (original dataset). All other parameters are consistent with those in Table 4.

More »

Expand

Table 10.

Performance of all models across five independent training runs on the test set (original dataset). Metrics are reported as mean ± standard deviation across five random seeds. Bold values indicate top performers.

More »

Expand

Table 11.

Effect of individual augmentation types on keypoint performance across all models on the test set (augmented dataset). Bold values indicate top performers.

More »

Expand