Figures
Abstract
Deep learning (DL) models are widely adopted in biomedical imaging, where image segmentation is increasingly recognized as a quantitative tool for extracting clinically meaningful information. However, model performance critically depends on dataset size and training configuration, including model capacity. Traditional sample size estimation methods are inadequate for DL due to its reliance on high-dimensional data and its nonlinear learning behavior. To address this gap, we propose a DL-specific framework to estimate the minimal dataset size required for stable segmentation performance. We validate this framework across two distinct clinical tasks: colorectal polyp segmentation from 2D endoscopic images (Kvasir-SEG) and glioma segmentation from 3D brain MRIs (BraTS 2020). We trained residual U-Nets—a simple, yet foundational architecture—across 200 configurations for Kvasir-SEG and 40 configurations for BraTS 2020, varying data subsets (2%–100% for the 2D task and 5%–100% for the 3D task). In both tasks, performance metrics such as the Dice Similarity Coefficient (DSC) consistently improved with increasing data and depth, but gains invariably plateaued beyond approximately 80% data usage. The best configuration for polyp segmentation (6 layers, 100% data) achieved a DSC of 0.86, while the best for brain tumor segmentation reached a DSC of 0.79. Critically, we introduce a surrogate modeling pipeline using Long Short-Term Memory (LSTM) networks to predict these performance curves. A simple uni-directional LSTM model accurately forecasted the final DSC, accurately forecasting the final DSC with low mean absolute error across both tasks. These findings demonstrate that segmentation performance can be reliably estimated with lightweight models, suggesting that collecting a moderate amount of high-quality data is often sufficient for developing clinically viable DL models. Our framework provides a practical, empirical method for optimizing resource allocation in medical AI development.
Citation: Lee J, Chung H, Suh M, Lee J-H, Choi KS (2025) Deep learning for deep learning performance: How much data is needed for segmentation in biomedical imaging? PLoS One 20(12): e0339064. https://doi.org/10.1371/journal.pone.0339064
Editor: Ziyu Qi, University of Marburg: Philipps-Universitat Marburg, GERMANY
Received: June 14, 2025; Accepted: December 1, 2025; Published: December 31, 2025
Copyright: © 2025 Lee et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The datasets analyzed during the current study are publicly available. The Kvasir-SEG dataset can be accessed at https://datasets.simula.no/kvasir-seg/, and the BraTS 2021 dataset can be accessed at https://www.med.upenn.edu/cbica/brats2020/.
Funding: This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. RS-2023-00251022) (K.S.C.); the Phase III (Postdoctoral fellowship) grant of the SPST (SNU-SNUH Physician Scientist Training) Program (K.S.C.); the SNUH Research Fund (No. 04-2024-0600) (K.S.C.); and the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI) grant funded by the Ministry of Health&Welfare (No. RS-2024-00439549) (K.S.C.).
Competing interests: The authors have declared that no competing interests exist.
Introduction
In the field of biomedical imaging, deep learning (DL) models are increasingly being adopted to extract quantitative information and support clinical decision-making [1,2]. Recently, segmentation has evolved into a powerful tool for deriving novel imaging biomarkers from routine scans by precisely delineating anatomical and pathological structures [3]. Recognizing this transformative potential, regulatory agencies have introduced streamlined approval processes for AI-enabled medical technologies [4,5].
Despite these advances, a persistent challenge in developing DL models lies in acquiring sufficient training data [6,7] (Fig 1). While larger datasets are generally presumed to enhance model generalization, real-world limitations—such as high annotation costs, patient privacy constraints, and clinical heterogeneity—often hinder large-scale data collection efforts. Biomedical datasets also frequently exhibit class imbalances [8] and institutional or demographic bias [9], further complicating the development of robust and generalizable models.
The graph depicts the relationship between the training dataset ratio and the corresponding test error for models of varying complexities, categorized by the number of layers.
The relationship between data volume, model complexity, and performance is not linear. While increasing data or model size can improve performance, recent studies reveal that beyond certain thresholds, this scaling can induce performance degradation or plateau effects [10–12], underscoring the need for principled, task-specific planning. This raises a critical and practical question: “how much is enough?” While the pursuit of state-of-the-art performance is paramount, understanding how to achieve it efficiently is vital, particularly as not all research and clinical institutions have access to unlimited computational resources. The ability to forecast performance and estimate the point of diminishing returns can guide more sustainable and targeted research strategies.
This question is not new. As early as 2015, Cho et al. proposed an empirical framework for estimating dataset requirements in medical imaging [13], and similar questions have continued to appear in the literature [2,14,15], reflecting an enduring demand for data-efficient DL strategies. Traditional statistical power analysis methods, often used to determine minimal sample size, are ill-suited for DL-based medical imaging tasks. The complexity of high-dimensional data, annotation noise, and nonlinear model behavior deviates significantly from classical assumptions, necessitating empirical, DL-specific frameworks for data planning. While foundation models have shown promise for zero- and one-shot segmentation, they often struggle to generalize across diverse medical modalities and clinical settings without significant fine-tuning [16]. Therefore, for many specialized tasks, supervised learning remains the most practical paradigm.
To address this need, we propose a DL-specific framework to estimate the minimal dataset size required for stable segmentation performance. This approach can provide “DL-specific” guidance for study design, serving as a practical alternative to traditional statistical methods. Using two distinct proof-of-concept tasks—colorectal lesion segmentation and brain tumor segmentation—we systematically vary dataset size and model depth in residual U-Nets under controlled conditions, isolating the effects of these two core variables on performance. Our findings reinforce the perspective that acquiring moderate amounts of high-quality data may yield greater value than pursuing indiscriminate scale [17,18], a view that aligns with emerging trends in efficient model development.
Materials and methods
Generalizable function determination method
The universal approximation theorem asserts that neural networks with one or more hidden layers can approximate any function to arbitrary accuracy within a training dataset [19]. However, this theorem guarantees approximation capabilities only within the training environment, without ensuring generalization to unseen data. Generalization must therefore be validated using independent test datasets to evaluate performance beyond training data. Significantly, many studies emphasize that the success of neural networks stems not solely from their approximation capabilities but also from their capacity to extract meaningful features [20]. This feature-extraction ability suggests that a neural network trained on a particular task has potential predictive power on independent test datasets.
The final model performance, influenced by multiple training-related factors, arises from interactions among variables such as network architecture, initial weights, data distribution, optimization techniques, and hyperparameter configurations. These variables considerably affect both model learning capacity and generalization performance, yet accurately quantifying their individual and combined effects remains an open research challenge. To address this issue, we propose a novel framework based on “learning dynamics.” Traditional approaches from statistical learning theory, such as VC dimension [21] and Rademacher complexity [22], offer theoretical bounds on model performance but typically lack direct applicability in real-world contexts [23]. Therefore, building upon these theoretical foundations, we adopt a computational methodology supplemented by analytical insights where necessary.
Hypothesis validation pipeline
The interplay between dataset size and model complexity significantly influences model performance. Building upon this relationship, once the function to be learned is established as generalizable, it becomes feasible to estimate the dataset size necessary to achieve clinically relevant performance. This estimation employs a predictive algorithm trained to capture the relationship between dataset size, model complexity, and test performance. Specifically, desired test performance is estimated by training a secondary neural network with performance metrics obtained from various combinations of dataset sizes and model complexities.
In this study, we introduce a hypothesis validation pipeline to predict performance for task-specific deep learning (DL) models. The pipeline comprises two primary components: a task model and a performance prediction model. The task model is trained on datasets of varying sizes and models with differing parameter counts, allowing exploration of how these factors impact performance for a specific task. The performance prediction model receives dataset size and model parameter count as inputs, predicting task model performance under specified conditions. This secondary model is trained to capture the “learning dynamics” of the task model, enabling predictions of generalization error based on training configurations. Integrating these components, our pipeline estimates the optimal dataset size required to meet target test error thresholds for clinical applications. This approach simplifies resource allocation for training and provides insights into how training conditions influence model generalization. Consequently, the pipeline offers a systematic methodology to optimize neural network development in biomedical imaging contexts.
Biomedical segmentation tasks
As a proof-of-concept, we applied our framework to two distinct segmentation tasks.
Task 1: Colorectal cancer lesion segmentation.
We first developed a DL model for the automatic segmentation of polyps from 2D colonoscopy images. This task is critical for the early detection of colorectal cancer, a leading cause of cancer-related mortality worldwide. For this task, we utilized the Kvasir-SEG dataset [24], a publicly available collection of 1,000 polyp images and their corresponding ground truth masks. Image resolutions vary from 332 × 487–1920 × 1072 pixels. The dataset was partitioned into training (n = 700), validation (n = 100), and test subsets (n = 200). All images were standardized to a 256 × 256 pixel resolution for consistency, and ground-truth annotations were provided as binary regions of interest (ROIs).
Task 2: Brain tumor segmentation.
To validate the generalizability of our framework, we selected a second, more complex task: multi-class glioma segmentation from 3D brain MRI scans. We used the publicly available BraTS 2020 dataset [25], which contains multimodal 3D MRIs (T1, T1-Gd, T2, T2-FLAIR) and ground-truth masks for enhancing tumor (ET), tumor core (TC), and whole tumor (WT). For this study, we focused on the whole tumor segmentation task. The dataset was partitioned into training (n = 258), validation (n = 37), and test (n = 74) sets. All MRI volumes were co-registered, interpolated to a uniform 1 mm³ isotropic resolution, and skull-stripped. We extracted 128 × 128 × 128 voxel patches from the volumes for training to manage the computational load of the 3D data.
Model architecture and training
The task-specific DL model employed a residual U-Net architecture [26,27] (Fig 2) for lesion segmentation.
The network employs an encoder-decoder structure connected via skip connections, enhancing spatial detail retention. Encoder layers featuring residual connections in the down-sampling pathway. The image and lesion mask were obtained from the publicly available Kvasir-SEG dataset (https://datasets.simula.no/dataset/kvasir-seg).
Justification for model choice.
Our selection of the residual U-Net was a deliberate strategic decision. The U-Net architecture is a seminal and foundational model in biomedical image segmentation. Its encoder-decoder structure with skip connections has become the de facto standard and forms the backbone of countless state-of-the-art models currently in use. By focusing our investigation on a fundamental property of this core architecture—network depth—we aim to provide insights that are broadly applicable to the entire community that builds upon this framework. This focused approach allows us to isolate the impact of dataset size and model depth without introducing the confounding variables that would arise from a comparison across many disparate, complex architectures.
Training protocol.
The model was trained in a supervised learning framework using paired input images and manually annotated ROIs. It was trained using pixel-wise binary cross-entropy loss and optimized with the Adam optimizer [28]. Model complexity was modulated by varying the convolutional network depth (3, 4, 5, or 6 layers). For each task, we generated 50 distinct dataset sizes by sampling incremental subsets of the training data, from 2% to 100% at 2% intervals. For the brain tumor segmentation model, we used depths of 3 or 5 layers and sampled the training data from 5% to 100% at 5% intervals.
Performance prediction model.
To forecast segmentation performance (as measured by DSC) across dataset sizes, four distinct performance prediction models based on long short-term memory (LSTM) [29] networks were developed (Fig 3):
The one-step prediction approaches independently predict outcomes at each individual time point, while the full-step prediction approach sequentially predicts multiple future outcomes, utilizing previously predicted results to inform subsequent predictions.
Uni-directional one-step prediction model: This model predicts the DSC value for the next data point (t) using only the sequence of actual, observed DSC values from previous data points (1 to t-1). It makes a single forecast at each step.
Uni-directional full-step prediction model: This model is autoregressive. It first uses the observed DSC values to make an initial prediction, and then uses its own previously predicted DSCs as input to forecast all subsequent values in the sequence.
Bi-directional one-step prediction model: This model uses a bi-directional LSTM to process the input sequence of observed DSCs from both forward (past to present) and backward (future to present) directions, providing a richer context to predict the DSC value at each individual step.
Bi-directional full-step prediction model: Similar to its uni-directional counterpart, this model is autoregressive but leverages a bi-directional LSTM. It uses context from both directions to predict the entire sequence of DSC values sequentially.
All LSTM models were optimized with the Adam optimizer, using the DSC data generated by the task models as input.
Evaluation metrics
The performance of the task-specific DL models was assessed using standard segmentation metrics, including Intersection over Union (IoU), Dice Similarity Coefficient (DSC), Recall, and Precision. DSC values from the task-specific model on the test set served as both inputs and target outputs for the performance prediction models. Predictive performance of the LSTM models was evaluated using Mean Absolute Error (MAE).
Results
Task 1: Colorectal lesion segmentation evaluation
S1–S4 Figs display heatmaps of segmentation metrics (IoU, DSC, Recall, Precision) on the Kvasir-SEG test set as functions of training data ratio and model depth. All metrics demonstrate improvement with larger training datasets, with deeper architectures showing pronounced gains once the training data ratio surpasses approximately 30%. The best configuration (6 layers, 100% data) achieved an IoU of 0.76, DSC of 0.86, Recall of 0.84, and Precision of 0.88. S5–S12 Figs further illustrate these trends. Fig 4 shows representative segmentation masks, where accuracy clearly improves with both increasing data and model depth. Deeper models consistently produce more refined delineations, reflecting their enhanced capacity to capture subtle details. A minor but notable observation in the performance curves (Fig 5) is the presence of small fluctuations, particularly at lower data ratios. These fluctuations are likely attributable to the stochastic elements inherent in DL training, such as random weight initialization and data shuffling, as well as the specific composition of the smaller, randomly sampled data subsets.
Representative examples of polyp segmentation on colonoscopy images. Sub-figures (a) and (b) compare the performance of a shallow model (3 layers, top row) and a deep model (6 layers, bottom row) trained on increasing amounts of data (2%, 30%, 60%, and 100%). The predicted segmentation masks (blue overlay) become progressively more accurate and aligned with the ground truth as both training data volume and model depth increase.
Performance prediction results for the Kvasir-SEG dataset using the uni-directional full-step LSTM model. The scatter points represent the actual Dice Similarity Coefficient (DSC) scores achieved by the U-Net model for each layer depth. The solid lines show the LSTM model’s predicted performance curve, which closely tracks the actual results. The dashed lines represent a simple linear regression baseline for comparison.
Task 2: Brain tumor segmentation evaluation
To confirm the generalizability of our findings, we replicated the experiment on the BraTS 2020 dataset. Fig 6 provides qualitative examples, showing how the predicted tumor masks on MRI scans become more accurate and complete as training data and model depth increase, closely aligning with the ground truth. This visual improvement is mirrored by the quantitative results, summarized in Fig 7, which show a remarkably similar trend. Performance on the whole tumor segmentation task improved consistently with increases in both the training data ratio and model depth. The best-performing model (5 layers, 100% data) achieved a DSC of 0.79 and an IoU of 0.65. As with the Kvasir-SEG experiment, performance gains began to plateau significantly after the training data ratio exceeded 80%. The 3D nature of the task resulted in overall higher computational requirements, but the fundamental relationship between data, complexity, and performance held.
Representative examples of whole tumor segmentation on 3D brain MRI scans. The figure illustrates how predicted tumor masks (colored overlay) improve with more data and greater model complexity. Each row compares a shallow model (3 layers) to a deep model (5 layers) trained on varying data ratios (e.g., 5%, 30%, 60%, 100%). Deeper models trained on more data produce segmentations that more accurately delineate the tumor boundaries, closely matching the ground truth.
Performance prediction results for the BraTS 2020 dataset, generated by the uni-directional full-step LSTM model. As in Fig 6, the plot compares the actual DSC scores (scatter points) from the segmentation task with the predicted performance curve from the LSTM model (solid lines) and a linear regression baseline (dashed lines). The model accurately forecasts the performance trend, showing consistent improvement before plateauing.
Model complexity and training time analysis
To provide a comprehensive reference for practical applications, we analyzed the computational cost of our models. Table 1 details the parameter counts, computational complexity, and inference throughput for each model depth. As expected, all metrics increase with model depth. The number of parameters increases rapidly with model depth, leading to a substantial rise in computational demand and training duration. This analysis highlights the critical trade-off between model performance and the computational resources required to train and deploy such models.
Task performance prediction evaluation
Fig 5 demonstrates the predictive accuracy of the uni-directional full-step performance prediction model in estimating DSC scores for the Kvasir-SEG task. The scatter points represent actual DSC scores, while solid lines indicate the model’s prediction. The prediction model reliably extrapolates performance, closely aligning with empirically observed scores, particularly for datasets using ≥50% of the training data. This underscores the model’s reliability in forecasting segmentation performance. This stepwise predictive approach effectively captures the inherent learning dynamics, facilitating accurate DSC score forecasting. Similarly, Fig 7 shows the prediction results for the BraTS 2020 task, where the LSTM model again accurately forecasts the performance trend, showing consistent improvement before plateauing. S13–S18 Figs illustrate similarly high predictive accuracy for the other LSTM model variants across both tasks.
Comparison of task performance prediction models
The influence of model depth on prediction accuracy was evaluated across all LSTM methods for the Kvasir-SEG dataset. As shown in Table 2, prediction errors generally decreased with increasing model depth, indicating enhanced predictive stability in deeper models. The uni-directional full-step prediction model achieved the lowest overall mean absolute error (0.0067 ± 0.0066), slightly outperforming the uni-directional one-step model (0.0075 ± 0.0076). Bi-directional models presented marginally higher mean errors. Consequently, the uni-directional full-step prediction model demonstrated the most robust prediction capability overall. A similar analysis on the BraTS 2020 dataset (Table 3) showed the uni-directional one-step prediction model achieving the lowest MAE (0.0153 ± 0.0157).
Discussion
This study introduced and validated a DL framework for estimating data and model size requirements, focusing on the relationships between dataset size, model complexity, and predictive performance. Our findings, now demonstrated across two distinct tasks—2D endoscopic polyp segmentation and 3D brain tumor segmentation—show that larger datasets and deeper models significantly enhance performance up to a point of diminishing returns. In both cases, metrics such as DSC and IoU showed consistent improvement as the training data ratio increased, but these gains began to plateau after approximately 80% of the dataset was used. This suggests that while more data is generally beneficial, a saturation point exists where the cost of further data acquisition may not be justified by the marginal performance improvement. Similarly, increasing model depth improved performance, but this improvement plateaued as complexity increased. This finding was consistent across both the 2D task (evaluated with 3–6 layers) and the 3D task (evaluated with 3 and 5 layers).
The strength of our framework lies in its validated applicability across different imaging modalities (endoscopy vs. MRI) and task complexities (2D binary vs. 3D multi-class segmentation). This suggests that the principle of modeling “learning dynamics” to predict a performance plateau is a generalizable strategy. Researchers working on other tasks, such as lung nodule detection in CT scans (e.g., LUNA16) [30] or cell segmentation in microscopy, could adopt this method to perform a preliminary analysis on a smaller data fraction to forecast the resources needed to reach a target performance level, thereby optimizing study design and resource allocation.
It is important to acknowledge the deliberate scope of our study. We focused exclusively on varying model depth within the U-Net architecture to isolate a key component of model complexity. This was a strategic choice to avoid the confounding variables that would arise from comparing different architectural families. While this provides a clean analysis of depth’s impact, it is not an exhaustive exploration of all possible model designs. Future work should undoubtedly extend this framework to other popular architectures (e.g., Transformers, Attention-based models) to see if similar learning dynamics hold.
Our performance prediction pipeline, using a simple LSTM, proved highly effective. It achieved a low mean absolute error (MAE < 0.007) in both tasks, accurately predicting the final DSC scores across various configurations. This provides an efficient method for optimizing model development without resorting to exhaustive trial-and-error experimentation. The training duration analysis (Fig 8) further emphasizes the need for this efficiency. While training a single model for 6 hours might seem acceptable, the cumulative time for hyperparameter tuning across hundreds of configurations becomes prohibitive. Our framework mitigates this by allowing researchers to forecast the endpoint. For resource-constrained settings, established techniques like transfer learning [31,32] and model compression [33,34] could be used in conjunction with our framework to further reduce computational demands.
The histogram illustrates the distribution of training times (in hours) for models trained with varying proportions of the dataset (train data ratio, indicated by the color gradient bar). Each bin represents counts of models completing training within specific time intervals, with darker colors indicating lower train data ratios and lighter colors indicating higher train data ratios.
Our work builds upon the understanding that traditional statistical power analyses are ill-suited for DL tasks, a point raised by others in the field [11,12]. Our empirical findings align with reports of performance saturation [12], and challenge the “more is always better” assumption. This reinforces the idea that focusing on acquiring a moderate amount of high-quality data can be more efficient than indiscriminately expanding dataset size, a philosophy that resonates with recent trends in efficient AI, such as model distillation from smaller, high-fidelity datasets [35].
A primary limitation of our study is that the proposed framework was validated on only two datasets. Although our results are promising, further validation across a more diverse range of medical imaging datasets is crucial to fully establish the framework’s generalizability and robustness. To this end, future work should prioritize evaluation on established public benchmarks; datasets such as LUNA16 [30] would serve as excellent candidates for these validation studies. Looking forward, the framework could be further enhanced by systematically analyzing other variables, such as the impact of different data augmentation strategies or optimizer choices. Furthermore, extending the prediction pipeline to leverage multimodal data—by integrating imaging with clinical reports or genomic data, for example—and to incorporate advanced transfer learning techniques represents a promising avenue for future research.
In conclusion, this study presents a systematic and empirically validated framework for optimizing dataset size and model complexity in biomedical deep learning applications. Our core contribution is a practical methodology that allows researchers to forecast segmentation performance and estimate the point of diminishing returns for data collection and model scaling. By demonstrating its effectiveness on both 2D endoscopic and 3D MRI datasets, we show that this approach offers a scalable, resource-efficient solution for model development. The simple yet powerful LSTM-based prediction model provides reliable, actionable guidance, facilitating impactful AI development by enabling a more strategic balance between performance and computational efficiency.
Supporting information
S1 Fig. The IoU values across different data ratios and model architectures.
https://doi.org/10.1371/journal.pone.0339064.s001
(PNG)
S2 Fig. The Dice Score values across different data ratios and model architectures.
https://doi.org/10.1371/journal.pone.0339064.s002
(PNG)
S3 Fig. The Recall values across different data ratios and model architectures.
https://doi.org/10.1371/journal.pone.0339064.s003
(PNG)
S4 Fig. The Precision values across different data ratios and model architectures.
https://doi.org/10.1371/journal.pone.0339064.s004
(PNG)
S5 Fig. IoU changes with increasing data ratio for different model architectures.
https://doi.org/10.1371/journal.pone.0339064.s005
(PNG)
S6 Fig. Dice Score changes with increasing data ratio for different model architectures.
https://doi.org/10.1371/journal.pone.0339064.s006
(PNG)
S7 Fig. Recall changes with increasing data ratio for different model architectures.
https://doi.org/10.1371/journal.pone.0339064.s007
(PNG)
S8 Fig. Precision changes with increasing data ratio for different model architectures.
https://doi.org/10.1371/journal.pone.0339064.s008
(PNG)
S9 Fig. IoU changes with increasing model complexity for different data ratios.
https://doi.org/10.1371/journal.pone.0339064.s009
(PNG)
S10 Fig. Dice Score changes with increasing model complexity for different data ratios.
https://doi.org/10.1371/journal.pone.0339064.s010
(PNG)
S11 Fig. Recall changes with increasing model complexity for different data ratios.
https://doi.org/10.1371/journal.pone.0339064.s011
(PNG)
S12 Fig. Precision changes with increasing model complexity for different data ratios.
https://doi.org/10.1371/journal.pone.0339064.s012
(PNG)
S13 Fig. DSC performance prediction for polyp segmentation using the uni-directional one-step model.
Performance prediction results generated by the uni-directional one-step LSTM model for the Kvasir-SEG dataset. The plot compares the actual DSC scores (scatter points) from the segmentation task with the model’s predicted performance (solid lines) and a linear regression baseline (dashed lines). This one-step model predicts each data point based on the sequence of preceding actual values.
https://doi.org/10.1371/journal.pone.0339064.s013
(TIF)
S14 Fig. DSC performance prediction for polyp segmentation using the bi-directional one-step model.
Performance prediction results from the bi-directional one-step LSTM model for the Kvasir-SEG dataset. The plot elements are consistent with previous figures: actual DSC scores (scatter points), predicted performance (solid lines), and a linear regression baseline (dashed lines). This model leverages both past and future data context to predict each point independently.
https://doi.org/10.1371/journal.pone.0339064.s014
(TIF)
S15 Fig. DSC performance prediction for polyp segmentation using the bi-directional full-step model.
Performance prediction results from the bi-directional full-step LSTM model for the Kvasir-SEG dataset. The graph displays the actual DSC scores (scatter points) against the autoregressive predicted performance curve (solid lines) and a linear regression baseline (dashed lines). This model uses bi-directional context and its own previous predictions to forecast the entire performance trajectory.
https://doi.org/10.1371/journal.pone.0339064.s015
(TIF)
S16 Fig. DSC performance prediction for brain tumor segmentation using the uni-directional one-step model.
Performance prediction results generated by the uni-directional one-step LSTM model for the BraTS 2020 dataset. The plot compares the actual DSC scores (scatter points) from the segmentation task with the model’s predicted performance (solid lines) and a linear regression baseline (dashed lines). This one-step model predicts each data point based on the sequence of preceding actual values.
https://doi.org/10.1371/journal.pone.0339064.s016
(TIF)
S17 Fig. DSC performance prediction for brain tumor segmentation using the bi-directional one-step model.
Performance prediction results from the bi-directional one-step LSTM model for the BraTS 2020 dataset. The plot elements are consistent with previous figures: actual DSC scores (scatter points), predicted performance (solid lines), and a linear regression baseline (dashed lines). This model leverages both past and future data context to predict each point independently.
https://doi.org/10.1371/journal.pone.0339064.s017
(TIF)
S18 Fig. DSC performance prediction for brain tumor segmentation using the bi-directional full-step model.
Performance prediction results from the bi-directional full-step LSTM model for the BraTS 2020 dataset. The graph displays the actual DSC scores (scatter points) against the autoregressive predicted performance curve (solid lines) and a linear regression baseline (dashed lines). This model uses bi-directional context and its own previous predictions to forecast the entire performance trajectory.
https://doi.org/10.1371/journal.pone.0339064.s018
(TIF)
References
- 1. Suganyadevi S, Seethalakshmi V, Balasamy K. A review on deep learning in medical image analysis. Int J Multimed Inf Retr. 2022;11(1):19–38. pmid:34513553
- 2. Varoquaux G, Cheplygina V. Machine learning for medical imaging: methodological failures and recommendations for the future. NPJ Digit Med. 2022;5(1):48. pmid:35413988
- 3. Choi S-J, Kim J-S, Jeong SY, Son H, Sung J-J, Park C-M, et al. Association of Deep Learning-based Chest CT-derived Respiratory Parameters with Disease Progression in Amyotrophic Lateral Sclerosis. Radiology. 2025;315(2):e243463. pmid:40358443
- 4. Hillis JM, Visser JJ, Cliff ERS, van der Geest-Aspers K, Bizzo BC, Dreyer KJ, et al. The lucent yet opaque challenge of regulating artificial intelligence in radiology. NPJ Digit Med. 2024;7(1):69. pmid:38491126
- 5. Petrick N, Chen W, Delfino JG, Gallas BD, Kang Y, Krainak D, et al. Regulatory considerations for medical imaging AI/ML devices in the United States: concepts and challenges. J Med Imaging (Bellingham). 2023;10(5):051804. pmid:37361549
- 6. Alberto IRI, Alberto NRI, Ghosh AK, Jain B, Jayakumar S, Martinez-Martin N, et al. The impact of commercial health datasets on medical research and health-care algorithms. Lancet Digit Health. 2023;5(5):e288–94. pmid:37100543
- 7. Goetz L, Seedat N, Vandersluis R, van der Schaar M. Generalization-a key challenge for responsible AI in patient-facing clinical applications. NPJ Digit Med. 2024;7(1):126. pmid:38773304
- 8. Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, et al. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88. pmid:28778026
- 9. Norori N, Hu Q, Aellen FM, Faraci FD, Tzovara A. Addressing bias in big data and AI for health care: A call for open science. Patterns (N Y). 2021;2(10):100347. pmid:34693373
- 10. Nakkiran P, Kaplun G, Bansal Y, Yang T, Barak B, Sutskever I. Deep double descent: where bigger models and more data hurt*. J Stat Mech. 2021;2021(12):124003.
- 11. Yoshida Y, Okada M. Data-dependence of plateau phenomenon in learning with neural network—statistical mechanical analysis *. J Stat Mech. 2020;2020(12):124013.
- 12. Thian YL, Ng DW, Hallinan JTPD, Jagmohan P, Sia SY, Mohamed JSA, et al. Effect of Training Data Volume on Performance of Convolutional Neural Network Pneumothorax Classifiers. J Digit Imaging. 2022;35(4):881–92. pmid:35239091
- 13. Cho J, Lee K, Shin E, Choy G, Do S. How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? arXiv preprint arXiv:1511.06348. 2015.
- 14. Lunardo F, Baker L, Tan A, Baines J, Squire T, Dowling JA, et al. How much data do you need? An analysis of pelvic multi-organ segmentation in a limited data context. Phys Eng Sci Med. 2025;48(1):409–19. pmid:40067638
- 15. Rajput D, Wang W-J, Chen C-C. Evaluation of a decided sample size in machine learning applications. BMC Bioinformatics. 2023;24(1):48. pmid:36788550
- 16. Shi P, Qiu J, Abaxi SMD, Wei H, Lo FP-W, Yuan W. Generalist Vision Foundation Models for Medical Imaging: A Case Study of Segment Anything Model on Zero-Shot Medical Segmentation. Diagnostics (Basel). 2023;13(11):1947. pmid:37296799
- 17. Schouten JPE, Matek C, Jacobs LFP, Buck MC, Bošnački D, Marr C. Tens of images can suffice to train neural networks for malignant leukocyte detection. Sci Rep. 2021;11(1):7995. pmid:33846442
- 18. Karimi D, Dou H, Warfield SK, Gholipour A. Deep learning with noisy labels: Exploring techniques and remedies in medical image analysis. Med Image Anal. 2020;65:101759. pmid:32623277
- 19. Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Networks. 1989;2(5):359–66.
- 20.
Arpit D, Jastrzębski S, Ballas N, Krueger D, Bengio E, Kanwal MS, et al. A closer look at memorization in deep networks. In: Proceedings of the International Conference on Machine Learning (ICML). PMLR; 2017.
- 21. Vapnik V, Levin E, Cun YL. Measuring the VC-Dimension of a Learning Machine. Neural Computation. 1994;6(5):851–76.
- 22. Mohri M, Rostamizadeh A. Rademacher complexity bounds for non-iid processes. Advances in Neural Information Processing Systems. 2008;21.
- 23. Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. 2016.
- 24.
Jha D, Smedsrud PH, Riegler MA, Halvorsen P, De Lange T, Johansen D, et al. Kvasir-seg: A segmented polyp dataset. In: Ro Y, et al. MultiMedia Modeling. MMM 2020; 2020. Springer. https://doi.org/10.1007/978-3-030-37734-2_37
- 25. Menze BH, Jakab A, Bauer S, Kalpathy-Cramer J, Farahani K, Kirby J, et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging. 2015;34(10):1993–2024. pmid:25494501
- 26.
Kerfoot E, Clough J, Oksuz I, Lee J, King AP, Schnabel JA. Left-ventricle quantification using residual U-Net. In: Pop M, et al. Statistical Atlases and Computational Models of the Heart. Atrial Segmentation and LV Quantification Challenges. (STACOM 2018). Springer; 2019. https://doi.org/10.1007/978-3-030-12029-0_40
- 27.
Ronneberger O, Fischer P. Brox T U-net: Convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells W, Frangi A. (eds) Medical Image Computing and Computer-Assisted Intervention. MICCAI 2015. Springer; 2015. https://doi.org/10.1007/978-3-319-24574-4_28
- 28. Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.
- 29. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80. pmid:9377276
- 30. Setio AAA, Traverso A, de Bel T, Berens MSN, Bogaard C van den, Cerello P, et al. Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: The LUNA16 challenge. Med Image Anal. 2017;42:1–13. pmid:28732268
- 31. Kim HE, Cosa-Linan A, Santhanam N, Jannesari M, Maros ME, Ganslandt T. Transfer learning for medical image classification: a literature review. BMC Med Imaging. 2022;22(1):69. pmid:35418051
- 32. Morid MA, Borjali A, Del Fiol G. A scoping review of transfer learning research on medical image analysis using ImageNet. Comput Biol Med. 2021;128:104115. pmid:33227578
- 33. Cheng Y, Wang D, Zhou P, Zhang T. Model Compression and Acceleration for Deep Neural Networks: The Principles, Progress, and Challenges. IEEE Signal Process Mag. 2018;35(1):126–36.
- 34. Han S, Mao H, Dally WJ. Deep compression: compressing deep neural networks with pruning, trained quantization and Huffman coding. arXiv preprint arXiv:1510.00149. 2015.
- 35.
Sun X, Zhang P, Zhang P, Shah H, Saenko K. Xia X DIME-FM: distilling multimodal and efficient foundation models. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 2023.