Abstract
Introduction
Class imbalance—where clinically important “positive” cases make up less than 30% of the dataset—systematically reduces the sensitivity and fairness of medical prediction models. Although data-level techniques, such as random oversampling, random undersampling, SMOTE, and algorithm-level approaches like cost-sensitive learning, are widely used, the empirical evidence on when these corrections improve model performance remains scattered across different diseases and modelling frameworks. This protocol outlines a scoping systematic review with meta-regression that will map and quantitatively summarise 15 years of research on resampling strategies in imbalanced clinical datasets, addressing a key methodological gap in reliable medical AI.
Methods and analysis
We will search MEDLINE, EMBASE, Scopus, Web of Science Core Collection, and IEEE Xplore, along with grey literature sources (medRxiv, arXiv, bioRxiv) for primary studies (2009–31 Dec 2024) that apply at least one resampling or cost-sensitive strategy to binary clinical prediction tasks with a minority-class prevalence of less than 30%. There will be no language restrictions. Two reviewers will screen records, extract data using a piloted form, and document the process in a PRISMA flow diagram. A descriptive synthesis will catalogue the clinical domain, sample size, imbalance ratio, resampling strategy, model type, and performance metrics where 10 or more studies report compatible AUCs. A random-effects mixed-effects meta-regression (logit-transformed AUC) will be used to examine the effect of moderators, including imbalance ratio, resampling strategy, model family, and sample size. Small-study effects will be assessed with funnel plots, Egger’s test, trim-and-fill, and weight-function models; influence diagnostics and leave-one-out analyses will evaluate robustness. Since this is a methodological review, formal clinical risk-of-bias tools are optional; instead, design-level screening, influence diagnostics, and sensitivity analyses will enhance transparency.
Discussion
By combining a comprehensive conceptual framework with quantitative estimates, this review aims to determine when data-level versus algorithm-level balancing leads to genuine improvements in discrimination, calibration, and cost-sensitive metrics across various medical fields. The findings will help researchers select concise, evidence-based methods for addressing imbalance, inform journal and regulatory reporting standards, and identify research gaps such as the under-reporting of calibration and misclassification costs, which must be addressed before balanced models can be reliably trusted in clinical practice.
Citation: Abdelhay O, Shatnawi A, Najadat H, Altamimi T (2025) Resampling methods for class imbalance in clinical prediction models: A scoping review protocol. PLoS One 20(11): e0330050. https://doi.org/10.1371/journal.pone.0330050
Editor: Hamed Tavolinejad, University of Pennsylvania Perelman School of Medicine, UNITED STATES OF AMERICA
Received: July 10, 2025; Accepted: October 10, 2025; Published: November 3, 2025
Copyright: © 2025 Abdelhay et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: No datasets were generated or analysed during the current study. All relevant data from this study will be made available upon study completion.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Medical prediction datasets often exhibit an imbalance, with the clinically important “positive” class making up less than 30% of observations. This skew systematically biases traditional (e.g., logistic regression) and modern machine-learning classifiers towards the majority class, reducing sensitivity for the minority group [1–3].
To mitigate this threat, a set of data-level resampling strategies—random oversampling (ROS), random undersampling (RUS), and the Synthetic Minority Oversampling Technique (SMOTE)—modifies the training data before modelling [1,4,5]. Although commonly used, ROS can cause overfitting due to duplicate instances, RUS may discard potentially informative data points, and SMOTE or its variants might generate unrealistic synthetic examples [6–9].
Evidence comparing resampling with alternative strategies remains inconclusive. An extensive systematic review showed no consistent performance advantage of machine learning over logistic regression when event-per-variable ratios were adequate [10]. Furthermore, simulation and empirical studies suggest that effective sample size planning, rather than aggressive post-hoc balancing, often negates the need for resampling [11–16].
At the algorithm level, cost-sensitive learning directly penalises errors in the minority class and can outperform methods that operate at the data level; however, it is infrequently reported in medical AI research [4,17].
Developments in binary classification theory—from early statistical formulations to perceptrons, support vector machines, and boosted ensembles—highlight how model choice interacts with class distribution and cost structure [18–21].
Across clinical class-imbalance settings, approaches span data-level resampling (random over/undersampling; SMOTE and variants), algorithm-level/cost-sensitive learning, and increasingly ensembles/transfer learning. Resampling can be helpful when minority events are scarce, but it may induce boundary distortion/overfitting. Cost-sensitive methods align optimisation with misclassification costs. Ensembles and transfer learning can improve robustness but add complexity and computational demands. Given heterogeneity in prevalence, thresholds, and reporting, we restrict quantitative pooling to ROC-AUC and synthesise PR-AUC, MCC, F1, calibration, and decision-analytic measures descriptively, with PR-AUC/MCC receiving greater interpretive weight under skew. A fuller comparative appraisal (pros/cons and clinical suitability) is deferred to the results paper, consistent with protocol scope [22–24].
In this context, we will undertake a scoping systematic review with meta-regression to (i) map the resampling and cost-sensitive strategies employed in imbalanced medical datasets, (ii) quantify their effects on discrimination and calibration, and (iii) identify methodological moderators and research gaps. This protocol outlines the intended methods.
Objectives
Primary objective.
This study aims to assess whether, in clinical prediction studies with binary outcomes and a minority-class prevalence below 30%, applying data-level resampling or algorithm-level cost-sensitive strategies significantly improves model performance compared to training on the original imbalanced data.
Specific objectives.
- Evidence mapping – Catalogue the complete range of imbalance correction strategies, including oversampling, undersampling, hybrids, and weighted or focal-loss models, reported between 2009 and 2024. Also include the clinical domains, dataset sizes, imbalance ratios, and modelling frameworks for these strategies.
- Comparative effectiveness – Quantify and compare discrimination metrics (e.g., AUC, sensitivity, specificity) and, where available, the calibration metrics achieved by
- ◦ oversampling,
- ◦ undersampling,
- ◦ hybrid pipelines, and
- ◦ cost-sensitive algorithms,
against models trained without any balancing.
- Moderator analysis—Employing mixed-effects meta-regression, evaluate how study-level characteristics (imbalance ratio, sample size, number of predictors, model family, and clinical domain) impact the effectiveness of each imbalance-correction strategy.
- Assess bias and robustness by examining the effects of small studies, publication bias, and significant outliers through funnel-plot diagnostics, trim-and-fill, weight-function models, and leave-one-out analyses; assess how these factors affect pooled estimates.
- Methodological gap identification – Emphasise recurring pitfalls, such as neglecting calibration, misclassification costs, or external validation, and develop evidence-based recommendations for future research and reporting.
We hypothesise: (H1) Conditional on adequate sample size, resampling strategies (over/under/hybrid/SMOTE-type) do not improve predictive performance over no resampling in imbalanced binary clinical prediction tasks. (H2) Cost-sensitive methods outperform pure over/undersampling at IR < 10%; (H3) Hybrid (resampling+algorithmic) methods outperform single-strategy approaches; (H4) External validation yields lower AUC than internal; (H5) Studies reporting calibration perform better on net benefit where available.” (Exploratory if data sparse.);
Covariates: imbalance ratio, sample size, validation tier, clinical domain, leakage safeguards.
Methods
This protocol adheres to the PRISMA-P (S3 file) [25] and PRISMA-ScR [26] guidelines and has been registered with INPLASY (ID: INPLASY202550026) (S1 File). Any amendments will be recorded in the INPLASY record. Amendments (e.g., eligibility or analysis changes) will be logged with date, rationale, and impacted sections in a public registry (INPLASY/OSF) and cited in the final report.
Eligibility criteria (PICOTS)
- Population: Clinical prediction studies that analyse binary outcomes with an explicit minority-class prevalence of less than 30%. For this review, a binary outcome is limited to diagnostic, prognostic, or treatment-response predictions in which the dependent variable has exactly two mutually exclusive states (e.g., disease present/absent).
- Interventions: Data-level resampling (random oversampling, random undersampling, SMOTE or variants, hybrid pipelines) and algorithm-level cost-sensitive strategies (weighted losses, focal loss).
- Comparators: Models trained on the original imbalanced data and/or alternative resampling or weighting strategies.
- Outcomes: Primary—AUC; secondary—sensitivity, F1-score, specificity, balanced accuracy, calibration metrics, and reported mis-classification costs.
- Timing: Publications from 1 Jan 2009–31 Dec 2024.
- Study design includes retrospective or prospective primary studies (such as model-development and validation papers) and systematic reviews that reanalyse primary data. Excluded are simulation-only papers, non-binary tasks, or abstracts that lack methods. Studies focusing solely on radiomics, image-segmentation pipelines, or pixel-level classification tasks will also be excluded, as these do not produce patient-level binary predictions.
- Scope exclusion (imaging segmentation/radiomics): We exclude pixel/voxel-level segmentation and radiomics tasks because they optimise dense, pixel-level predictions and are evaluated with overlap/shape metrics (e.g., Dice/Jaccard/Hausdorff), which are not commensurable with patient-level clinical prediction (e.g., ROC-AUC, PR-AUC, calibration) that is the focus of this review. Including segmentation would mix fundamentally different targets, class-imbalance structures, and metrics; therefore, such studies are out of scope. [27,28]
Information sources and search strategy
Searches will be conducted in MEDLINE (PubMed), EMBASE, Scopus, Web of Science Core Collection, and IEEE Xplore. A peer-reviewed strategy combines controlled vocabulary and free-text terms to address class imbalance, resampling, and clinical prediction; an example MEDLINE string is provided in the S2 File. No language limits were applied, but non-English full texts had to be translatable.
Grey literature (medRxiv/arXiv/bioRxiv/GitHub): We include medRxiv, arXiv, bioRxiv, and GitHub to (i) reduce publication bias/small-study effects by capturing studies not yet in indexed journals, as recommended by major evidence-synthesis guidance, and (ii) map rapidly evolving ML methods whose earliest public disclosure is often via preprints/code releases. To mitigate risks (variable peer review/reporting quality), we apply minimum reporting standards (TRIPOD+AI-aligned task clarity, data splits/leakage safeguards, model specification, performance reporting, and reproducibility) and versioning (latest preprint version; tagged GitHub commit). We will (a) label preprints/code-only sources explicitly, (b) exclude records failing minimum standards from the synthesis (retaining them in the PRISMA flow), and (c) run sensitivity analyses that exclude grey-literature records to assess their influence on conclusions. This approach follows PRISMA/PRISMA-ScR, which aims to map evidence while comprehensively managing transparently reporting quality.
We will screen these sources, but include a record in synthesis only if the minimum reporting is met:
- Predictive task clarity (target population, outcome definition, prediction horizon).
- Data & split transparency (source, inclusion/exclusion, train/validation/test strategy; leakage safeguards);
- Model specification (algorithms, hyperparameters, resampling/cost strategies);
- Performance reporting aligned with TRIPOD+AI (discrimination; threshold-dependent metrics when used; calibration if available) and, for imaging-AI studies, CLAIM elements as applicable;
- Reproducibility (accessible code or sufficient procedural detail to replicate). Records failing these are catalogued but excluded from synthesis (retained in PRISMA flow). [29,30]
Version control for preprints/GitHub: For preprints, we use the latest version at extraction. For GitHub, we require a tagged release/commit hash to ensure reproducibility. Reporting items and reproducibility checks for grey-literature records are aligned with TRIPOD+AI, where LLM-based prediction studies appear; TRIPOD-LLM items will be consulted when applicable. [31]
Study selection
Search results will be imported into Zotero for deduplication [32] and prioritised with ASReview [33]. Two reviewers will independently screen titles and abstracts, followed by full texts, resolving conflicts by consensus or through third-party adjudication. Reasons for exclusion will be recorded and displayed in a PRISMA flow diagram [25]. Data missing from the full text will be requested from authors (two-week window). We will detect duplicate/overlapping cohorts (e.g., preprint→journal of the same dataset) by matching data sources/time windows/outcomes and will retain the most complete, peer-reviewed record; secondary records contribute unique methodological details. A de-duplication table will document decisions. [34]
Data extraction
A standardised, pilot-tested form will record bibliometrics, clinical domain, sample size, imbalance ratio, resampling strategy, model family, performance metrics, calibration statistics, and cost-sensitive measures. Two independent reviewers will extract all items twice into a REDCap database (version 14.0.19). A third reviewer will run the comparison report, resolve discrepancies, and export a single verified dataset. Statistical analyses will be performed in R (v4.4.0) using the metafor (v4.8-0), dplyr (v1.1.4), and ggplot2 (v3.5.2) packages. After publication, all code and a session-info file will be uploaded to the OSF repository.
Outcomes and effect measures
- a. Evaluation Metrics
Why is accuracy insufficient? In imbalanced settings, accuracy can be high while the minority class is poorly detected; we report it only for completeness.
Discrimination. We prioritise ROC-AUC for quantitative pooling due to ubiquity and cross-study comparability, while noting its optimistic behaviour under skewed prevalence. PR-AUC will be emphasised in interpretation because it reflects positive-class performance and is more informative in cases of imbalance. [35]
Threshold-dependent metrics: We will tabulate/visualise F1, sensitivity/specificity, and Matthew’s correlation coefficient (MCC); MCC provides a balanced assessment from the full confusion matrix and often outperforms accuracy/F1 in skewed data. These metrics will not be pooled because thresholds and prevalences vary across studies [36].
Calibration & decision impact: Calibration slope/intercept and Brier score will be summarised descriptively (no pooling). Where authors report decision-curve analysis (net benefit) or explicit misclassification costs, we will extract and summarise without imputing costs; multiple author-reported cost scenarios will be presented as sensitivity analyses. [37–39]
- b. Outcomes & Synthesis
Primary metric & pooling: only ROC-AUC will be meta-analysed (random-effects on logit-AUC). Pooling requires ≥5 clinically comparable studies (same target, prediction horizon, and validation tier). We summarise heterogeneity with τ² and I2; if I2 > 75% or subgroups are sparse/incoherent, we will not pool. [40]
Interpretive weighting under imbalance: while only ROC-AUC is pooled, PR-AUC and MCC will receive greater interpretive weight in narrative/visual synthesis for imbalanced datasets. [35,36]
When pooling is inappropriate (e.g., if criteria are unmet, such as sparse subgroups, incompatible outcomes, or overlapping cohorts), we will use a structured narrative following SWiM guidance, accompanied by standardised tables/figures. [40]
Risk-of-bias and methodological quality
Even though this is a methodological scoping review, we will apply a tailored quality checklist informed by TRIPOD+AI report items and PROBAST/PROBAST-AI domains (focus on reproducibility, data leakage safeguards, validation, and calibration reporting) to describe reporting quality and potential bias. [34]. We will apply design-level screening for reproducibility, influence diagnostics (Cook’s distance [41], studentized residuals [42]), and small-study-effect tests (funnel plot [42], Egger’s regression [43], and Vevea–Hedges’ weight function [44]) to inform sensitivity analyses. We will continue to assess whether studies report blinding, handle missing data, and provide external validation; we plan to incorporate these elements into a supplementary risk-of-bias table. Although methodological, we will apply a tailored checklist drawing on TRIPOD+AI (reporting) and PROBAST/PROBAST+AI domains (risk of bias/applicability). Results summarised narratively (no scoring).
Terminology and bias signals
To avoid ambiguity, we standardise terminology and use resampling strategy to denote data-level methods (random over-/undersampling, SMOTE variants, hybrids). We adopt the term “small-study effects” as an umbrella term for patterns whereby smaller studies report larger effects; such patterns can arise from publication bias, outcome-reporting bias, lower study quality, between-study heterogeneity, or chance. We will inspect funnel plot asymmetry and, where feasible, apply Egger’s test as a screening tool. Still, we will interpret asymmetry as evidence of small-study effects, rather than publication bias alone, and discuss plausible causes in context. [45,46]
How we’ll report it
Consistent with PRISMA 2020, we will report whether small-study effects were assessed, which methods were used (visual inspection, Egger’s test), and limitations of these tests. We will refrain from formal testing when subgroups contain too few studies (e.g., < 10), and will emphasise qualitative interpretation when power is low.
Data synthesis
Phase 1—Descriptive mapping: Tables and visualisations (e.g., heat maps, temporal plots) will summarise trends in resampling use, model type, imbalance severity, and performance.
Phase 2 — Quantitative synthesis: Random-effects meta-regression of logit-AUC will examine moderators (imbalance ratio, sample size, resampling strategy, model family). Pooling requires ≥5 clinically coherent studies (same target, horizon, validation tier). The REML estimator and Knapp-Hartung confidence intervals will be employed [42]. Heterogeneity will be assessed using τ² and I2 [42]; leave-one-out analyses will be used to test robustness. The analyses will be implemented in R (metafor, dplyr, ggplot2) [42]. If I2 is very high (≈>75%) or subgroups are sparse/incoherent, we will not pool and will follow SWiM for structured narrative synthesis. [40]
Subgroup and sensitivity analyses
Planned subgroup contrasts include oversampling versus undersampling, hybrid versus single-technique pipelines, cost-sensitive versus data-level only, high (>20%) versus very low (<5%) minority prevalence, and deep learning versus traditional models. Sensitivity analyses will exclude studies with high influence, those lacking external validation, and studies without calibration reporting. The imbalance ratio (IR) will be stratified a priori into four bins: very rare (< 5%), rare (5–10%), moderate (10–20%), and mild (20–30%) [6]. If any bin contains fewer than 10 studies, it will be merged with the next wider bin. For meta-regression, these bins will be dummy-coded (reference = mild), and IR will also be modelled as a restricted cubic spline to test linearity. Sensitivity analyses will replicate the model using two dichotomies (< 10% versus ≥ 10%; < 20% versus ≥ 20%).
Discussion
Class imbalance remains one of the most persistent threats to safe clinical prediction: skewed data encourages algorithms to optimise overall accuracy at the expense of rare—but clinically essential—events. Algorithm-level approaches that embed explicit misclassification penalties can theoretically offset this bias [47,48]. Simultaneously, recent deep learning innovations such as deep belief networks and focal loss functions promise further gains in high-dimensional settings [49,50]. However, the empirical value of these strategies has never been systematically synthesised across the medical spectrum. Our planned scoping review with meta-regression addresses a critical methodological gap.
Anticipated challenges
- Extreme heterogeneity: preliminary scoping indicates a broad dispersion in clinical domains, imbalance ratios, sample sizes, and metrics. Even when studies report AUC, converting to a common logit scale may not entirely harmonise differences in test–set construction and cross-validation folds.
- Inconsistent reporting: fewer than one in ten studies in the initial screening publish calibration indices, and details of cost-sensitive losses are frequently relegated to supplementary code or omitted entirely.
- Sparse external validation: Most papers evaluate performance using random internal splits; true generalisability remains uncertain.
- Publication and small-study effects: Funnel plot asymmetry is anticipated, as smaller datasets often utilise aggressive oversampling, which skews apparent discrimination.
- Metric multiplicity: sensitivity, specificity, F-score, precision-recall AUC, and balanced accuracy are reported idiosyncratically, complicating quantitative synthesis.
Strengths
- The breadth of evidence: includes five bibliographic databases and grey literature repositories, encompassing 15 years of work and yielding the most extensive curated corpus of imbalance-related prediction studies.
- Dual synthesis: A descriptive map is paired with a random-effects meta-regression that explores moderators such as imbalance severity, sample size, and model family, yielding detailed insights not available in narrative reviews.
- Rigorous bias diagnostics: Influence statistics, funnel-plot tests, trim-and-fill, and Vevea–Hedges models will quantify the robustness of pooled estimates, alleviating the optimism that pervades the model-development literature.
- Technology-enabled workflows: ML-assisted screening using ASReview accelerates and transparently documents selection decisions [33].
- Alignment with contemporary guidance: Search, extraction, and reporting follow the PRISMA 2020 extensions to enhance reproducibility and uptake [25].
Limitations
Despite these safeguards, several constraints persist. First, residual heterogeneity is unavoidable; even a comprehensive meta-regression may elucidate only a modest fraction of the variance observed between studies. Second, using AUC as the primary effect size may risk neglecting threshold-dependent performance and real-world applications. Third, cost-sensitive studies might still be too few or inconsistently reported to facilitate quantitative pooling, necessitating a descriptive approach that limits formal comparisons through resampling. Fourth, living-review updates will depend on the speed at which newly published work reports compatible statistics—the review may lag very recent methodological advances.
Potential impact and influence on practice
By determining when and for whom resampling or weighting truly adds value, this review will assist data scientists in avoiding reflexive oversampling, which can obscure calibration or encourage overfitting. Evidence suggests that cost-sensitive losses rival data-level balancing, which could shift practice towards simpler, loss–function–centric pipelines readily available in mainstream frameworks [47–50]. Clinicians and journal editors might utilise the findings to demand more comprehensive reporting of calibration, confusion matrices, and misclassification costs, thereby accelerating the adoption of emerging AI reporting extensions (e.g., TRIPOD-AI), see also [25]. Regulators may likewise refer to our recommendations when evaluating the fairness of deployed diagnostic or prognostic models.
Future directions
The mapped gaps suggest four priorities:
- Prospective, multi-centre cohorts with rare outcomes to test whether cost-sensitive and focal-loss networks outperform oversampling in truly out-of-sample settings.
- Standardised reporting templates that mandate disclosure of class distribution, sampling strategy, calibration, and decision-curve analysis; our findings can feed directly into upcoming guideline revisions.
- Generative augmentation and domain-adapted GANs: Early evidence (e.g., synthetic EEG and radiology data) hints at privacy-preserving promise but requires rigorous external validation [51].
- Continuous evidence surveillance through annual database alerts and semi-automated screening pipelines aligns with the living-review paradigm and ensures the conclusions remain current as new imbalance-handling techniques emerge [25,33].
The planned review will quantify the performance lift (or degradation) attributable to balancing strategies and outline a research agenda for more reproducible, cost-aware, and clinically grounded predictive modelling.
Supporting information
S1 File. INPLASY Protocol.
INPLASY Protocol Registration.
https://doi.org/10.1371/journal.pone.0330050.s001
(DOCX)
S2 File. Search Queries.
Ready-to-paste search queries with limit (2009–31st of Dec 2024).
https://doi.org/10.1371/journal.pone.0330050.s002
(DOCX)
S3 File. PRISMA-P Checklist.
PRISMA-P Checklist.
https://doi.org/10.1371/journal.pone.0330050.s003
(DOCX)
References
- 1.
Mena LJ, Gonzalez JA. Machine learning for imbalanced datasets: Application in medical diagnostic. FLAIRS. 2006.
- 2. Li D-C, Liu C-W, Hu SC. A learning method for the class imbalance problem with medical data sets. Comput Biol Med. 2010;40(5):509–18. pmid:20347072
- 3. Rahman MM, Davis DN. Addressing the Class Imbalance Problem in Medical Datasets. IJMLC. 2013;:224–8.
- 4. Mienye ID, Sun Y. Performance analysis of cost-sensitive learning methods with application to imbalanced medical data. Informatics in Medicine Unlocked. 2021;25:100690.
- 5. Alahmari F. A Comparison of Resampling Techniques for Medical Data Using Machine Learning. J Info Know Mgmt. 2020;19(01):2040016.
- 6. Carvalho M, Pinho AJ, Brás S. Resampling approaches to handle class imbalance: a review from a data perspective. J Big Data. 2025;12(1).
- 7.
Panjainam P, Kanjanawattana S. A Comparison of the Hybrid Resampling Techniques for Imbalanced Medical Data. In: Proceedings of the 2024 7th International Conference on Robot Systems and Applications. 2024. 46–50. https://doi.org/10.1145/3702468.3702477
- 8. Jo T, Japkowicz N. Class imbalances versus small disjuncts. SIGKDD Explor Newsl. 2004;6(1):40–9.
- 9. van den Goorbergh R, van Smeden M, Timmerman D, Van Calster B. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression. J Am Med Inform Assoc. 2022;29(9):1525–34. pmid:35686364
- 10. Christodoulou E, Ma J, Collins GS, Steyerberg EW, Verbakel JY, Van Calster B. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12–22. pmid:30763612
- 11. Demidenko E. Sample size determination for logistic regression revisited. Stat Med. 2007;26(18):3385–97. pmid:17149799
- 12. Yenipınar A, Koç Ş, Çanga D, Kaya F. Determining sample size in logistic regression with G-Power. Black Sea J Eng Sci. 2019;2(1):16–22.
- 13. Charan J, Kaur R, Bhardwaj P, Singh K, Ambwani SR, Misra S. Sample Size Calculation in Medical Research: A Primer. ANAMS. 2021;57:74–80.
- 14. Balki I, Amirabadi A, Levman J, Martel AL, Emersic Z, Meden B, et al. Sample-Size Determination Methodologies for Machine Learning in Medical Imaging Research: A Systematic Review. Can Assoc Radiol J. 2019;70(4):344–53. pmid:31522841
- 15. Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS One. 2019;14(11):e0224365. pmid:31697686
- 16. Figueroa RL, Zeng-Treitler Q, Kandula S, Ngo LH. Predicting sample size required for classification performance. BMC Medical Informatics and Decision Making. 2012;12:1–10.
- 17. Araf I, Idri A, Chairi I. Cost-sensitive learning for imbalanced medical data: a review. Artif Intell Rev. 2024;57(4).
- 18. Cox DR. The Regression Analysis of Binary Sequences. J Royal Statistical Society Series B: Statistical Methodology. 1959;21(1):238–238.
- 19. Block HD. The Perceptron: A Model for Brain Functioning. I. Rev Mod Phys. 1962;34(1):123–35.
- 20.
Stitson M, Weston J, Gammerman A, Vovk V, Vapnik V. Theory of support vector machines. University of London. 1996;117(827):188–91.
- 21.
Hastie T, Tibshirani R, Friedman J. Boosting and additive trees. 2009.
- 22. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. jair. 2002;16:321–57.
- 23. Branco P, Torgo L, Ribeiro RP. A Survey of Predictive Modeling on Imbalanced Domains. ACM Comput Surv. 2016;49(2):1–50.
- 24. Haibo He, Garcia EA. Learning from Imbalanced Data. IEEE Trans Knowl Data Eng. 2009;21(9):1263–84.
- 25. Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. pmid:33782057
- 26. Tricco AC, Lillie E, Zarin W, O’Brien KK, Colquhoun H, Levac D, et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann Intern Med. 2018;169(7):467–73. pmid:30178033
- 27. Müller D, Soto-Rey I, Kramer F. Towards a guideline for evaluation metrics in medical image segmentation. BMC Res Notes. 2022;15(1):210. pmid:35725483
- 28. Reinke A, Tizabi MD, Baumgartner M, Eisenmann M, Heckmann-Nötzel D, Kavur AE, et al. Understanding metric-related pitfalls in image analysis validation. Nat Methods. 2024;21(2):182–94. pmid:38347140
- 29. Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, Van Calster B, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378. pmid:38626948
- 30. Tejani AS, Klontzas ME, Gatti AA, Mongan JT, Moy L, Park SH, et al. Checklist for Artificial Intelligence in Medical Imaging (CLAIM): 2024 Update. Radiol Artif Intell. 2024;6(4):e240300. pmid:38809149
- 31. Gallifant J, Afshar M, Ameen S, Aphinyanaphongs Y, Chen S, Cacciamani G, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med. 2025;31(1):60–9. pmid:39779929
- 32.
Zotero. 7.0.15 ed. Vienna, VA USA: Corporation for Digital Scholarship. 2025.
- 33. van de Schoot R, de Bruin J, Schram R, Zahedi P, de Boer J, Weijdema F, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell. 2021;3(2):125–33.
- 34. Wolff RF, Moons KGM, Riley RD, Whiting PF, Westwood M, Collins GS, et al. PROBAST: A Tool to Assess the Risk of Bias and Applicability of Prediction Model Studies. Ann Intern Med. 2019;170(1):51–8. pmid:30596875
- 35. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. pmid:25738806
- 36. Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6. pmid:31898477
- 37. Van Calster B, McLernon DJ, van Smeden M, Wynants L, Steyerberg EW, Topic Group ‘Evaluating diagnostic tests and prediction models’ of the STRATOS initiative. Calibration: the Achilles heel of predictive analytics. BMC Med. 2019;17(1):230. pmid:31842878
- 38. Binuya MAE, Engelhardt EG, Schats W, Schmidt MK, Steyerberg EW. Methodological guidance for the evaluation and updating of clinical prediction models: a systematic review. BMC Med Res Methodol. 2022;22(1):316. pmid:36510134
- 39. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26(6):565–74. pmid:17099194
- 40. Campbell M, McKenzie JE, Sowden A, Katikireddi SV, Brennan SE, Ellis S, et al. Synthesis without meta-analysis (SWiM) in systematic reviews: reporting guideline. BMJ. 2020;368:l6890. pmid:31948937
- 41. Cook RD. Detection of influential observation in linear regression. Technometrics. 1977;19(1):15–8.
- 42.
Harrer M, Cuijpers P, Furukawa T, Ebert D. Doing Meta-Analysis with R: A Hands-On Guide. Boca Raton (FL): Chapman & Hall/CRC. 2021.
- 43. Egger M, Davey Smith G, Schneider M, Minder C. Bias in meta-analysis detected by a simple, graphical test. BMJ. 1997;315(7109):629–34. pmid:9310563
- 44. Vevea JL, Hedges LV. A General Linear Model for Estimating Effect Size in the Presence of Publication Bias. Psychometrika. 1995;60(3):419–35.
- 45. Sterne JA, Egger M, Smith GD. Systematic reviews in health care: Investigating and dealing with publication and other biases in meta-analysis. BMJ. 2001;323(7304):101–5. pmid:11451790
- 46. Page MJ, Moher D, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021;372:n160. pmid:33781993
- 47.
Elkan C. The foundations of cost-sensitive learning. Int Joint Conference Artificial Intelligence. Lawrence Erlbaum Associates Ltd. 2001.
- 48. Zhi-Hua Zhou, Xu-Ying Liu. Training cost-sensitive neural networks with methods addressing the class imbalance problem. IEEE Trans Knowl Data Eng. 2006;18(1):63–77.
- 49. Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006;18(7):1527–54. pmid:16764513
- 50.
Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. 2017.
- 51. Shickel B, Tighe PJ, Bihorac A, Rashidi P. Deep EHR: a survey of recent advances in deep learning techniques for electronic health record (EHR) analysis. IEEE J Biomed Health Inform. 2018;22(5):1589–604. pmid:29989977