Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Advancing patient care: Machine learning models for predicting grade 3+ toxicities in gynecologic cancer patients treated with HDR brachytherapy

  • Andres Portocarrero-Bonifaz ,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    aportocarrerob@pucp.edu.pe

    Affiliations Department of Radiation Oncology, Brown Cancer Center, University of Louisville School of Medicine, Louisville, Kentucky, United States of America, Department of Radiation Oncology, Mayo Clinic, Jacksonville, Florida, United States of America

  • Salman Syed,

    Roles Data curation, Investigation, Methodology, Validation, Writing – review & editing

    Affiliation Department of Radiation Oncology, Brown Cancer Center, University of Louisville School of Medicine, Louisville, Kentucky, United States of America

  • Maxwell Kassel,

    Roles Data curation, Methodology, Software, Validation, Writing – review & editing

    Affiliation Department of Radiation Oncology, Brown Cancer Center, University of Louisville School of Medicine, Louisville, Kentucky, United States of America

  • Grant W. McKenzie,

    Roles Data curation, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Department of Radiation Oncology, Brown Cancer Center, University of Louisville School of Medicine, Louisville, Kentucky, United States of America

  • Vishwa M. Shah,

    Roles Data curation, Investigation, Methodology, Validation, Writing – review & editing

    Affiliation Division of Gynecologic Oncology, Department of Gynecology and Obstetrics, Loma Linda University Medical Center, Loma Linda, California, United States of America

  • Bryce M. Forry,

    Roles Data curation, Investigation, Methodology, Validation, Writing – review & editing

    Affiliation Department of Radiation Oncology, Brown Cancer Center, University of Louisville School of Medicine, Louisville, Kentucky, United States of America

  • Jeremy T. Gaskins,

    Roles Data curation, Formal analysis, Investigation, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Department of Bioinformatics and Biostatistics, University of Louisville School of Public Health and Information Sciences, Louisville, Kentucky, United States of America

  • Keith T. Sowards,

    Roles Conceptualization, Investigation, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Department of Radiation Oncology, Brown Cancer Center, University of Louisville School of Medicine, Louisville, Kentucky, United States of America

  • Thulasi Babitha Avula,

    Roles Data curation, Investigation, Methodology, Validation, Writing – review & editing

    Affiliation Department of Radiation Oncology, Brown Cancer Center, University of Louisville School of Medicine, Louisville, Kentucky, United States of America

  • Adrianna Masters,

    Roles Data curation, Investigation, Methodology, Supervision, Validation, Writing – review & editing

    Affiliation Department of Radiation Oncology, Brown Cancer Center, University of Louisville School of Medicine, Louisville, Kentucky, United States of America

  • Jose G. Schneider,

    Roles Data curation, Formal analysis, Investigation, Methodology, Software, Validation, Writing – review & editing

    Affiliation Department of Radiation Oncology, Vanderbilt University Medical Center, Nashville, Tennessee, United States of America

  • Scott R. Silva

    Roles Conceptualization, Data curation, Investigation, Methodology, Project administration, Supervision, Validation, Writing – review & editing

    Affiliation Department of Radiation Oncology, Brown Cancer Center, University of Louisville School of Medicine, Louisville, Kentucky, United States of America

Abstract

Background

Gynecological cancers are among the most prevalent cancers in women worldwide. Brachytherapy, often used as a boost to external beam radiotherapy, is integral to treatment. Advances in computation, algorithms, and data availability have popularized the use of machine learning to predict patient outcomes. Recent studies have applied models such as logistic regression, support vector machines, and deep learning networks to predict specific toxicities in patients who have undergone brachytherapy.

Objective

To develop and compare machine learning models for predicting grade 3 or higher toxicities in gynecological cancer patients treated with high dose rate (HDR) brachytherapy, aiming to contribute to personalized radiation treatments.

Methods

A retrospective analysis was performed on gynecological cancer patients who underwent HDR brachytherapy with Syed-Neblett or Tandem and Ovoid applicators from 2009 to 2023. After applying exclusion criteria, 233 patients were included in the analysis. Dosimetric variables for the high-risk clinical target volume (HR-CTV) and organs at risk, along with tumor, patient, and toxicity data, were collected and compared between groups with and without grade 3 or higher toxicities using statistical tests. Seven supervised classification machine learning models (Logistic Regression, Random Forest, K-Nearest Neighbors, Support Vector Machines, Gaussian Naive Bayes, Multi-Layer Perceptron Neural Networks, and XGBoost) were constructed and evaluated. The training process involved sequential feature selection (SFS) when appropriate, followed by hyperparameter tuning. Final model performance was characterized using a 25% withheld test dataset.

Results

The top three ranking models were Support Vector Machines, Random Forest, and Logistic Regression, with F1 testing scores of 0.63, 0.57, and 0.52; normMCC testing scores of 0.75, 0.77, and 0.71; and accuracy testing scores of 0.80, 0.85, and 0.81, respectively. The SFS algorithm selected 10 features for the highest-ranking model. In traditional statistical analysis, HR-CTV volume, Charlson Comorbidity Index, Length of Follow-Up, and D2cc - Rectum differed significantly between groups with and without grade 3 or higher toxicities.

Conclusions

Machine learning models were developed to predict grade 3 or higher toxicities, achieving satisfactory performance. Machine learning presents a novel solution to creating multivariable models for personalized radiation therapy.

Introduction

Gynecological cancers rank among the most diagnosed malignancies affecting women on a global scale [1]. In the United States of America, it is estimated that there were 116,930 new cases and 36,250 deaths in 2024 attributable to gynecologic malignancies [2]. Treatments for gynecologic cancers include surgery, chemotherapy, and/or radiotherapy [3]. Brachytherapy is necessary in the management of locally advanced cervical cancer, since patients who do not receive brachytherapy following concurrent external beam radiation therapy (EBRT) and chemotherapy have significantly worse overall survival [4]. Colson-Fearon et al. reported that the 4-year overall survival in locally advanced cervical cancer patients treated with brachytherapy versus without brachytherapy is 67.7% versus 45.7%, respectively [5]. With 3-dimensional magnetic resonance image-guided brachytherapy for cervical cancer, the 5-year local control is 92% [6].

Possible side effects in patients who have undergone radiation treatment for gynecological cancers with a brachytherapy component include gastrointestinal (GI), genitourinary (GU), and vaginal (VAG) toxicities. According to the Common Terminology Criteria for Adverse Events (CTCAE) version 5.0, these toxicities are graded from 1 (mild) to 5 (death) based on their severity and impact on daily activities. Common GI toxicities include diarrhea, proctitis, and rectal bleeding. GU toxicities often present as urinary frequency, incontinence, or cystitis. Vaginal toxicities such as stenosis, mucositis, and fistulas may also occur. Follow-up data from the EMBRACE-I trial reported that at three years, the incidence of severe (grade 3 or higher) GI, GU, and vaginal toxicities were 7% (5.6–8.8%), 6.1% (4.8–7.7%), and 3.6% (2.7–5.0%), respectively. The overall incidence of adverse events, including non-GI, GU, and VAG toxicities, was 21.5% at three years and 26.6% at five years [7].

Machine learning (ML) has been defined as an optimization problem to find the most suitable predictive model for new data based on an existing dataset obtained from a similar context [8]. The recent rise in popularity of ML has been due to the development of new algorithms, theory, data availability, and improvements in low-cost computation [9]. For many problems, ML has shown to have better overall predictive metrics than conventional statistical models (CSM) [1012]. In the treatment of cervical cancer with HDR brachytherapy, ML has proven to be applicable to several aspects. Among others, tree-based models have been found to provide accurate classification performance for optimal applicator selection [13], and deep-learning methods show promise in improvement of treatment planning, such as segmentation, reconstruction, plan optimization, dose calculation, and other aspects related to treatment outcomes [14].

ML is a bottom-up approach that has the advantages of being data-driven, of not requiring strict a-priori assumptions about the forms of the relationships between variables and outcomes, and of accounting for complex interactions among input features [15]. In contrast, CSMs can be viewed as top-down approaches, and their main advantages are their interpretability due to usually focusing on the parsimonious relationships between input and response, the low computational resources needed to fit the models, and being less susceptible to overfitting with large datasets [16,17].

Binary classification, in which the ML model predicts an output that is either one of two possible classes, is one of the most common tasks that can be solved with supervised machine learning [18]. For this problem, a model is trained with data that contains both features and the response labels, and the algorithm compares the actual and predicted results using an appropriate assessment metric [19]. This study aims to build and compare some of the more common binary classification machine learning models in the context of predicting if a patient will develop grade 3 or higher toxicities (Output: Yes/No) in gynecologic cancer patients treated with EBRT and brachytherapy. By accurately forecasting severe toxicities based on clinical, demographic, and treatment-related features, we aim to enhance personalized patient care by enabling clinicians to identify high-risk patients before treatment initiation. This predictive capability allows for proactive adjustments to treatment plans and implementation of preventive measures regarding potential side effects. This study forms part of a larger research effort aimed at improving treatment outcomes and quality of life (QoL) for patients by integrating machine learning models into the clinical decision-making process.

Methods

Data collection

A comprehensive retrospective analysis was conducted, encompassing a total of 233 patients who had undergone high dose rate (HDR) brachytherapy with Syed-Neblett or Fletcher-Suit-Delclos Tandem and Ovoid (T&O) applicators for treatment of gynecological cancer (cervix, endometrium, vagina, or vulva) at a single institution spanning the period from 2009 to 2023. As healthcare professionals directly involved in patient care, the authors had access to identifying patient information throughout the study. This data was accessed for research purposes from March 1st, 2022, to March 1st, 2024. Demographic details, tumor characteristics, treatment variables, dosimetric information (including if the patient received an EBRT boost), and occurrences of gastrointestinal (GI), genitourinary (GU), and vaginal (VAG) toxicities during and post-treatment were gathered. The exclusion criteria included the following: patients with a prior brachytherapy history, those treated with more than a single type of brachytherapy applicator, conflicting dosimetric data found in records, concurrent external beam radiotherapy for a distinct proximal disease site, or a combination of low dose rate (LDR) and HDR treatments. Toxicities were classified according to the Common Terminology Criteria for Adverse Events (CTCAE) version 5.0 [20], and the integrity of the database was reviewed three times by both a physician and a medical physicist to ensure its accuracy and reliability. For treatment planning, the dosimetry goals as detailed in the EMBRACE trials and ASTRO Clinical Practice Guideline were followed [6,21]. All patients received EBRT and brachytherapy. The process used to calculate the total EQD2 dose has been described in detail in a previous work, and follows the procedure suggested by ICRU 89 [22,23]. This study was approved by our institutional review board (IRB 22.0117).

Statistical analysis

Preliminary dataset exploration was done by comparing between patients that developed no higher than a grade 2 toxicity and those that developed grade 3 or higher toxicities at any point in time after EBRT initiation; continuous variables were reported as means and standard deviations and compared with 2-sample t-tests. Categorical variables were listed as counts and percentages and compared with the Fisher exact test. Non-normal continuous variables were reported at median and interquartile range (IQR) and compared with the non-parametric Mann-Whitney test; a p-value of 0.05 or lower was considered to be statistically significant. Kaplan-Meier curves for disease free survival and local control were created to characterize the cohort. This analysis was performed using R statistical software version 4.3.2. [24].

Data preprocessing

The machine learning analysis was done using Python 3 and Jupyter Notebook (IPython kernel). Various code libraries (collections of pre-written functions and classes), including Scikit-learn v1.3.2 [25], were used for their straightforward integration and reproducibility; care was taken to ensure compatibility and the use of the appropriate library versions. Charlson Comorbidity Index was categorized into approximate quartiles (“Low” (0–2), “Medium” (3), “High” (4–5), or “Very High” (> 5)), and KPS was assigned categories according to clinical interpretation: “Bad” (50–70), “Normal” (80), or “Good” (90–100). Data pre-processing involved four steps: A) Encoding, B) Imputation, C) Class Balancing and D) Normalization.

For data encoding, categorical and ordinal variables were assigned to numeric labels. The data then underwent a stratified split based on the target, resulting in two groups with an equivalent proportion of toxicity events: 75% for training (n = 174) and 25% for testing (n = 59).

Imputation of missing feature values was done via the K-nearest neighbors (KNN) algorithm due to its simplicity, low computational cost and its better overall performance over other data imputation methods such as Multilayer Perceptron (MLP) or mean/mode imputation; this approach aligns with the recommendations of Garcia-Laencina et al [26]. Imputation was essential to allow consistent, unbiased comparison across multiple ML algorithms, including those which inherently cannot appropriately function with missing values, and to fully utilize our limited dataset without discarding cases. The missing data points primarily result from unobtainable patient records or measurements not being taken, suggesting the missingness is Missing at Random (MAR). KNN imputation is appropriate for MAR imputation when there is only a moderate amount of missingness (15%-30% missing), as in our setting, as it matches an individual with missing data to a similar patient based on the observed data; the advantages of KNN may be less pronounced with lower levels of missing data, as suggested by Beretta et al [27]. For categorical and ordinal variables, a K-nearest neighbors imputer was employed utilizing the single nearest neighbor to guarantee imputation to a single class for that feature. For the numerical continuous features, the KNN imputer was used with K = 5 neighbors, and the missing features were imputed by the average. This parameter was chosen after extensive experimentation. These imputers identify their nearest neighbors by calculating the Euclidean Distance between data points (not including the missing data). They were fitted using the training data only, and their algorithms applied to both the training and testing data [28].

The defined positive class of Grade 3 or higher toxicity was observed in a minority of patients (24%, 56/233), leading to an imbalanced dataset. To address this imbalance, the class-balancing algorithm SVM-SMOTE [29] was used only during model training. (Preliminary analyses suggested this algorithm had better performance than alternative balancing algorithms such as SMOTE [30] and ADASYN [31]). Out of the initial 174 samples from the training data, an additional 90 synthetic positive cases were generated for a total of 264 samples (132 positive, 132 negative).

The final pre-processing step included the normalization/standardization of values. After experimentation, the Standard Scaler was used for continuous numerical variables. For categorical and ordinal variables, both Target Encoding and One Hot Encoding were initially considered. Due to our sample size, the former was ultimately favored due to it not increasing the number of features, given how an increased features-to-samples ratio has been associated with overfitting the data [32]. Other normalization/standardization methods such as MinMax Scaler and the Robust Scaler were also explored but not reported in this work due to obtaining worse results. The fitted Scalers and the Target Encoding objects were stored into a Joblib file and then employed in the testing data.

Investigation into collinearity between input features was also performed using Pearson’s correlation coefficient. The final model eliminated one of the pairs of collinear features with values greater than 0.80 correlation. Other thresholds such as 0.7 and 0.95 were also analyzed but yielded worse results. When dose metrics were collinear, D2cc and D90 were given priority to remain in the final model due to being the most widely used clinical values [21].

Evaluation of machine learning models

There are multiple performance metrics that can be used to assess a model performance on predicting new data. In this study, the Accuracy, Precision, Recall, F1 score, Matthews Correlation Coefficient (MCC), the area under the curve of a receiver operating characteristic curve (AUC-ROC), and the area under the curve of a precision-recall curve (AUC-PR) were used; the first four metrics are defined using the number of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) as follows:

(1)(2)(3)(4)(5)

In this context, Positive/Negative represents whether the ML model predicts a toxicity event, and True/False represents whether the ML prediction agrees/disagrees with the patient record. Accuracy as shown in formula (1) is the ratio of correctly predicted instances over the total number of patients. Precision, which is represented by formula (2), is the ratio of correctly predicted positive observations to the total number of observations predicted to be positive. Recall, also known as Sensitivity, is the ratio of correct predictions among patients with toxicities as shown in formula (3); the F1 score, as defined in formula (4), is equivalent to the harmonic mean of precision and recall [33]; MCC, or its normalized version (normMCC) [34], is a balanced measure that considers all four basic metrics (TP, FP, TN, FN) as shown in formula (5). Additional metrics such as the AUC-ROC and AUC-PR evaluate the overall performance of the model by considering performance across all possible decision thresholds of the model [35]. In this work, the reported F1, recall, and precision scores are calculated for the positive class (patients that present a toxicity). For the AUC-ROC curve, the baseline denotes a random classifier, manifesting as a diagonal line with an AUC-ROC value of 0.5. Conversely, the PR curve’s baseline reflects a situation where all classifications are assumed to be positive, resulting in a horizontal line on the precision-recall plot; the position of this line on the Y-axis is contingent upon the characteristics of the data under consideration [36,37]. These prediction metrics are calculated and reported for the training (without the SVM-SMOTE generated synthetic samples used for data balancing) and withheld test data (with the missing data-imputed for both). For the purposes of this work, the authors have considered the top ML models as the ones with the highest test data F1 score. Confidence intervals of 95% were calculated using bootstrap resampling with replacement over 1000 iterations. In each iteration, the model was retrained using balanced data, and metrics were recalculated on the unbalanced training and testing datasets. The standard deviation of the metric values across iterations was used to estimate the confidence interval under the assumption of a normal distribution [38].

Sequential feature selection

In various domains, including healthcare, datasets may exhibit high dimensionality, referring to the presence of a large number of variables or features. This characteristic can adversely affect the development and interpretability of some machine learning algorithms (Logistic Regression, Support Vector Machines, K Nearest Neighbors, and Gaussian Naive Bayes) [39,40]. To reduce dimensionality, several approaches exist such as feature extraction and feature selection [41]. In this work, multiple variations of sequential feature selection were initially considered, including Sequential Forward Selection (SFS), Sequential Backward Selection (SBS), Sequential Forward Floating Selection (SFFS) and Sequential Backward Floating Selection (SBFS), which used as their estimator the same model to later be used for training [4244]; after experimentation, Sequential Forward Selection was chosen and applied during the training of LR, SVM, KNN and GNB due to faster computation times and better performance metrics. Note that, regardless of traditional statistical analysis, both marginally significant variables, and those that were not, are explored when training the ML algorithms. The forward feature selection process adds one feature into the model at a time, determining inclusion based on which predictor optimizes the evaluation criteria, which in our case was the F1 score. As part of model training, a 10-fold Stratified Shuffle Split cross-validator was used over the class-balanced training data to reduce overfitting and appropriately assess the performance metrics of the sequential feature selection algorithm [4546]. A SHAP analysis [47] or variable importance plot (For RF) was done for the top 3 models to study the relevance of the final features that were included.

Machine learning algorithms

A total of 7 machine learning models were implemented and compared. The included models were the following: Logistic Regression (LR), Random Forest (RF), K-Nearest Neighbors, Support Vector Machines (SVM), Gaussian Naive Bayes (GNB), Multi-Layer Perceptron Neural Network, and XGBoost (XGB). While there are many other ML classification algorithms in the literature, these seven choices represent the most commonly utilized algorithms for radiation outcome modeling [48]. The selection of these methods provided a range from simpler models (such as LR, KNN, and GNB) to more complex approaches (such as RF, SVM, MLP and XGB), addressing practical challenges related to dataset size and potential risks of overfitting. The baseline for the precision – recall curve was determined to be a horizontal line equal to 0.237 based on a classifier that labels all predictive instances as positive within the held-out testing data. After selecting the most relevant features through the Sequential Feature Selection process for the appropriate models, the hyperparameters of all 7 models were further fine-tuned by using a Grid Search over another 10-fold Stratified Shuffle Split cross validator to optimize prediction under each model choice; the hyperparameter search space used by Grid Search is detailed in S1 Table. The Python code and study database have been made available to the reader. This data is available at: https://github.com/AndresPB95/ML-Model-Gynecological-HDR-G3Plus-Toxicities. A comprehensive diagram depicting the full machine learning workflow is provided in Fig 1, and S2 Table presents a summary of all the features explored by the ML models, along with their types. The initial features included in the database of study were selected through discussions with physicians and professionals directly involved in HDR treatments.

thumbnail
Fig 1. Flowchart outlining the steps used when training and evaluating the different models.

The process is divided into the following steps: (A) Initial Train/Test split: The data is initially divided into training and testing sets. The training set is used for most of the model development process, while the testing set is reserved to simulate new, unseen data. (B) Data preprocessing (Training Set): Preprocessing steps include: (i) A KNN Imputer is fitted and applied to the training data to fill in missing values, (ii) Collinear features are removed, (iii) SVM SMOTE is used to oversample the positive class (*Only for training). Note: A separate, unbalanced copy of the training set was retained for evaluation, (iv) a StandardScaler is fitted and applied to the training data ensuring they are on a comparable scale. (C) Data preprocessing (Testing Set): The preprocessing objects fitted to the training set are subsequently applied to the testing set: (i) The KNN Imputer is used to fill in missing values in the testing data, (ii) Collinear features are removed, (iii) the StandardScaler is applied for normalization. Note: SVM SMOTE was NOT used to oversample the test set. (D) Hyperparameter Tuning: For each model, the following tuning procedures are conducted using 10-fold cross-validation: (i) Sequential Feature Selection (if applicable) creates and trains multiple models by adding one feature at a time. Each model’s F1 score is tested by comparing the predicted values with the known labels, and features that improve the F1 score are retained, building towards the most effective feature set. (ii) GridSearch trains multiple models with various hyperparameter combinations. Each combination’s F1 score is tested by comparing the predicted values with the known labels, and the best-performing combination is selected for the final model. (E) Final Model Generation: After identifying the optimal hyperparameters and features, a final model is trained using the entire balanced training set. (F) Evaluation: The model’s performance is evaluated by comparing its predictions against the known labels using both the unbalanced training set and the testing set.

https://doi.org/10.1371/journal.pone.0312208.g001

Results

The dataset included demographic and clinical data for n = 233 patients, of which n = 56 (24%) had a grade 3 or higher toxicity. It comprised 32 features: 3 ordinal, 7 categorical, and 22 numerical variables. The highest missing data rates were observed for the total number of treatment days (11%) and maximum tumor length (16.7%), while all other features had missingness rates of 4.3% or less. The demographic, treatment, and tumor-related data are shown in Table 1. Patients who experienced grade 3 or higher toxicity were found to have longer follow-up (median 12.4 months versus 3.8 months), more likely to have low or very high comorbidity scores and had significantly higher HR-CTV values (median 50 cc versus 39 cc, p = 0.041).

thumbnail
Table 1. Comparison of patient, treatment, and tumor characteristics between groups with and without grade 3 or higher toxicities.

https://doi.org/10.1371/journal.pone.0312208.t001

Table 2 compares median dose coverage to the tumor (V100%, D50%, D90%, and D98%) and the doses to the organs at risk (OARs) by toxicity status. Patients with grade 3 or higher toxicities had significantly higher D2cc doses to the rectum (p = 0.043), but no other doses were statistically significantly different. The HR-CTV V100, D1cc - Rectum, and doses to the sigmoid colon were slightly higher for the group with grade 3 or higher toxicities but not statistically significant.

thumbnail
Table 2. HR-CTV and OAR dosimetric values between groups with and without grade 3 or higher toxicities.

https://doi.org/10.1371/journal.pone.0312208.t002

The seven machine learning models were then fitted using all variables included in Tables 1 and 2 as described in the Methods section. The performance of these models on the withheld test data are depicted visually in Figs 2 and 3. Numeric comparisons based on both the (class-imbalanced) training data and withheld test data are shown in Table 3. The top three models for predicting grade 3 or higher toxicities are found to be Support Vector Machines (SVM), Random Forests (RF), and Logistic Regression (LR) with F1 testing scores of 0.63, 0.57 and 0.52, normMCC testing scores of 0.75, 0.77 and 0.71, and Accuracy testing scores of 0.80, 0.85 and 0.81, respectively. All values shown in Table 3 assume a classification threshold value of 0.5 for toxicity prediction. Note that this table also includes the metrics from the training data, which for some models (MLP and KNN) disagree strongly with the test data performance measures, indicating severe overfitting in the training data. Table 4 exhibits the most relevant features and the values of the hyperparameters selected by the GridSearchCV optimization algorithm over the training data. The top features repeated among these three models are Chemotherapy, Charlson Comorbidity Index, KPS, D2cc - Small Bowel, Stage, Histology, and Follow-Up Time.

thumbnail
Fig 2. Precision-Recall curves comparing 7 machine learning models and a baseline value.

PR curves are computed using the withheld test data. SVM is the model with the highest area under the curve.

https://doi.org/10.1371/journal.pone.0312208.g002

thumbnail
Fig 3. Receiver Operating Characteristics curves for 7 machine learning models and a baseline value.

ROC curves are computed using the withheld test data. SVM is the model with the highest area under the curve.

https://doi.org/10.1371/journal.pone.0312208.g003

thumbnail
Table 3. Training and testing performance metrics for the considered machine learning models.

https://doi.org/10.1371/journal.pone.0312208.t003

thumbnail
Table 4. Most important features as selected by the Sequential Feature Selection algorithm (where appropriate) and found optimal hyperparameters for the top 3 scoring models.

https://doi.org/10.1371/journal.pone.0312208.t004

The total runtime for the training of Logistic Regression, Random Forest, K-Nearest Neighbors, Support Vector Machines, Gaussian Naive Bayes, Multi-Layer Perceptron, and XGBoost was 0.35, 8.24, 0.45, 0.73, 0.13, 104.32, and 28.86 minutes, respectively. The Python script was executed on a computer equipped with a 14th Generation Intel(R) Core(TM) i9-14900HX 2.20 GHz processor (E-cores up to 4.10 GHz P-cores up to 5.80 GHz), 32 GB of DDR5–5600 MHz (SODIMM) RAM, and a 64-bit Windows 11 Pro operating system.

Discussion

This study aimed to investigate the utility of using machine learning models to predict grade 3 or higher toxicities in gynecologic cancer patients treated with EBRT and interstitial or T&O brachytherapy. The database was analyzed using traditional statistics which compared groups with and without grade 3 + toxicities; disease free survival and local control were also reported (S1 Fig). To design the toxicity prediction models, data were encoded and pre-processed. Next, a sequential feature selector method was used when appropriate, and hyperparameter tuning was performed.

From a clinical application standpoint, ML models can significantly enhance patient care by identifying individuals who are at higher risk of developing severe toxicities. Clinicians can use this predictive information to modify the treatment plan to reduce the risk of severe toxicities, tailor patient counseling, set appropriate expectations, and implement proactive monitoring strategies. For example, patients identified as high-risk may benefit from more frequent follow-up appointments, early interventions for symptom management, or a change in the prescription dose. Additionally, the model can aid in shared decision-making by involving patients in discussions about the potential risks and benefits of their treatment plans. By stratifying patients based on their predicted toxicity risk, healthcare teams can allocate resources more efficiently and personalize supportive care measures to mitigate adverse effects and maintain or improve patients’ QoL. Thus, ML models can serve as a valuable tool for enhancing patient outcomes through individualized clinical management. Once validated, deployment of the best model into clinical workflows could be done with a decision-support application that provides real-time risk assessments during treatment planning. This tool could be implemented directly within the Eclipse environment via ESAPI scripting (C#), as a standalone browser-based application accessible to clinicians, or through alternative deployment methods adapted to the specific needs and infrastructure of each clinic.

A comparison of the patients with and without grade 3 toxicities, using basic marginal statistical analysis, suggested few differences between the groups including HR-CTV, Charlson Comorbidity Index, Length of Follow Up, and D2cc - Rectum. Some of these variables such as the HR-CTV and D2cc - Rectum have been previously shown to be predictors of grade 3 or higher toxicities of HDR brachytherapy. Lee et al. observed that patients with grade 3–4 toxicities had a significantly higher median HR-CTV of 111 cc compared to 43 cc for those patients with grade 0–2 toxicities [49]. Mesko et al. found a statistically significant difference between patients with and without grade 3 toxicities, with median HR-CTV values of 93.8 cc and 51 cc, respectively [50]. Mazeron et al. found that rectal D2cc values equal to or greater than 75 Gy EQD2 are associated with higher grade and more frequent toxicities in MRI-guided adaptive brachytherapy for locally advanced cervical cancer [51]. When compared with traditional statistics, machine learning models consider nonlinear interactions between variables [52]. Our top-performing model selected 10 features, with Length of Follow-Up, D2cc to the Small Bowel, Charlson Comorbidity Index, HR-CTV volume, and D2cc to the sigmoid being the most relevant (S2 Fig), agreeing with the discussed literature and the statistical analysis. Notably, Length of Follow-Up and Charlson Comorbidity Index were also significant predictors in our second and third best-performing models (S3 and S4 Figs). Higher D2cc doses to the small bowel and sigmoid indicate greater radiation exposure, increasing the risk of severe toxicity, while a higher Charlson Comorbidity Index reflects poorer overall health and greater susceptibility to adverse effects. Larger HR-CTVs may necessitate more aggressive treatment, heightening toxicity risk, and Length of Follow-Up is crucial for capturing late-onset toxicities. It is important to note that the features chosen by SFS may exclude variables that are easily manipulable when creating a treatment plan, particularly dosimetric variables. This issue could be explained twofold: 1) certain combinations of hyperparameters could limit the ability of SFS to find the correct interactions between features in the final selection; or 2) certain combinations of features could be more relevant and produce better predictions than when using actual dosimetric data. A model without any dosimetric features would still be useful for predicting toxicity risk but would not provide the clinician the option of adjusting the treatment plan to reduce the risk of toxicity.

Supervised machine learning has been utilized to perform classification tasks in various areas of healthcare such as for predicting diagnosis and prognosis of COVID-19 patients, prediction of hospitalization due to heart disease, and outcome prediction of infectious diseases [5355]. To the authors’ knowledge, this is the first analysis using and comparing multiple models for predicting grade 3 or higher toxicities in gynecologic cancer patients treated with external beam radiation and HDR interstitial or T&O brachytherapy. Through March 2020, there were only 53 published studies on the use of machine learning to predict radiation-induced toxicities [56], and through September 2023, only 14 studies had been published on deep learning models to predict toxicities from radiation treatment [57].

Regarding ML in brachytherapy toxicity prediction, Tian et al. developed a model for predicting fistula formation, reporting a recall of 97.1% and AUC of 0.904 utilizing the SMOTE algorithm and a SVM model with a radial basis kernel function on a database that included 35 patients with 7 positive cases; the limitation of this study lies in the small dataset, no withheld test dataset, high risk of model overfitting, and only using one model in their study [58]. For prediction of rectal toxicities, Chen et al. and Zhen et al. predicted grade 2 or higher rectal toxicity by using SVM and convolutional neural networks, respectively, with scores of (cross-validation estimated) recall and AUC of 0.85 and 0.91 for the former and 0.75 and 0.89 for the latter. Their work includes the addition of dose map features for the training of the model; both of these studies were done with a database of 42 patients with 12 positive cases of patients that developed toxicities [59,60]. Additionally, there has been work by Lucia et al. who developed Normal Tissue Complication Probability (NTCP) models for acute and late gastrointestinal, genitourinary, and vaginal toxicities using a database of 102 patients that included radiomic features, but only for a logistic regression model, which obtained balanced accuracy scores between 63.99 and 78.41 [61]. Cheon et al. considered deep learning models for predicting late bladder toxicities which outperformed its multivariable logistic regression counterpart [62], with data of 281 patients which achieved an F1 score of 0.76. In contrast to earlier investigations, our study presents the largest patient dataset used for predicting grade 3 or higher toxicities. Similar to these studies, we employ data-balancing algorithms to promote stability in the model training stage. Our methodology incorporates feature selection for all models except for MLP and RF. Specifically, we leverage the Sequential Feature Selection Algorithm to promote parsimony within the model fit. This aligns with the methodologies employed in previous reports.

Overfitting occurs when a model becomes overly complex, capturing noise in the training data instead of learning the underlying patterns, leading to poor predictions when applied to new data [63]. To mitigate this phenomenon, the use of a withheld testing data set is required to assess the degree of overfitting and the performance of the model [64]. The authors recommend that the training and withheld data testing scores should always be reported to provide a comprehensive understanding of a model’s performance. A clear illustration of overfitting can be appreciated in Table 3 for the MLP, KNN and XGB models, where they achieved impressive training F1 scores of 1.00, 0.72 and 1.00, respectively; contrasting sharply with their testing scores of 0.39, 0.32, and 0.48. These scores show that these 3 models are not generalizable for predicting new similar data points. A likely explanation for these three models’ performance would be their hyperparameter sensitivity. In the case of MLP, there are a large number of hyperparameters and possibilities for neuron activation; similarly, XGBoost offers many adjustable settings, and KNN relies on a highly accurate selection of k-neighbors to make a reliable classification [65]. Each model merits a robust and thorough search space to cover sufficient model possibilities and an analysis of different hyperparameter tuning packages.

Regarding the scoring metrics, our study showed that the support vector machine was the best model for predicting grade 3 toxicities, obtaining a training F1 score of 0.61, accuracy of 0.82, normMCC of 0.75, precision of 0.63, recall of 0.6, AUC-ROC of 0.87, and AUC-PR of 0.6; whereas for that same model, the test data obtained an F1 score of 0.63, accuracy of 0.80, normMCC of 0.75, precision of 0.56, recall of 0.71, AUC-ROC of 0.78, and AUC-PR of 0.65. In the withheld test data, out of all the patients that had a toxicity (n = 14), 71% were correctly predicted by the model (TP = 10); and out of all the predicted cases, 56% represented a true toxicity event and were not false positives (FP = 8). Given the high level of clinical uncertainty in whether patients will develop toxicities, this may be viewed as an adequate performance; while slightly lower, it remains comparable to similar studies, likely reflecting the increased difficulty associated with accurately predicting higher-grade (Grade ≥3) toxicities. An important detail that must be considered is that the precision value is as important as the recall, since during normal clinical practice it is equally as important to avoid false positives as it is to detect true positive cases. In particular, a toxicity prediction model may suggest that the physician consider lowering the dose to certain OARs to prevent these high-grade radiation-related side effects; an algorithm with good recall but prone to predicting false positives may lead to reducing the dose for a patient not susceptible to developing toxicities. This reduction, in turn, may involve sacrificing a portion of tumor coverage, potentially decreasing tumor control. For this reason, the F1 score emerges as the optimal metric for evaluating the model’s performance. In future investigations within this area, prioritizing either the recall or the precision score, which is not replaceable by specificity, could be explored. Notably, specificity becomes less valuable in situations marked by an imbalance with a majority of true negatives [66] as the model’s ability to predict negative outcomes can render overly optimistic scores in such scenarios. Once a best performing model has been identified, multi-institutional clinical trials will be needed to assess their performance on routine clinical practice.

The strength of this work lies in several key aspects. First, the study analyzes multiple machine learning models to find the best fit across a variety of common prediction algorithms. Additionally, we divide the data into training and testing sets before employing cross-validation for the model’s training, enhancing generalizability of the final models and providing more trustworthy measures of out-of-sample performance, despite potential reductions in the values of these metrics. The use of a Stratified Shuffle Split approach guarantees that there will be a positive class on the testing set of the cross validation, ensuring meaningful performance in every split. Furthermore, the focus on the F1 score and reporting precision as the performance metrics is of practical relevance for assessing the clinical performance of the model, especially when predicting toxicities.

The primary limitation of this study is the relatively small sample size. Although our dataset of 233 data points is larger than those used in previous studies, only 75% of this data was utilized for training, which may influence the model’s accuracy and introduce uncertainty. Research has shown that smaller datasets are more prone to overfitting and can be influenced by random variability in the data [67]. To address this, several models were evaluated, ranging from simpler methods to more complex machine learning algorithms. The authors plan on elaborating multi-institutional studies to address sample size limitations in the future. Additionally, only the dosimetric, treatment, and tumor variables were considered in this study, but not any additional features such as dose maps with spatial information. Regarding data balancing through Synthetic Oversampling, alternative techniques like threshold tuning could be investigated.

Furthermore, developing methods to address overfitting could be beneficial, particularly for MLP, KNN, and XGBoost. For KNN, different alternatives for the classifier could be explored, such as weighted-KNN, Radius-based Nearest Neighbor, or changing to a different distance metric other than Euclidean. For MLP, a much more defined hyperparameter space will be addressed in future studies. Several overfitting solutions have been outlined by Ying [68], such as Early-Stopping, L1 and L2 regularization parameters, etc. Whereas for XGBoost, the number of trees and its depth could be further limited, as well as manipulating some other relevant hyperparameters.

Finally, more computationally intensive approaches could be explored in future research. These include performing comprehensive hyperparameter optimization using dedicated libraries such as Optuna [69], evaluating additional ensemble-based methods like AdaBoost, or combining multiple algorithms. Other promising directions include the implementation of Multi-Feature Combined models as described by Yang et al. [70], as well as physics-informed ML [71] approaches that incorporate fundamental brachytherapy dose calculations to provide a physics-based foundation for data-driven model refinements.

Conclusion

Multiple machine learning models were trained and assessed to predict grade 3 or higher toxicity development in patients with gynecologic malignancies who received EBRT and interstitial or T&O brachytherapy treatment yielding satisfactory results for the top performing model. This novel approach of toxicity prediction holds the potential to set a new paradigm in standard clinical care and contribute towards personalized care in radiation therapy. New techniques to improve model training need to be explored, and overcoming machine learning limitations like small datasets requires collaborative efforts among peers. In the future, further investigations are needed to prospectively validate these models in other healthcare settings.

Supporting information

S1 Fig. Kaplan Meier plots for the entire patient cohort.

A) Disease Free Survival and B) Local control.

https://doi.org/10.1371/journal.pone.0312208.s001

(PDF)

S2 Fig. SHAP analysis for Support Vector Machine.

https://doi.org/10.1371/journal.pone.0312208.s002

(PDF)

S3 Fig. Variable Importance plot for Random Forest.

https://doi.org/10.1371/journal.pone.0312208.s003

(PDF)

S4 Fig. SHAP analysis for Logistic Regression.

https://doi.org/10.1371/journal.pone.0312208.s004

(PDF)

S1 Table. Hyperparameter Search Space and MLP architecture.

https://doi.org/10.1371/journal.pone.0312208.s005

(PDF)

S2 Table. Summary of input features and output from models.

The variable type and number of missing data points for each input is shown.

https://doi.org/10.1371/journal.pone.0312208.s006

(PDF)

References

  1. 1. Costa M, Lai C. Coordinated efforts to harmonise gynaecological cancer care. Lancet Oncol. 2022;23(8):971–2. pmid:35901816
  2. 2. Siegel RL, Giaquinto AN, Jemal A. Cancer statistics, 2024. CA Cancer J Clin. 2024;74(1):12–49.
  3. 3. Portelance L, Corradini S, Erickson B, Lalondrelle S, Padgett K, van der Leij F, et al. Online Magnetic Resonance-Guided Radiotherapy (oMRgRT) for Gynecological Cancers. Front Oncol. 2021;11:628131. pmid:34513656
  4. 4. Robin TP, Amini A, Schefter TE, Behbakht K, Fisher CM. Disparities in standard of care treatment and associated survival decrement in patients with locally advanced cervical cancer. Gynecol Oncol. 2016;143(2):319–25. pmid:27640961
  5. 5. Colson-Fearon D, Han K, Roumeliotis MB, Viswanathan AN. Updated Trends in Cervical Cancer Brachytherapy Utilization and Disparities in the United States From 2004 to 2020. Int J Radiat Oncol Biol Phys. 2024;119(1):154–62. pmid:38040060
  6. 6. Pötter R, Tanderup K, Schmid MP, Jürgenliemk-Schulz I, Haie-Meder C, Fokdal LU, et al. MRI-guided adaptive brachytherapy in locally advanced cervical cancer (EMBRACE-I): a multicentre prospective cohort study. Lancet Oncol. 2021;22(4):538–47. pmid:33794207
  7. 7. Vittrup AS, Kirchheiner K, Pötter R, Fokdal LU, Jensen NBK, Spampinato S, et al. Overall Severe Morbidity After Chemo-Radiation Therapy and Magnetic Resonance Imaging-Guided Adaptive Brachytherapy in Locally Advanced Cervical Cancer: Results From the EMBRACE-I Study. Int J Radiat Oncol Biol Phys. 2023;116(4):807–24. pmid:36641039
  8. 8. Wiens J, Shenoy ES. Machine Learning for Healthcare: On the Verge of a Major Shift in Healthcare Epidemiology. Clin Infect Dis. 2018;66(1):149–53. pmid:29020316
  9. 9. Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015;349(6245):255–60. pmid:26185243
  10. 10. Shin S, Austin PC, Ross HJ, Abdel-Qadir H, Freitas C, Tomlinson G, et al. Machine learning vs. conventional statistical models for predicting heart failure readmission and mortality. ESC Heart Fail. 2021;8(1):106–15. pmid:33205591
  11. 11. Singal AG, Mukherjee A, Elmunzer BJ, Higgins PDR, Lok AS, Zhu J, et al. Machine learning algorithms outperform conventional regression models in predicting development of hepatocellular carcinoma. Am J Gastroenterol. 2013;108(11):1723–30. pmid:24169273
  12. 12. Walsh CG, Ribeiro JD, Franklin JC. Predicting Risk of Suicide Attempts Over Time Through Machine Learning. Clinical Psychological Science. 2017;5(3):457–69.
  13. 13. Stenhouse K, Roumeliotis M, Ciunkiewicz P, Banerjee R, Yanushkevich S, McGeachy P. Development of a Machine Learning Model for Optimal Applicator Selection in High-Dose-Rate Cervical Brachytherapy. Front Oncol. 2021;11:611437. pmid:33747926
  14. 14. Shi J, Chen J, He G, Peng Q. Artificial intelligence in high-dose-rate brachytherapy treatment planning for cervical cancer: a review. Front Oncol. 2025;15:1507592. pmid:39931087
  15. 15. Breiman L. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statist Sci. 2001;16(3).
  16. 16. Rajula HSR, Verlato G, Manchia M, Antonucci N, Fanos V. Comparison of Conventional Statistical Methods with Machine Learning in Medicine: Diagnosis, Drug Development, and Treatment. Medicina (Kaunas). 2020;56(9):455. pmid:32911665
  17. 17. Ley C, Martin RK, Pareek A, Groll A, Seil R, Tischer T. Machine learning and conventional statistics: making sense of the differences. Knee Surg Sports Traumatol Arthrosc. 2022;30(3):753–7. pmid:35106604
  18. 18. Rainio O, Teuho J, Klén R. Evaluation metrics and statistical tests for machine learning. Sci Rep. 2024;14(1):6086. pmid:38480847
  19. 19. Saravanan R, Sujatha P. A state of art techniques on machine learning algorithms: a perspective of supervised learning approaches in data classification. 2018 Second International Conference on Intelligent Computing and Control Systems (ICICCS). 2018:945–9.
  20. 20. National Cancer Institute. Common Terminology Criteria for Adverse Events (CTCAE). Accessed April 16, 2024. Available at: https://ctep.cancer.gov/protocolDevelopment/electronic_applications/ctc.htm
  21. 21. Chino J, Annunziata CM, Beriwal S, Bradfield L, Erickson BA, Fields EC, et al. Radiation Therapy for Cervical Cancer: Executive Summary of an ASTRO Clinical Practice Guideline. Pract Radiat Oncol. 2020;10(4):220–34. pmid:32473857
  22. 22. Portocarrero-Bonifaz A, Syed S, Kassel M, McKenzie GW, Shah VM, Forry BM, et al. Dosimetric and toxicity comparison between Syed-Neblett and Fletcher-Suit-Delclos Tandem and Ovoid applicators in high dose rate cervix cancer brachytherapy. Brachytherapy. 2024;23(4):397–406. pmid:38643046
  23. 23. International Commission on Radiation Units and Measurements. J ICRU. 2013;13(1–2):NP. pmid:27335497
  24. 24. R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2024. Accessed November 25, 2024. Available at: https://www.R-project.org/
  25. 25. Pedregosa F, Varoquaux G, Gramfort A, Others et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
  26. 26. García-Laencina PJ, Sancho-Gómez J-L, Figueiras-Vidal AR. Pattern classification with missing data: a review. Neural Comput & Applic. 2009;19(2):263–82.
  27. 27. Beretta L, Santaniello A. Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak. 2016;16 Suppl 3(Suppl 3):74. pmid:27454392
  28. 28. Zhang H, Zheng G, Xu J, Yao X. Research on the Construction and Realization of Data Pipeline in Machine Learning Regression Prediction. Mathematical Problems in Engineering. 2022;2022:1–5.
  29. 29. Nguyen HM, Cooper EW, Kamei K. Borderline over-sampling for imbalanced data classification. IJKESDP. 2011;3(1):4.
  30. 30. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: Synthetic Minority Over-sampling Technique. JAIR. 2002;16:321–57.
  31. 31. He H, Bai Y, Garcia EA, et al. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). 2008:1322–8.
  32. 32. Vabalas A, Gowen E, Poliakoff E, Casson AJ. Machine learning algorithm validation with a limited sample size. PLoS One. 2019;14(11):e0224365. pmid:31697686
  33. 33. Vujovic ŽÐ. Classification Model Evaluation Metrics. IJACSA. 2021;12(6).
  34. 34. Chicco D, Jurman G. The Matthews correlation coefficient (MCC) should replace the ROC AUC as the standard metric for assessing binary classification. BioData Min. 2023;16(1):4.
  35. 35. Sofaer HR, Hoeting JA, Jarnevich CS. The area under the precision‐recall curve as a performance metric for rare binary events. Methods Ecol Evol. 2019;10(4):565–77.
  36. 36. Hajian-Tilaki K. Receiver Operating Characteristic (ROC) Curve Analysis for Medical Diagnostic Test Evaluation. Caspian J Intern Med. 2013;4(2):627–35. pmid:24009950
  37. 37. Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432. pmid:25738806
  38. 38. Raschka S. Model evaluation, model selection, and algorithm selection in machine learning. arXiv [Preprint]. 2018 [cited 2024 Aug 11. ]. Available from: https://arxiv.org/abs/1811.12808
  39. 39. Jia W, Sun M, Lian J, Hou S. Feature dimensionality reduction: a review. Complex Intell Syst. 2022;8(3):2663–93.
  40. 40. Zekić-Sušac M, Pfeifer S, Šarlija N. A Comparison of Machine Learning Methods in a High-Dimensional Classification Problem. Business Systems Research Journal. 2014;5(3):82–96.
  41. 41. Soufan O, Kleftogiannis D, Kalnis P, Bajic VB. DWFS: a wrapper feature selection tool based on a parallel genetic algorithm. PLoS One. 2015;10(2):e0117988. pmid:25719748
  42. 42. Li J, Cheng K, Wang S, Morstatter F, Trevino RP, Tang J, et al. Feature selection: A data perspective. ACM Comput Surv. 2017;50(6):1–45.
  43. 43. Chandrashekar G, Sahin F. A survey on feature selection methods. Computers & Electrical Engineering. 2014;40(1):16–28.
  44. 44. Molina LC, Belanche L, Nebot A. Feature selection algorithms: a survey and experimental evaluation. In: 2002 IEEE International Conference on Data Mining, 2002. Proceedings. 2002:306–13.
  45. 45. Prusty S, Patnaik S, Dash SK. SKCV: Stratified K-fold cross-validation on ML classifiers for predicting cervical cancer. Front Nanotechnol. 2022;4.
  46. 46. Xu Y, Goodacre R. On Splitting Training and Validation Set: A Comparative Study of Cross-Validation, Bootstrap and Systematic Sampling for Estimating the Generalization Performance of Supervised Learning. J Anal Test. 2018;2(3):249–62. pmid:30842888
  47. 47. Lundberg S, Lee S. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4765–74.
  48. 48. Luo Y, Chen S, Valdes G. Machine learning for radiation outcome modeling and prediction. Med Phys. 2020;47(5):e178–84. pmid:32418338
  49. 49. Lee LJ, Damato AL, Viswanathan AN. Clinical outcomes of high-dose-rate interstitial gynecologic brachytherapy using real-time CT guidance. Brachytherapy. 2013;12(4):303–10. pmid:23491023
  50. 50. Mesko S, Swamy U, Park S-J, Borja L, Wang J, Demanes DJ, et al. Early clinical outcomes of ultrasound-guided CT-planned high-dose-rate interstitial brachytherapy for primary locally advanced cervical cancer. Brachytherapy. 2015;14(5):626–32. pmid:26024784
  51. 51. Mazeron R, Fokdal LU, Kirchheiner K, Georg P, Jastaniyah N, Šegedin B, et al. Dose-volume effect relationships for late rectal morbidity in patients treated with chemoradiation and MRI-guided adaptive brachytherapy for locally advanced cervical cancer: Results from the prospective multicenter EMBRACE study. Radiother Oncol. 2016;120(3):412–9. pmid:27396811
  52. 52. Li R, Shinde A, Liu A, Glaser S, Lyou Y, Yuh B, et al. Machine Learning-Based Interpretation and Visualization of Nonlinear Interactions in Prostate Cancer Survival. JCO Clin Cancer Inform. 2020;4:637–46. pmid:32673068
  53. 53. Muhammad LJ, Algehyne EA, Usman SS, Ahmad A, Chakraborty C, Mohammed IA. Supervised Machine Learning Models for Prediction of COVID-19 Infection using Epidemiology Dataset. SN Comput Sci. 2021;2(1):11. pmid:33263111
  54. 54. Dai W, Brisimi TS, Adams WG, Mela T, Saligrama V, Paschalidis IC. Prediction of hospitalization due to heart diseases by supervised learning methods. Int J Med Inform. 2015;84(3):189–97. pmid:25497295
  55. 55. Noorbakhsh-Sabet N, Zand R, Zhang Y, Abedi V. Artificial Intelligence Transforms the Future of Health Care. Am J Med. 2019;132(7):795–801. pmid:30710543
  56. 56. Isaksson LJ, Pepa M, Zaffaroni M, Marvaso G, Alterio D, Volpe S, et al. Machine Learning-Based Models for Prediction of Toxicity Outcomes in Radiotherapy. Front Oncol. 2020;10:790. pmid:32582539
  57. 57. Tan D, Mohd Nasir NF, Abdul Manan H, Yahya N. Prediction of toxicity outcomes following radiotherapy using deep learning-based models: A systematic review. Cancer Radiother. 2023;27(5):398–406. pmid:37482464
  58. 58. Tian Z, Yen A, Zhou Z, Shen C, Albuquerque K, Hrycushko B. A machine-learning-based prediction model of fistula formation after interstitial brachytherapy for locally advanced gynecological malignancies. Brachytherapy. 2019;18(4):530–8. pmid:31103434
  59. 59. Chen J, Chen H, Zhong Z, Wang Z, Hrycushko B, Zhou L, et al. Investigating rectal toxicity associated dosimetric features with deformable accumulated rectal surface dose maps for cervical cancer radiotherapy. Radiat Oncol. 2018;13(1):125. pmid:29980214
  60. 60. Zhen X, Chen J, Zhong Z, Hrycushko B, Zhou L, Jiang S, et al. Deep convolutional neural network with transfer learning for rectum toxicity prediction in cervical cancer radiotherapy: a feasibility study. Phys Med Biol. 2017;62(21):8246–63. pmid:28914611
  61. 61. Lucia F, Bourbonne V, Visvikis D, Miranda O, Gujral DM, Gouders D, et al. Radiomics Analysis of 3D Dose Distributions to Predict Toxicity of Radiotherapy for Cervical Cancer. J Pers Med. 2021;11(5):398. pmid:34064918
  62. 62. Cheon W, Han M, Jeong S, Oh ES, Lee SU, Lee SB, et al. Feature Importance Analysis of a Deep Learning Model for Predicting Late Bladder Toxicity Occurrence in Uterine Cervical Cancer Patients. Cancers (Basel). 2023;15(13):3463. pmid:37444573
  63. 63. Peng Y, Nagata MH. An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data. Chaos Solitons Fractals. 2020;139:110055. pmid:32834608
  64. 64. El Naqa I, Boone JM, Benedict SH, Goodsitt MM, Chan H-P, Drukker K, et al. AI in medical physics: guidelines for publication. Med Phys. 2021;48(9):4711–4. pmid:34545957
  65. 65. Zhang Z. Introduction to machine learning: k-nearest neighbors. Ann Transl Med. 2016;4(11):218. pmid:27386492
  66. 66. Ali MM, Paul BK, Ahmed K, Bui FM, Quinn JMW, Moni MA. Heart disease prediction using supervised machine learning algorithms: Performance analysis and comparison. Comput Biol Med. 2021;136:104672. pmid:34315030
  67. 67. Rajput D, Wang W-J, Chen C-C. Evaluation of a decided sample size in machine learning applications. BMC Bioinformatics. 2023;24(1):48. pmid:36788550
  68. 68. Ying X. An Overview of Overfitting and its Solutions. J Phys: Conf Ser. 2019;1168:022022.
  69. 69. Akiba T, Sano S, Yanase T, et al. Optuna: A next-generation hyperparameter optimization framework. KDD 19: 25th ACM SIGKDD international Conference on Knowledge Discovery & Data Mining. 2019:2623–31.
  70. 70. Yang Z, Wang C, Wang Y, Lafata KJ, Zhang H, Ackerson BG, et al. Development of a multi-feature-combined model: proof-of-concept with application to local failure prediction of post-SBRT or surgery early-stage NSCLC patients. Front Oncol. 2023;13:1185771. pmid:37781201
  71. 71. Karniadakis GE, Kevrekidis IG, Lu L, Perdikaris P, Wang S, Yang L. Physics-informed machine learning. Nat Rev Phys. 2021;3(6):422–40.