Predicting preterm birth using explainable machine learning in a prospective cohort of nulliparous and multiparous pregnant women

Preterm birth (PTB) presents a complex challenge in pregnancy, often leading to significant perinatal and long-term morbidities. “While machine learning (ML) algorithms have shown promise in PTB prediction, the lack of interpretability in existing models hinders their clinical utility. This study aimed to predict PTB in a pregnant population using ML models, identify the key risk factors associated with PTB through the SHapley Additive exPlanations (SHAP) algorithm, and provide comprehensive explanations for these predictions to assist clinicians in providing appropriate care. This study analyzed a dataset of 3509 pregnant women in the United Arab Emirates and selected 35 risk factors associated with PTB based on the existing medical and artificial intelligence literature. Six ML algorithms were tested, wherein the XGBoost model exhibited the best performance, with an area under the operator receiving curves of 0.735 and 0.723 for parous and nulliparous women, respectively. The SHAP feature attribution framework was employed to identify the most significant risk factors linked to PTB. Additionally, individual patient analysis was performed using the SHAP and the local interpretable model-agnostic explanation algorithms (LIME). The overall incidence of PTB was 11.23% (11 and 12.1% in parous and nulliparous women, respectively). The main risk factors associated with PTB in parous women are previous PTB, previous cesarean section, preeclampsia during pregnancy, and maternal age. In nulliparous women, body mass index at delivery, maternal age, and the presence of amniotic infection were the most relevant risk factors. The trained ML prediction model developed in this study holds promise as a valuable screening tool for predicting PTB within this specific population. Furthermore, SHAP and LIME analyses can assist clinicians in understanding the individualized impact of each risk factor on their patients and provide appropriate care to reduce morbidity and mortality related to PTB.


Introduction
The World Health Organization reports that, every year, approximately one in ten babies is born prematurely, before completing 37 weeks of gestation [1].Preterm birth (PTB) complications rank as a leading cause of death among children under the age of five, with an estimated 75% of the one million PTB-related deaths being preventable [1].The incidence of PTB varies across 184 countries, ranging from 5 to 18% [2].In 2019, the United States (US) reported a PTB prevalence of roughly 10.2% in 2019, while the United Arab Emirates (UAE) estimated a prevalence of around 6.3% in the same year, considering both Emirati and expatriate populations [3].PTB is a complex condition with multifactorial causes [1].Among the widely studied risk factors are maternal demographics and characteristics (such as advanced maternal age and adolescent pregnancy, social determinants (including smoking and substance use), economic factors, medical complications, obstetric history (as short interpretation interval and previous PTB), and conditions specific to the current pregnancy [1][2][3].However, from the extensive list of risk factors, predicting the occurrence of PTB is challenging because the signs and symptoms of preterm labor are common and can be nonspecific.Thus, the current assessment methods for predicting individual PTB risk are problematic, particularly for nulliparous women with no obstetric history.Traditionally, these statistical models rely on single factors, such as demographic history, obstetric history, and clinical characteristics.Machine learning (ML)-based models have successfully predicted the risks of numerous medical conditions [4].Several studies have employed ML models to predict PTB.However, despite their mathematical sophistication, these models are "black-boxes," lacking both interpretability and explanation.Therefore, for clinical utility, the predictions made by ML models must be interpretable by clinicians, enabling them to assess PTB risk for each patient and understand the contribution of individual risk factors to these predictions [5,6].Explainable ML models, such as the SHapley Additive explanations (SHAP) and local interpretable model-agnostic explanations (LIME), can be used to achieve this goal.Once clinicians understand the ML model results, they gain confidence in the ML model prediction of impending PTB risk [7][8][9][10][11].This informed clinical risk stratification can substantially improve the health outcomes of preterm infants and their mothers.Consequently, this study aimed to predict PTB in nulliparous and parous women using ML models, identify important predictors associated with PTB, and provide explanations for the contribution of each risk factor to PTB prediction using SHAP and LIME.

Data and population
The dataset utilized in this analysis was obtained from an ongoing prospective maternal and child cohort study, the Mutaba' ah Study, conducted in Al Ain, UAE [12].Eligible participants included all pregnant women aged 18 y and above from the Emirati population residing in Al Ain City, who provided informed consent for themselves and their newborns.This study received approval from the Abu Dhabi Health Research and Technology Ethics Committee (DOH/CVDC/2022/72) and was conducted in strict accordance with the Declaration of Helsinki.Prior to data collection, written informed consent was obtained from all participants.This analysis encompassed 3509 women with singleton pregnancies recruited between May 2017 and February 2021.
As certain risk factors did not apply to nulliparous women, such as previous PTB, previous cesarean section (CS), and parity, the sample was separated into nulliparous mothers (n = 801) and parous mothers (n = 2708).
Participants were followed up during pregnancy, with data collection conducted through self-administered questionnaires and medical records.PTB was categorized based on gestational age at birth, provided in weeks.Any birth occurring before 37 weeks of gestation was classified as PTB.Clinicians performed a scoping review of the literature on PTB risk factors.Simultaneously, a review of the features employed in existing PTB prediction studies utilizing ML models was carried out, as shown in S1 Table .A set of risk factors was derived from the common factors present in both current ML-based studies and medical literature.This combined list was then filtered using data from the electronic medical records (applying ICD 10 coding) and relevant information from the questionnaire (S2 Table ).Notably, 35 relevant features were identified in this study (S2

Study population characteristics
Descriptive statistics were employed to illustrate and compare the distribution of the characteristics of the study population based on PTB status.Continuous variables were represented by means and standard deviations, discrete quantitative variables by medians and ranges, and categorical variables by counts and percentages.Student's t-test was used to determine the differences between group means for continuous variables, while categorical variables were compared using Pearson's Chi-square test or Fisher's exact test.Statistical analyses were performed using Stata 16.1 (Stata Corp, College Station, TX, USA).A p-value less than or equal to 0.05 was considered statistically significant.

Machine learning models
We used a combination of domain knowledge and empirical evaluations to select the models.Specifically, we chose models that have been widely employed in the literature with demonstrated good performance in similar studies (Table 1).
Literature review.In this study, the performances of six ML classifiers were evaluated to select the most accurate method for predicting PTB in the population.These ML models include the support vector machine (SVM) [32], random forest (RF) [33], logistic regression (LR) [34,35], multilayer perceptron (MLP) [36], gradient boosting machine (GBM) [37], and XGBoost [38].While XGBoost outputs variable importance, it does not measure the direction and level of impact of the variables on the outcomes.
SHapley Additive exPlanations (SHAP).SHAP values were introduced to better explain the contribution of features or risk factors to the outcome, specifically PTB [39][40][41].The SHAP Shapley values are an attribution method that fairly assigns predictions to individual features.SHAP is a computational method for calculating Shapley values, which also suggests global interpretation methods based on combinations of Shapley values across the dataset.A higher SHAP value indicates that a feature increases the likelihood of PTB, while a lower SHAP value suggests that a feature reduces the outcome likelihood.Thus, the SHAP method can rank the importance of features and reveal the relationship between these features and the outcome.Further details regarding SHAP [5,[19][20][21][22] are provided in the S1 File.We obtained a list of the ten most important risk factors and their SHAP values (refer to the Results section).
Local interpretable model-agnostic explanations (LIME).LIMEs offer explanations for predictions by replacing a complex model with a locally interpretable surrogate model [42].Therefore, we performed a risk factor-based analysis of individual patients using LIME.Further details regarding the LIME are included in the S1 File.Sensitivity and specificity of LR were 0.50 and 0.64 in the first trimester, respectively, whereas they were 0.29 and 0.84 for RF, respectively.Similarly, in the second trimester, the sensitivity and specificity of RF were 0.45 and 0.94, respectively, and those of ANN were 0.62 and 0.84, respectively.
Belaghi et al. [20] PTB prediction AUC of LR in the first trimester were 0.68 and 0.73 for nulliparous and multiparous women, respectively, whereas in the second trimester, they were 0.72 and 0.78, respectively.
Lee et al. [22] PTB prediction LR, ANN, and RF.AUC within the range of 0.52-0.58was achieved for a highly imbalanced dataset.
Cately et al. [24] PTB Classification Data resampling and ANN.AUC of 0.71 was achieved with sensitivity of 0.33.

Experimental settings
The experiments were conducted as follows: First, the missing values in the dataset are replaced with a missed-forest imputation algorithm [43].The algorithm, known as Miss Forest, was used to handle the missing values in the datasets.It utilizes the RF algorithm, which is an ML technique.Miss Forest treats missing values as a distinct category and predicts them using other variables in the dataset.It iteratively imputes missing values by creating an RF model and refining the predictions in subsequent iterations.This approach allows the algorithm to capture the complex relationships within the data.Subsequently, a tenfold cross-validation was conducted on the dataset to evaluate the average performance.We divided the data into two sets: a training set, which contained 80% of the data, and a test set, encompassing the remaining 20%.This partition ensured that both the training and testing sets included the same proportion of preterm samples.The six most commonly used ML classifiers were employed to identify the best classifier for the task.The evaluation criterion of the area under the Receiver Operating Characteristic (ROC) curve (AUC) was used.The AUC represents the performance of a classifier in distinguishing between positive and negative instances.A higher AUC value indicates a superior classification performance.The ROC curve with a good predictive performance exhibited an AUC close to 1.Thereafter, the global behavior of the best ML model was explained to identify the risk factors using SHAP.A higher SHAP value signifies that the feature increases the likelihood of PTB, while a lower SHAP value suggests that the feature reduces this likelihood.Finally, individual SHAP patient analyses were conducted to identify the underlying risk factors associated with each patient.For comparison, the LIME was used for individual patient analyses.All experiments were conducted using Python 3.8 on a personal computer with an Intel (R) Core i9-9900 CPU@ 3.10 GHz and 8 GB RAM.

Study population characteristics
The distribution of risk factors and descriptive characteristics of the parous (n = 2708) and nulliparous (n = 801) mothers are presented in Table 2 and S1 Table, respectively.The overall incidence of PTB in this study was 11.23% (11% in parous women and 12.1% in nulliparous women).
In parous mothers (Table 2), low levels of education, exposure to passive smoking, and history of infertility treatment were more frequent among women with PTB than among those without PTB.Parous women with PTB exhibited substantial differences from those without PTB in terms of maternal age, gravidity, preexisting hypertension, preexisting diabetes mellitus, previous PTB, previous cesarean delivery, or previous pregnancy loss.During pregnancy, a considerable proportion of parous mothers with PTB suffer from conditions such as preeclampsia, antepartum hemorrhage, oligohydramnios, infection of the amniotic sac, placenta previa and placental disorders, Streptococcus carrier B, or genitourinary infection.The most significant characteristics for women with PTB were higher maternal age, higher gravidity, and a low level of education.Exposure to passive smoking, history of infertility treatment, preexisting hypertension, preexisting diabetes mellitus, and history of PTB were more frequent in women with PTB (Table 2).
Among nulliparous mothers (Table 3), significant differences were observed in the selfreported planning status of pregnancy, physical activity before pregnancy, and history of infertility treatment between those with and without PTB.For pregnancy and delivery characteristics, preexisting diabetes mellitus, preeclampsia, antepartum hemorrhage, oligohydramnios, infection of the amniotic sac, premature rupture of membranes, and placental abruption were considerably more common in nulliparous mothers with PTB than in those without PTB.

Performance and interpretation of the ML model
Parous women.For parous women, the LR, SVM, MLP, GBM, RF, and XGBoost models achieved an AUC of 0.720, 0.521, 0.706, 0.720, 0.726, and 0.735, respectively (Fig 2A).As XGBoost exhibited the highest AUC, signifying the best predictive performance among these models, we employed it for explainable analysis using SHAP and LIME.
The SHAP plot based on the weights of the risk factors is depicted in Fig 2B .This figure showed that the most important risk factor was a history of PTB, followed by a history of CS, and preeclampsia in the parous population.Other important risk factors include maternal age, placenta previa, BMI at delivery, and interpregnancy interval.Nulliparous women.We evaluated the performance of six ML classifiers in predicting PTB in nulliparous women.The XGBoost algorithm achieved the highest performance, with an AUC of 0.723 (Fig 4A).Important risk factors for PTB were identified, including maternal BMI at delivery, maternal age, amniotic sac infection, preeclampsia, and history of infertility treatment (Fig 4B and 4C).Other significant risk factors included oligohydramnios, physical activity before pregnancy, and preexisting diabetes.
We further analyzed the individual risk factors associated with a set of nulliparous mothers (

Principal findings
This study aimed to use ML to develop a prediction model for PTB and identify the key predictors associated with PTB in pregnant women.We used the identified PTB predictors and riskstratified each patient to generate a risk score using SHAP values for parous and nulliparous mothers.Out of the six most commonly used ML models for PTB prediction, XGBoost exhibited the best performance.The top five most important risk factors in parous women were previous PTB, previous cesarean section, diagnosis of preeclampsia, maternal age, and placenta previa.Among the nulliparous women, the important risk factors were BMI at delivery, maternal age, amniotic fluid infection, premature rupture of membranes, and preeclampsia.

Results in the context of what is known
PTB rates vary by region and country.The overall PTB rate in this study was 11.23%.North America recorded a rate of 11.2%, with the highest rates in North Africa and Sub-Saharan Africa at 13.4 and 12%, respectively [44].The US ranked among the top ten countries for PTB For instance, for previous cesarean section (CS) delivery, when the number of CS deliveries increases then the risk of PTB delivery increases while patients with lower (or no CS) deliveries are at a lower risk of PTB.We also observed negative interactions, such as patients with higher BMI, are at a relatively lower risk of PTB, whereas those with lower BMI are at a higher risk.
The strength of AI and ML is their ability to learn from new inputs and leverage those insights to enhance health outcomes and patient experiences.This model, XGBoost, achieved an AUC value of 0.735 and identified plausible and well-known risk factors for clinicians, such as previous cesarean section, previous PTB, diagnosis of preeclampsia, maternal age, placenta previa, and BMI [45].Despite, Lee et al. [46] achieving an AUC of 0.54-0.83, the extensive list of predictors described was inexplicable and unrelated to predictors for PTB (e.g., "upper gastrointestinal tract symptom, gastroesophageal reflux disease, Helicobacter pylori"; all entities describing the same symptom) and particularly for practicing clinicians.Although Sun et al. [16] achieved a maximum AUC of 0.885, they recognized risk factors, such as age, magnesium, fundal height, serum inorganic phosphorus, mean platelet volume, waist size, total cholesterol, triglycerides, globulins, and total bilirubin for their prediction model.Risk factors, such as fundal height and waist size, were determined by healthcare professionals.Because these measurements are skill-dependent, they may be inaccurate, resulting in misleading outcomes, particularly in obese patients [47].A study using retrospective medical data reported an AUC of 0.739 [16].The risk factors included blood pressure, blood glucose, lipids, and uric acid as metabolic predictors of PTB.Performing uric acid tests is not routine in all pregnant women.Although it is a significant risk factor, it has only been measured for pregnancy outcomes in women with preeclampsia/eclampsia [48].In a systematic review of traditional prediction models for the risk of spontaneous PTB and based on routine clinical parameters, the AUC for these models ranged from 0.54 to 0.67 with consequential outcomes that our ML predictions are beginning to demonstrate potential [49].Moreover, unlike LR, the XGBoost algorithm uses a nonparametric assessment; therefore, the correlation of the independent variables has no significant impact on the weighted ranking of each variable.Significantly, XGBoost demonstrated highly promising performance in patients with diabetic retinopathy, with an AUC of 0.99 [50].The difference in the AUC from our results may be explained by the sample size of  32 452 women [50].This suggests that XGBoost can also be used as a predictive model for other diseases [51].Finally, our study progressed beyond the black-box approach (Table 1), in which the results could be easily interpreted by clinicians, thereby enhancing the clinical usability of this study.

Clinical implications
ML in healthcare increases the accuracy of prediction and diagnosis, thus helping clinicians make informed decisions and personalize patient health care.Our study shaped the predictive probabilities of PTB in asymptomatic, nulliparous, and at-risk mothers using real-world data.Nulliparous women often present a dilemma for clinicians, even when traditional prediction models are used.The results confirmed that traditional models discriminated poorly for nulliparous women (AUC 0.51-0.56)[49].We produced individual risk categorizations for PTB and, more importantly, for nulliparous women using SHAP values, where each risk factor was assigned a weight by the algorithm.Once patients are deemed at-risk or high-risk, their treating physicians will then follow-up patients using serial endovaginal ultrasound measurements of cervical length from 16 weeks of gestation to 24 weeks of gestation [52].Thus, the SHAP method is a stage closer to a reliable method for making the output of the XGBoost model clinically interpretable.

Research implications
The set of risk factors used in this study to derive the ML model provides the first step, and other indicators, such as ultrasound parameters, biomarkers, and fetal fibronectin, can be incrementally added to test the performance of the XGBoost method for improved predictability.Our results must be externally validated and verified in other populations, particularly asymptomatic and nulliparous women.The development of a risk calculator from these predictors to stratify risk based on the scores obtained would make an immense contribution to clinicians [50].

Strengths and limitations
In this study, risk factor selection to build the ML model for PTB prediction was based on the medical literature.The major advantage of this study was personalized risk stratification with SHAP values and easy clinical interpretability, fostering individual management recommendations.XGBoost outperformed the five other ML techniques with the highest AUC value (0.735), which was the key guide for evaluating the function of the predictive model.In addition, this algorithm is less time-consuming than other ML algorithms.
This study has several limitations.One of the limitations of this study is the absence of information regarding the etiology of PTB in our population, specifically whether it was indicated or spontaneous.Consequently, we restricted our analysis to parous and nulliparous women.
Second, the results are internally validated.We must be cautious about generalizability, as the model needs to be tested on other datasets and evaluated in other centers.Although XGBoost has considerable potential, the model performance can be improved by adding more indicators.This study provides only a preliminary explanation of the interpretability of the ML model.

Conclusion
This study highlights the use of a novel technology (XGBoost) for risk stratification among individual pregnant women, particularly asymptomatic, nulliparous, and at-risk patients with Fig 1 illustrates the entire methodology, from data collection and preprocessing, the application of different ML algorithms to select the best-performing ML model, identification of the important risk factors and their relative risk scores for PTB, and finally, the individual patient analyses and clinical recommendations.
A summary graph of the ten most important risk factors is illustrated in Fig 2C.The prediction of a set of patients using SHAP is depicted in Fig 3.The first patient (parous mother), represented in Fig 3(a), was at an exceptionally low risk of PTB delivery (0.0).This low risk is primarily attributed to factors such as the absence of a history of PTB, a maternal age of 28 y, and long interpregnancy interval (903 d).The data of the second patient presented in Fig 3(b) demonstrated a medium risk

Fig 1 .
Fig 1. Proposed methodology for predicting PTB using machine learning models.https://doi.org/10.1371/journal.pone.0293925.g001 Fig 5).For example, patient (a) in Fig 5 had a total risk score of 0.00 and was not diagnosed with PTB.Her most preventive factor was maternal age of 22 y old, no infection of the amniotic sac, BMI at delivery at 28.8 kg/m 2 , and no premature rupture of membrane.Patient (b) in Fig 5 had a median risk of 0.22 of PTB and was associated with infection of the amniotic sac or membranes (intrauterine infection), preeclampsia, premature rupture of membranes, and a history of infertility treatment.Finally, the nulliparous mother c in Fig 5 had a higher risk of PTB delivery (0.85) owing to her low BMI at delivery (19.9 kg/m 2 ), infection of the amniotic sac, premature rupture of membrane, and no physical activity before pregnancy.SHAP dependence plot for body mass index vs.PBT for nulliparous women showed in S2 Fig display that low BMI had higher SHAP values clarifying the difference between patient (a) and patient (c) in Fig 5. We have also provided LIME for these patients in S3 Fig.

Fig 2 .
Fig 2. A. ROC curve for PTB prediction in parous women (n = 2708).B. SHAP-based feature importance plot for parous women.C. Summary plot for top 10 SHAP-based risk factors in parous women.Each dot in the graph indicates a patient and her relative risk towards PTB prediction.Several patients at the same point create a dense region.The colors indicate the feature values on the right side (vertically): blue indicates lower values while red indicates higher values of a risk factor.For instance, for previous cesarean section (CS) delivery, when the number of CS deliveries increases then the risk of PTB delivery increases while patients with lower (or no CS) deliveries are at a lower risk of PTB.We also observed negative interactions, such as patients with higher BMI, are at a relatively lower risk of PTB, whereas those with lower BMI are at a higher risk.

Fig 3 .
Fig 3. Individual patient set analysis for parous women using SHAP.The feature values in red indicate the risk factors increasing the chances of PTB, whereas those in blue indicate factors reducing the chances of PTB.The size of the risk factor indicates its degree of influence on that specific patient.Patient (a) is at lower risk, patient (b) median risk, and patient (c) higher risk of PTB.https://doi.org/10.1371/journal.pone.0293925.g003

Table 1 . Related research on PTB prediction using machine learning models. Reference Problem and approach Methods used Performance
An AUC of 0.67 was achieved for CDC data, whereas a maximum AUC of 0.64 was achieved for the NYC dataset via ANN and Light GBM.

Table 2 . Descriptive characteristics of the parous pregnant women.
This patient exhibited several risk factors, including oligohydramnios, exposure to passive smoking, a history of previous pregnancy loss, and lower level of education.Notably, there was no history of PTB or CS.Finally, Fig3(c) illustrates the data of the third parous mother with a higher risk of PTB delivery (0.87) because she had preeclampsia, a history of previous PTB, a history of CS, maternal age > 43 y, and preexisting diabetes.For comparison, the data of the same set of patients were explained using LIME by selecting the ten most important risk factors (S1 Fig).

Table 3 . Descriptive characteristics of the nulliparous pregnant women.
[21,42]ual patient risk scores were calculated to determine the level of risk.These results were internally validated.The results of both the SHAP and LIME algorithms aligned, except in a few cases, mainly because LIME does not guarantee an accurate distribution of the effects[21,42].However, because SHAP focuses on local accuracy and consistency, it generates more accurate model outcomes.Both the SHAP and LIME guide clinicians toward individual risk categorization.This will prevent over investigation and undue economic burdens on health systems.