An imbalance-aware deep neural network for early prediction of preeclampsia

Preeclampsia (PE) is a hypertensive complication affecting 8-10% of US pregnancies annually. While there is no cure for PE, aspirin may reduce complications for those at high risk for PE. Furthermore, PE disproportionately affects racial minorities, with a higher burden of morbidity and mortality. Previous studies have shown early prediction of PE would allow for prevention. We approached the prediction of PE using a new method based on a cost-sensitive deep neural network (CSDNN) by considering the severe imbalance and sparse nature of the data, as well as racial disparities. We validated our model using large extant rich data sources that represent a diverse cohort of minority populations in the US. These include Texas Public Use Data Files (PUDF), Oklahoma PUDF, and the Magee Obstetric Medical and Infant (MOMI) databases. We identified the most influential clinical and demographic features (predictor variables) relevant to PE for both general populations and smaller racial groups. We also investigated the effectiveness of multiple network architectures using three hyperparameter optimization algorithms: Bayesian optimization, Hyperband, and random search. Our proposed models equipped with focal loss function yield superior and reliable prediction performance compared with the state-of-the-art techniques with an average area under the curve (AUC) of 66.3% and 63.5% for the Texas and Oklahoma PUDF respectively, while the CSDNN model with weighted cross-entropy loss function outperforms with an AUC of 76.5% for the MOMI data. Furthermore, our CSDNN model equipped with focal loss function leads to an AUC of 66.7% for Texas African American and 57.1% for Native American. The best results are obtained with 62.3% AUC with CSDNN with weighted cross-entropy loss function for Oklahoma African American, 58% AUC with DNN and balanced batch for Oklahoma Native American, and 72.4% AUC using either CSDNN with weighted cross-entropy loss function or CSDNN with focal loss with balanced batch method for MOMI African American dataset. Our results provide the first evidence of the predictive power of clinical databases for PE prediction among minority populations.


Introduction
Preeclampsia (PE) spectrum disorders occur in pregnant women and are generally defined by new onset hypertension and proteinuria after week 20 of gestation. PE afflicts 8-10% of the approximately 4 million yearly pregnancies in the US [1]. Of those women who survive, PE is associated with long-term health effects, such as increased risk of heart disease, stroke, and diabetes [2]. Children of women with PE also have increased risk of long-term cardiovascular illness [3]. Studies have shown that the low-dose aspirin early in pregnancy can reduce the occurrence of PE in pregnant women who are at high risk [4]. Stratification of women at the highest risk of PE would allow clinicians to provide primary prevention at the right time.
Several statistical and machine learning (ML) models have been developed. A comprehensive overview of previous studies is given in Table 1. Kenny et al. (2014) [5] and Sandstrom et al. (2019) [6] applied logistic regression to predict PE among nulliparous women. Moreira et al. (2017) [7] and   [8] have successfully used random forest classifiers to predict PE; however, the random forest algorithm and its variable importance measures tend to show bias in the presence of predictor variables with many categories and variables with different scale of measurement [9]. Marić et al. (2020) [10] have proposed the use of the elastic net model for PE prediction, but their study focused on a single high-risk referral hospital which included a higher occurrence of PE than in the general population.
Previous studies have shown that the mothers' race and ethnicity are significant risk factors for PE [11]. Minority women experience severe morbidity and mortality rates as high as four times that of their Caucasian counterparts during pregnancy and postpartum [12][13][14][15][16], with several studies reporting an increased incidence of PE among African American (AA) and American Indian/Native American (NA) women [17][18][19][20][21][22][23]. AA women have a higher risk of a  prolonged length of stay in the hospital and progression to severe forms of PE [24]. Long-term follow-up of patients with PE indicates a high recurrence risk in future pregnancies and a two to eight-fold risk of cardiovascular disease [25,26], affecting minorities more frequently and with more adverse maternal and neonatal outcomes [27]. Delivery may reduce the risk for adverse outcomes for the mother, but premature delivery presents many complications for the baby, having consequences for the whole family. Unfortunately, minority women often initiate prenatal care later [28], limiting the lead time for physicians to assess each patient's individual risk for PE, negatively affecting care delivery effectiveness [13]. Existing works have rarely addressed the inherent sparsity and large number of categorical variables in the data available in large clinical databases, the presence of noisy and missing data, and the skewed distribution of observations (known as imbalanced data) among preeclamptic and healthy individuals. Therefore, these ML models may not be reliable, or interpretable (due to the sparsity and presence of a large number of categorical variables). The severity of these issues is even higher for AA and NA populations as confirmed by our study on a number of PE datasets. Modern ML models, have been extremely effective in dealing with these issues in application of predictive modeling [29,30]. In particular imbalance-aware models can handle imbalanced data during the learning process [31][32][33][34][35].
To our knowledge, there are no reliable ML-based decision support approaches specific to PE prediction, particularly for AA and NA populations. There is no study that has investigated the risk factors of PE and developed predictive models for AA and NA populations using advanced ML techniques, particularly Deep neural networks (DNN) [36,37]. DNNs have been useful for large, high-dimensional datasets [38][39][40]. They are flexible to being extended for imbalanced classification problems [41].
1. First, using the chi-square feature selection method, we identify the significant clinical and demographic attributes associated with developing PE among women from AA and NA populations as well as general populations using large datasets. There is no existing work that has studied the PE risk prediction using ML for minority populations.
2. Second, we construct a new cost-sensitive deep neural network (CSDNN) prediction model capable of identifying women with high suspicion of developing PE and estimating their associated risk (probability) using highly imbalanced and high-dimensional sparse datasets.
In particular, we extend the idea of using focal loss to classify sparse imbalanced PE datasets, where focal loss has been primarily utilized for object detection problems [47] in the literature. To the best of the authors knowledge, there is no advanced deep neural network algorithm that takes into consideration the imbalanced and sparse nature of the preeclampsia prediction problem.
3. Lastly, we demonstrate the effectiveness and impact of the proposed scheme through a rich array of datasets which represents a diverse cohort of both AA and NA populations such as Texas Public Use Data Files (PUDF), Oklahoma PUDF, and the Magee Obstetric Medical and Infant (MOMI) databases. Our work is distinguished from previous works that studied small EHR datasets (with few hundreds or less patients in their cohort [7,43,48,49] 4. Furthermore, in order to improve the accuracy of our proposed models using MOMI data, we have added a new variable for each patient which represents the number of incidents of spikes in blood pressure within the first 14 weeks. However, we were not able to use this variable with Texas and Oklahoma datasets, but using the MOMI dataset our work intends to identify the PE patients as early as possible with higher accuracy by considering the incidents of spikes in the blood pressure within the first 14 weeks compared to the existing studies.

Artificial neural networks
Artificial Neural Networks (ANN) originated in the 1940s, with the McCulloch-Pitts Neuron [50]. The idea of "artificial neurons" is inspired by the human brain, in which a neuron takes "input" in the form of signals from surrounding cells, and will only activate in the form of an electrical spike if the combined signals passes a threshold level. An artificial neuron mimics this behavior by taking a series of features x, multiplying each by an individually chosen weight w, and then adds the result to a bias term b before summing them together to calculate if a predefined threshold is met, which allows for classification. Later versions of ANN adapted the artificial neuron to represent more complicated functions by linking them together into a multilayer perceptron (MLP) or feedforward ANN [51]. The MLP is typically composed of multiple layers, each layer containing a pre-defined number of neurons, or nodes. These layers can be subdivided into three separate types: the input layer, which takes each feature x as input; a number of hidden layers (the number of layers here denotes the depth of the network), which performs the previously seen linear computation on each input before passing the output to the next layer; and, the output layer, which returns the final prediction. Each node in a layer is connected to every node in the next layer, making a fullyconnected neural network where the final prediction is a functional composition of each layer.
These functions each take the form of: Where x are the input features, w are the weights of each node in the layer, and b are accompanying bias term. More modern versions of neural networks add non-linearity through the use of activation functions [52]. The most commonly-used activation function is the sigmoid activation given by Where z represents the linear output of the node. The downside of this activation function is that it can saturate, meaning that if the output is too large or small the gradient can become close to 0 which negatively affects the ability of the network to update the parameters. To overcome this issue, the sigmoid function has been improved through using the related hyperbolic tangent function given by This function outputs values between -1 and 1 and has a significantly steeper gradient which makes it easier for training than using the sigmoid function. Another common activation function is the rectified linear unit (ReLU) function: ReLU tends to converge faster than sigmoid or tanh [58] which makes the learning of a neural network more efficient. Due to the ability to learn complex non-linear functions, neural networks have been used successfully in a variety of machine learning problems, such as image recognition [59,60], machine translation [61,62], speech recognition [63,64], weather forecasting [65], credit scoring [66], and cancer detection [67][68][69].

Back-propagation and gradient-based learning
The ANN models employ specific optimization processes to obtain the parameters (w � , b � ) at each layer, which is called the back-propagation stage [70]. Stochastic gradient descent (SGD) [71], Adaptive Moment Estimation (Adam) [72], NAdam [73], and root mean squared propagation (RMSProp) [74] are the most common optimizers. We use the RMSProp, Adam, and NAdam methods in the back-propagation for each dataset due to their superior performance in computational efficiency [73][74][75]. We select the best optimizer for each dataset using model selection algorithms explained below.
The parameters (w � , b � ) is calculated using these optimization algorithms at each epoch. The number of epochs used in the model is another hyperparameter. Too few epochs may lead to the model not learning the data and too many may result in over-fitting, limiting generalization to other datasets [36]. The best number of epochs is identified through the early stopping method, in which the risk of overtraining is reduced by stopping training after the validation error is stabilized or no further improvement is made.

Cost-sensitive neural networks
Despite the success of neural networks in a variety of applications, their use might be challenging due to the distribution of the given dataset. In classification, many machine learning algorithms, including neural networks, assume that the distribution of classes is roughly the same. When this assumption is violated, the neural network can best reduce the misclassification cost by outputting the majority class in every case. This results in a model with high accuracy, but with no ability to distinguish between classes [76].
Our method removes this assumption using a cost-sensitive learning approach. In this approach, originally proposed by Kukar [77], the cost function is modified such that different costs are associated with the true value of any given sample. Two specific loss functions were employed: weighted cross-entropy and focal loss functions.

Weighted cross-entropy loss
In neural networks, the cross-entropy (CE) loss function is usually used for binary classification problems which is defined by Where y 2 {±1} is the ground-truth class and p 2 [0, 1] is the model's estimated probability of the class with label y = 1. This basic loss function can be modified by multiplying the cost of each individual sample by a class specific weight [47], which results in what is referred to as the weighted CE (WCE) or balanced CE, defined by: Where C þ ¼ N 2N þ , C À ¼ N 2N À , and N + and N − are the sizes of the positive and negative classes respectively. The parameters C + and C − represents misclassification penalties of samples in the minority (positive) and majority (negative) class respectively. Accordingly, the error cost function is formulated with Eq 7.

Focal loss
Focal Loss (FL) is an extension of CE loss for binary imbalanced classification proposed by Lin et al. [47], and was initially developed for object detection application. The main idea behind the FL is to focus training on hard samples while reducing the loss contribution from well-classified and easy samples through adding a modulating factor to the sigmoid CE loss. Suppose the predicted output from the model for both classes areŷ ¼ ½ŷ 1 ;ŷ 2 � T . The sigmoid function calculates the probability distribution for minority and majority classes as p t = sigmoid(ŷ t ) = 1/(1 + exp(Àŷ t )) where p t is provided in Eq 8, The focal loss can be formulated with Eq 9: where y 2 {±1} is the ground-truth class and p 2 [0, 1] is the model's estimated probability for the class with label y = 1. The parameter γ � 0 should be tuned. The modulating factor (1−p t ) γ is added which reduces the loss contribution from easy examples-in essence, the more confident a model is in its prediction, the less the sample will contribute to the loss. FL is equivalent to CE, when γ = 0. The effect of the modulating factor increases as the γ parameter increases [47]. In addition, an α-balanced variant of the original focal loss has been developed to further focus on the effective number of samples. The parameter α t 2 [0, 1] is defined with Eq 10.
The α-balanced variant of the focal loss has shown better performance over the non-α balanced form [47]. It is calculated with Eq 11.
We used WCE and α-balanced focal loss functions to treat imbalanced classification and we represented the deep neural network based on these two loss function using CSDNN-WCE and CSDNN-FL, respectively. We also used the standard deep neural network with CE loss function for comparison purposes, and for simplicity we denote it as DNN throughout this paper.

Balanced batch training for imbalanced data
For comparison, we also used the balanced batch generator from the scikit-learn Imbalancedlearn library [78]. It utilizes a chosen sampling strategy to balance a dataset prior to generating a batch of data for training. We applied random oversampling with replacement, which randomly selects data samples from the PE class (minority class) and includes them in the training data.

Chi-square feature selection
Chi-square (χ 2 ) feature selection [79] selects significant features using the test of independence between the feature and the classes: Where N ij is the number of samples that belong to class C i in the j th interval. E ij is the expected frequency of N ij and l is the number of intervals. The features with the highest χ 2 values are selected for the predictive model. This method is useful for categorical data, which was ideal for our datasets.

Performance measures
The most commonly-used performance measures in binary classification tasks are calculated from the confusion matrix (Fig 1). The number of true positives (TP) represents the number of preeclamptic (PE) patients correctly classified, while true Negatives (TN) is the number of non-preeclamptic (Non-PE) patients classified as Non-PE. The number of false positives (FP) refers to the Non-PE patients classified as PE, while false negatives (FN) represents PE patients classified as Non-PE.
The most common metric in classification tasks is accuracy, provided by Eq 14, which measures the total proportion of correctly classified samples.
Accuracy is very sensitive to the size of the majority class (Non-PE), and is likely to obtain a misleadingly high accuracy dominated by the majority class pattern while the minority class samples are most likely misclassified. Accuracy does not account for this imbalance, and so we used additional metrics to better understand model performance.
We report Precision, recall, G-mean, and AUC. Precision measures the number of positive values that are actually positive, while recall (or sensitivity) measures what percentage of the positive cases were captured by the model. Specificity refers to the percentage of the negative examples that are truly negative. We also report G-mean, which takes into account both the specificity and sensitivity, as well as the area under the receiver operating characteristic curve (AUC), which measures the balance between the correctly classified positive samples (TP) and incorrectly classified negative samples (FP). The performance metrics were calculated with Eqs 15-18.
G À mean ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Sensitivity � Specificity p ð18Þ

Data preparation
We compare and validate the ML models using three datasets including the 2013 Texas Inpatient Public Use Data File (PUDF) (case 1), the 2017-2018 Oklahoma PUDF sets (case 2), and a granular research hospital data which is the Magee Obstetric Medical and Infant (MOMI) data (case 3). These state and national datasets that represent a diverse cohort of African American and Native American populations, allowing us to examine racial disparities in PE outcome specifically in our analysis. The PUDFs exclude information that could identify patients directly or indirectly. Access to these data files is given to users after submission and approval of the Data Use Agreement. The University of Oklahoma institutional ethical review board approval was obtained for this study (#12718, 07/20/2020). The characteristics of these datasets are explained in this section. Case 1: Texas PUDF. The 2013 Texas Inpatient PUDF has patient-level information related to inpatient hospital stays collected from all state-licensed hospitals except those that are exempt from reporting."Exempt" hospitals include those located in a county with a population less than 35,000, or those located in a county with a population more than 35,000 and with fewer than 100 licensed hospital beds and not located in an area that is delineated as an urbanized area by the United States Bureau of the Census [80]. This data is maintained and extracted from the Texas Department of State Health Services' Hospital Discharge Database [81].
The Texas PUDF includes both sociodemographic and clinical information about each patient. The clinical information in particular is contained within up to 25 admission and discharge diagnosis codes for each patient, and are defined by the The International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM).
We first filtered the records of women who delivered in hospital using an ICD-9-CM code beginning with V27 (Outcome of delivery) and then analyzed this subset of data. The Texas PUDF contains 360,943 patients who delivered at the hospital. Of those, 14,375 (3.98%) developed PE. Table 2 Table 3 shows the distribution of patients based on ethnicity and race. There is a large number of missing values regarding race among the non-Hispanic population, and a large amount of missing values regarding ethnicity amongst Other Races and the White population in this data. Table 4   AA patients had a higher incidence of PE, with a frequency of 9.51% (as a proportion of population) among all race/ethnic groups. Fig 4 shows that the preeclamptic women are more likely to have prolonged length of stay in hospital compared to Non-PE patients. In particular, AA Hispanic patients with PE have longer average stay in the hospital ( Table 5). The average length of stay for women with PE is longer. The median and interquartile range (IQR) of women with PE are 3 days and 1 day in comparison to the median and IQR of 2 days and 1 day for women without PE.
Case 2: Oklahoma PUDF. The second set of data that we use is the 2017 and 2018 Oklahoma Discharge Public Use Data Files (PUDF). The Oklahoma PUDF consists of statewide discharge data collected from two data sources: 1) the Uniform Claims and Billing Form (UB-92) for the hospital inpatient and outpatient surgeries, 2) the HCFA/CMS 1500 claims forms for the ambulatory surgery centers. It is maintained by the Health Care Information Division of the Oklahoma State Department of Health.
Unlike the 2013 Texas PUDF, the 2017-2018 Oklahoma PUDF used the ICD-10-CM diagnosis codes [83]. The women who delivered in hospital are filtered based on the ICD-10-CM codes beginning with Z37 (Outcome of delivery). This dataset contains a total of 84,632 women who delivered in-hospital, of which 4,721 (5.58%) developed PE. Table 6 demonstrates the demographic attributes, such as age, race, insurance type, and month of delivery. The frequency of each feature's values (percentage of the population) is reported in this table. Unlike the Texas PUDF, no data on ethnicity is collected for each patient, but race remains an available variable along with additional attributes such as marital status and month of admission. There are no records indicating the delivery date for each patient in this data. We used the month of admission to estimate the month of delivery for each patient.  The prevalence represents a U-shaped curve highlighting the youngest and oldest patients as the most at-risk patients. Table 7 shows the frequency of PE among each racial group in the Oklahoma PUDF, while Fig 6 shows the absolute number of patients in each racial category and their respective prevalence of PE. Despite White patients contributing the overwhelming majority of patients in this dataset, Native Americans and African Americans have the highest prevalence of PE. In particular, PE among Native Americans is almost twice that of the "Other/Unknown" race.
Similar to the Texas PUDF, the average length of stay is longer for patients with PE compared to those without PE as shown in Fig 7. The median and IQR length of stay for those without PE is 2 and 1 days respectively, while for those with PE, the median and IQR length of stay is 3 and 2 days respectively.  The length of stay also varies with a patient's race. According to the Table 8, we notice that AAs and NAs had longer hospital stays compared to their white and "other" counterparts.
After selecting the initial set of clinical features, each was formulated as a binary feature based on the presence of corresponding ICD-9-CM/ICD-10-CM codes in any of the diagnosis columns. We set the value of the feature equal to 1 if the corresponding ICD-9-CM/ICD-10-CM codes used in this study were present in any of the diagnosis columns; otherwise the value was set to zero. A full list of ICD-9-CM/ICD-10-CM codes is described in the S3 Table of the S1 File. Furthermore, a detailed analysis of prevalance of each code in both Texas and Oklahoma datasets can be found in the S5 Table of the S1 File.
Case 3: MOMI. Maintained by University of Pittsburgh's Medical Center Magee-Womens Hospital since 1995, the MOMI Database reports about 300 variables for more than 200,000 deliveries. The dataset is extracted from medical records coding, admitting services, outpatient encounters, ultrasound, and other ancillary systems for all mother-infant pairs delivered at Magee. Unlike the Texas and Oklahoma data, this dataset contains patient information from multiple prenatal visits in addition to the demographic and clinical features. For this study, we considered patients' information in their latest prenatal visit within the first 14 weeks, which includes in total 31,431 women who delivered in-hospital, of which 2,743   Table 9. The clinical attributes of the MOMI data are summarized in the S6 Table of the S1 File. Furthermore, we added a new feature for each patient which represented the number of incidents of spikes in blood pressure within the first 14 weeks. A spike in blood pressure is defined as systolic pressure above 130 or diastolic pressure above 80. In addition, this dataset consists of several numeric variables, including weight, age, previous pregnancies, the number of abortions and deliveries, etc. All numeric variables are normalized so that all values are within the range of 0 and 1 prior to training ML models. The race variable for patients from the groups "Indian(Asian)", "Chinese", "Korean", "Filipino", "Japanese", "Vietnamese", and "Other Asian" are labeled as "Asian." We noted that patients from any of the cohorts "Hawaiian", "Samoan", "Guam/Chamorro", and "Other Pacific Islander" are identified as "Polynesian" race. We include "Native American" and "Alaskan Native" patients as one group ("Native American (NA)"). Ethnicity (whether or not a mother is Hispanic) is included as a variable in the MOMI dataset, however 42% of this variable was missing, and as such we dropped this variable from our analysis.   We observed that there is a noticeable U-shaped trend in incidence which reflects the high risk of PE among the oldest patients and a slightly increased risk among the youngest patients. Fig 9 represents the prevalence of PE among different racial groups in MOMI dataset. African American patients experience a higher rate of PE (11.4%) compared to other racial groups. To reduce the computational complexity and increase the interpretability of our results, we used Chi-square feature selection to extract the most critical variables related to preeclampsia, which will be described in the Results Section.
Missing data challenge. In the Texas PUDF set, the county feature has the most significant number of missing values (2.5%), followed by the patients' ethnicity (0.95%), race  (0.24%), and insurance (0.04%). In the Oklahoma PUDF, the most common missing values was marital status variable (17.74%), followed by the county (0.004%) and insurance (0.004%) variables.
In the MOMI dataset, features with greater than 20% missing values were dropped prior to any preprocessing steps. The most frequently missing feature is race (2.06%), followed by the mean arterial pressure (1.40%) and infant sex (1.32%). Furthermore, we observed that if there was a missing value for the total number of previous pregnancies for a patient, some others such as deliveries, miscarriages, abortions, and whether or not this is a first pregnancy were also missing. More details about features with missing values are provided in the S7-S9 Tables of the S1 File. We used a Multiple Imputation technique [84] to estimate missing values. For the Multiple Imputation implementation, we used Bayesian ridge regression [85] repeated 10 times in order to gain a more robust estimate for the final imputed values.

Feature selection
Using Chi-square feature selection, we identified the 20 most important risk factors which are ranked in terms of variable importance for each of the three datasets.   Case 3: MOMI data. Our computational results with MOMI data show that our models performed best when all features are considered in training. However, we identified the top 20 most important features in order to improve the interpretability of the data (Figs 16 and 17). Like the Texas and Oklahoma datasets, hypertension and diabetes are significant predictors, however other variables such as kidney disease are among the important risk factors which were not identified within the Texas and Oklahoma datasets (case 1 and 2). Interestingly, we note that the number of previous spikes in high blood pressure was among the significant risk factors.
In the African American dataset as shown in the Fig 17, there were four features that were identified as important and do not overlap with the full population. These include maternal  collagen vascular disease, maternal structural heart disease, hemorrhagic disorders, and maternal anemia without hemoglobinopathy.

Model evaluation and discussion
In this section, we present the results of DNN and the proposed CSDNN algorithms on Oklahoma, Texas, and MOMI datasets. The performance of these algorithms were compared based  on the evaluation measures described in the Performance Measures Section. We implemented both DNN and CSDNN algorithms in Python version 3.6 with Keras [86] and TensorFlow libraries [87]. All experiments and data processing were performed on an AMD Ryzen 5 3.8 GHz 6-Core processor and 16GB of Ram in a 64-bit platform. As a preprocessing step prior to classification, continuous variables were normalized such that they have zero mean and unitary standard deviation. In the MOMI dataset, outliers were removed using Local Outlier Factor [88]. Furthermore, a drop-out rate of 20% was applied to reduce the risk of overfitting for training both DNN and CSDNN.
CSDNN architecture. Neural Networks contain many hyperparameters that are needed to be set prior training such as learning rate, depth, the number of nodes per layer, activation functions, and weight initialization strategy. Given the large number of hyperparameters, we have found the best performing combination of hyperparameters for both DNN and CSDNN models using three hyperparameter strategies: Random Search [89], Bayesian optimization [90], and Hyperband [91]. All hyperparameter tuning is performed using the Keras Tuner library [92].
The initial ranges of each hyperparameter for model selection algorithms are summarized in Table 10. These hyperparameters are the batch size (B), the number of epochs (T), the number of hidden layers (h), the number of neurons in hidden layers (k), the learning rate (η), and activation functions (a). Each model selection algorithm is performed on each dataset using 10-fold cross validation repeated 5 times to increase the robustness of results, while for small sub-population datasets we performed 10-fold cross validation repeated 35 times. The best set  of hyperparameters was selected based on the model selection that yields the highest G-mean. The best architecture along with hyperparameters obtained from the three model selection techniques for the best architecture of the DNN and CSDNN with WCE and FL functions as well as hybrid models that further balances batches with oversampling (Balanced Batches) are summarized in S10-S15 Table in S1 File. We observe that the Hyperband model selection consistently performs well on all datasets for both DNN and CSDNNs. The number of layers in most models does not exceed 4. Most models have employed larger learning rate (e.g., 0.001), but a few of the smaller datasets (e.g., Oklahoma NA datasets) have chosen smaller learning rates (0.00001), particularly for DNN and CSDNN-WCE that used the balanced batches method. Overfitting can be mitigated through early stopping of the neural networks. A detailed analysis of training and validation AUC versus the number of epochs for each model can be found in the S1-S6 Figs in S1 File.

Comparative analysis of CSDNN-FL versus parameters γ and α.
To further investigate the effect of FL function on CSDNN performance, we obtained the cumulative loss generated from different values of the γ parameter. Inspired by the original paper by Lin et al. [47], Figs 18-20 were created by training the CSDNN-FL model with the best performing α for each dataset and different values of γ. The samples in the test set wer split into the positive and negative samples, and the loss is calculated for each sample using different values of γ. Then, the plots were created by ordering the normalized loss from lowest to highest and plotting the cumulative sums for the positive and negative classes for various γ, resulting in the cumulative sum plots (CSPs) shown in Figs 18-20. The effect of γ on positive samples (PE cases) was not as noticeable, however the effect of γ on negative samples (Non-PE cases) was substantially different. Both positive and negative CSPs appeared relatively analogous when γ = 0. By increasing the γ had a large effect on down-weighting the easy negative samples, as FL focuses learning on hard negative samples. This was consistent with earlier literature on FL [47].
The Tables 11-13 show that as γ increased, AUC decreased, but G-mean increases slightly except in the Texas dataset. CSDNN-FL for the Full Texas dataset showed relatively inferior specificity and higher recall values as γ increaseed. The results in these tables are obtained after calculating the best performing α, and then fixing α when calculating γ. Comparative analysis of csdnns with balanced batch method. We also compared our proposed CSDNN-FL and CSDNN-WCE with the standard DNN with and without Balanced Batches (BB) on the Full Texas, Oklahoma, and MOMI datasets as well as sub-population datasets. We reported the average G-mean, AUC, accuracy, precision, recall, and specificity values in Table 14. We observed that CSDNN-FL performs better compared to CSDNN-WCE and DNN on the Texas and Oklahoma datasets (in terms of G-mean and AUC). In Oklahoma African American dataset, we observed that there is no statistically significant difference between CSDNN-WCE and CSDNN-FL results. A detailed description is presented in Statistical Analysis Section. Interestingly, DNN and CSDNN-FL with balanced batches performed better than other methods for MOMI Full data and MOMI African American datasets, respectively. We  also observed that in all cases the CSDNN-FL and CSDNN-WCE improve the recall. For the Texas Full dataset, the CSDNN-FL model has a recall of 61.6% versus 11.8% for the standard DNN model, which indicates the CSDNN-FL algorithm is capable of detecting more preeclamptic women than the standard DNN model.
For measuring and comparing the characteristic of CSDNN-FL, CSDNN-WCE, and DNN with and without balanced batches, we used box plots for G-mean measure as shown in Figs 21-24, which have been obtained over 50 iterations on the same data for each algorithm for the Full datasets and 350 iterations on the same data for the subpopulations. As shown by the Figs 21-24 (and standard deviations in the S4 Table of the S1 File), CSDNN-FL was more robust than the other models in terms of G-mean. In addition, the models equipped with balanced batches more frequently demonstrated greater variation between runs for Oklahoma African American and Texas Native American datasets, which refleced their propensity for overfitting. Moreover, the African American subpopulation had more variation in G-mean in DNN compared to other methods for both Texas and Oklahoma datasets. We also observe Table 11. Comparative analysis of CSDNN-FL versus γ using the Full Texas dataset. ACC, SP, PR, and RE represents accuracy, specificity, precision, and recall, respectively. The highest G-mean and AUC values are denoted in bold. The parameter α is set as 0.97.  that the Native American subpopulation amongst the Oklahoma dataset had a highly variable G-mean among others most likely due to the small datasets. Comparative analysis of computation time. Table 15 shows the average time required for the proposed algorithms to complete training and testing. In addition to evaluating the quality of predictive models in terms of AUC and G-mean, it is critical to investigate whether there is a computational bottleneck. Each computational running time for training and testing is reported as an average of 10 iterations in which no hyperparameter tuning or cross validation was performed. Each model had the same architecture configuration with three hidden layers consisting of 60, 30, and 45 nodes in each layer respectively, and with a batch size of 8192. The MOMI data was an exception, in which each model was implemented with a batch   size of 4096. We observed that the proposed algorithms were capable of dealing with large amounts of data in a reasonable amount of time.
Comparative analysis with traditional ml algorithms. To further evaluate the performance of the CSDNN methods, we compared our CSDNNs with multiple existing methods: logistic regression (LR), support vector machine with linear kernel (SVM-Lin), support vector machine with radial basis function (SVM-RBF), and cost-sensitive versions of each of those models including weighted LR (WLR), weighted SVM-Lin, (WSVM-Lin), and weighted SVM-RBF (WSVM-RBF). Tables 16-18 show the average G-mean and AUC values for the Texas, Oklahoma, and MOMI datasets, respectively. In all cases, the cost-sensitive versions outperform in terms of both AUC and G-mean, however the best performing model for the Texas and Oklahoma datasets is the CSDNN-FL with 66% and 64% AUC. The MOMI dataset demonstrated CSDNN-WCE and CSDNN-FL perform well compared to other techniques with 76% AUC. However, WLR produced the highest G-mean values followed by WSVM-RBF and CSDNNs methods.
The ROC curve of all models is shown in Figs 25 and 26. These graphs show an improvement over other traditional algorithms, although in all datasets neural networks tend to  Tables 16-18. We note that the results of the pairwise Wilcoxon rank sum test are summarized in the S16-S18 Tables of the S1 File and are described in the Statistical Analysis of Results Section. The robustness of the ML algorithm is critical in the PE prediction problem-a promising prediction method should produce the same results over several iterations. To measure this,  we employed box plots as shown in Figs 27-29, which have been obtained over 50 iterations on the same data for each algorithm. As the figure shows, CSDNN-FL was more robust than the other algorithms for Oklahoma and Texas datasets. A small standard deviation was observed in CSDNN-FL followed by CSDNN-WCE for both datasets. WLR followed by CSDNN-WCE showed better results each time for the MOMI dataset, while the performance of both LR and SVM was inferior in most cases. Therefore, integrating cost-sensitive into prediction models could improve the accuracy of models.
Statistical analysis of results. To test whether there is a statistical difference between the models, a Kruskal-Wallis test was performed for each dataset. Table 19 shows that the null hypothesis is rejected with an extremely low p-value for each dataset (at a specific significance rate α = 0.05). We conclude that there is a statistical difference between the models.
In order to test whether our CSDNN models (CSDNN-Focal and CSDNN-WCE) perform well compared to other existing methods, we perform a pairwise Wilcoxon rank sum test between the CSDNN models and the benchmark methods. This test is performed on G-mean values which are obtained from the 10-fold cross validation repeated 5 times. Since this test must be run multiple times, the family-wise error rate is taken into account by reducing the significance level to 0.0005.
The results of a one-tailed Wilcoxon rank-sum test between CSDNNs and the benchmark methods for the Texas, Oklahoma, and MOMI Full datasets are presented in the S16-18 Tables of the S1 File. The null hypothesis is rejected if the p-value for the test is lower than the significance rate α = 0.0005. CSDNN-FL and CSDNN-WCE significantly outperforms the corresponding methods in both Texas and Oklahoma datasets as shown in the S16 and 17 Tables of the S1 File. While CSDNN-WCE method outperformed most methods, it showed significantly inferior results compared to the WSVM-RBF in the Texas dataset, and the WLR, CSDNN-FL-BB, and DNN-BB in the Oklahoma dataset. However, our CSDNN models performed significantly better than most methods, except WLR, WSVM-RBF, DNN-BB, and CSDNN-FL-BB for the MOMI dataset (S18 Table of the S1 File).
The results of the one-tailed Wilcoxon rank test for Texas and Oklahoma African American datasets are shown in S19 and S20 Tables in the S1 File, respectively. For these datasets, CSDNN-FL significantly outperformed the other methods except for the Oklahoma African American dataset, in which CSDNN-WCE performed significantly better than the CSDNN-FL.
The one-tailed Wilcoxon rank-sum test results for Texas and Oklahoma Native American datasets are presented in S21 and S22 Tables in the S1 File. We observed that CSDNN-FL significantly outperformed most methods, except DNN-BB, in both datasets, however, there was

Conclusion
False-negative PE predictions may result in high rates of maternal morbidity and mortality, while false positives may lead to unnecessary interventions. As such, identifying patients who would be well suited for outpatient management is challenging. Providing physicians with reliable and accurate tools to improve targeting and implementation prevention measures is critical in advancing the life-long health of preeclamptic patients. We propose the use of CSDNN in PE prediction, which suffers from highly imbalanced datasets. We compared the focal loss function in CSDNN (originally applied to image data) with both weighted cross-entropy and standard cross-entropy loss functions. In addition, we evaluated and compared the results of CSDNNs with the corresponding models equipped with balanced batch training sets obtained from random oversampling.
We performed an extensive experimental analysis using three clinical datasets to show the advantages of our CSDNN algorithm. Provided that the African American and Native American women experience severe morbidity and mortality rates compared to of their Caucasian counterparts during pregnancy, we studied the performance of our method on each subpopulation in addition to the full datasets. We further compared the CSDNN results with the performance of the existing methods. Our results demonstrated that in many cases (5 out of 8 datasets), our CSDNN equipped with focal loss function performed better with significantly less variation in the results compared to other methods in terms of G-mean and AUC.
Limitations to this study largely involve Oklahoma and Texas PUDFs which do not include laboratory test results, detailed drug or alcohol usage, detailed blood pressure, specific height/ weight information, etc. To overcome this drawback, we studied MOMI data which contains granular information about prenatal visits. Furthermore, Texas PUDF do not allows us to distinguish multiple-incident events for the same patient, which is necessary to treat the bias in our statistical modelling results. To overcome this issue, we studied the Texas PUDF data for only one year.
Future studies should extend the application of models to early and late onset PE. In addition, while our model accounts for race/ethnicity in PE prediction and presents promising classification results for each group in a highly imbalanced setting, additional investigation and computational testing for more equitable results, and further data collection for the minority groups needs to be explored in the future. The proposed models for minority groups can be extended to other health problems with disparities in outcomes. We will improve the results of the deep neural network on specially small datasets in the future works [93]. In addition, future studies should include detailed information on socioeconomic status, maternal weathering, and allostatic load. Fairness in machine learning also merits continued investigation.
Supporting information S1 File. File of supporting results and data analysis. (ZIP)