Investigating gene methylation signatures for fetal intolerance prediction

Pregnancy is a complicated and long procedure during one or more offspring development inside a woman. A short period of oxygen shortage after birth is quite normal for most babies and does not threaten their health. However, if babies have to suffer from a long period of oxygen shortage, then this condition is an indication of pathological fetal intolerance, which probably causes their death. The identification of the pathological fetal intolerance from the physical oxygen shortage is one of the important clinical problems in obstetrics for a long time. The clinical syndromes typically manifest five symptoms that indicate that the baby may suffer from fetal intolerance. At present, liquid biopsy combined with high-throughput sequencing or mass spectrum techniques provides a quick approach to detect real-time alteration in the peripheral blood at multiple levels with the rapid development of molecule sequencing technologies. Gene methylation is functionally correlated with gene expression; thus, the combination of gene methylation and expression information would help in screening out the key regulators for the pathogenesis of fetal intolerance. We combined gene methylation and expression features together and screened out the optimal features, including gene expression or methylation signatures, for fetal intolerance prediction for the first time. In addition, we applied various computational methods to construct a comprehensive computational pipeline to identify the potential biomarkers for fetal intolerance dependent on the liquid biopsy samples. We set up qualitative and quantitative computational models for the prediction for fetal intolerance during pregnancy. Moreover, we provided a new prospective for the detailed pathological mechanism of fetal intolerance. This work can provide a solid foundation for further experimental research and contribute to the application of liquid biopsy in antenatal care.


Introduction
Pregnancy is a complicated and long procedure during one or more offspring development inside a woman [1,2]. Various pathological syndromes and severe situations may occur during pregnancy [3][4][5]. Fetal intolerance, which is also known as fetal distress, is one of the common but dangerous situations during birth processes [6]. It generally refers to babies suffering from oxygen shortage during the birth processes [6][7][8]. A short period of oxygen shortage after birth is quite normal for most babies and does not threaten their health [7]. However, if babies have to suffer from a long period of oxygen shortage, then this condition is an indication of pathological fetal intolerance, which probably causes their death.
The identification of the pathological fetal intolerance from the physical oxygen shortage is one of the important clinical problems in obstetrics for a long time. The following five symptoms according to the clinical syndromes indicate that the baby may suffer from fetal intolerance [9][10][11]:1) high heart rate or tachycardia; 2) low heart rate or bradycardia; 3) irregular heart rates or arrhythmia; 4) lack of movement in the womb; and 5) stool found in the amniotic fluid. For example, the alteration of the heart rate is quite normal for new born babies. However, the constant abnormal heart rate patterns and alterations strongly indicate pathological fetal intolerance [11]. Fetal intolerance is actually a quite severe disease and leads to the death of babies. The medical staff must save the baby after the manifestation of severe symptoms and try to find out an accurate and effective way to predict fetal intolerance (e.g., quick early diagnosis).
With the rapid development of molecule sequencing technologies, liquid biopsy [12][13][14] combined with high-throughput sequencing or mass spectrum techniques, provides a quick approach to detect real-time alteration in the peripheral blood at multiple levels (e.g., genomics [12], transcriptomics [13], and proteomics [14]). Various genetic variations, such as mutations in IGF-II and H19, have already been confirmed to participate in the pathogenesis of fetal intolerance [15]. The genomic methylation status has also been confirmed to be functionally correlated with fetal intolerance. In 2018, an independent study on the methylation status of SLC9B1 has confirmed that such methylation pattern can actually predict the clinical outcome of potential pregnancy related to fetal intolerance [16]. Gene methylation is functionally correlated with gene expression. Thus, the combination of gene methylation and expression information would help in screening out the key regulators for the pathogenesis of fetal intolerance.
We combined gene methylation and expression features together and screened out the optimal features, including gene expression or methylation signatures for fetal intolerance prediction, for the first time. Moreover, we have applied various computational methods to construct a comprehensive computational pipeline to identify the potential biomarkers for fetal intolerance dependent on the liquid biopsy samples. We set up qualitative and quantitative computational models for the prediction for fetal intolerance during pregnancy. Furthermore, we provided a new prospective for the detailed pathological mechanism of fetal intolerance. This work can provide a solid foundation for further experimental research and contribute to the application of liquid biopsy in the antenatal care.

Data
We downloaded the gene expression and methylation profiles of fetal intolerance from Gene Expression Omnibus (GEO, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE107460) [16]. We extracted 22 fetal intolerance and 96 control samples with gene expression and methylation profiles from the original dataset. The expression levels of 15,505 genes were measured with Illumina HumanHT-12 V4.0 expression beadchip. The methylation data were measured with Illumina HumanMethylation450 BeadChip. The probes with missing values in more than 20% of the samples were removed. Thereafter, the remaining missing values were imputed with function impute.knn (K = 10) by using R package impute (https:// bioconductor.org/packages/impute/). Lastly, 449,094 methylation probes were found. We would like to investigate the gene expression and methylation difference between fetal intolerance and control samples.

SMOTE
The dataset we analyzed here has unbalanced numbers of positive and negative samples (i.e., 22 vs. 96). We first applied the synthetic minority oversampling technique (SMOTE) [17] to obtain a balanced data benefitting the classification model construction. SMOTE aims to iteratively produce new samples for the minor sample class (i.e., fetal intolerance samples) to ensure that the sample numbers of this minor sample class will be equivalent to that of the major one (i.e., control samples) when SMOTE is finished. In this study, the tool "SMOTE" in Weka is used to produce equivalent numbers of samples.

Boruta feature filtering
Boruta feature filtering [18] can filter all features relevant to the target outputs on the basis of random forest (RF) in a wrapper manner. This algorithm recognizes important features by comparing the importance scores corresponding to the real and shuffled features. The following are the three main calculation steps for the Boruta approach: i) production of a new shuffled dataset by copying the training dataset and shuffling the feature values; ii) calculation of the importance score of each feature by training a RF classifier on the shuffled dataset; and iii) evaluation of the importance score of each feature in the original training dataset to retain the real features with remarkably higher importance scores than the shuffled features.

Feature ranking and selection
Minimum redundancy maximum relevance. Minimum redundancy maximum relevance (mRMR) [19][20][21][22] holds two key assumptions: one is to select features with minimum redundancy among themselves; and the other one is to select features with maximum relevance with class labels. The mRMR filters informative features by selecting the features that simultaneously satisfy the minimum redundancy and maximum relevance measured by mutual information. These factors are important or informative to ensure that the following classification model can distinguish class labels (e.g., fetal intolerance or not in this work).
Incremental feature selection. Incremental feature selection (IFS) [23] can iteratively determine the optimal number of selected features with feature order. First, IFS selects a series of feature subsets from the mRMR ranked features. For example, the first selected feature subset consists of the top-ranked one feature, and the second one is composed of the top-ranked two features. In each training data consisting of features from each feature subset, IFS trains one classification model. The performance is evaluated in the 10-fold cross-validation [24]. Finally, IFS selects the feature subset with optimal performance as the optimum feature subset.
Classification algorithm. RF. RF [25][26][27] creates an assemble classification model consisting of several tree classifiers. The RF determines the predicted sample class/category by an aggregating vote from multiple tree classifiers (i.e., decision trees). The RF produces the final consensus results by averaging all decision trees' predictions because a subtle difference exists between each decision tree. Accordingly, overfitting is avoided, and the model performance robustness is improved. Support vector machine. The support vector machine (SVM) [28][29][30][31] is a classification model based on statistical learning theory. This model can map data samples to a given data class/category. SVM aims to transform the original data from a low-dimensional data space to a high-dimensional one by using a given kernel function (e.g., Gaussian kernel). Thereafter, the model can divide the data samples of each class/category by maximizing the data interval in the high-dimensional data space during training. Subsequently, the model further predicts/ tests a new sample' category depending on the interval where this new sample belongs. In this study, we use the sequence minimization optimization algorithm implemented in Weka software [32,33] to create an SVM for a two-class classification model.
Rule learning classifier RIPPER. We used RIPPER [34] generating classification rules to classify the samples from different classes/categories. RIPPER can predict new data by learning the interpretable classification model in accordance with the IF-ELSE rules. Moreover, RIP-PER can learn all rules for each sample class; it learns the rules for one class and then moves to learn the rules for the next class. Learning starts from the minority sample class and then to the second minority sample class until the dominant class. In this study, the "JRip" algorithm implemented in Weka software was used.

Performance evaluation
In this study, a commonly used evaluation method, namely, the Matthew correlation coefficient [35][36][37] (MCC), is used to evaluate the prediction performance of each classification model within a 10-fold cross-validation. MCC has a value ranging between −1 and +1 and achieves +1 when the classification model has a good performance. In this study, we evaluate the two-class classification models. Thus, the MCC for binary problem is adopted as follows: where TP, TN, FP, and FN are the number of true-positive, true-negative, false-positive, and false-negative samples, respectively. Furthermore, we also counted sensitivity (SN), specificity (SP) and accuracy (ACC) for each model to give a full evaluation.

Results and discussion
In this study, we adopted several advanced computational methods to analyze the gene expression and methylation profiles of fetal intolerance. The whole procedures are illustrated in Fig  1. This section gave the detailed results and performed the discussion on the results.

Results of Boruta and mRMR
The dataset was first analyzed by Boruta to select key features. 15 relevant features were kept, which are provided in S1 Table. These features were further evaluated by mRMR method, a feature list was generated, which are also available in S1 Table. Selected features and classification models for distinguishing pregnant patients with or without fetal intolerance Of the obtained feature list, IFS method generated several feature subsets in a way that the top feature comprised the first feature subset, the top two features constituted the second feature subset, and so forth. Fifteen feature subsets were accessed. For each feature subset, a classification model was built using one of the three classification algorithms (RF, SVM and RIPPER). Each model was evaluated by 10-fold cross-validation. This procedure employed SMOTE to tackle the problem that control samples were much more than fetal intolerance sample. Obtained SNs, SP, ACCs and MCCs are listed in Tables 1-3. For an easy observation, we plotted a curve for the IFS results with each classification algorithm, as shown in Fig 2, in which MCC was set as Y-axis and the number of features was set as X-axis. For SVM, the highest MCC was 0.796, which was obtained by the top 10 features. Thus, we can build an optimum SVM model with these top 10 features. The other three measurements (SN, SP and ACC) of such model are listed in Table 2. The highest MCC was 0.832 for RF when top 11 features were adopted. Accordingly, an optimum RF model was built with these top 11 features. The

PLOS ONE
SN, SP and ACC of this model are provided in Table 1. Evidently, the optimum RF model was superior to the optimum SVM model. As RF and SVM are black-box algorithms, their classification principle is hard to understand. Thus, few insights can be obtained. In view of this, we further applied RIPPER in a similar way. The IFS performance is shown in Table 3, from which a curve was plotted, as illustrated in Fig 2. The best MCC was 0.687 when top 13 features were used. Thus, an optimum RIPPER model was set up with these features. Other three measurements of this model are listed in Table 3. Clearly, such model was inferior to the optimum RF and SVM models. However, some rules can be extracted from this model, which clearly displayed the classification procedures. Based on top 13 features, we obtained five rules via RIPPER, which are listed in Table 4.

PLOS ONE
As previously mentioned, we presented various qualitative and quantitative novel computational approaches to distinguish pregnant patients with fetal intolerance from healthy pregnant women dependent on their personal blood gene expression and methylation profiles. We not only identified a group of effective genes with a specific gene expression or methylation pattern that contributes to the diagnosis of fetal intolerance but also attempted to set up a set of quantitative rules for accurate and interpretable prediction on the basis of our methods. All the predicted gene expression and methylation patterns and their quantitative rules have been confirmed by recent publications. The detailed analysis and discussion on the top-ranked genes and rules can be seen below.

Optimal genes for fetal intolerance diagnosis and monitoring
Our newly presented computational methods identified fifteen methylation sites that are correlated with fetal intolerance and involved in five genes: NHEDC1, COMTD1, DLGAP2, HEG1, and KIAA1875.
The first gene (NHEDC1) with five methylation sites, also known as SLC9B1, has been widely reported to participate in the intracellular pH regulation in germ cells [38]. Such gene has been reported to have quite various biological effects with different methylation statuses [39,40]. The abnormal methylation of such gene has been confirmed to participate in cell differentiation [39]. Such gene has already been reported as a typical biomarker for the clinical prediction of fetal intolerance [16], thereby validating the efficacy and accuracy of our prediction.
The second gene is COMTD1, which encodes an effective methyltransferase with O-methyltransferase activity [41][42][43]. No direct evidence confirmed that COMTD1 can independently predict fetal intolerance; however, COMTD1 in a mother's blood is correlated with several congenital disorders, such as psychotic diseases and autism [44]. Given that congenital

PLOS ONE
disorders are among the major inducements for fetal intolerance [45,46], biomarkers are correlated with such gene to monitor this condition. Furthermore, COMTD1 has been confirmed to be detectable in the blood on the methylation level [43]. This finding confirms the potentials of such gene as an effective biomarker for fetal intolerance prediction and monitoring. DLGAP2 is the third gene encoding a specific membrane associated protein and has been widely reported to participate in the molecular organization of synapses and neuronal cell signaling [47,48]. In 2019, an independent study confirmed that the methylation status of such gene in the blood can monitor the blood sugar level of mothers and maternal insulin sensitivity during pregnancy [49,50]. Considering that the blood sugar level of mothers is also pathologically correlated with fetal intolerance [51,52], such gene can be regarded as a potential biomarker during fetal intolerance monitoring and diagnosis.
HEG1 is a quite effective regulator for the heart and vessels during the early developmental stage [53]. In 2019, a report confirmed that the abnormal methylation regulation on such gene may induce trophoblast invasion at the maternal-fetal interface, thereby inducing a high level of mothers' psychological distress [54] and the abnormal development of fetal hearts [54,55], even though not validated in human beings. Considering that fetal heart development has also been predicted to be correlated with fetal intolerance [56][57][58], HEG1 can be regarded as an effective biomarker for fetal intolerance prediction. Gene KIAA1875 known as WDR97 has been reported as a blood biomarker with moderate functional annotations [59]. This gene has also been confirmed to be detectable at the epigenomic level in the blood [60]. Thus, KIAA1875 may act as a quality control biomarker to measure the reliability of the samples, although no direct reports at present has confirmed its specific role in fetal intolerance prediction.

Optimal rules for fetal intolerance diagnosis and monitoring
We set up a group of quantitative rules for diagnosing the fetal intolerance in clinical application in addition to above qualitative analysis. The four rules are used to evaluate the risk of pregnant mothers suffering from fetal intolerance. Five methylation sites with specific methylation tendency (hypermethylation or hypomethylation) contribute to the prediction of these rules. Among these methylation sites, two genes are annotated: COMTD1 and NHEDC1. These genes have already been confirmed to be functionally correlated with fetal intolerance in the above analysis.
The hypermethylation of NHEDC1 and the hypomethylation of COMTD1 contribute to the identification of patients with fetal intolerance, thereby revealing the specific methylation tendency (hypermethylation or hypomethylation) from the rules. Recent studies have shown that the methylation of NHEDC1 can indicate the onset of fetal intolerance [16], thereby supporting this prediction. COMTD1 is functionally correlated with fetal intolerance, and its hypomethylation may cause abnormal congenital disorders [41], thereby inducing pathological fetal intolerance. Therefore, these quantitative rules contribute to the accurate prediction of fetal intolerance using blood samples.

Conclusion
In summary, the optimal genes and rules we identified in this study have all been supported by recent publications. The efficacy and accuracy of our prediction have also been validated. The blood gene methylation profiling of certain effective biomarkers may be accurate and effective enough for the clinical monitoring of fetal intolerance during pregnancy by using our newly presented computational approaches. Therefore, this work may not only reveal several potential pathological factors for fetal intolerance but also set up a set of potential diagnostic standards (biomarkers and rules) for the clinical monitoring and diagnosis of fetal intolerance.
Supporting information S1 Table. List of ranked features on the basis of mRMR.