The prediction of late-onset preeclampsia: Results from a longitudinal proteomics study

Background Late-onset preeclampsia is the most prevalent phenotype of this syndrome; nevertheless, only a few biomarkers for its early diagnosis have been reported. We sought to correct this deficiency using a high through-put proteomic platform. Methods A case-control longitudinal study was conducted, including 90 patients with normal pregnancies and 76 patients with late-onset preeclampsia (diagnosed at ≥34 weeks of gestation). Maternal plasma samples were collected throughout gestation (normal pregnancy: 2–6 samples per patient, median of 2; late-onset preeclampsia: 2–6, median of 5). The abundance of 1,125 proteins was measured using an aptamers-based proteomics technique. Protein abundance in normal pregnancies was modeled using linear mixed-effects models to estimate mean abundance as a function of gestational age. Data was then expressed as multiples of-the-mean (MoM) values in normal pregnancies. Multi-marker prediction models were built using data from one of five gestational age intervals (8–16, 16.1–22, 22.1–28, 28.1–32, 32.1–36 weeks of gestation). The predictive performance of the best combination of proteins was compared to placental growth factor (PIGF) using bootstrap. Results 1) At 8–16 weeks of gestation, the best prediction model included only one protein, matrix metalloproteinase 7 (MMP-7), that had a sensitivity of 69% at a false positive rate (FPR) of 20% (AUC = 0.76); 2) at 16.1–22 weeks of gestation, MMP-7 was the single best predictor of late-onset preeclampsia with a sensitivity of 70% at a FPR of 20% (AUC = 0.82); 3) after 22 weeks of gestation, PlGF was the best predictor of late-onset preeclampsia, identifying 1/3 to 1/2 of the patients destined to develop this syndrome (FPR = 20%); 4) 36 proteins were associated with late-onset preeclampsia in at least one interval of gestation (after adjustment for covariates); 5) several biological processes, such as positive regulation of vascular endothelial growth factor receptor signaling pathway, were perturbed; and 6) from 22.1 weeks of gestation onward, the set of proteins most predictive of severe preeclampsia was different from the set most predictive of the mild form of this syndrome. Conclusions Elevated MMP-7 early in gestation (8–22 weeks) and low PlGF later in gestation (after 22 weeks) are the strongest predictors for the subsequent development of late-onset preeclampsia, suggesting that the optimal identification of patients at risk may involve a two-step diagnostic process.


Introduction
Maternal hemodynamic status differs in patients with early-and late-onset preeclampsia [79]. These differences can be identified as early as 24 weeks of gestation. Those who develop late-onset preeclampsia have increased cardiac output and relatively unchanged total vascular resistance [79], whereas patients with early-onset preeclampsia have lower cardiac output and relatively increased vascular resistance.
New technology, not based on antigen-antibody reactions, has been developed to increase the number of proteins that can be detected simultaneously with a high degree of sensitivity and dynamic range [127,128]. This aptamer-based method uses single-strand DNA or RNA molecules that bind to proteins, peptides, or other pre-defined molecules with high affinity and specificity. The use of aptamer technology for the discovery of biomarkers for cardiovascular disease [129] and other medical conditions [130][131][132][133][134][135] has recently been reported, and we have previously reported changes in the maternal plasma proteome as a function of gestational age [128]. Therefore, we used this high through-put proteomic platform to identify proteins that can serve as biomarkers for the identification of patients who subsequently develop lateonset preeclampsia.

Study design
A retrospective nested case-control study was designed to include 90 patients with normal pregnancies (controls) and 76 patients with late-onset preeclampsia defined as preeclampsia diagnosed at !34 weeks of gestation). Patients were enrolled between February 2007 and Dec 2013 as part of a longitudinal cohort study conducted at the Center for Advanced Obstetrical Care and Research of the Perinatology Research Branch, NICHD/NIH/DHHS, the Detroit Medical Center and Wayne State University. Plasma samples were collected at the time of each prenatal visit scheduled at four-week intervals from the first or early second trimester until delivery. Each patient had at least two samples collected during the following gestational age intervals: 8-<16 weeks, 16-<24 weeks, 24-<28 weeks, 28-<32 weeks, 32-<37 weeks and >37 weeks. The median number (range) of samples per patient was 5(2-6) for cases and 2 (2-6) for controls. All patients provided written informed consent, and the use of biological specimens, as well as clinical and ultrasound data for research purposes, were approved by the Wayne State University Human Investigation Committee and the Institutional Review Board of NICHD.
at least two occasions, 4 hours to 1 week apart) and proteinuria (!300 mg in a 24-hour urine collection, or two random urine specimens obtained 4 hours to 1 week apart containing !1 + by dipstick or one dipstick demonstrating !2+ protein) [83,136].
Early-onset preeclampsia was defined as preeclampsia diagnosed before 34 weeks [83]. Late-onset preeclampsia was defined as preeclampsia diagnosed at or after 34 weeks of gestation.

Proteomic analysis
Maternal plasma protein abundance was determined using the SOMAmer (Slow Off-rate Modified Aptamers) platform and its reagents that allowed the abundance of 1,125 proteins to be profiled [138][139][140]. Proteomics profiling services were provided by Somalogic, Inc. (Boulder, CO, USA) in December 2014.
The serum samples were diluted and then incubated with the respective SOMAmer mixes pre-immobilized onto streptavidin-coated beads. The beads were washed in order to remove all non-specifically bound proteins and other matrix constituents. Proteins that remained bound to their cognate SOMAmer reagents were tagged using an NHS-biotin reagent. After the labeling reaction, the beads were exposed to an anionic competitor solution to prevent non-specific interactions from reforming after disruption.
Using this approach, pure cognate-SOMAmer complexes and unbound (free) SOMAmer reagents are released from the streptavidin beads using ultraviolet light that cleaves the photocleavable linker used to quantitate proteins. The photo-cleavage eluate, which contains all SOMAmer reagents (some bound to a biotin-labeled protein and some free), was separated from the beads and then incubated with a second streptavidin-coated bead that binds the biotin-labeled proteins and the biotin-labeled protein-SOMAmer complexes. The free SOMAmer reagents were then removed using subsequent washing steps. In the final elution step, proteinbound SOMAmer reagents were released from their cognate proteins using denaturing conditions. These SOMAmer reagents were then quantified by hybridization to custom DNA microarrays. The Cyanine-3 signal from the SOMAmer reagent was detected on microarrays and used for quantification [138][139][140]. adjustment of the overall signal within a single plate (85 samples processed per plate/run) was performed in three steps: Hybridization Control Normalization, Median Signal Normalization, and Calibration, using the manufacturer's protocol. Outlier protein abundance values above 2 × the 98 th percentile of all samples, were replaced with 2 × the 98 th percentile of all samples (data thresholding) (See S1 File for the protein abundance data after the thresholding step). Protein abundance was then log (base 2) transformed to improve normality. Linear mixed-effects models with cubic splines (number of knots = 3) were used to model protein abundance in controls as a function of gestational age using the lme4 package [141] under the R statistical language and environment (www.r-project.org). Data for all samples was then expressed as multiples-of-the-mean (MoM) values for the corresponding gestational age in normal pregnancies.
Development of multi-marker prediction models. The goal of this analysis was to develop parsimonious, accurate prediction models by using protein abundance in each gestational age interval separately (8-16, 16.1-22, 22.1-28, 28.1-32, 32.1-36 weeks of gestation) applying predictive modeling techniques for omics data that we previously reported [142][143][144]. Log (base 2) MoM values for one protein at a time were used to fit a Linear Discriminant Analysis (LDA) model, and compute, by leave-one-out cross-validation (LOOCV), a classification performance measure for each protein. This performance measure was the partial Area Under the Receiver Operating Characteristic (ROC) curve (pAUC) using a cut-off of 0.5 false positive rate. The use of a partial as opposed to the full area under the ROC curve was chosen to emphasize the need to find proteins that have high sensitivity at low false positive rates. Further, proteins that did not change at least 10% in average abundance between the groups were removed from the analysis. Then, LDA models were fit using increasing sets of up to 5 of the top proteins ranked by the pAUC. To enforce model parsimony, the inclusion of each additional protein was conditioned on the increase of 0.01 units in a pAUC statistic. Classification performance indices [AUC, sensitivity, specificity, positive and negative predictive values, likelihood ratio (+) and (-)] were obtained for the best combinations of markers in each interval by LOOCV. While this accounts for biases due to over-fitting of the data for a given set of selected proteins, it does not account for the fact that those proteins were selected from a large pool of candidate predictors. Therefore, classification performance indices were also obtained using bootstrap. With this approach, after data transformation into MoM, patients (both cases and controls) were selected with replacement. All analysis steps involved in the prediction model development (including selection of predictor proteins) were performed using only data from the selected patients (training set) and prediction performance was calculated by applying the resulting model on the patients left out (test set). Averages of 100 such bootstrap iterations are reported and the discussion of results is based primarily on these performance estimates since they are considered most robust.
Differential abundance analysis. Since the classifier development pipeline described above is focused on finding the most accurate, parsimonious set of proteins that predict lateonset preeclampsia, it will not necessarily retain all proteins showing evidence of differential abundance. Therefore, a complementary analysis was performed to test for differences between mean log (base 2) MoM values between cases and controls at each gestational age interval. Linear models with coefficient significance evaluated via moderated t-tests were implemented using the limma package [145] of Bioconductor repository [146]. With this procedure, standard deviation estimates of log2 MoM values for each protein are shrunk toward a common (pooled) value to improve robustness. Significance was inferred based on the false discovery rate adjusted p-value (q-value) <0.25 and fold-change in abundance >1.1 fold after adjusting for BMI, smoking status, maternal age, and parity.
Gene ontology enrichment analysis. Proteins selected as differentially abundant between late-onset preeclampsia and normal pregnancy in each interval of gestation were mapped to Entrez gene identifiers [147] based on Somalogic, Inc., annotation, and then to gene ontology [148]. Biological processes over-represented among the proteins that changed with late-onset preeclampsia were identified using a Fisher's exact test. Gene ontology terms with three or more hits and q-values <0.1 were considered significantly enriched.

Clinical characteristics of the study population
Women with late-onset preeclampsia had a lower median gestational age at delivery (p<0.001) and a higher median maternal BMI (p = 0.03) than the controls. Thirty-seven percent (28/76) of cases had severe preeclampsia and 63% (48/76) had mild preeclampsia. Median gestational age at delivery was lower both in patients who had mild preeclampsia and in those who had severe preeclampsia than in the controls (p<0.001), but the median maternal BMI was higher than the controls only in patients who had severe preeclampsia (p = 0.01) ( Table 1).
Proteomic prediction models for late-onset preeclampsia prior to diagnosis Fig 1 depicts a summary of the LOOCV (black segments for best combination of markers; red segments, PlGF alone) and bootstrap (bars with 95% CI) based performance estimates for the prediction of late preeclampsia. The bootstrap estimates of AUC ( Fig 1A) and sensitivity at a 20% false positive rate (FPR) (Fig 1B) achieved by the best combinations of proteins were significantly higher than those of PlGF in the first two gestational age intervals (8-16 and 16.1-22 weeks) (bars higher than red line segments).
At 8-16 weeks of gestation, the best combination of proteins included only matrix metalloproteinase 7 (MMP-7) that had a sensitivity of 69% at a FPR of 20% (black segment on top of the red bar at 8-16 weeks, Fig 1B) and 57% at a FPR of 10% (AUC = 0.79; see black segment at 8-16 weeks, Fig 1A and Table 2). Individual patient longitudinal MMP-7 profiles are depicted in Fig 2A, highlighting the differences in the samples taken between 8-16 weeks of gestation. When random sets of cases and controls were selected with replacement, and the entire procedure to build the classification model was repeated, MMP-7 was chosen as the best predictor in 88 of the 100 bootstrap trials and the typical (mean) AUC of the prediction model was 0.76 (see Table 3, and red bar at 8-16 weeks Fig 1A). The consistency of bootstrap-based (AUC = 0.76) and final model estimates (AUC = 0.79) of prediction performance suggest minimal to no data over-fitting. The second most frequently selected predictor protein (23/100 iterations) either by itself or in combination with other proteins was BMP-1 (AUC = 0.74) (see Fig 2B).
At 16.1-22 weeks of gestation, MMP-7 was again the single best predictor of late-onset preeclampsia with a sensitivity of 68% at a FPR of 20%, and 62% at a FPR of 10% (AUC = 0.83; 0.82 bootstrap estimate) (Tables 2 and 3). Longitudinal MMP-7 profiles emphasizing the differences in the samples taken between 16.1 to 22 weeks of gestation are shown in Fig 3. MMP-7 was selected in the best model of 94 of the 100 bootstrap trials with the next most frequently selected proteins HMG-1 (high-mobility group protein box-1) and gpIIbIIIa (Integrin alpha-IIb: beta-3 complex) being selected only 18 and 17 times, respectively.
At 22.1-28 weeks of gestation, the proteomics profile predicted late-onset preeclampsia with a sensitivity of 48% at a FPR of 20% and with a sensitivity of 23% at FPR of 10% (AUC = 0.72). The two proteins included in the final model at this gestational age interval were RAN (RAs-related Nuclear protein, also known as GTP-binding nuclear protein Ran) and METAP1 (Methionine aminopeptidase 1). However, the bootstrap-estimated performance of combinations of proteins at this gestational age interval was substantially lower (29% sensitivity at a FPR of 20%, AUC = 0.55): PlGF (Fig 4: longitudinal profiles) was selected most frequently in the best model (24/100 times) followed by METAP1 (16/100), MMP-7(15/100) and RAN (12/100) ( Table 3).
Prediction performance for late-onset preeclampsia at the 28.1-32 and 32.1-36 week intervals did not exceed the values obtained at the 8-16 and 16.1-22 week intervals, with proteins such as RAN, Calcium/calmodulin-dependent protein kinase type II alpha chain (CAMK2A),   PlGF, tissue factor (TF), and Cathepsin B being among the most frequently (14 to 44 times out of 100) included as predictors in the optimal LDA prediction models for late-onset preeclampsia ( Table 3).

Prediction of late-onset preeclampsia according to its severity
When severe and mild late-onset preeclampsia cases were compared separately against the controls, the estimated prediction performance of multi-protein models was very similar to the one for overall late-onset preeclampsia (Fig 1A and 1B and Tables 2 and 3). Although for the 8-16 and 16.1-22 weeks' intervals when MMP-7 was selected as the best model in a majority of bootstrap trials, there were differences in the top proteins included for prediction of subsequent mild as opposed to severe late-onset preeclampsia (Table 3). PlGF, PTP-1B (Tyrosineprotein phosphatase non-receptor type 1), and FCN2 (Ficolin-2) were the most frequently selected to predict severe preeclampsia (10-24/100 times) while RAN, TF, FER, and Cathepsin B were the most frequently selected in the best combinations of predictors of mild late-onset preeclampsia (Table 3). Since combinations of proteins did not perform any better than PlGF alone, we describe only the prediction performance indices for PlGF in the intervals from 22.1-36 weeks of gestation (see red line segments in Fig 1): at 22.1-28 weeks, the sensitivity of PlGF was 53% (FPR = 20%) for overall late-onset preeclampsia (50% for mild and 59% for severe preeclampsia) ( Fig 1B); at 28.1-32 weeks, the sensitivity of PlGF was 36% (FPR = 20%) for overall late-onset preeclampsia (30% for mild and 46% for severe preeclampsia) (Fig 1B). At 32.1-36 weeks, the sensitivity of PlGF was 56% (FPR = 20%) for overall late-onset preeclampsia (45% for mild and 69% for severe preeclampsia) (Fig 1B).

Differential protein abundance summary
In addition to the few proteins that were included in the parsimonious models predictive of late-onset preeclampsia at different gestational age intervals (Table 2), 36 additional proteins The prediction of late-onset preeclampsia: Results from a longitudinal proteomics study showed evidence for differential abundance after adjusting for BMI, smoking status, maternal age, and parity (q-value<0.25 and fold change >1.1) in at least one interval of gestation. Table 4 shows the linear fold-changes in the MoM values between late-onset preeclampsia and the control groups, as well as the nominal and FDR adjusted p-values (q-values) for each gestational age interval. The heatmap summarizes the differential abundance patterns across all gestational age intervals considered (Fig 5 and Table 4). Notably, the abundance of MMP-7, CDK8/cyclin C (Cyclin-dependent kinase 8:Cyclin-C complex), PPID (Peptidylprolyl isomerase D), and RAN were higher while the abundance of HSP70 (Heat shock 70 kDa protein 1A/ The prediction of late-onset preeclampsia: Results from a longitudinal proteomics study 1B) was lower in cases compared to the controls in the first three gestational age intervals (8-16, 16.1-22, 22.1-28 weeks).
Of the 36 proteins associated with late-onset preeclampsia in at least one gestational age interval, 11 (31%) were among those modulated during gestation in normal pregnancy [149] (OR = 4.3, p<0.001) ( Table 4). This supports our prediction that proteins that change with gestation in normal pregnancy could be helpful in understanding obstetrical complications and may serve as biomarkers for the prediction of these disorders. The prediction of late-onset preeclampsia: Results from a longitudinal proteomics study Biological processes perturbed in late-onset preeclampsia Gene ontology analysis of the proteins that changed significantly between the cases and controls was performed for each gestational age interval. Despite the inherent limited power of The prediction of late-onset preeclampsia: Results from a longitudinal proteomics study such analysis (due to few significant proteins at each gestational age interval), we have identified biological processes perturbed in late-onset preeclampsia. These gene ontologies included: small molecule metabolic process and positive regulation of apoptotic process at 8-16 weeks, and positive regulation of vascular endothelial growth factor receptor signaling pathway, positive regulation of cell adhesion, and extracellular matrix organization at 16-22 weeks (OR = 3.1-38.1, all q<0.1) ( Table 5).  98], and a relatively high rate of thrombocytopenia, elevated liver enzyme abnormalities, and the HELLP syndrome [150]. By contrast, late-onset preeclampsia is thought to result from a mismatch between the nutrient supply by the mother and the metabolic demands of the fetus at the end of pregnancy [77-79]. Typically, the placenta is of normal weight [92]; it is less likely to have maternal vascular lesions of underperfusion than the placenta in early-onset disease The prediction of late-onset preeclampsia: Results from a longitudinal proteomics study [90][91][92][93]; fetuses are frequently of appropriate or large birth weight for gestational age [105][106][107][108]; and the uterine arteries and umbilical artery Doppler velocimetries are generally within normal range [79]. Late-onset preeclampsia is more likely to occur in obese patients [151][152][153]. Cardiac output, total vascular resistance, and the morphology of the left ventricle, as determined by echocardiography, are also different in early-and late-onset preeclampsia by 24 weeks of gestation [79].
Because the etiologies of early-and late-onset preeclampsia are different, biomarkers predicting their development are expected to diverge. For example, the concentrations of PlGF and anti-angiogenic factors (sFlt-1 and sEng) are good predictors of early-onset preeclampsia, but not of late-onset disease. We undertook the discovery of biomarkers focusing exclusively on late-onset disease: we and other investigators previously addressed the prediction of earlyonset preeclampsia [99,101,[119][120][121][122][123].
MMP-7, a predictor of late-onset preeclampsia. Elevated abundance of MMP-7 in maternal plasma before 22 weeks of gestation was the strongest predictor of late-onset preeclampsia. This matrix metalloproteinase, also called matrilysin, is involved in the degradation of several types of collagen (III, IV, V, IX, X, XI), proteoglycans, fibronectin, elastin, and casein [154]. It is the smallest MMP that circulates in the blood. The main form is pro-MMP-7, which is enzymatically inactive. MMP-7 is involved in innate immune processes, mainly in the lung and gut, due to its proteolytic activity that activates α-defensins (anti-bacterial peptides able to disrupt bacterial membrane) [154]. Indeed, silencing MMP-7 in mice will result in the inability to activate pro-α-defensins in the gut and a higher susceptibility to intestinal bacterial infections [155]. MMP-7 also has an important role in releasing TNF-α from macrophages; and it is involved in the transepithelial migration of neutrophils by cleaving syndecan-1, the main heparan sulphate proteoglycan on the epithelium. Maternal plasma concentration of TNF-α is elevated in preeclampsia [156,157].
Recent evidence suggests that MMP-7 may play a role in atherosclerotic disease, which has many parallels to preeclampsia [158]. Indeed, the SUMMIT Consortium (surrogate markers for micro-and macrovascular hard endpoint for innovative diabetes tools) reported that circulating MMP-7 concentrations were higher in patients with Type 2 diabetes mellitus, correlated with patients' age, and were independently associated with the prevalence of cardiovascular disease and the burden of atherosclerosis as well as arterial stiffness and plaque inflammation. Baseline MMP-7 concentrations were elevated in patients who had a coronary event during the study period [158]. Circulating concentrations of MMP-7 are significantly higher in patients with histological unstable atherosclerotic carotid lesions compared to patients with stable lesions [159]. In addition, markedly higher mRNA levels of MMP-7 were found within carotid plaques than in arteries without plaques [160]. MMP-7 within the carotid plaques was primarily localized in macrophages [160,161], and in vitro studies showed that combined stimulation of inflammatory mediators (TNF-α), oxidized LDL, and hypoxia markedly increased MMP-7 expression in monocytes [160]. In atherosclerotic plaques, MMP-7 is expressed by lipid-laden macrophages [161], the same cells present in acute atherosis of the spiral arteries, a lesion associated with preeclampsia [162,163]. Thus, MMP-7 may contribute to plaque destabilization in patients with carotid artery stenosis. Abbas et al. [160] reported that MMP-7 concentrations were especially higher if the patients were symptomatic within the prior two months of sampling. Moreover, high plasma concentrations of MMP-7 in these patients were independently associated with total mortality [160].
During pregnancy, MMP-7 is expressed in the decidua and trophoblast. In the first trimester, uterine NK cells and macrophages abundant in the decidua express MMP-7; and matrilysin may have a role in the process of transformation of the spiral arteries, because 50%-75% of leukocytes infiltrating and remodeling the vessels are positive for MMP-7 and MMP-9 [164]. [165].

During normal pregnancy there is constant expression of MMP-7 in the intermediate trophoblast and decidual cells throughout gestation
Matrilysin is associated with pregnancy complications: 1) its amniotic fluid concentrations are elevated in women with preterm labor and intact membranes who deliver preterm regardless of the presence of intra-amniotic infection [166]; 2) in the placentas of patients with severe preeclampsia, there is extensive immunostaining of all layers of villous trophoblast for MMP-7 [165]; 3) by contrast, placentas from patients with severe early-onset preeclampsia with fetal growth restriction, the interstitial trophoblast cell expression of MMP-3 and MMP-7 are markedly reduced [167]. The authors attributed this finding to the fact that decidual NK cells aggregated near the spiral arteries secrete leukemia inhibitory factor (LIF) that suppresses the expression of MMPs. This may impede the physiological transformation of the spiral arteries, which has been implicated in the pathophysiology of early-onset preeclampsia [167]. Collectively, these reports suggest that MMP-7 may be involved in two fundamental processes associated with the development of preeclampsia: placentation and inflammation.
A previous study reported that MMP-2 is elevated in maternal urine as early as 12-16 weeks of gestation, and an elevated concentration of MMP-2 at 12 weeks predicted the development of preeclampsia with a sensitivity of 100% and a specificity 62.5%; at 16 weeks of gestation, an elevated MMP-2 in maternal urine predicted the development of preeclampsia with a sensitivity of 87.5% and a specificity of 74.1% [168]. The urine concentration of MMP-7 in patients who subsequently developed preeclampsia did not differ from those with a normal pregnancy, but the study could not differentiate between patients who subsequently developed early-and late-onset preeclampsia [168].
What are the differences in the proteomic profile between patients with mild and severe late-onset preeclampsia?. The severity of preeclampsia has major implications for maternal and neonatal outcomes. Patients with a mild disease need only timely delivery and observation. By contrast, women with severe preeclampsia have a high rate of maternal morbidity, including eclampsia, abruption, elevated liver enzymes, and emergency cesarean delivery [20]. Therefore, early identification of women who will subsequently develop severe preeclampsia is important as they may benefit from a timely delivery prior to the onset of the severe preeclampsia [169].
Until 22 weeks of gestation, MMP-7 was the most predictive protein for the development of late-onset preeclampsia, either in mild or severe form. After 22 weeks, we observed differences in the set of proteins most predictive of mild or severe preeclampsia. PlGF optimally identified patients destined to develop severe preeclampsia at 22.1-28 and 32.1-36 weeks of gestation, whereas patients destined to develop mild preeclampsia were better predicted by a different set of proteins at each gestational age. These proteins are involved in angiogenesis (e.g., PlGF), coagulation (e.g., tissue factor), cell division (e.g., RAs-related Nuclear protein), and cell-tocell interaction (e.g., tyrosine-protein kinase Fer). The finding that, after 22 weeks of gestation, PlGF is the best predictor of late-onset preeclampsia, especially in its more severe form, is consistent with previous reports [61 ,115,170-173]. We and others [55,70,174] presented the use of this angiogenic factor as a tool for the assessment of the impending risk for preeclampsia, demonstrating lower concentrations of PlGF in cases when compared to controls as early as at least six weeks prior to the onset of the disease [61]. Moreover, the determination of this angiogenic factor has prognostic value in patients presenting to the obstetrical triage area with suspected preeclampsia for the identification of those requiring delivery due to impending preeclampsia [171,172].
Identification of patients who subsequently developed late-onset preeclampsia may warrant a two-stage assessment approach. The comparison of the proteomic prediction models built to predict subsets of cases based on the severity of this syndrome suggests that we may need a two-step approach for the prediction of late-onset preeclampsia. Similar to the current paradigm for the identification of patients at risk of aneuploidy, for which a two-step model has been used (the first at 11-13 weeks of gestation includes nuchal translucency and biochemical markers such as hCG and PAPP-A; and the second at 17 weeks includes alpha feto-protein, hCG, and E3, as well as inhibin in cases of quad test) to generate an integrated risk that serves as the basis for further diagnostic tests, e.g., amniocentesis to diagnose aneuploidy [175][176][177][178]. Unlike the detection of patients at risk for early-onset preeclampsia, in which maternal background characteristics, PlGF concentration, and maternal blood pressure at the time of sample collection can identify the majority of patients at risk for the development of this syndrome [82,114,118,119], our study indicates that optimal prediction of late-onset preeclampsia may involve two diagnostic steps: the first assessment during early gestation (8-22.1 weeks), using MMP-7, and the second one later during the third trimester (28.1-32 weeks). Until 22.1 weeks, MMP-7 has the highest predictive performance for the identification of patients at risk to develop late-onset preeclampsia regrades to severity, whereas, after 22 weeks, the set of optimal proteomic predictors differs according to the severity of late-onset preeclampsia. This has implications on clinical management, since those who are at risk for the development of severe late-onset preeclampsia may benefit from timely delivery near 37 weeks of gestation, while those who are destined to have a mild disease may continue pregnancy to term under close surveillance.
Strengths and limitations. The major strengths of this study are the large number of proteins tested, as well as its longitudinal design and the number of samples included in the analysis, especially during early stages of pregnancy. This is the first study to demonstrate that proteomic profiles identify patients destined to develop severe or mild late-onset preeclampsia as early as 16 weeks with a sensitivity that surpasses that of PlGF. Our study includes mainly African American women; this may limit the generalizability of our results to this ethnic group, which is at much higher risk to develop preeclampsia than other ethnic groups.
It is common in the field of high-dimensional biology to combine predictors (e.g., mRNAs, proteins, metabolites, etc.) in a logistic regression (or other type of prediction model) and report one set of predictive performance indices on the full set of patients used to select the predictors and fit models for this purpose. However, such approaches would lead to optimistically biased performance indices due to at least two sources of bias. The most important is the feature selection bias, since, when selecting from a large pool of candidate biomarkers, it is generally possible to find a few "biomarkers" that appear to predict the outcome better than expected by chance (e.g., AUC>0.5). The second source of bias comes from tuning (estimating) the weights (co-efficients) of a predefined set of predictors to fit the available data. We avoided these common pitfalls by relying on bootstrap-estimated performance indices. With this procedure, predictor/feature selection and model fitting are repeated 100 times on data from a training set of patients while the model is tested on data from patients left out at each iteration. As shown in Fig 1, the LOOCV AUC and sensitivity estimates of the best combination of markers are in the worst case as low as the one of PlGF, yet we only claim better prediction compared to PlGF alone in the first two intervals when the bootstrap-based estimates of multi-marker models are significantly higher than those of PlGF.
Also, we and others [143] have addressed the problem that indicates when high-dimensional data are used to build prediction models, the same prediction performance can be achieved with widely different sets of predictors, due, among other reasons, to the correlation that may exist among them. Therefore, instead of emphasizing the sets of proteins identified in the final models (Table 2), we focused our inferences on the proteins that appear to be selected as the best predictors more often during the 100 different bootstrap iterations. For instance, while PlGF was the most reliable predictor of late-onset preeclampsia in the interval 22.1-28 weeks, being included in the best combination 24/100 times, when all data was used to fit the final model, RAN and METAP1 appeared to be the best choices even though they were selected 12 and 16 times in the best combination out of 100 bootstrap trials.

Conclusion
We present herein new biomarkers to identify patients who will develop late-onset preeclampsia based on a high through-put proteomics method. We report that elevated MMP-7 early in gestation (8-22 weeks) and low PlGF later in gestation (after 22 weeks) are the strongest predictors for the subsequent development of late-onset preeclampsia, hence suggesting that the optimal identification of patients at risk may involve a two-step diagnostic approach. In addition, abnormal proteomic profiles before 22 weeks of gestation are associated with perturbation of several biological processes including the positive regulation of vascular endothelial growth factor receptor signaling pathway.
Supporting information S1 File. Proteomics data used in the analyses presented in this study. Protein abundance data for each sample (rows) and each of the 1125 proteins is given in this table. ID: anonymized identifier indicator of the patient, GA: gestational age at sample, LatePE: is 1 for late preeclampsia and 0 for normal pregnancy. Protein symbol and names provide by Somalogic, Inc, are included above the protein accession numbers. (CSV) S2 File. Summary of differential abundance analysis between late-onset preeclampsia and normal pregnancy in five intervals of gestation. Thirty-six proteins that were significant (Sig.) (q<0.25 and fold change >1.1) in at least one interval are shown. Adjustment was performed for BMI, maternal age, parity and smoking. FC: linear fold change, with negative values denoting lower while positive values denoting higher level in cases than in controls, p: pvalue, q: adjusted p-value, GA: gestational age. The column labeled as "Changes with GA" indicates whether the protein abundance changes with gestational age [128]