Systematic review of prediction models for gestational hypertension and preeclampsia.

Introduction Prediction models for gestational hypertension and preeclampsia have been developed with data and assumptions from developed countries. Their suitability and application for low resource settings have not been tested. This review aimed to identify and assess the methodological quality of prediction models for gestational hypertension and pre-eclampsia with reference to their application in low resource settings. Methods Using combinations of keywords for gestational hypertension, preeclampsia and prediction models seven databases were searched to identify prediction models developed with maternal data obtained before 20 weeks of pregnancy and including at least three predictors (Prospero registration CRD 42017078786). Prediction model characteristics and performance measures were extracted using the CHARMS, STROBE and TRIPOD checklists. The National Institute of Health quality assessment tools for observational cohort and cross-sectional studies were used for study quality appraisal. Results We retrieved 8,309 articles out of which 40 articles were eligible for review. Seventy-seven percent of all the prediction models combined biomarkers with maternal clinical characteristics. Biomarkers used as predictors in most models were pregnancy associated plasma protein-A (PAPP-A) and placental growth factor (PlGF). Only five studies were conducted in a low-and middle income country. Conclusions Most of the studies evaluated did not completely follow the CHARMS, TRIPOD and STROBE guidelines in prediction model development and reporting. Adherence to these guidelines will improve prediction modelling studies and subsequent application of prediction models in clinical practice. Prediction models using maternal characteristics, with good discrimination and calibration, should be externally validated for use in low and middle income countries where biomarker assays are not routinely available.


Introduction
Prediction models for gestational hypertension and preeclampsia have been developed with data and assumptions from developed countries. Their suitability and application for low resource settings have not been tested. This review aimed to identify and assess the methodological quality of prediction models for gestational hypertension and pre-eclampsia with reference to their application in low resource settings.

Methods
Using combinations of keywords for gestational hypertension, preeclampsia and prediction models seven databases were searched to identify prediction models developed with maternal data obtained before 20 weeks of pregnancy and including at least three predictors (Prospero registration CRD 42017078786). Prediction model characteristics and performance measures were extracted using the CHARMS, STROBE and TRIPOD checklists. The National Institute of Health quality assessment tools for observational cohort and crosssectional studies were used for study quality appraisal.

Results
We retrieved 8,309 articles out of which 40 articles were eligible for review. Seventy-seven percent of all the prediction models combined biomarkers with maternal clinical characteristics. Biomarkers used as predictors in most models were pregnancy associated plasma protein-A (PAPP-A) and placental growth factor (PlGF). Only five studies were conducted in a low-and middle income country.

Introduction
Hypertensive disorders of pregnancy (HDPs) are important causes of maternal morbidity and mortality globally but the burden is greatest in low-and middle-income countries (LMIC) [1][2][3]. These disorders of pregnancy include gestational hypertension, preeclampsia and eclampsia and are characterized by an increase in blood pressure and multi-organ derangements which range from mild to severe [4]. There is no known cure but daily administration of low dose aspirin early in the first trimester has been shown to reduce the incidence and the severity of preeclampsia [5][6][7][8]. Preeclampsia is a major indication for preterm delivery, accounting for about 15% of all preterm deliveries [9][10][11][12][13] and is a cause of increased healthcare costs through the prolonged stay of the mother or newborn in intensive care units [14]. Prediction models provide estimates of the probability or risk of the future occurrence of a particular outcome or event in individuals at risk of such an event [15]. Prediction models have also been used to identify women at high risk of developing HDPs later in pregnancy so as to provide for closer monitoring from early pregnancy onwards, including low dose aspirin prophylaxis [5][6][7][8] which has been shown to reduce the risk of developing preeclampsia.
The aim of this systematic review was to evaluate the performance of multivariate prediction models to address the question of the effectiveness of prediction models in identifying pregnant women at risk of gestational hypertension and preeclampsia. The objectives were to identify prediction models for gestational hypertension and preeclampsia; assess the methodological quality of the studies to develop and externally validate the prediction models using the CHARMS [16] checklist; and to identify prediction models that can be applied in low and middle income country settings.

Methods
This study was conducted using the critical appraisal and data extraction for systematic reviews of prediction modelling studies (CHARMS) [16], strengthening the reporting of observational studies in epidemiology (STROBE) [17] and the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) [18] checklists. The Population, Intervention, Comparator and Outcome (PICO) format for the review was as follows: P (pregnant women), I (prediction models), C (none) and O (gestational hypertension or preeclampsia). The study protocol was registered with the Prospero International Prospective Register of Systematic Reviews (CRD 42017078786). through 18 September 2017. The search was updated to 15 October 2019 (DLV,EA). The MeSH database, EMTREE subject headings and CINAHL subject headings were used to construct the search strategy along with author keywords and general keywords. In addition, an electronic hand search was conducted in a number of journals from 10th September through 25 th September, 2017 and from October 1 to October 15, 2019. Finally, grey literature was searched using the New York Academy of Medicine Grey Literature, OCLC's OAISTER, and Open Grey databases.
The search strategy is provided as a supplementary file (S1 Data).

Eligibility/Inclusion criteria
Cohort studies, nested-case control studies and randomized controlled trials were eligible for inclusion in the study. Case-control, cross-sectional, animal studies, bio-molecular studies, letters, reviews and case reports were excluded because for prediction modeling studies we require absolute risks whereas case-control or cross-sectional studies only give relative risks. The primary outcomes for the included studies were gestational hypertension and preeclampsia.

Definition of terms
Gestational hypertension was defined as elevated systolic blood pressure equal to or greater than 140mmHg and/or diastolic blood pressure equal or greater than 90mmHg on at least two occasions four hours apart and appearing for the first time after 20 weeks of gestation without proteinuria [4]. Pre-eclampsia was defined as gestational hypertension with proteinuria of 300mg or more in a 24-hour urine sample or spot urine protein/creatinine ratio of 30mg/ mmol [4]. Pre-eclampsia was further divided into early-onset preeclampsia (requiring preterm delivery before 34 weeks gestation) and late-onset preeclampsia (with delivery at or after 34 weeks gestation or later) as an outcome by some studies [19][20][21][22][23][24]. A prediction model [25] was defined as a logistic regression formula or a survival model with three or more predictors that could be used to estimate risk probabilities for individual patients or to distinguish between groups of patients of different risks.

Screening methods for study identification
Two reviewers (EA, MAC) independently assessed the titles and abstracts of the search results to select relevant papers for further screening. After removal of duplicates, the articles were obtained for screening/reading of the full text after which eligible papers were selected for inclusion in the systematic review. Discrepancies between the reviewers were resolved through consensus.

Data extraction and management
Data extraction of the identified studies was done by using the CHARMS checklists (EA). Extracted data were checked (MAC) and disagreements were resolved by consensus (EA, MAC). In case of disagreement a third reviewer (KKG) was consulted. Studies were analysed qualitatively given the large variability of the studies included.
The following categories were extracted: authors, journal, year of publication, region or place where study was conducted, period of data collection, study design, inclusion and exclusion criteria, the sample size of the derivation cohort and/or the validation cohort, the gestational age at which women were enrolled into the study and the number of outcomes. Other information extracted were the number and types of predictors, the target population for whom the prediction model is intended for, the handling of missing data, the modeling method used, the model selection method, the handling of continuous data, the method used for internal validation and whether or not an external validation was done.

Quality assessment
Quality of the studies was assessed using the CHARMS, STROBE and TRIPOD checklists and the National Institute of Health (NIH) [26] quality assessment tools for observational cohort and cross-sectional studies was independently assessed by two authors (EA, MAC). The NIH quality assessment tools focus on concepts that are key for critical appraisal of the internal validity of a study. The tool uses a 14-item checklist to assess the study design, inclusion criteria, outcome and variable description and collection and loss to follow up among others. Each item is scored as yes, no or other (not reported, not applicable or cannot determine). The tool also provides guidance on grading the studies as good, fair or poor. The studies were finally graded for risk of bias as"low" if risk of bias was unlikely, "moderate" if there were no essential flaws, but not all criteria had been satisfied and "high" if there were flaws in one or more important items. We adapted the tool and used 13 out of the 14 items, because one item, "for exposures that can vary in amount or level, did the study examine different levels of the exposure as related to the outcome (e.g., categories of exposure, or exposure measured as continuous variable)?" was not relevant to our review.

Meta-analysis
We performed a meta-analysis on 22 of the studies with preeclampsia as outcome, using the MedCalc Statistical Software version 19.1.7 (MedCalc Software Ltd, Ostend, Belgium; https:// www.medcalc.org; 2020). These 22 studies had fully reported the area under the curve with 95% confidence intervals. We used the random effects model.

Prediction models for gestational hypertension and pre-eclampsia
All forty studies included in this review were conducted between 2000 and 2019. Table 1 gives an overview of important parameters of the selected studies. The studies have been grouped in the following order: maternal characteristics only, maternal characteristics and uterine artery Doppler, maternal characteristics with biomarkers and maternal characteristics with biomarkers and uterine artery Doppler.
Twelve studies were conducted in the United Kingdom, eight in the United States of America, four each in Australia, Spain and Italy and three in New Zealand. Two studies were done in the Netherlands, Ireland, Brazil, Chile and Ghana with one each in Japan, China, Germany, Norway, Bulgaria, Greece, Belgium and Canada.
Most of the studies were prospective cohort studies (33/40 = 82.5%), four were retrospective cohort studies (10%), three were nested-case control studies (7.5%) and one study combined a retrospective and prospective cohort design for data collection. The prediction models were derived through logistic regression or parametric survival modeling.
The gestational age at inclusion into the studies ranged between eight and twenty weeks. All the gestational ages were confirmed by ultrasound. The sample size for the studies ranged between 173 and 35,948. The events per variable in the studies ranged between 2.1 and 88.2.
Seventy seven percent of all the prediction models combined biomarkers with maternal clinical characteristics. Body mass index (BMI) was the most frequently used predictor (19/ 40). Other maternal clinical predictors used in the models were first trimester systolic blood  pressure and diastolic blood pressure, mean arterial pressure, maternal ethnicity, parity, previous history of preeclampsia, family history of hypertension, family history of preeclampsia, history of smoking and history of gestational diabetes mellitus. The following biomarkers were included: uterine artery pulsatility index (UtA PI, 17/40), pregnancy associated plasma protein-A (PAPP-A) (16/40) and placental growth factor (PlGF) (16/40). The following predictors were used less than ten times in the studies under review: free beta human chorionic gonadotropin (fß-HCG), alpha feto protein (AFP), soluble fms-like tyrosine kinase-1 (sFlt-1), placental protein 13 (PP13), A disintegrin and metalloproteinase 12 (ADAM12), soluble endoglin (sEng) and vascular endothelial growth factor (VEGF). Fig 2 shows the frequency of predictor variables in the prediction models.

Methodological quality of the studies to develop or validate prediction models using the CHARMS, STROBE and TRIPOD checklists
Source of data. All the studies indicated the type of study design used to obtain data for the prediction modeling. 37 were cohort studies whilst three were nested case-control studies.

PLOS ONE
Systematic review of prediction models.
Participants. All the studies indicated the participant eligibility and recruitment criteria, including the study location, number of centres and the inclusion and exclusion criteria.
Outcomes to be predicted. All the studies gave a standard definition for the outcome(s) to be predicted. Most of the studies had a single outcome while eleven studies had two or more outcomes.
Candidate predictors. All the studies defined and described the candidate predictors and the methods for their measurement. The timing of predictor measurements was also provided in all studies. Handling of predictors in the modeling process was described by 31 out of the 40 studies. Nine of the studies categorized continuous variables whilst 21 studies kept continuous variables linear.
Sample size. All studies provided the number of participants and the number of outcomes. Only nine of the studies explicitly estimated the sample size before the onset of the study. The number of outcomes in relation to the number of candidate predictors (events per variable) were deduced from the data and ranged between 2.1 and 88.2.
Missing data. The number of participants with any missing value for each predictor was not provided by the studies. Nine of the studies did not indicate how missing data were handled. Complete case analysis was used by 26 out of the 40 studies whilst five studies imputed missing data using the single regression imputation method [19,32], expectation maximization method [33,48] and multiple imputation [47].
Model development. All the studies selected candidate predictors for inclusion in the model through univariate analysis using a pre-determined p-value. Logistic regression and survival modelling were used to derive the prediction models. For selection of predictors during multivariable modeling, one study used the stepwise forward selection method, 14 studies

PLOS ONE
Systematic review of prediction models.
used the stepwise backward selection method and two studies used stepwise selection without further specification. One study [46] applied the Lasso regression approach and another survival analysis whilst 21 studies did not state the method used for deriving the model.
Model performance. Discrimination of the prediction models, depicted by the c-statistic or the area under the receiver operating characteristic (ROC) curve was reported by 34 (85%) of the studies while calibration was reported by five (12.5%) studies. Classification measures were reported by 37 (92.5%) of the studies (Table 1).
Risk of bias assessment. Risk of bias refers to the extent that flaws in the design, conduct, and analysis of the primary prediction modelling study lead to biased, often overly optimistic, estimates of predictive performance measures such as model calibration, discrimination, or (re)classification (usually due to overfitted models). Details of the risk of bias assessment are presented in Table 2. Prediction models applicable in low and middle income settings. Apart from two models each from Brazil and Chile, both Upper middle income countries in Latin America, and two models from Ghana, all the other models in the literature that met our inclusion criteria

PLOS ONE
Systematic review of prediction models.

Discussion
We set out to review the evidence in the published literature on the performance of multivariate prediction models for gestational hypertension and preeclampsia to assess the effectiveness of prediction models in identifying pregnant women at risk for gestational hypertension and preeclampsia. The specific objectives of this study were to identify prediction models for gestational hypertension and preeclampsia in the literature, assess the methodological quality of the prediction modeling studies by applying the CHARMS checklist and identify prediction models that can be applied in low and middle income country settings.

Prediction models for gestational hypertension and preeclampsia
Our study identified 40 prediction models for gestational hypertension and preeclampsia, most of which had been developed and validated in high-income countries in Europe,

PLOS ONE
Australia and the USA. Only two of such studies had been conducted in a low and middle income country setting. Most of the prediction models were developed in single centres but a few had been developed using data from multiple centres in one or more countries.

Methodological quality of prediction modeling studies
The STROBE (Strengthening the reporting of observational studies in epidemiology), TRI-POD (Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis) and the CHARMS checklists have outlined steps for developing and validating prediction models. The CHARMS checklist in particular provides guidance as to the items to extract when conducting a systematic review of prediction studies. An assessment of the methods used in model development in the studies evaluated in this review showed gaps in application of recommendations in the CHARMS, TRIPOD and STROBE checklists. The following domains of the CHARMS checklist were not adequately addressed in most of the studies: the source of data, study participants, outcome(s) to be predicted, candidate predictors, sample size, missing data, model development, model performance, model evaluation, results, interpretation and discussion. For example continuous predictors were dichotomized in some of the studies despite evidence and recommendations to the contrary [62][63][64][65]. Bias in predictor selection is known to occur when continuous predictors are categorized. Again, categorizing continuous variables assumes that there is a stepwise change in risk from one cut-off point to another. Bodnar et al [66] have demonstrated a dose-dependent relationship between prepregnancy BMI and the risk of preeclampsia. As BMI increases, so does the risk of preeclampsia. Therefore categorizing the predictor variable makes the functional relationship between the continuous variable (predictor) and the outcome variable linear, hence nonlinear transformations such as restricted cubic splines or fractional polynomials cannot be applied [62,67,68].
To prevent overestimation of risks by prediction models, it is recommended that the number of outcomes in relation to the number of predictors (events-per-variable) should be at least ten to one [69,70]. This requires an adequate sample size that ensures that there are enough outcomes in the study. Hence sample size estimation is an important methodological consideration so that at the onset of the study an adequate events-per-variable can be assured and thereby prevent overestimation of the predictive performance of the models (overfitting). Unfortunately, most of the studies under review did not report on sample size estimation. An adequate sample size also minimizes predictor selection bias. Predictor selection bias tends to be greater in smaller datasets when the events-per-variable ratio is small, especially when there are weak predictors in the dataset [16].
Information on missing data should be reported as part of the results of the studies. This includes the number of participants with any missing value (including values for both predictors and outcomes), number of participants with missing data for each predictor and how the missing data were handled, for example by complete case analysis, imputation or other methods. Information about missing data gives an idea as to the extent of bias, dependent on the reasons for the missing data. Where data were not missing completely at random, the prediction estimates are likely to be biased [64,[71][72][73][74][75]. Missing data are seldom missing completely at random and may often be related to other observed participant data. Consequently, participants with completely observed data are likely to be different from those with missing data. Complete-case analysis which was the commonest method used to handle missing data in most studies deletes participants with a missing value from the analysis, thereby resulting in loss of information from a subset of the study population. This may result in over or under estimation of the predictive effect and reduced performance in an external population.
Prediction model performance is one of the important domains to be in the reported on [71]. Model performance indicators include calibration, discrimination and classification. It is recommended that discrimination and calibration should always be reported for prediction models. Discrimination indicates how well the prediction model distinguishes between two outcomes such as disease or non-disease and is assessed using the c-statistic or the area-underthe-curve (AUC) of a receiver operating characteristic curve [76][77][78]. The AUC ranges from 0.5 to 1 and represents the prediction model's ability to correctly classify a randomly selected individual as being from one of two hypothetical populations [78][79][80][81]. An AUC value of 1.0 is considered perfect, 0.9-0.99 excellent, 0.8-0.89 good, 0.7-0.79 fair and 0.51-0.69 poor. An AUC of 0.5 is considered non-informative. The AUC in the studies under review ranged between 0.65 and 0.98. Apart from the study by Kuijk et al [19] which had an AUC of 0.65, all the other studies reported AUC greater than or equal to 0.70, indicating good to excellent discrimination. Calibration refers to how well the predicted risks compare to the observed outcomes. Usually this is evaluated in a calibration plot by graphically plotting observed against predicted event rates [16,67,82]. Calibration plots may be supplemented by the Hosmer-Lemeshow test, which is a formal statistical test to determine whether calibration is adequate. Unfortunately most of the studies under review did not report the calibration plot. This shortcoming leaves room for uncertainty in applying the model in clinical practice because one cannot determine the probability range within which the model works well. Both discrimination and calibration are essential in determining model performance.
Prediction model evaluation can be undertaken by internal validation (using the same dataset as that used to develop the model) and external validation (using a different dataset to that used in developing the model). The external dataset should be collected using the same predictor and outcome definitions and measurements. Again most of the studies did not report whether or not internal validation had been performed thus breaching an important methodological consideration. Most of the studies did not follow the guidelines in the TRIPOD, STROBE and CHARMS checklists. A possible explanation may be that some of studies were conducted prior to the development of these guidelines so the investigators may not have had the benefit of these methodological guidelines.

Prediction models applicable in low and middle income settings
Only five of the studies had been conducted in a low-and-middle income country setting. Given contextual differences between high and low-and-middle income countries, many of the prediction models under review which have been developed in high income countries at present may not be applicable in most low-and-middle income countries. This is because these prediction models included biomarkers and uterine artery pulsatility index as predictors in addition to maternal clinical characteristics [20,21,23,24,27,28,30,[36][37][38][39][40][41]44,46,[48][49][50][51][52]61,83]. At present uterine Doppler measurement and serum biomarker assays are not widely available in many low-and-middle income countries. Therefore prediction models using biomarkers and uterine artery pulsatility index may not be routinely applied in these settings.
Generally, prediction models developed in one setting have to be externally validated in new populations to assess their performance before applying them in clinical decision-making. The model intercept and the regression coefficients often have to be updated to fit the new context or population to which the prediction model is being applied to. Thus prediction models developed elsewhere may be updated for use in other settings provided the predictors and outcome are the same. In situations where a prediction model includes variables which cannot be measured in the setting where the model is to be applied, that model cannot be used in that population. Consequently most prediction models developed in high income countries and including variables like serum biomarkers and uterine artery pulsatility index are at present not applicable in most low-and-middle income countries where the burden of hypertensive disorders of pregnancy is greater. Presently prediction models using maternal clinical characteristics, and which give optimum predictions can be externally validated and applied in low resource settings.

Conclusion
Most of the studies evaluated did not completely follow the CHARMS, TRIPOD and STROBE guidelines in prediction model development and reporting. Adherence to these guidelines will improve prediction modelling studies and subsequent application of prediction models in clinical practice. Prediction models using maternal characteristics, with good discrimination and calibration, should be externally validated for use in low and middle income countries where biomarker assays are not routinely available.