Development and validation of a prognostic tool: Pulmonary embolism short-term clinical outcomes risk estimation (PE-SCORE)

Objective Develop and validate a prognostic model for clinical deterioration or death within days of pulmonary embolism (PE) diagnosis using point-of-care criteria. Methods We used prospective registry data from six emergency departments. The primary composite outcome was death or deterioration (respiratory failure, cardiac arrest, new dysrhythmia, sustained hypotension, and rescue reperfusion intervention) within 5 days. Candidate predictors included laboratory and imaging right ventricle (RV) assessments. The prognostic model was developed from 935 PE patients. Univariable analysis of 138 candidate variables was followed by penalized and standard logistic regression on 26 retained variables, and then tested with a validation database (N = 801). Results Logistic regression yielded a nine-variable model, then simplified to a nine-point tool (PE-SCORE): one point each for abnormal RV by echocardiography, abnormal RV by computed tomography, systolic blood pressure < 100 mmHg, dysrhythmia, suspected/confirmed systemic infection, syncope, medico-social admission reason, abnormal heart rate, and two points for creatinine greater than 2.0 mg/dL. In the development database, 22.4% had the primary outcome. Prognostic accuracy of logistic regression model versus PE-SCORE model: 0.83 (0.80, 0.86) vs. 0.78 (0.75, 0.82) using area under the curve (AUC) and 0.61 (0.57, 0.64) vs. 0.50 (0.39, 0.60) using precision-recall curve (AUCpr). In the validation database, 26.6% had the primary outcome. PE-SCORE had AUC 0.77 (0.73, 0.81) and AUCpr 0.63 (0.43, 0.81). As points increased, outcome proportions increased: a score of zero had 2% outcome, whereas scores of six and above had ≥ 69.6% outcomes. In the validation dataset, PE-SCORE zero had 8% outcome [no deaths], whereas all patients with PE-SCORE of six and above had the primary outcome. Conclusions PE-SCORE model identifies PE patients at low- and high-risk for deterioration and may help guide decisions about early outpatient management versus need for hospital-based monitoring.

Introduction An important indicator of acute pulmonary embolism (PE) of moderate to high severity is an acute increase in right ventricular pressure or size or decreased systolic function. PE-provoked right ventricle (RV) abnormality is commonly assessed in two ways: 1) laboratory surrogates of myocardial stretch and injury, and 2) imaging assessments for RV dilatation, pressure increases, and decreased systolic function. The most common diagnostic tests are natriuretic peptide and troponin, and imaging by computed tomography (CT) and echocardiography. Assessments for abnormal RV (abnlRV) are absent in validated clinical prognostic models, such as the original and simplified Pulmonary Embolism Severity Index (PESI and sPESI) and Hestia [1][2][3]. These prognostic prediction models utilized a limited set of candidate variables without pertinent imaging and laboratory measurements [4]. Risk of early clinical deterioration from worsening RV function is not captured in current prediction models [5][6][7].
The newer anticoagulants offer efficacy and safety in PE treatment, yet there is hesitancy to discharge those with acute PE. Hospitalization for PE is as high as 90%-95% in the U.S. and Europe, yet 41%-51% of PE patients are classified as low-risk by existing clinical prediction models [8][9][10][11][12]. Clinical algorithms, checklists, and prognostic models are being developed and updated to optimize the safety of outpatient management, improve prognostic accuracy for outcome(s), and provide guidance to reduce practice variation. Incorporation of imaging and laboratory assessments for PE-provoked abnlRV have now been incorporated into hybrid clinical algorithms [1,7,[13][14][15][16], and some meta-analyses now support use of one or multiple RV assessment methods [4,17,18]. A consistent definition of PE-provoked abnlRV, however, is lacking [19][20][21][22].
Acute care providers are thus challenged to identify PE patients who are considered lowrisk (and safe for early discharge) and those at greater risk of clinical deterioration without a clear guideline on RV assessment in acute PE. Providers must make disposition decisions driven by concerns for acute deterioration (respiratory failure, cardiac arrest, new dysrhythmia, sustained hypotension, and rescue reperfusion intervention) within the first days of PE diagnosis rather than events at 30 days or later. Thus, we aimed to develop and validate a Participants Inclusion and exclusion criteria were the same for both development and validation databases. Men and women 18 years or older with image-confirmed acute PE diagnosed within 12 hours of ED presentation were eligible for enrollment. Patients were excluded for any of the following reasons: age 17 years old and younger at the time of screening; refusal to participate in study; radiologist's determination that filling defects were chronic, resolving, or unchanged after comparison to previous CT, if available; empiric anticoagulation or escalated intervention initiated more than 12 hours before PE diagnosis; incidental identification of either segmental or subsegmental intraluminal filling defects on CT or unrelated to primary diagnostic workup or ED presentation.

Data collection
The electronic case report included over 400 variable entry fields for prognostic model testing and other aims of the registry. For the prognostic tool, we collected 138 data elements on each patient, including vital signs at presentation, risk factors for PE, comorbidities, contemporaneous measurements of cardiac biomarkers [troponin and brain natriuretic peptide (BNP)], and CT and goal-directed echocardiography evaluations performed early in ED management of the index PE event.

Outcome measures
The primary composite outcome had morbidity and mortality outcomes of interest to emergency providers, which require hospital-based monitoring or time-sensitive interventions. We used a composite of death (all cause and PE-related) and clinical deterioration within five days of index PE confirmation. We incorporated and adapted a composite primary outcome previously used by researchers and considered to be important to providers and pulmonary embolism response teams in the USA and other countries [5,7,13,[25][26][27][28]. The individual components of the composite outcome have previously been reported on [5,27]. Deaths were classified as PE-related when the site investigator reviewed the case and determined death was not likely to be due to another cause, such as septic shock or acute myocardial infarction. Elements of clinical deterioration included respiratory failure, cardiac arrest, new dysrhythmia, sustained hypotension requiring intravenous volume expansion or adrenergic medication, and rescue reperfusion intervention.
Respiratory failure was defined as respiratory distress associated with emergent interventions with mechanical ventilation (intubation, non-invasive positive pressure ventilation, or surgical cricothyrotomy). Cardiac arrest was defined as any unstable cardiac rhythm or absent electrical activity requiring cardiopulmonary resuscitation or advanced cardiac life support for asystole, pulseless electrical activity, ventricular fibrillation, or unstable ventricular tachycardia. New dysrhythmia was defined as the identification of atrial fibrillation with rapid ventricular response, atrial flutter, supraventricular tachycardia, stable ventricular tachycardia, or bradycardia that was not evident at ED presentation. Hypotension was defined as systolic blood pressure less than 90 mmHg (or a 40 mmHg decrease from baseline) or shock index >1 associated with administration of greater than 500 mL of intravenous fluids within 15 minutes for volume expansion or administration of norepinephrine, dopamine, or epinephrine infusion.
Major bleeding was attributed to treatment with anticoagulation or thrombolysis and not as a primary outcome of clinical deterioration due to PE severity. The presence of death or any clinical deterioration element within five days of hospitalization was considered to be positive for the primary outcome. The absence of death or clinical deterioration within five days post-PE confirmation was considered negative for primary outcome. Each patient could have more than one element of clinical deterioration.
Although not the focus of this report, our secondary outcome included the same events in 5 days with the addition of major bleeding, recurrence of venous thromboembolism (VTE), or subsequent hospitalization within 30 days.

Predictor variables
We considered 138 candidate variables available at the point-of-care, including laboratory and imaging tests relevant to assessment of abnlRV, and those previously vetted by PE registries, sPESI, Hestia, and European Society of Cardiology (ESC) [3,4,15,29,30]. Predictor variables were measured and assessed while blinded to outcomes. We included symptoms, signs, and findings likely to represent higher PE severity. As an example, we chose syncope instead of shortness of breath or chest pain based on clinical experience and evidence in the literature [31][32][33][34]. We added a variable that factored in initial heart rate < 50 or > 100 bpm [35]. We included a component of Hestia that employed provider gestalt of medical and social support reasons for hospitalization as social determinants of health. Variables that addressed the safety risk of PE treatment (including predispositions to bleeding) were not included as candidate variables for the primary outcomes of clinical deterioration. We report on the missingness of variables in the final prognostic model and associated outcomes of those with missing variable responses.
site performed univariable analyses to determine completeness and sensibility of entries. Verification queries were performed, with corrections made if necessary.

Sample size
We used Peduzzi's rule for logistic regression to guide determination of sample size [39]. This rule declares the maximum number of independent (predictor) variables is no more than N/ 10, where N is the number of observations (subjects) in the smaller of the two groups (outcome dichotomous yes/no). We were prepared to accommodate up to 22 final variables. So, 220 subjects (220/10 = 22) were needed in the smaller (clinical deterioration yes) subgroup. Using an estimated 25% occurrence of clinical deterioration within several days (based on previously cited literature), sample size of 880 was required for the development database [6,27,40].

Missing values
We reported the percentage of missing observations for each variable. Missing categorical data were marked as absent [41].

Statistical analysis methods
Data cleaning. We performed three interim data cleans during the enrollment phase before importing to SAS for the final data clean after the final enrollment. During the enrollment phase, important variables were assessed for missingness and discrepancies during data cleaning. For example, we reported outliers in vital signs or laboratory measurement values to the site investigators. At the close of enrollments, descriptive statistics were used to examine predictor and outcome variables for sensibility and missingness. Instructions for corrective actions were assigned to the site investigator and clinical research team by referring to source documents within the electronic health records. After missingness was mitigated and sensibility of data optimized, the database was used for analysis. We then imported the development and external validation databases to SAS Enterprise Guide 7.1 (SAS Institute Inc., Cary, NC, USA).
We computed overall descriptive statistics on all variables in each dataset. We reported on the number of non-missing and missing values, the mean, median, standard deviation, minimum and maximum values for continuous variables. We used frequencies and percentages of each value (including missing values) for categorical variables. The PI and biostatistician inspected reports and made queries to verify and correct data as needed.
Model development. Fig 1 shows the steps taken to derive the prognostic model. We screened 138 candidate variables with bivariate analyses of the primary outcome in the development dataset. We used Student's t-test for continuous variables, Cochran-Armitage test for trend for ordinal variables, and the chi-square test for categorical variables. We chose a significance level of 5% or clinical importance as preliminary screening criteria for the full model testing and filtering of candidate variables for subsequent regression model testing. The rule for retaining variables was not simply p < 0.05. Rather, the decision whether to retain a variable was based on a combination of factors, including strength of association, prior research findings, and clinical importance as determined by investigators. Below, we outline the subsequent steps taken to optimize its clinical utility. Full descriptions of each step follow the outline.
1. We used a least absolute shrinkage operator (LASSO) logistic regression model for variable selection [42].
2. To further assess the predictor variables selected by the LASSO procedure, we included them in a standard logistic regression with the primary outcome as the response on the development data. We excluded predictor variables with p > 0.10 from further analysis. 3. We ran a generalized linear mixed model (GLMM) on the development database, with the primary outcome as the response and the reduced set of predictor variables identified by the LASSO and standard logistic regression models. The GLMM included a random intercept term for the clinical site to adjust for intra-site clustering.
4. To facilitate real-time clinical use by providers, we simplified the final 9-variable logistic regression model to a 9-variable points model that we named PE-SCORE.
Per the outline, we first used a LASSO logistic regression model for variable selection. LASSO is a type of penalized regression method that minimizes collinearity and avoids overfitting the model [43]. In addition, LASSO partitioned the development database such that twothirds (67%) of data were used to train (or fit) the model, while 33% of the data were used for the first stage of internal validation of the model [44]. We selected the optimal level of penalization by using average squared error between responses and predictions in the internal validation data [44].
To further assess the predictor variables selected by the LASSO procedure, we included them in a standard logistic regression with the primary outcome as the response on the development data. We excluded predictor variables with p > 0.10 to create a more parsimonious model. Because of possible intra-site clustering, we considered the clinical research site to have a potentially important random effect in modeling the primary outcome. To assess site differences on the primary outcome and selected predictor variables, we used one-way analysis of variance for continuous variables, the Kruskal-Wallis test for ordinal variables, and the chisquare test for categorical variables. Informed by these findings, we ran a generalized linear mixed model (GLMM) on the development database, with the primary outcome as the response and the reduced set of predictor variables identified by the LASSO and standard logistic regression models. The GLMM included a random intercept term for the clinical research site to adjust for intra-site clustering. To determine the importance of the site effect in the model, we assessed its variance using a test based on the ratio of residual pseudo-likelihoods. We tested odds ratios of retained variables and used their confidence intervals (CIs) to determine significance as predictors of the primary composite outcome. Presentation of prediction model. For the logistic regression, we reported coefficients for the variables in the final model, p-values, likelihood ratios, and odds ratios with confidence intervals. [The logistic regression equation is available for calculation of the probabilities.] Next, we assigned whole points and weights to the final variables of the tool, which were proportional to each variable's odds ratio for the primary outcome. We developed the points tool, called Pulmonary Embolism Short-term Clinical Outcome Risk Estimator (PE-SCORE), for real-world usefulness to providers at the point of decision-making [2,30,[45][46][47].
External validation. We used the external validation database to test the PE-SCORE model for reproducibility of results and to measure performance of the model on an entirely different sample. Site investigators and data extractors were blinded to the selection of development and validation databases.
We reported descriptive statistics to determine similarities and differences between the databases and compared them with t-test and chi-square analyses for predictor variables. We ran the points model on the validation database.
Prognostic model performance. We reported on the prognostic performance of both the logistic model and the points model (PE-SCORE) on the development and validation databases. We measured and assessed sensitivity, specificity, and positive and negative predictive values for the primary outcome (yes/no) using two thresholds (low-risk and high-risk) for the PE-SCORE model. To report on discrimination, we reported sensitivity, 1 minus specificity, and receiver operating characteristic (ROC) to derive the area under the curve (AUC) and area under precision recall curve (AUCpr), with 95% confidence intervals and F1 scores and curves for visualization. For calibration, we 1) reported the proportion of observed actual events versus predicted probabilities, and 2) assessed goodness of fit between individuals with and without the outcome of interest with the Spiegelhalter z test and its p-value [48]. We reported measurements of calibration slope for overestimation and underestimation of risk prediction and the intercept for calibration-in-the-large [49,50]. We provided figures of calibration curves for visualization [50,51]. We used the following interpretation guideline: A slope < 1.0 suggests estimated risks are exaggerated, whereas slope > 1 suggests risks are underestimated. The calibration intercept was used for overall calibration-in-the-large. Using an optimal value of 0, negative values indicated overestimation, whereas positive values suggested underestimation.
To compare model performance, we compared the AUC of the full logistic model with the PE-SCORE in the development dataset. For this comparison, we used the method described by DeLong [52]. To compare AUC of PE-SCORE in the development and validation databases, we used the chi square test presented by Gonen [53].

Participants
We enrolled 1008 patients into the development database, with 73 post-enrollment exclusions, leaving 935 records for analysis. We enrolled 815 patients in the validation database, with 14 post-enrollment exclusions, leaving 801 records for analysis. As shown in Table 1, patient characteristics in both databases were similar, as was the incidence of primary composite outcome and each of its components. Recurrence of VTE, major bleeding, and death within 30 days were higher in the development database.
There was low missingness for candidate variables. The variable with the most frequent missing responses (marked as absent) was GDE score at 2.2% and 3.4% in the development and validation databases, respectively. GDE missingness, however, was expected. Our assessment showed the impact of missing GDE values on outcomes was minimal: the percentage of patients experiencing the primary outcome for those with GDE negative, positive, and missing responses for abnlRV were 14.4%, 38.4%, and 28.6%, respectively (S1 Table). Model development S1 Table shows main results of univariable analyses of candidate variables on the development database. Notably, any cancer (p = 0.987) and heart failure (p = 0.285) had non-significant pvalues. Twenty-six of the 138 candidate variables vetted by univariable analyses had p-values below 0.05 and were retained for subsequent LASSO regression. We re-entered variables for chronic obstructive pulmonary disease (COPD), cancer, and oxygen saturation below 90% because these were variables in validated sPESI and Hestia models. LASSO retained 13 variables; cancer was not retained again. We next ran a standard logistic regression with the 13 retained variables, nine of which had p < 0.10 in the logistic model and were retained for further analysis.
In the univariable comparisons of clinical research sites, we found statistically significant differences between sites for the variables shown in S2 Table (primary composite outcome, race, age, ethnicity, abnormal heart rate, creatinine greater than 2.0 mg/dL, abnormal RV by imaging, and medical/social reasons for hospitalization). Moreover, the random intercept term for the clinical research site was statistically significant (p < 0.01) in the GLMM. Accordingly, we retained 'clinical research site' as a random effect in the model. [Although clotting disorder was statistically significant, it was uncommon; thus, it was not included in the final prognostic model.] Table 2 shows the nine variables used in the final logistic regression equation.

Model specification
The logistic regression equation to determine probability of the primary outcome is P = [1 +exp(-(α RE + Sβ i x i ))] -1 , where αRE is the fixed intercept (-2.91) and Sβ i x i is the sumproduct of the nine fixed regression coefficients of the random effects model.
To convert the 9-variable logistic regression prognostic model into a simpler format for usefulness, we used the odds ratios shown in Table 3. The odds ratios of most of the nine predictor variables were similar and each was assigned 1 point, except for the creatinine > 2.0 mg/dL variable, which was assigned 2 points. The reason 2 points were awarded for creatinine > 2.0 mg/dL was based on the adjusted odds ratio of 5.37 for this variable. The adjusted odds ratio of 5.37 for renal impairment was more than double that of 5 variables in the model. Compared to dysrhythmia, which had the second highest adjusted odds of 4.00, the

PLOS ONE
adjusted odds for creatinine elevation was 40% higher. We recognized that by awarding 2 points for creatinine elevation, the range for our point system would be 0-10, which is standard for many similar scales. The weights assigned to each variable in the final PE-SCORE model are listed in Table 3. The lowest PE-SCORE is 0 and the highest score is 10.  Table 4 shows the actual versus predicted events of PE-SCORE on the validation database. Predicted events were derived from the logistic regression model estimations. At the low end of the risk estimation, actual events in the validation database were higher (8% compared to 2% in the development database). There were no deaths within 5 days for patients with PE-SCORE of zero. There was one death among patients with PE-SCORE of 4, but it was not considered PE-related (segmental PE with coexisting perforated intestinal ulcer and gastrointestinal bleeding). The patient did not have CT or GDE finding of RV abnormalities, although both troponin and BNP were elevated. In this case, the PE-SCORE was elevated (although the sPESI was zero) because of other medical conditions, a heart rate of 105 bpm, and creatinine greater than 2.0 mg/dL.

Prognostic model performance
All 9 components of the prognostic model were available for full scoring of PE-SCORE for 888 of 935 patients (95%) in the development database and 737 of 801 patients (92%) in the validation database. In the development database, for the minimum score of zero, the proportion with primary composite outcome was 2%. Among those with scores of 6 or higher, the composite outcome was 69.6%. The exception was 38% for a score of 3 and 35.6% for a score of 4. In the validation dataset, for the minimum score of 0, the proportion with primary outcome was 8%. Among those with scores of 6 or higher, 100% had the primary outcome. The discrepancy in the middle ranges was absent. Based on the results, we set a low-risk threshold for PE-SCORE at 0 points and high-risk threshold at 5 points. Table 5  We report on the AUCpr due to the imbalance in outcomes on both development and validation databases [54]. Fig 3 shows AUCpr for logistic regression model on the development database, followed by the AUCpr for PE-SCORE on the development and validation databases. In Table 5, we provide four metrics for calibration of PE-SCORE on the development and validation dataset and for logistic regression model on the development database. Fig 4 shows calibration slope and intercept values to be excellent: 1) the Spiegelhalter z test did not indicate lack of fit (p > 0.05); 2) calibration curve slope values were close to 1.0 and linear regression intercept values were close to zero. Calibration slopes and intercepts were close to 1.0 and zero on both databases. Although the Hosmer-Lemeshow test suggested lack of fit (p <0.1) for the full regression model and points model on the development database, those results were offset by three calibration test metrics that indicated excellent calibration [50].  888 (95%) had complete responses for nine components of the PE-SCORE tool. Of 801 patients in the validation database, 737 (91%) had complete responses for all nine components of the PE-SCORE tool. GDE was the only variable missing. None of the other 8 variables used to calculate the PE-SCORE were missing. A modified PE-SCORE that did not include GDE was calculated with a reduced potential range of 0-9 points. Their modified scores had an actual range of 0-4. The percentages of patients experiencing the primary outcome among those with modified PE-SCORE scores of 0, 1, 2, 3, 4 were 16.7%, 16.7%, 50.0%, 50.0%, and 0%, respectively. In comparison, the percentages of patients without missing GDE who experienced the primary outcome were 2.1%, 7.3%, 18.6%, 38.0%, and 35.6% among those with a PE-SCORE of 0, 1, 2, 3 and 4, respectively. Except for a modified score of 4, these percentages were higher in each point category for the patients with GDE missing than the same score for the group not missing GDE. Table 6 shows traditional prognostic accuracy performance metrics for PE-SCORE (at two different risk thresholds) on the development and validation databases. We used a threshold of zero points for PE-SCORE to address low-risk stratification. A threshold of 5 for PE-SCORE indicates high-risk for clinical deterioration. At the lower-risk threshold, providers are most interested in the negative predictive value (NPV) of a prognostic model. We report on the model's performance in low-versus high-risk stratification because the decisions made are quite different. Low-risk stratification increases consideration for immediate outpatient clinical management, whereas high-risk stratification increases the intensity of monitoring.

Discussion
We used prospective registry databases and developed and validated an original prognostic model from a field of 138 candidate variables. The registry involved contemporaneous and early assessments for PE provoked RV abnormalities with predefined laboratory and imaging assessments, and focused on outcomes of interest to providers at the point of decision-making Table 6. Performance of PE-SCORE model at two risk thresholds on both databases. and to pulmonary embolism response teams [5,7,13,28]. The final variables in the prognostic model are readily available during ED evaluation, including interview questions, witnessed events (syncope), vital signs at presentation or on a cardiac monitor (heart rate and systolic blood pressure), past medical history, routine laboratory findings, and imaging. The two imaging variables were CT RV:LV ratio, which is determined from CT images, and goal directed echocardiography, which is performed at the patient's bedside and provides multiple dynamic images of the RV.

Development database
In a meta-analysis of 71 prognostic model reports, 17 were original prognostic models like our study [4]. The other 54 reports were validating, updating, or investigating the impact of prognostic models. For the 17 original prognostic studies, the number of candidate variables ranged from seven to greater than 30. In five studies, the number of candidate variables were either unclear or not reported [16,[55][56][57][58]. Few studies included imaging findings as candidate variables: echocardiography finding of abnlRV (one study), RV:LV ratio (two studies), CT findings of RV abnormality (one study), and ultrasound for venous thrombosis (four studies) [1,16,[57][58][59][60][61].
Most reports on prognostic models for acute PE focus on outcomes of death, recurrent VTE, and bleeding at a time point of 30 days or longer [4,18,62]. In contrast, our study focused on death or clinical deterioration within five days of PE diagnosis, as outcomes that are important to providers and researchers [5,7,13,25].
Prognostic performance of the logistic regression and PE-SCORE models was strong for discrimination and calibration. The logistic regression model had an AUC of 0.83 in the development database. The user-friendly PE-SCORE points tool had AUCs of 0.77 and 0.78 in the development and validation databases, respectively. When decision-making priority is focused on patient candidacy for outpatient treatment, PE-SCORE set to a low-risk threshold has high negative predictive value. When the decision-making priority is determining whether increased intensity of monitoring or increased considerations for escalated PE treatment may be indicated, PE-SCORE set to a high-risk threshold has moderate accuracy.
Regression analyses provide plausible ranking of importance of RV imaging variables in PE risk stratification: GDE had greater odds ratio than CT. We used GDE instead of comprehensive echocardiography to visually detect PE-provoked abnormalities of RV size, pressure, and systolic function. To assign a GDE score of 1 or more, providers were required to detect RV dilatation (not severe RV systolic dysfunction or septal shift alone). The ordinal nature of GDE scoring was itself calibrated, showing increased odds of clinical deterioration as GDE scores increased.
The absence of variables in our final prognostic model deserves discussion. Troponin and natriuretic peptides are considered influential PE prognostic predictors in meta-analyses [18,22,[63][64][65]. Although our study's univariable analyses showed significant differences in both cardiac biomarkers in outcome groups, neither troponin nor natriuretic peptide elevation were retained after regression analyses. Our findings rank the predictive accuracy of laboratory RV assessments lower than imaging RV assessments in a restricted prognostic model. It is plausible natriuretic peptide and troponin do not directly identify the cardiac chamber experiencing acute myocardial stretch and myocardial injury. In contrast, GDE directly identifies RV dilatation and abnlRV systolic function. Age and cancer (predictors featured in models like PESI, sPESI, Hestia, and ESC) were not significant in univariable analyses or with penalized regression analysis. In the original PESI study, those aged > 65 years accounted for 52%-59% of the development and validation cohorts [2]. In our study, the proportion of patients aged 65 or older in both databases was lower at 39.3%. In the original PESI derivation and validation report, cancer was present in 19% and 16% of the databases, respectively. In our PE-SCORE study, cancer was present in 24.8% of PE patients, but did not reach statistical significance for prognosis of acute clinical deterioration.
The absence of advanced age or cancer as discrete variables in the PE-SCORE does not prevent these features from being considered by provider's discretion for social or medical reasons for hospitalization or an increased level of monitoring-an important component of PE-SCORE. Original clinical scores or guidelines, which were developed for outcomes of death, recurrent VTE, or major bleeding 30 days or later, tend to be pragmatically modified or adapted to consider other social/medical conditions or laboratory or imaging findings instead of being used in isolation during clinical practice [7,13,15,66]. With PE-SCORE, variables for provider discretion on social or medical reasons for hospitalization for increased monitoring and RV imaging assessment are built in.
Unlike our findings, some studies found troponin and echocardiography findings of abnlRV did not have prognostic value in determining in-hospital adverse events [67]. Zondag et al. reported that although 35% classified as low-risk by Hestia criteria had coexisting RV abnormality by CTPA, there was no difference in outcomes compared to patients without abnlRV [68].
Our study has several limitations. Although the validation was performed on a different database with data collected during a different time period, external validation should be conducted at sites outside the current registry. Our study focused on clinical deterioration and early mortality due to PE severity. We did not assess outcomes due to PE treatment (e.g., bleeding, bleeding risk, compliance with treatment), which would influence disposition decisions and need for safety outcomes. The study setting was focused on emergency department patients and ambulatory care settings where the cadence and feasibility of testing may not be generalizable to patients developing acute PE while already in the inpatient setting. Already hospitalized patients who develop acute PE may have different risk factors or susceptibilities to PE-associated deterioration from those diagnosed in an outpatient setting.
Our a priori study design included using troponin measurements as continuous data; however, institutional change in troponin assay at the central site interrupted plans to perform linear regression on the troponin variable. Similarly, two of the six sites used NT proBNP, while others used point-of-care BNP assay measurements. Therefore, we used institutional assay cut-offs to create categorical variables (troponin and natriuretic peptide elevation). Univariable analyses showed significant differences in mean troponin, point-of-care BNP, and NT proBNP measurements between outcome groups in both databases. Valuable information, however, may have been lost by converting a continuous variable into a categorical variable [69].
Univariable analysis identified the clinical site itself as a variable of importance. The logistic regression model therefore has a random effects intercept for clinical sites. The random intercept cannot be used in a risk calculation on patients at sites outside of the six sites of this study, as the random effect of the new site is unknown. Thus, only the fixed intercept of the random effects model is used in the risk calculation. Model performance at a clinical site outside the six sites in this study may differ. Other discrete variables that may be of interest (e.g., median income, insurance status, other social determinants of health) were not included in this study. Despite significant differences in patient characteristics between sites, the prognostic model performed well on patients.
Another possible limitation of our report is that machine-based learning derivation techniques may offer better management of multiple variables (including those with interactions); however, our preliminary steps with classification tree analysis were not helpful. The logistic regression model we developed had an AUC of 0.83 (95% CI 0.80-0.86), whereas the PE-SCORE yielded an AUC of 0.78 on the development database. Although PE-SCORE had lower prognostic performance than the logistic regression model, PE-SCORE performed similarly by AUC on both databases and offers real-world usefulness at the site-of-care.
It is plausible that definitions of candidate variables may be modified or optimized in future updates or revisions of prognostic tools. For example, other CT-derived variables, such as contrast reflux or the pulmonary arterial occlusion index, may be considered as candidate variables for a prognostic model. We only used CT RV:LV ratio from the CT. Initial oxygen saturation by pulse oximetry and initial respiratory rate at presentation were not retained. Both clinical variables were measured with patients at rest. Because exertional shortness of breath is a common symptom of PE, oxygen saturation and respiratory rate (measured after fixed and defined exertion) may yield different results when developing a prognostic tool. Even retained variables, like initial heart rate, can be optimized by measuring heart rate after fixed exertion or by using highest heart rate within a fixed time interval as a candidate variable.
Our prognostic model included creatinine elevation (greater than 2.0 mg/dL) as a parameter of renal function. Other reports have identified acute renal injury as a prognostic factor [70,71]. Acute renal injury was not included as a candidate variable in PESI/sPESI. In our study, we did not attempt to differentiate renal injury from renal failure. We did not use glomerular filtration rate to assess renal function, and we used a modest cut-off for creatinine level evaluation for provider use. It is possible a different cut-off value of creatinine or a different renal function parameter of renal function assessment may offer an optimal prognostic performance.
Another possible limitation is that we used an ordinal GDE score of visually estimated severe RV dilatation (absolute or relative to left ventricle) and severe RV systolic function. Use of echocardiographic measurements on two-dimensional modality or other echocardiographic modalities may increase risk stratification stringency or provide recommendations for optimal cut-offs.
Although this study was performed at academic centers, competency in GDE has been expected of those emerging from EM residency training for the past decade. Our results may indicate an opportunity to study the impact of employing GDE into PE risk stratification. Upon external validation, any real-world application of PE-SCORE would include recommendation that technically difficult or uninterpretable GDE images limit full use of PE-SCORE. None of the other eight variables used to calculate PE-SCORE were missing during development. When faced with absent GDE scores, providers should use available clinical information, recognizing the worst case scenario (that GDE is abnormal) has not been ruled out. Providers may either add a point or consider the partial PE-SCORE a minimum score. The other real-world option is to consider comprehensive echocardiography (by cardiology service).
Most of the clinical outcomes were determined during hospitalization and may not have been recognized outside of the hospital setting. The study design did not directly assess the impact or safety of implementing the prognostic prediction or its PE-SCORE on provider decisions regarding disposition, level of monitoring needed, or escalation of treatment.
Potential benefits of PE-SCORE include early detection of deterioration and avoiding misclassification of patients who experience the outcome but would have been classified as lowrisk by another prognostic tool. Potential harms may include unnecessary testing or interventions in those who did not experience any clinical deterioration outcomes despite higher risk classification, subjecting them to potential adverse events of the interventions, and increased lengths of stay and medical costs. After external validation, we anticipate use of the PE-SCORE tool in acute care settings with similar prevalence of early clinical deterioration to identify PE patients likely to benefit from early discharge and those who may need higher level monitoring and escalated PE interventions. However, incorporation of any new prognostic tool into clinical practice requires implementation and impact studies to better understand the clinical consequences [72].

Conclusions
We have summarized development and validation of a new prognostic tool that uses readily available imaging findings from CT, GDE, vital signs, and interview information. A PE-SCORE score of zero conferred a low probability and a score of � 6 predicted high probability of clinical deterioration/death within days of PE diagnosis. External validation may support use of this prognostic tool to inform decisions about early outpatient management versus the need for hospital-based monitoring and considerations for escalated PE interventions.
Supporting information S1