Prediction of 30-day pediatric unplanned hospitalizations using the Johns Hopkins Adjusted Clinical Groups risk adjustment system

Background The Johns Hopkins ACG System is widely used to predict patient healthcare service use and costs. Most applications have focused on adult populations. In this study, we evaluated the use of the ACG software to predict pediatric unplanned hospital admission in a given month, based on the past year’s clinical information captured by electronic health records (EHRs). Methods and findings EHR data from a multi-state pediatric integrated delivery system were obtained for 920,051 patients with at least one physician visit during January 2009 to December 2016. Over this interval an average of 0.36% of patients each month had an unplanned hospitalization. In a 70% training sample, we used the generalized linear mixed model (GLMM) to generate regression coefficients for demographic, clinical predictors derived from the ACG system, and prior year hospitalizations. Applying these coefficients to a 30% test sample to generate risk scores, we found that the area under the receiver operator characteristic curve (AUC) was 0.82. Omitting prior hospitalizations decreased the AUC from 0.82 to 0.80, and increased under-estimation of hospitalizations at the greater risk levels. Patients in the top 5% of risk scores accounted for 43% and the top 1% of risk scores accounted for 20% of all unplanned hospitalizations. Conclusions A predictive model based on 12-months of demographic and clinical data using the ACG system has excellent predictive performance for 30-day pediatric unplanned hospitalization. This model may be useful in population health and care management applications targeting patients likely to be hospitalized. External validation at other institutions should be done to confirm our results.


Introduction
About one-third of pediatric healthcare costs result from hospital admissions [1]. In 2012 the average costs for a pediatric hospitalization in the United States was $6,415 at a rate of 7,928 stays per 100,000 population aged 0-17 years, but this increased to $11,143, and decreased to 2,505 stays per 100,000 population, when neonatal stays were excluded [2]. Health systems that seek to reduce costs or admissions, either to improve efficiency or patient flow, often target patients at high risk of hospitalization. To develop and aim appropriate programs, risk assessment tools are needed that can accurately identify an at-risk population. Unfortunately, there are few pediatric-specific risk assessment tools that can be used to segment a population by its need for care management or other preventive services [3].
Certain types of hospitalizations are predictable because they are scheduled admissions for such indications as chemotherapy, surgery, and diagnostic tests. The majority, however, are unplanned and thus have some degree of associated preventability. Although there have been several studies on risk factors for pediatric readmission [4][5][6][7][8][9][10], there has been less attention given to developing predictive models for unplanned hospitalizations in populations of children and adolescents.
Our aim in this study was to develop a parsimonious risk model that used patient demographic, clinical data, and service use data over a one-year period to predict unplanned hospitalization (i.e., excluding admissions scheduled in advance of the admission date) in the next 30 days. Rather than developing a completely novel model, we built on the established Johns Hopkins ACG System's clinical markers as the core of our modeling approach [11]. Prior studies have demonstrated that the ACG system is useful to classify pediatric populations but by levels of healthcare service use [12][13][14], but none has used this risk adjustment system to predict pediatric unplanned hospitalizations.

Data source and study sample
This study was done using Electronic Health Record (EHR) data for patients seen in the Children's Hospital of Philadelphia (CHOP) health system. CHOP includes a large primary and specialty care outpatient network and a major inpatient facility that services a primary healthcare market in the states of Pennsylvania, New Jersey, and Delaware. Data were extracted from the CHOP EHR System (Epic) for visits in outpatient, emergency department, and inpatient settings for patients with at least one physician visit in any of these settings from January 2009 to December 2016. During the study period, 920,226 patients met these selection criteria. Applying a criterion that the children not already be hospitalized at the start of the reference month (see following) reduced the population to 920,051. The CHOP Institutional Review Board designated this study as not human subjects research.

Unplanned hospitalization
Because a portion of pediatric hospitalizations are scheduled for such activities as inpatient chemotherapy administration, neurological testing, and surgery and thus are not preventable, we focused on those that were unplanned. These hospitalizations have been confirmed as real events, and not administrative artifacts, by ensuring that the site of care was an inpatient place of service in the CHOP hospital. Unplanned hospitalizations were those that were not flagged as elective hospitalizations in an Admission/Discharge/Transfer table in our database. Among all confirmed hospital admissions during the study period, 87% were unplanned.

Predictors
Clinical variables were derived from the ACG system and included its DxPM score (a diagnosis-based probability estimate for patient risk of future healthcare use [3]), number of chronic conditions (0, 1, 2, 3+), and number of hospital dominant conditions (0, 1, 2+), the latter defined as a diagnosis associated with at least a 50% probability of hospitalization among patients of all ages within the coming year [15]. DxPM was categorized based on the percentile value for the cohort in the preceding year: 0-50% was the default, and other categories were 51-75%, 76-85%, 86-95%, 96-98%, and 99%. Demographic predictors were patient age, gender, race/ethnicity, and insurance type. Age was treated as a categorical variable, with the age of the patient's first visit during the prior year divided into three-month blocks up to three years and one-year blocks afterward up to age 18. We used finer age stratifications in the first year of life because infancy holds the highest risk of hospitalization (excluding inpatient stays for birth). The insurance types were binary variables, defined as whether prior coverage of the patient was public insurance, private or self-pay. The number of unplanned hospitalizations in the past year was categorized as 0, 1, or 2+; because prior hospitalizations turned out to be a strong predictor and we were concerned about potential bias using hospitalizations to predict hospitalizations, we tested an alternative model omitting this predictor.

Statistical analyses
We generated 84 epochs (12 months x 7 years) on a sliding window of 12 months of patient data across 2009-2016. Each successive window began and ended a month later than the preceding. For instance, the period January 2009 through December 2009 was used to predict a hospitalization occurring in January 2010, and so on. We split the study population into a 70% training sub-sample to develop the models and a 30% test sample to test model performance on a different set of patients.
Logistic regression was used to model the risk of a patient being hospitalized in the current month, excluding patients who were already in hospital, prior to the current month and extending into or past the current month. For this exclusion, we did not limit to planned or unplanned hospitalizations, or apply the other checks used to confirm unplanned hospitalizations.
As the outcome is a binary variable representing whether the patient had any admissions in a given month, this will necessarily drop hospitalizations that are readmissions that follow an admission earlier in that month. Similarly, our exclusion rule drops admissions that are readmissions for patients who are excluded due to an ongoing hospitalization as described above.
A generalized linear mixed model (GLMM) for prediction of risk of hospitalization in the current month was created using the demographic, clinical and prior hospitalizations as predictors and accounting for multiple measurements from the same patient using a patient-level random effect which described how the patients' individual risk might vary from the overall population controlling for the predictors. To account for time-varying trends, we also included month of epoch (12 values, January through December) and its position in the sequence (a real-valued number scaling from 1 to 84). The GLMM was implemented in the statistical computer language R [16] using the lme4 package [17].
The GLMM was used to derive risk scores computed as the beta coefficients from the model derived from the training sample and applied to the covariates for patients in the testing sample. The scores were based only on the demographic, clinical and prior use coefficients, not on patient-based random effects or the time-based predictors added to the GLMM. Patient-based effects had to be dropped as the random effects would not be applicable to the test set or any new group of patients. Time-based predictors would not be relevant within a given epoch. This approach allowed us to classify patients by risk of future hospitalization within a given epoch using a consistent approach across all epochs.
Area under the curve (AUC), which estimates the probability that a hospitalized patient will outscore a non-hospitalized patient, was used to describe how well the model can discriminate among patients at different risk for hospitalization. As a model can behave better on a training set than on a new set of data, model optimism was defined as the difference between the AUC for the training set and the AUC for the test set [18]; some decline in AUC is expected, as the model can fit noise as well as real effects in the data, but a large decline would indicate that the model results may not be generalizable. Table 1 shows the distribution of clinical, demographic and hospitalization variables among patients. Because age, prior hospitalizations in past year, and clinical variables can all be expected to change across time windows, the table shows the number and percentage of patients with at least one record in a given value. The table also shows the distribution of time windows (epochs) that includes a particular patient, and the distribution of patient parameters across these epochs. 53,091 or 5.72% of patients were represented in all 84 epochs, and the median number of epochs per patient was 24. Finally, the total number of unplanned hospitalizations (barring exclusions for patients already in hospital, as described above) for each epoch were tallied and used to estimate the overall rate of hospitalization in a given month across all epochs and within specific categories.

Results
The 84 epochs contained an average of 369,980 patients (SD 21,759), of whom an average of 1,322 (SD 132) or 0.36% (SD 0.04%) were hospitalized in the next month. There was some seasonal effect: the rates in December and January averaged 0.40%, while those in July averaged 0.31% (S1 Fig). There was also evidence of a long-term decline over time with monthly hospitalization rates declining from about 0.37% in 2009 to 0.33% in 2016 (S2 Fig). The declining rate was due to fairly constant hospitalization counts with an increasing size of the at-risk population. Because of these trends, the GLMM model across all epochs included a linear term for decline of hospitalization rate and a month-based factor for the seasonal variation.
The GLMM model coefficients with standard errors are shown in in Table 2, positive values reflecting increased risk of hospitalization. We found that prior hospitalizations had a large predictive value for new hospitalizations, so for comparison, we also show the GLMM coefficients for the alternative model fit without prior hospitalizations as a predictor. A striking factor is the 'U-shaped' estimate of the effect of age, decreasing with age for the first several years of life, and then increasing again at age 13. Also note the seasonal variation, where risk is higher in the winter and lower in the summer. Although the alternative model without prior hospitalizations does not perform as well as the main model (see following), there is little difference in parameters between model fits.
Omitted from Table 2 for clarity are two parameters which are not included in the GLMMderived score, although they are included in the calculation of predicted hospitalization risk for patients in a given epoch. One parameter is the intercept (baseline value), which for the main model is -7.381 (SE 0.027), corresponding to a baseline risk of hospitalization of 0.06% per month. The other is the per-epoch adjustment, which has a coefficient of -0.048 (SE 0.002) per year.
The fixed effects, without the time-dependent predictors per month or per epoch, were used to generate a score to identify hospitalization risk for patients within each epoch. The results were compared for the training and test patient populations. The AUC for all epochs was 0.826 in the training set and 0.821 in the test set, suggesting negligible overfitting. When Table 1. Distribution of patients and demographic/clinical variables. Left column is by individual patient and whether they had at least one epoch (time window) with a given factor. Middle column is total number of epochs, treating the same patient in different epochs as different records. Monthly hospitalization rate, the rightmost column, is calculated from the total number of hospitalizations and the total number of epochs in a given category. we omitted prior hospitalizations, AUC fell to 0.808 for training and 0.802 for test. There were no visible trends in AUC over time. Table 3 shows how the decile of calculated score compares to both the observed hospitalization rates and the predicted rates from the GLMM including time-varying fixed effects but not patient-level random effects. These random effects were left out of the prediction calculation because they are not available for the test set and will not be available for patient populations outside our own. The intra-class correlation coefficient for the GLMM is 0.215, indicating that 21.5% of the variability in results can be attributed to patient-specific factors that would be accounted for in the omitted patient-level random effect. Deciles were calculated within epoch so that it would be possible to get an idea of variability.
Note that at the highest decile, the model prediction underestimates the true unplanned hospitalizations. Plotting the ratio of observed/predicted rates against decile (Fig 1), we see that the main model tends to under-estimate lower risks of hospitalization, and that the observed/predicted ratios have parallel increases with decile. Comparing the main model to the model without prior hospitalizations, we can see that the reduced model further underestimates the percentage at higher rates. The higher AUC for the main model may be attributable to better discrimination between low-and high-risk patients, even if the actual assessment of risk is biased.
To examine the feasibility of targeting patients at greater risks of hospitalization, we looked at hospitalizations captured in groups defined by increasing cut-offs of score based on percentile within an epoch using data from the test sample. Using a 10% cut-off, an average of 56% of all observed unplanned hospitalizations were captured in the group of records above the cutoff, the top 5% accounted for 43% and the top 1% accounted for 20% of hospitalizations.
To address whether the model bias at higher rates could be attributed to specific diagnoses, we calculated the ratio of average hospitalizations and average predicted rate for each patient in the test set and linked the resulting table to the condition records to determine which Major Expanded Diagnosis Clusters (MEDC) from the ACG system were associated with higher ratios of observed hospitalization to predicted rates. We limited the analysis to those conditions associated with direct visits (inpatient, outpatient, ER or observation). The MEDC codes Short-term prediction of pediatric hospitalizations

Discussion
This study sought to determine whether the Johns Hopkins ACG risk adjustment system is useful for the specific question of hospitalization risk within the limited population of pediatric patients. The results are encouraging. The AUC, describing discrimination power of the scoring model, is 0.821. The closest analogue in the literature to the current model may be predictive models for 30-day readmissions, and prior studies did not see an AUC above 0.83 and only a minority of studies had AUC above 0.70 [8,19]. There are two benefits of this. One is that we have a new assessment of what risk factors hold for pediatric patients. Although some of our findings, such as the effect of race, may be more specific to our patient cohort, the seasonality and age-based coefficients may be of more general applicability. The other is that we have shown that an existing validated clinical software package can be used to distill a patient's potentially complex history into a parsimonious set of predictors for outcome modeling. For our model, we must consider whether further refinements could improve performance, particularly among the highest risk patients. One avenue for expanding the current model is in considering hospitalization risk beyond the current month. However, a model which predicts multiple hospitalizations over a period of a few months may require added sophistication to account for correlations between longitudinal measurements for the same patient. Tools for such models are currently available [20] but still relatively experimental.
An assumption of our model is that all prior admissions are equal, but we do not distinguish between admission and readmission or whether there are readmissions that would lead to more than one hospitalization in given month. The question of whether all admissions are the same may also impact the outcome being modeled. For example, Leyenaar et al considered whether the time-sensitive nature of some conditions made direct admission or admission through ER more appropriate for some patients [21].
It is reasonable to assume that patients at greater risk for short-term readmission may also be at increased risk for hospitalization over a longer time frame [22]. The type and extent of surgery is known to affect readmission rate [5,7], as is length of stay during a hospitalization [14,23]. Auger and Davis found that patients admitted on a weekend were more likely to be readmitted within 30 days [10]. All of these factors should be available in a database.
Cecil et al followed a birth cohort specifically to examine factors affecting unplanned admissions [24]. They found that higher usage of outpatient visits, indicating a sicker child, is a potential indicator of greater risk of unplanned admissions; among 5-9 year-old children, an additional sick outpatient visit per year increased the risk of unplanned admissions by 23%. The other finding of note from this study was that incomplete vaccinations increased the risk among 1-4 year-olds children by 89%. Outpatient visits are one indicator of children who are sicker or otherwise more prone to hospitalization. Another is emergency visits, which have been seen as a factor in hospitalization [25] and readmission [5] rates. These are examples of additional predictors that could be added to our model.
Our predictive model for unplanned hospitalization does not consider environmental factors such as climate, pollution, or family situation. These data are now readily available by linking EHR data to area-level data-sets using the patient's residence and converting it to census block or tract [26]. The current effort was deliberately limited to information that would be available solely in EHRs.