A comparison of three methods in categorizing functional status to predict hospital readmission across post-acute care

Background Methods used to categorize functional status to predict health outcomes across post-acute care settings vary significantly. Objectives We compared three methods that categorize functional status to predict 30-day and 90-day hospital readmission across inpatient rehabilitation facilities (IRF), skilled nursing facilities (SNF) and home health agencies (HHA). Research design Retrospective analysis of 2013–2014 Medicare claims data (N = 740,530). Data were randomly split into two subsets using a 1:1 ratio. We used half of the cohort (development subset) to develop functional status categories for three methods, and then used the rest (testing subset) to compare outcome prediction. Three methods to generate functional categories were labeled as: Method I, percentile based on proportional distribution; Method II, percentile based on change score distribution; and Method III, functional staging categories based on Rasch person strata. We used six differentiation and classification statistics to determine the optimal method of generating functional categories. Setting IRF, SNF and HHA. Subjects We included 130,670 (17.7%) Medicare beneficiaries with stroke, 498,576 (67.3%) with lower extremity joint replacement and 111,284 (15.0%) with hip and femur fracture. Measures Unplanned 30-day and 90-day hospital readmission. Results For all impairment conditions, Method III best predicted 30-day and 90-day hospital readmission. However, we observed overlapping confidence intervals among some comparisons of three methods. The bootstrapping of 30-day and 90-day hospital readmission predictive models showed the area under curve for Method III was statistically significantly higher than both Method I and Method II (all paired-comparisons, p<.001), using the testing sample. Conclusions Overall, functional staging was the optimal method to generate functional status categories to predict 30-day and 90-day hospital readmission. To facilitate clinical and scientific use, we suggest the most appropriate method to categorize functional status should be based on the strengths and weaknesses of each method.


Introduction
In many disciplines of medicine, clinical staging refers to hierarchical categories along the continuum of the measured construct. [1][2][3] The concept of "clinical staging" is also applied in acute and post-acute prospective payment systems, for example, the skilled nursing facilities (SNFs) resource utilization groups, known as case-mix group [4,5]. Individuals in the same SNFs resource utilization group are expected to share common abilities, respond similarly to assessment items, and likely have analogous needs for resources or equivalent costs of care [4,5]. When applied to functional status, known as "functional staging", such categorizations allow clinicians to accurately plan care, track prognosis, and enable researchers to define and refine case-mix adjustment groups. Functional staging can also be used to examine intervention effectiveness [6][7][8][9][10], enables meaningful categorical comparisons within and across groups of person(s) and setting(s).
While continuous scores may provide detailed clinically information for clinicians [11,12], categorizing scores facilitates policy discussion and decision-making. Additionally, using continuous score produces a summed score. The same summed score could, in fact, represents different levels of performances [13]. The site-neutral unified payment model, proposed by the Medicare Payment Advisory Commission [14], recommends eliminating payment difference across settings for patients with similar case-mix demographics and severity of impairments. Generating categories based on functional status provides clinical evidence for unified payment models and other health reform measures. Investigators have demonstrated that adding functional status categories in risk-adjustment models (e.g., hospital readmission) reduces differences in population-level case-mix [15,16]. Adding functional status categories in predictive models can therefore improve the equality of resources allocation, care quality, and generate more accurate estimated care costs [17]. Practitioners and researchers have used functional status categories to present hierarchical levels of patients' function for decades [6][7][8][9][10]. However, methods used to categorize functional status to predict health outcomes often are arbitrary and vary significantly. To identify the optimal method to categorize functional status, we compared three approaches in developing functional status categories to predict hospital readmission. Method I is a conventional percentile approach: tertile, quartile or quintile based on summed-scores distribution. Method II is a combination of change score with percentile method: tertile, quartile or quintile of change score between admission and discharge. Method III is a functional staging method using person strata categories based on latent trait theory. This paper aims to examine the relatively optimal approach to categorize functional status with outcome prediction in hospital readmission for Medicare beneficiaries. Hospital readmission was chosen as the main outcome in this study because it is an important national quality measure of patient care [4,18].

Data source
The study included 100% Medicare claims data from 2013-2014. We used the following data files: Inpatient Rehabilitation Facility (IRF) and Inpatient Rehab Facility-Patient Assessment Instrument (IRF-PAI) [19]; Skilled Nursing Facility (SNF) and Minimum Data Set (MDS 3.0) [20]; Home Health Agency (HHA) and Outcome and Assessment Information Set (OASIS-C) [21]; the Medicare Provider Analysis and Review and the Master Beneficiary Summary files.

Ethical assurances
This study was approved by the University Institutional Review Board (IRB # 16-0014). Additionally, a Data Use Agreement was established with the Centers for Medicare and Medicaid Services prior to all data analyses.

Cohort selection
We identified 2,953,006 eligible cases using a combination of Medical Severity Diagnosis Related Group codes and ICD-9 procedure codes for three impairment conditions: stroke (061-066), lower extremity joint replacement (469-470, 81.51 and 81.54) and hip/femur fractures (480-482). Using a combination of claims and assessment data, we included only those beneficiaries discharged from a hospital to one of the three post-acute care (PAC) settings: IRF, SNF and HHA. After applying exclusion criteria (S1 Table), the final analytical sample included 740,530 cases: 17.7% with stroke (n = 130,670), 67.3% with lower extremity joint replacement (n = 498,576), and 15.0% (n = 111,284) with hip and femur fracture (Table 1).
To develop and validate the three proposed methods, we used 1:1 ratio to randomly split the study cohort into a development subset (n = 370,265) and a testing subset (n = 370,265). The development subset was used to develop functional status categories from three methods. The testing subset was used to compare outcome prediction for three methods.
We also conducted sensitivity analysis to examine difference of demographics and personlevel characteristics before and after excluding 23% of potential patients (step 12 vs. step 15 in S1 Table). The cohort in step 12 included patients who did not receive PAC. The cohort that included 23% patients (generated by step 12) had less total SNF stay within 90 days at IRF compared to the cohort used in this study (generated by step 15). However, we did not find other variables significantly different between step-12 cohort and our study cohort (S6 Table).

Study outcome
The primary outcome was unplanned all-cause 30-day and 90-day hospital readmission (yes/ no) after index hospital discharge [22,18]. We chose 30-day window to reflect current reimbursement system. Additionally, we included a longer follow-up time-period (90-day) to be consistent with the episode-based payment initiatives [23,24].

Primary variable
The primary variable was functional status categories for two domains (Self-Care and Mobility) generated from three methods (details below). Self-Care and Mobility domains were chosen as these two domains being consistently measured across the PAC settings. Additionally, these two domains are potentially modifiable factors relevant to hospital readmission.

Functional status categories
Comparable items of the Self-Care and Mobility domains from each assessment were selected based on their conceptual meanings (e.g., eating items were selected from IRF-PAI, MDS and OASIS as the three items measure the same activity: eating). The number of selected items by assessment was 11 in IRF-PAI (6 Self-Care and 5 Mobility items), 11 in MDS (5 Self-Care and 6 Mobility) and 8 in OASIS (5 Self-Care and 3 Mobility) (S2 Table). We used co-calibration tables [25] to co-calibrate Self-Care and Mobility scores separately into a 0-100 scale, for the following three methods. Method I: Percentile based on proportional distribution. For each impairment condition, we created tertile, quartile and quintile categories based on the co-calibrated summed score distribution for each assessment. Self-Care and Mobility had the same numbers of categories. We generated percentiles first for each assessment, following c-statistics to determine whether to choose tertile, quartile or quintile for each impairment condition at each setting. Based on the c-statistics, quartile was chosen for stroke and lower extremity joint replacement, and quintile was chosen for hip and femur fracture. S1 Fig demonstrates an example of using Method I to generate functional categories of IRF-PAI Self-Care in Stroke. The same procedure was repeated for MDS and OASIS across impairment conditions. Detailed categories were provided in S3 Table. Method II: Change score with percentile distribution. We first calculated the change score between admission and discharge for each assessment (Self-Care and Mobility were calculated separately). Secondly, we calculated percentile (tertile, quartile and quintile) based on the change score distribution. Lastly, to increase clinical meaningfulness when interpreting negative, zero and positive change scores, we combined the percentile change score distribution with the following operational definitions: tertiles (small, medium and large change), quartiles (negative and zero change, small positive change, medium positive change and large positive change) and quintiles (negative change, zero change, small positive change, medium positive change and large positive change).
Same as Method I, Self-Care and Mobility of each assessment had the same number of categories due to the nature of percentile method. Using c-statistics, quartile was selected for stroke and lower extremity joint replacement; quintile was selected for hip and femur fracture. The quintile proportion was found inapplicable for stroke and lower extremity joint replacement as the same functional score was used in more than one category. S2 Fig demonstrates an example of using Method II to generate functional categories of IRF-PAI Self-Care in Stroke. The same procedure was repeated for MDS and OASIS across impairment conditions. Detailed categories were provided in S4 Table. Method III: Functional staging. Fig 1 provides the detailed procedures demonstrating how we generated functional staging categories for IRF-PAI Self-Care in Stroke. We generated a person separation index (Gp) and calculated person strata, to statistically distinguish different ability levels using Rasch person strata formula (4 � Gp+1)/3 [26][27][28][29][30][31]. We followed this existing formula to calculate the number of person strata for each assessment by impairment condition [26][27][28][29][30][31][32][33][34][35][36]. Person strata are the concept based on a norm reference method using the distribution of person measure and centering on the mean of the person distribution. Each strata needs to be separated by at least three measurement errors apart to be statistically distinct [26][27][28][29][30][31]. We then identified the corresponding cutoff raw score from the 0-100 scale cocalibration table [25].
Using the development subset, for stroke, we generated four categories for Self-Care and three categories for Mobility for all three instruments. For lower extremity joint replacement, we generated three Self-Care and two Mobility categories for IRF-PAI and OASIS; and three Self-Care and three Mobility categories for MDS. For hip and femur fracture, we generated three Self-Care and two Mobility categories for IRF-PAI; two Self-Care and three Mobility categories for MDS; and four Self-Care and three Mobility categories for OASIS (S5 Table).

Model comparisons
Six indices were used to compare the outcome prediction of the three methods: C-statistics/Area under the Curve (AUC). The c-statistics measure the discrimination ability of the model. We compared the logistic model discrimination using c-statistics with asymptotic 95% confidence intervals. The c-statistic is also known as the AUC, the area under the receiver operating characteristic curves. The AUC is the most commonly used method to evaluate probability of model performance in the context of binary outcomes with higher values indicates better model fit [37][38][39][40][41][42].
Somer's Delta (Somer's D). Somer's D is a nonparametric test to assess the strength and direction of the association between an ordinal dependent variable and an ordinal independent variable. Somer's D is based on the assumption of a monotonic relationship between the independent and the outcome variables. Higher Somer's D indicates better model fit [43].

Akaike information criterion (AIC)/Bayesian information criterion (BIC).
Both AIC and BIC [44] evaluate goodness-of-fit (model fit) and penalize for the excessive number of estimated parameters using log-likelihood functions. AIC/BIC provide a standard to balance between model parsimony and the penalty for overfitting [45,46]. Lower AIC/BIC value indicates better model fit [44,45].
Integrated Discrimination Improvement (IDI). The IDI indicates the difference in discrimination slopes between two models. The IDI measures whether the new model improves the average sensitivity without sacrificing its average specificity [47]. Higher (positive) values of IDI indicate that the new model is better than the reference model.
Net Reclassification Improvement (NRI). The NRI is a reclassification measure using reclassification tables constructed separately for respondents with and without events (i.e., outcome occurs or not) between two models [48]. Higher (positive) values of NRI (percent) indicate reclassification by the new model had higher sensitivity compared to the reference model.

Statistical analyses
We stratified all analyses by impairment conditions for both development and testing subsets. First, we constructed a baseline logistic regression model which included sociodemographic variables (age, sex, race/ethnicity, disability entitlement and Medicare-Medicaid dual eligibility), health status (Hierarchical Condition Category composite score, Elixhauser comorbidity categories, condition-specific severity, hospital length of stay, intensive care days and coronary care days) and post-acute length of stay. Then, we added three types of functional status to the baseline logistic regression model. We used baseline model to (a) ensure fair comparison conveyed by different functional status categories from three methods, and to (b) examine the magnitude change of outcome prediction by adding functional status variables. The predictive models with three methods of generating functional status categories were examined by AUC, Somer's D, AIC, BIC, IDI and NRI using the testing sample. To validate the stability of the estimates, a bootstrap procedure with 1000 re-samples was used to statistically compare c-statistics of the three methods using the testing sample. The c-statistics with bootstrapping is a standardized way for model comparison. Each of the three methodologies were later compared using paired t-tests if significant difference existed among methods. We used SAS version 9.4 (SAS Institute, Inc., Cary, NC) to perform all analyses.

Bootstrapping
In both 30-day and 90-day hospital readmission models, the results of the bootstrapping using testing sample showed that the AUC for Method III was the highest compared with both Method I and Method II for the three impairment conditions (all paired-comparisons, p<.001).

Clinical application
We provided functional status categories generated from Methods I-III (S3-S5 Tables). We also provide the estimated risk of 30-day and 90-day hospital readmission using the self-care and mobility combinations based on Method III functional staging categories (Fig 2). For example, among patients with stroke who had self-care score between 6-11, those with mobility score between 5-15 will have 22.8% probability of 30-day readmission and 33.4% probability of 90-day readmission (Fig 2).

Discussion
Generating meaningful categories allow for functional status comparisons and optimal outcome prediction across post-acute settings. This study compared three functional category methods and found the functional staging approach (Method III) generated the relatively optimal prediction for 30-day and 90-day hospital readmission. While the study findings imply that using functional staging approach can be relatively optimal for outcome prediction, it is unclear whether this improvement can also produce superior clinically meaningful levels. To facilitate clinical and scientific use, we suggest the most appropriate method to categorize functional status should be based on the strengths and weaknesses of each approach. For example, Method I may have the advantage of convenience (quick to calculate), Method II may have the advantage when reporting functional change and Method III may have the advantage in outcome prediction (i.e. hospital readmission). The choice of the method requires a delicate judgement and balance between available resource, time demand and study purpose. This study provides preliminary data to guide future healthcare policy reforms (e.g., bundled payment) when classifying patients' self-care and mobility function. We also generated tables of functional categories based on the three methods and plots of function-based readmission risks using functional staging for clinicians and researchers to use. Policymakers are beginning to explore the impact of functional status on classification systems in post-acute risk-adjusted capitation payments [15,49,50]. Researchers and the Medicare Payment Advisory Commission reported that adding functional status improved prediction of resource use and cost of care [15][16][49][50][51]. Categorizing patients into clusters would be clinically and administratively useful (e.g. patients in the same cluster may experience comparable care cost or require similar resources). By its nature, functional staging is hierarchical and thus may provide gradients of functional recovery (or loss) that can help case-mix adjustment in services use and outcome comparisons, aiding in care provision, resource allocation decisions and eventually quality of care evaluation.
We acknowledge that patients with varying clinical characteristics and disease severity may benefit differently from various levels of care provided at different types of post-acute settings. However, recent healthcare reform proposals emphasize the need for a unified prospective payment system for post-acute settings [14]. Thus, comparisons of effectiveness and efficiency of care for patients with similar case-mix demographics across post-acute settings are eminent and inevitable. Identifying standardized and consistent approaches to measure functional status across post-acute settings could inform future policy decisions and improve quality of patient care after hospitalization. Based on the Improving Medicare Post-Acute Care Transformation Act of 2014, Centers for Medicare and Medicaid Services Section GG data elements were implemented to collect unified functional data across PAC settings [52,53]. While Section GG data elements potentially would resolve functional assessment issues related to uniformity across PAC settings, using a standardized functional categorization method based on co-calibration functional scores provides firsthand comparisons of functional status across PAC settings. This study serves as a basis for Section GG data elements to develop hierarchical functional categorizations across settings in the future.
The study findings also indicate that generating more categories is not associated with better outcome prediction. Our results support the notion that the number of functional status categories varies by impairment condition, and using distinct functional levels may be more appropriate than the arbitrary percentile cutoff criteria, where a predefined fixed number of distribution-based categories dictates the categorization. Functional staging consider hierarchical functional levels, thus this empirical approach can classify patients into distinct functional levels.
Current evidence regarding the advantages and limitations of different functional category methods remains unclear and largely unexplored. In the emerging environment of value-based care and precision medicine, it is reasonable to ask: are percentile proportional distribution and change score too insensitive to provide accurate functional categories necessary to assess and predict quality outcomes? If the answer is yes, then what are the appropriate approaches? Our study and findings address this question and provide a potential solution for improving rigor in comparative effectiveness studies across post-acute settings.
Ongoing demonstration projects of uniform functional assessment, episode-based payment models, and unified payment system across post-acute settings signify the growing need to conduct rigorous post-acute health services and health policy research. This study is the first we are aware of to examine the impact of quality measures based on different categorization methods of functional status. Future study should examine whether different categorization methods of functional status are associated with different provision of care services. It is also important to explore other variables in addition to functional status to optimize outcome prediction accuracy for individual patients. In addition, future study should validate whether our finding can be applied to other quality outcomes, such as successful community discharge for Medicare beneficiaries.

Study limitations
This study has limitations related to using Medicare files [54]. For example, our findings may not be applicable to persons < 66 years old or those enrolled in insurance plans other than Fee-For-Services. In addition, this study focused on the physical aspects of functional status while cognitive function is an essential element of functional performance. We suggested future studies of this kind include cognitive function items. We are aware of the importance of stability of functional staging for both clinical application and policy decision-making, and recognize that co-calibration methodologies may introduce conversion measurement errors. We are also aware of that using categorization may introduce discontinuity at the boundaries of cut-off scores, thus limit statistical power, precision, and obscure the 'functionality' of individual differences. Future study also needs to identify whether the improvement of functional staging approach has clinical meanings compared to alternative methods. We also suggest future study investigating whether different clinically meaningful change levels can and/or should be included within each category, or if items should be weighted to enhance accuracy for both clinical utility and policy decision-making.

Conclusions
Current measures and methods examining functional status across post-acute settings vary significantly. To compare effectiveness and quality of care across post-acute settings, identifying an optimal functional category method is imperative. While our study found functional staging approach generated functional categories that explained the largest variances in both 30-day and 90-day hospital readmission prediction, we are uncertain whether functional staging approach can provide clinically meaningful improvement compared to alternative methods. We suggest clinicians, researchers and policy makers execute their best judgments to balance the strengths and weaknesses of each method when categorizing functional status. Additional research is needed to better understand the advantages and the limitations of using functional staging categories to assess and predict other important national quality measures across post-acute settings.