Exploring the use of machine learning for risk adjustment: A comparison of standard and penalized linear regression models in predicting health care costs in older adults

Background Payers and providers still primarily use ordinary least squares (OLS) to estimate expected economic and clinical outcomes for risk adjustment purposes. Penalized linear regression represents a practical and incremental step forward that provides transparency and interpretability within the familiar regression framework. This study conducted an in-depth comparison of prediction performance of standard and penalized linear regression in predicting future health care costs in older adults. Methods and findings This retrospective cohort study included 81,106 Medicare Advantage patients with 5 years of continuous medical and pharmacy insurance from 2009 to 2013. Total health care costs in 2013 were predicted with comorbidity indicators from 2009 to 2012. Using 2012 predictors only, OLS performed poorly (e.g., R2 = 16.3%) compared to penalized linear regression models (R2 ranging from 16.8 to 16.9%); using 2009–2012 predictors, the gap in prediction performance increased (R2:15.0% versus 18.0–18.2%). OLS with a reduced set of predictors selected by lasso showed improved performance (R2 = 16.6% with 2012 predictors, 17.4% with 2009–2012 predictors) relative to OLS without variable selection but still lagged behind the prediction performance of penalized regression. Lasso regression consistently generated prediction ratios closer to 1 across different levels of predicted risk compared to other models. Conclusions This study demonstrated the advantages of using transparent and easy-to-interpret penalized linear regression for predicting future health care costs in older adults relative to standard linear regression. Penalized regression showed better performance than OLS in predicting health care costs. Applying penalized regression to longitudinal data increased prediction accuracy. Lasso regression in particular showed superior prediction ratios across low and high levels of predicted risk. Health care insurers, providers and policy makers may benefit from adopting penalized regression such as lasso regression for cost prediction to improve risk adjustment and population health management and thus better address the underlying needs and risk of the populations they serve.


Introduction
Risk adjustment models are applied by payers and health care delivery organizations to adjust for differences in patient characteristics when estimating expected health care resource use, clinical outcomes, and quality of care. Commonly used predictors in risk adjustment models include demographic information and clinical variables. The dominant type of risk adjustment models in practice are standard linear regression based on ordinary least squares (OLS) [1]. For example, the HHS-Hierarchical Condition Categories (HHS-HCC) model, a risk adjustment model adopted for health plans participating in the Affordable Care Act, uses standard linear regression with age, gender, diagnoses and interactions between diagnoses to predict medical expenditure risk [2].
An emerging literature has begun to explore the potential application of machine learning methods to predict health care costs and utilization for risk adjustment purposes [3][4][5][6]. These studies compared a variety of machine learning techniques for risk adjustment including penalized regression, random forests, multivariate adaptive regression splines, boosted regression trees, neural network, and super learner. Early success has demonstrated the potential value of machine learning regression and classification methods for predicting costs and utilization. With new data sources becoming available for population health management [7][8][9], machine learning methods will become increasingly useful to process and analyze increasingly complex population-level health data.
However, despite the potential value of advanced machine learning approaches to predicting risk, payers and providers are still heavily relying on OLS regression to risk adjust and manage their patient populations. The slow adoption of advanced machine learning techniques can be partly explained by the unfamiliarity of risk stratification analysts with such techniques and complex interpretation and integration of results needed in practice. One approach to pushing the needle toward machine learning adoption in risk adjustment practice is through the introduction of incremental, effective and transparent machine learning regression models that stay within the framework of standard linear regression and also have as good performance as some more sophisticated but less transparent machine learning techniques [3]. This study concentrated on penalized linear regression models including lasso (least absolute shrinkage and selection operator) [10], ridge [11] and elastic net [12] and conducted a thorough comparison of penalized regression with standard linear regression in predicting total health care costs, which was not previously reported in published literature. We focused on older adults (�65 years old) as they incur disproportionately more health care spending [13].
Multiple factors make penalized linear regression a viable potential next step beyond OLS for risk prediction and adjustment. First, transparency of a risk adjustment model is paramount for care management and resource allocation. Penalized linear regression provides almost the same level of transparency and interpretability as standard linear regression. Some machine learning techniques such as random forests and neural network are hard to estimate and difficult to interpret, and yet they do not offer better prediction compared to penalized regression in predicting health care costs [3]. Second, despite that standard linear regression is still the most popular risk adjustment approach, penalized linear regression can be as easily scaled and deployed in environments with limited computational power and thus represents a pragmatic step forward for risk adjustment. Third, penalized regression such as lasso regression selects and retains important variables for prediction. Providers often have incentives to increase the intensity of coding medical services (a practice referred to as "upcoding"), especially those included in a risk adjustment model, in order to maximize reimbursement [14]. Carefully selecting predictors for a risk adjustment model with clinical insights and statistical criteria may curtail the opportunity for upcoding. As an example, HCC models accomplished this by creating a hierarchy of grouped conditions only based on a subset of all available diagnosis codes [2]. In addition, keeping only important variables in a model may facilitate care management as it is easier for care managers to target key risk factors.
The study also assessed the value of penalized regression in generating more parsimonious models as well as using additional predictors collected over a longer period of time. We tested parsimonious OLS models by including only important predictors selected by lasso regression. OLS provides unbiased estimates when specified correctly whereas penalized regression sacrifices unbiasedness for a potential reduction of expected prediction error. Variable selection may reduce the number of irrelevant predictors included in a model and thus increase efficiency and reduce the chance of overfitting. We also compared predictive model performance using baseline predictors from 1 year versus 4 years in the past.
The overall goal of this study was to assess the potential of penalized linear regression models for risk adjustment. Specifically, the study 1) compared standard linear regression with penalized linear regression in predicting future total health care costs in older adults, 2) compared standard linear regression using full and reduced sets of predictors selected by lasso regression, and 3) assessed the value of using longitudinal data from 4 years versus 1 year in the past as predictors.

Methods
This retrospective cohort study used IMS LifeLink Health Plan Claims Database [15], which is comprised of fully adjudicated and de-identified medical and pharmaceutical claims from health insurance plans. The database captures a geographically diverse sample of health plan enrollees in the U.S. Charges, allowed and paid amounts are available for all services rendered, as well as date of service for all claims. The database is fully compliant with the Health Insurance Portability and Accountability Act (HIPAA). The Institutional Review Board at the Johns Hopkins Bloomberg School of Public Health reviewed the study proposal and determined that the human subjects research activity described in the application meets the criteria for Exemption under 45 CFR 46.101(b), Category (4). It approved proposed use of an existing limited data set from commercial health plan claims in the U.S. (IRB No: 00008699). Patients were selected from a large health plan with longitudinal patient records. Patients were required to have 5 years of continuous medical and pharmacy insurance benefits from 2009 to 2013 and be at least 65 years old at the end of 2012. Although they were all Medicare Advantage enrollees, the selected patients were not nationally representative of Medicare Advantage enrollees.
Total health care costs in 2013 were the target outcome for all predictive models. Predictors were extracted from data prior to 2013. Previous diseases and symptoms as indicated by recorded medical diagnoses and pharmacy claims were included as predictors. The Johns Hopkins Adjusted Clinical Groups (ACG) System version 11.0 [16] was applied to medical and pharmacy claims to generate binary comorbidity indicators by grouping International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM) diagnosis codes from inpatient and outpatient claims and National Drug Codes (NDCs) from pharmacy claims. Only diagnoses made by a physician (excluding labs, imaging, and other provisional diagnoses) were included for grouping. The high-level "rolled-up" comorbidity groups have up to 282 diagnosis-based conditions called Expanded Diagnosis Clusters (EDCs) and up to 67 pharmacy-based conditions called Rx-defined Morbidity Groups (RxMGs). EDC and RxMG grouping algorithms were created by clinicians based on clinical judgement and cover a large aggregate set of comorbidities. RxMGs represent conditions treated with medications and do not completely overlap with EDCs which are based solely on diagnosis codes. Comorbidities with zero prevalence were excluded. In addition to the yearly comorbidity indicators, age (at the end of 2012), age squared and sex were included as predictors in all predictive models. To compare predictive model performance using information from baseline periods of different length, yearly EDC and RxMG comorbidity indicators were first extracted from medical and pharmacy claims in 2012 for 1-year prospective prediction models, and then 4 sets of the same yearly indicators were extracted in each of the 4 years from 2009 to 2012 for longitudinal prediction models.
The primary difference between standard and penalized regression is that penalized regression adds a regularization term in a least squares loss function before it is optimized to estimate coefficients. Lasso regression adds the sum of absolute values of coefficient estimates as the regularization term (i.e., L1 regularization) whereas ridge regression adds the sum of squares of coefficient estimates as the regularization term (i.e., L2 regularization). Elastic net adds a weighted average of L1 and L2. One unique feature of lasso regression is that it selects predictors simultaneously with model estimation. We compared standard linear regression with penalized linear regression with lasso (α = 1), ridge (α = 0), and elastic net (0<α<1) regularization as defined by (1α)/2ǁβǁ 2 2 + αǁβǁ 1 (β is a vector of coefficients). We tested elastic net regularization with α ranging from 0.1 to 0.9 with an interval of 0.1. The regularization term is multiplied by a model hyperparameter called lambda that determines the total amount of regularization when added to the least squares loss function. This study used cross-validation to find the optimal value of lambda that achieved minimum cross-validation mean standard error [17]. In addition, we tested two parsimonious OLS models with 2012 and 2009-2012 predictors, including only predictors selected by lasso regression. The OLS regression predicting 2013 costs with the full set of 2012 predictors represented the standard base case model for comparison purposes.
The entire study sample was split into training (75%) and test (25%) sets. All model development and validation was conducted in the training set. OLS was estimated in the training set directly as no model tuning is needed. Penalized linear regression was tuned using 10-fold cross-validation in the training set. Tuned penalized regression models were re-estimated using the entire training dataset. Predictive performance of final estimated models was assessed in the test set by: (1) R squared (R 2 ), representing the percent of total variation of actual costs explained by a model (a higher percent indicates better performance), (2) root mean squared error (RMSE): square root of mean squared differences between predicted and actual costs (a smaller value indicates better performance), (3) mean absolute prediction error (MAPE): mean absolute value of differences between predicted and actual costs (a smaller value indicates better performance), and (4) prediction ratio (PR): sum of predicted costs divided by sum of actual costs (a value closer to 1 indicates better performance). Model performance was assessed in the entire test set as well as within each of the 10 deciles of predicted costs in the test set. All programming was performed in R version 3.4.2 [18] with glmnet version 2.0-16 [19]. R codes can be found at https://github.com/hkan2018/risk_adjustment_ with_penalized_regression.

Results
A total of 81,106 patients met the selection criteria with 60,737 split to a training set and 20,369 to a test set. In the entire study sample, mean (standard deviation (SD)) age was 73.8 (6.7) years old and 50.8% were females. Mean total health care costs (SD) in 2013 was $16,509 (41,376). Proportion of patients with a specific EDC (n = 277) or RxMG (n = 67) in 2012 in the training set can be found in S1  Table 2 shows model performance within each of the 10 deciles of predicted costs for OLS (with the full and reduced sets of predictors), ridge, and lasso regression using 2012 predictors. Among the 4 models, lasso regression showed prediction ratios consistently close to 1 across all the 10 deciles of predicted costs (e.g., PR of 0.979 in decile 1 and 1.019 in decile 10). OLS with the full set of predictors under-predicted costs in low predicted risk deciles and over-predicted costs in high predicted risk deciles (e.g., PR of .433 in decile 1 and 1.073 in decile 10). Although the parsimonious OLS and ridge regression improved on prediction ratio compared to OLS with the full set of predictors, both the models showed inferior prediction ratios in low and high ends of predicted costs compared to lasso regression (e.g., PR of .539 in the parsimonious OLS and 0.754 in ridge regression compared to .979 in lasso regression in decile 1; PR of 1.075 in the parsimonious OLS and 1.022 in ridge regression compared to 1.019 in lasso regression in decile 10). Elastic net regression showed similar performance by deciles as lasso regression (see Table A in S2 Table).
The longitudinal predictive model included 1,387 predictors over the 4-year period from 2009 to 2012. Table 3 shows the same direction of performance gaps between standard and penalized linear regression with 4 years of predictors as shown by the models with 1 year of data, but the performance gaps enlarged as indicated by R 2 , RMSE and MAPE. For example, the difference in R 2 between OLS with the full set of predictors (15.0%) and penalized regression models with 4 years of predictors (18.0-18.2%) was larger than between the models with 1 year of data (16.3% versus 16.8-16.9%). However, penalized regression with 4 years of data showed a slightly larger prediction ratio (1.004) compared to 1.002-1.003 in penalized regression with 1 year of data. Improved performance of penalized regression models with 4 years versus 1 year of predictors (R 2 : 18.0-18.2% versus16.8-16.9%) indicates the value of longitudinal data for better prediction performance. However, this gain only occurred with penalized regression. OLS with full 2009-2012 predictors actually had worse performance (e.g., R 2 = 15.0%) than OLS with full 2012 predictors (R 2 = 16.3%). It is noteworthy that R 2 of OLS with 2009-2012 predictors assessed in the training set was 21.5% vs. 15.0% in the test set, indicating more serious overfitting. However, OLS with important predictors over 4 years selected by lasso performed better (e.g., R 2 = 17.4%) than OLS with full 2012 predictors (R 2 = 16.3%).
Out of the original 1,387 predictors over the 4-year period, lasso regression selected 276 important predictors, among which 46, 44, 65 and 119 comorbidity indicators came from 2009, 2010, 2011, and 2012, respectively, indicating that all of the 4 previous years of data contributed to prediction of 2013 health care costs with more recent years of comorbidities more likely being selected as important variables. Although the parsimonious OLS regression (e.g., R 2 = 17.4%) performed better than OLS with the full set of 2009-2012 variables (R 2 = 15.0%), it still fell short of the performance achieved by penalized regression (R 2 : 18.0-18.2%), indicating that variable selection for OLS was not enough to achieve the same level of prediction improvement displayed by penalized regression. Table 4 shows model performance by deciles of predicted costs with 4 years of predictors of the same 4 models (i.e., OLS with full and reduced sets of 2009-2012 predictors, ridge, and lasso regression). Comparing Table 2 and Table 4 shows more pronounced differences in prediction ratios between lasso and the other three models with 4 years of predictors. Prediction ratios of lasso regression were much closer to 1 across low and high levels of predicted costs compared to the other three models (e.g., PR of -0.177 in OLS with the full set of predictors, Elastic net regression showed similar performance by deciles as lasso regression (see Table B in S2 Table).

Discussion
Payers and providers commonly use standard OLS linear regression for risk adjustment and population health management. Although machine learning methods in general have shown initial promising results, payers and providers have been slow in adopting unfamiliar complex methods with difficult-to-interpret results. However, they might be more amenable to techniques such as penalized linear regression with underlying machine learning fundamentals but familiar and transparent regression framework. This study demonstrated important advantages of using penalized regression versus traditional standard OLS regression to predict future healthcare costs among older adults with demographic and comorbidity variables. Specifically, our findings showed that penalized linear regression outperformed OLS with full and reduced (selected by lasso) sets of predictors, based on R 2 , RMSE, and MAPE, except for prediction ratio in which OLS showed a slight advantage. Although all penalized regression models performed similarly when evaluated in the entire test set, lasso regression consistently showed superior prediction ratios across high and low levels of predicted risk compared to Table 4

RMSE ($) MAPE ($) PR Mean predicted costs ($)
Mean actual costs ($) ridge and OLS. Coefficient shrinkage and variable selection may have helped lasso to achieve better performance across the entire risk spectrum. Built-in variable selection of lasso regression may reduce overfitting as well as the number of irrelevant predictors included in the model. In addition, lasso regression generated a much smaller number of negative predicted costs with only 2 observations in the test set with negative predictions compared to 120 negative predictions by the OLS model (data not shown). Although elastic net regression showed similar performance as lasso within deciles of predicted risk, lasso regression may be preferable for its simpler interpretation with built-in variable selection. In contrast, OLS suffers from biased prediction as indicated by prediction ratio deviating from 1 in low and high risk patients. Alleviating group-level biased prediction is critical to a health plan or a clinical care organization that may enroll a biased population of patients with underlying risk skewed towards either the high or low end of risk spectrum. This study also demonstrated better prediction of parsimonious OLS models with a smaller set of important comorbidity indicators selected by lasso regression than OLS with the full set of predictors. OLS using a full set of predictors without any variable selection may suffer from including irrelevant predictors leading to increased standard error of estimates [20] and/or overfitting. In practice, including only important predictors in a risk adjustment model can both reduce opportunities for upcoding and facilitate care management by allowing care managers to focus on patients with key risk factors.

OLS with all 2009-2012 Predictors
This study also compared predictive performance of OLS versus penalized regression models against various temporal cuts of the data to simulate situations where "longer" health care data is available (e.g., Medicare data). Comorbidities from each of the past 4 years contributed to better prediction by penalized regression compared to using only 1 year of prior data, and this gain in performance with longitudinal data can only be harnessed by penalized regression as standard linear regression actually showed worse performance using 4 years of prior predictors. We also compared overall performance of OLS and lasso regression with 1, 2, 3, and 4 years of prior data and saw a clean trend that with an increasing number of years of prior data, OLS lost prediction power while lasso gained prediction power (data not shown). This further confirms the advantage of using penalized regression such as lasso regression to model longitudinal data. Both payer and provider organizations can utilize this advantage of penalized regressions to increase the utility of their longer historical data that they are accumulating over time.
Although OLS may produce unbiased estimates when specified correctly, in practice, we do not expect a risk adjustment model for health care costs to be correctly specified, meaning incorporating only relevant variables and relating them to the cost outcome with correct functional specification. This is because individuals are exposed to numerous factors related to biology, behavior, health care, social and physical environment that may impact their health and health care through numerous complex and interactive pathways. Thus, it is not advisable to use causal inference and unbiased estimates to guide model selection for risk adjustment models. In this case, techniques like penalized regression that accept some bias in model estimates for a reduction in variance can be appropriate for improving overall expected prediction error. A favorable bias-variance tradeoff was clearly demonstrated for penalized regression in this study. Although penalized regression models produced slightly increased bias as measured by a 1% to 3% increase in prediction ratios relative to that of OLS in the entire test set, overall, penalized regression clearly achieved better prediction performance than OLS with and without variable selection. Furthermore, penalized regression, especially lasso and elastic net regression, even considerably improved on prediction ratios across low and high levels of predicted risk compared to OLS.
Numerous machine learning techniques exist for regression in the supervised learning setting [17]. Although some machine learning methods such as super learner [3] and deep learning [21] may boost prediction accuracy, they are usually not easy to train nor to understand and interpret, and may require substantial computing power. A transparent modeling technique such as lasso regression is easier to train and scale and empirically demonstrated superior performance among all the other standard and penalized linear regression models tested in this study.
This study only used comorbidity indicators as predictors, derived from recorded diagnoses and filled prescription drugs, reflecting the information a primary care physician (PCP) may typically have access to. A PCP usually knows relatively well diseases and symptoms as well as prescription drugs of his/her patients. Even without complete information, lasso regression demonstrated that only a subset of comorbidities was important for predicting costs. In addition, despite the lack of comprehensive information on health care costs and utilization in EHR systems [22,23], EHRs provide unique data sources for risk stratification [24][25][26][27][28]. Thus, the findings of this study are potentially applicable to both provider and payer settings for practical risk adjustment applications [29].
Finally, although this study did not intend to develop a full risk adjustment model ready to use for payment purposes, it is still worth noting that estimating the impact of improved risk adjustment on actual outcomes such as adverse selection and overpayments to health plans is not as straightforward as it may first appear to be because of the need to consider endogenous response of payers to specific incentives created by a risk adjustment model [30]. For example, an increase in R 2 of a risk adjustment formula does not necessarily result in an increase or a decrease of government overpayments in the Medicare Advantage program [31]. An empirical investigation of the Hierarchical Condition Categories (HCCs) risk adjustment approach developed by the Centers for Medicare and Medicaid Services (CMS) found that the introduction of the more sophisticated risk adjustment did not alter favorable selection into Medicare Advantage [32]. But we expect that more accurate prediction of costs especially across different levels of risk as demonstrated by penalized regression such as lasso may reduce the room for possible adverse selection and thus make it more difficult to find ways to outmaneuver risk adjustment for financial gains. More research is needed to assess payment-specific issues including risk selection for future new risk adjustment models. Equally importantly, from the clinical care perspective, more accurate identification of patients with low and high future health care needs can help care management programs effectively target appropriate patients for interventions.
The study has a few limitations. The distribution of total health care costs is highly skewed with large outliers. We conducted a sensitivity analysis by assigning $134,074 (99 th percentile of the distribution of total costs in 2013) to all cases with 2013 health care costs over that amount. The sensitivity analysis results did not alter the directions of our findings although the differences in model performance tended to be less pronounced. Although we used ICD-9-CM diagnosis codes in this study, EDCs derived from ICD-9-CM are consistent with those derived from ICD-10-CM. Thus, our study results are applicable to newer health care data with ICD-10-CM as well. This study did not test all regression techniques. However, we tested a generalized linear model with the log link and gamma distribution, which failed to show consistent advantages over standard linear regression. We also tested several more advanced machine learning techniques including random forests and neural network and found no better overall performance than penalized regression. As the study used administrative claims from a particular large health plan in IMS database, the results may not be generalizable to other health plans or to patients under 65 years old. The sample size of this study was limited. Further research is needed to confirm the findings in larger and more diverse samples and to further establish external validity using test data drawn from a different time period or from a different health plan. We also caution that the clear-cut favorable bias-variance trade-off of penalized regression observed in this study may change with a different outcome variable or even a different data source.
In conclusion, this study demonstrated the advantages of using transparent and easy-tointerpret penalized regression models for predicting future health care costs in older adults relative to standard linear regression. In particular, lasso regression showed better prediction performance across different levels of predicted risk. Such predictive analytic techniques, while incorporating underlying machine learning principles, still embody the familiar linear regression framework and provide transparence and interpretability with a gain in prediction performance. As digital data sources become ever more ubiquitous in the health care sector, it is imperative that advances in data science be considered and embraced as appropriate based on transparent and rigorous assessments. Health care insurers, providers and policy makers may benefit from adopting penalized regression such as lasso regression for cost prediction to improve risk adjustment and population health management and thus better address the underlying needs and risk of the populations they serve.
Supporting information S1