Using “Big Data” to Capture Overall Health Status: Properties and Predictive Value of a Claims-Based Health Risk Score

Background Investigators across many fields often struggle with how best to capture an individual’s overall health status, with options including both subjective and objective measures. With the increasing availability of “big data,” researchers can now take advantage of novel metrics of health status. These predictive algorithms were initially developed to forecast and manage expenditures, yet they represent an underutilized tool that could contribute significantly to health research. In this paper, we describe the properties and possible applications of one such “health risk score,” the DxCG Intelligence tool. Methods We link claims and administrative datasets on a cohort of U.S. workers during the period 1996–2011 (N = 14,161). We examine the risk score’s association with incident diagnoses of five disease conditions, and we link employee data with the National Death Index to characterize its relationship with mortality. We review prior studies documenting the risk score’s association with other health and non-health outcomes, including healthcare utilization, early retirement, and occupational injury. Results and Conclusions We find that the risk score is associated with outcomes across a variety of health and non-health domains. These examples demonstrate the broad applicability of this tool in multiple fields of research and illustrate its utility as a measure of overall health status for epidemiologists and other health researchers.


Background
Researchers across many fields often struggle with how best to capture an individual's overall health status, with options including both subjective and objective measures. Simple self-report measures have proven to be surprisingly predictive of mortality, often more so than objective measures of health [1,2]. Yet for investigators who are unable to collect survey data due to the expense, or for those with access to only secondary data sources, such measures are not available for use.
With the increasing availability of "big data" sources in the form of linkable digitized claims and administrative records, epidemiologists and health researchers now have the opportunity to conduct studies using large longitudinal datasets, such as those from Medicare. Such claims data are increasingly used in academic research settings to determine outcomes such as health diagnoses and medication adherence [3,4]. A potential advantage of claims data is their ubiquitousness and relatively low costs, as they require little or no additional data collection. Yet the sheer volume of records and number of entries may pose a challenge for those seeking to condense an individual's chart into one marker of overall health status. In this context, the predictive algorithms-or "risk scores"-developed in corporate settings are particularly valuable. These scores were initially used by actuaries and insurers to create predictive algorithms to forecast health expenditures [5,6] and by the Centers for Medicare and Medicaid Services (CMS) to determine payments to health maintenance organizations [7]. Yet they have tremendous potential to be useful in population health, health economics, and other fields of research. Studies suggest that these algorithms are better at predicting health expenditures compared with simple measures of number of comorbidities or functional status [8,9]. Earlier objective measures relied on simply abstracting the seriousness or number of medical conditions from an individual's medical chart [2], while these risk scores employ more complex algorithms.
There are a handful of risk scores that have been adopted in a limited fashion by academic health researchers. The algorithm inputs differ in each, including the use of prescription drug claims data [10], inpatient and outpatient diagnostic codes [11,12], or some combination of these in addition to healthcare utilization data [13,14]. These risk scores have been used primarily in health services research, particularly in studies of the U.S. health insurance market [15][16][17][18] and to predict health expenditures [19,20].
Given the increasing availability of claims data and the limited predictive value of prior objective measures of health, risk scores represent an underutilized tool that could advance the sophistication of health research. In this paper, we describe the properties, predictive value, and possible applications of one such risk score to formally introduce this novel metric to the academic health research community. Our goal is to demonstrate that such risk scores are valuable objective markers of overall health status for health researchers with access to claims data. We show that this risk score is predictive of a range of diverse short-term and long-term health outcomes, including mortality, as well as several non-health outcomes, demonstrating its broad applicability in health research. By illustrating its associations with a wide variety of health outcomes, we demonstrate its utility as an objective marker of overall health status that can be used in future studies that employ claims data. Competing Interests: MRC serves as a senior medical advisor to Alcoa, under the terms of a research contract between Stanford University and Alcoa. All other authors receive some percentage of their salary support from a grant from Alcoa. Verisk Health did not provide funding or other resources for this study. This does not alter the authors' adherence to PLOS ONE policies on sharing data and materials. management tool for the development of clinical intervention and quality programs and as a method to forecast expenditures and utilization. This score is computed using an individual's age and gender, as well as Current Procedural Terminology (CPT) and International Classification of Diseases (ICD) codes and use of healthcare services from the previous year. These inputs are then used to predict an individual's health expenditures in the coming year. The scores are standardized such that a score of 1 indicates that the individual's health expenditures are likely to fall at the mean in the following year, in a nationally representative population defined by Verisk. Each unit increase predicts a one-fold increase in expenditures above the mean. The specific inputs into the predictive model developed by Verisk are proprietary and not described in this manuscript.

Study Sample
The sample in which we demonstrate the properties and predictive value of this risk score is a cohort of manufacturing workers at Alcoa, a large U.S. employer for whom we have complete claims data. This includes all individuals who were working at the firm on January 1, 1996 with at least one risk score calculated during the period 1996-2011 (N = 14,161). This longitudinal dataset contains repeated observations per person, ranging from 1 to 16 depending on when an individual drops out of the sample. This yields 151,931 risk score (person-year) observations during this time period, or an average of 10.7 years per person. Observations with missing values in a given year are omitted from the relevant analyses. By 2011, there are 5,962 individuals remaining in the sample. We link these data with other datasets, including personnel and administrative information provided by Alcoa (Table 1). While this sample is not representative of the U.S. population or the U.S. workforce, we selected these individuals because of the extensive data available for this population that enable us to conduct the analyses we present here, and because these employees are all covered by similar insurance plans with comprehensive benefits, so that findings will not be confounded by insurance status.

Health Conditions
We use the claims data to identify incident (i.e., new) cases of five disease conditions: diabetes, hypertension, asthma/chronic obstructive pulmonary disease (COPD), depression, and ischemic heart disease (IHD). For each of these conditions, individuals with one or more inpatient claims or two or more outpatient claims with a relevant ICD diagnosis code in a 365-day period are considered to have a new diagnosis of the disease in question. To rule out prevalent (i.e., pre-existing) cases, we require the individual to have no claims related to the diagnosis for the first two years of the study period. As our dataset includes claims data beginning in January 1, 1996, for each disease we exclude individuals with diagnoses in 1996-1997, such that the earliest possible date of diagnosis for a given disease is January 1, 1998. If the disease diagnosis is established based on two outpatient claims, the date of diagnosis is considered to be the date of the first claim. On the other hand, if an individual has two outpatient claims separated by more than 365 days early during the study period, and subsequently has two claims in the same year later during the study period, the date of diagnosis will be based on the later claims, as the first two claims do not meet the criteria for diagnosis. This strategy, while imperfect, is similar to methods frequently used with claims data, and is unlikely to affect our study findings [4,22].
To examine mortality, we link our dataset with the National Death Index to obtain the date of death for individuals in the sample who died (N = 1,155), including those who left Alcoa at any point during the study period.

Data Analysis
We conduct several analyses to illustrate the properties of the risk score, to examine its demographic correlates, and to demonstrate its relationship to a variety of health and non-health outcomes.
First, we present the risk score's overall distribution and examine individual-level interclass correlation coefficients and year-to-year correlation. To examine the extent to which the risk score is correlated with age, race, and gender, we conduct multivariable linear regression with individual-level random effects, clustering robust standard errors at the individual level to account for interdependence of the observations. We also control for year to adjust for secular trends.
We employ linear probability models with individual-level random effects to identify the degree to which an individual's risk score in a given year is associated with the probability of being newly diagnosed with a disease condition in the following year. For example, an individual's risk score in 2003 is used as the predictor variable for health outcomes in 2004; risk scores in 2011 are therefore not included in these analyses, as our dataset does not include health outcomes beyond 2011. As described above, these conditions include diabetes, hypertension, asthma/COPD, depression, and IHD. These analyses control for age, gender, race, and a dummy variable for each year to account for secular trends. Standard errors are clustered at the individual level to account for interdependence of the observations.
To assess the risk score's association with long-term disease outcomes, we perform a timeto-event analysis for each of the disease conditions. Using the personnel dataset, we identify the last date that each individual was active at the firm, after which that individual is censored. We present unadjusted Kaplan-Meier survival curves by risk score quintile. We then estimate Cox proportional hazards models, controlling for race, gender, and age group (20-30 years old, 30-40 years old, etc.). We estimate two sets of Cox proportional hazards models: (1) with risk score as a continuous variable, and (2) with risk score by quintile.
For mortality analyses, we present unadjusted Kaplan-Meier survival curves by risk score quintile in 1996. The National Death Index includes deaths through September 1, 2011, after which we censor surviving individuals. We estimate Cox proportional hazards models, controlling for race, gender, and age group. As above, we estimate two sets of Cox proportional hazards models: (1) with risk score as a continuous variable, and (2) with risk score by quintile.

Ethics Approval
The Stanford University Institutional Review Board provided ethics approval for this study (Protocol 16281). Individual informed written consent was waived by the Institutional Review Board based on an epidemiological exemption.

Risk Score Properties
The mean risk score in this population in 1996 is 1.12, with a standard deviation of 1.36 (Table 2). Fig 1 illustrates the distribution of the risk score in this population in 1996. For clarity of presentation of this figure, we omit those with risk scores greater than four, representing 2.1% of individuals (with mean and maximum scores of 7.73 and 33.19, respectively).
Risk scores are fairly stable over time.
Year-to-year correlation for an individual is 0.49, with an individual-level inter-class correlation coefficient of 0.67. That is, 67% of the observed variance is between rather than within individuals. We next examine demographic factors that are correlated with the risk score (Table 3). Risk factors for higher risk score include age (β = 0.51 per 10-year increment, p < 0.001), being female (β = 0.12, p = 0.005), and being black (β = 0.45, p < 0.001). After controlling for these covariates, we observe an annual increase in the average risk score of 0.025 units during the study period (p < 0.001). Given the aging of the sample and these secular trends, the mean risk score in 2011 is 1.83 with a standard deviation of 2.72.

Associations with Disease and Mortality
Each increment of 1 in the risk score is associated on average with an increased likelihood of receiving a new diagnosis of asthma (0.04%, p < 0.001), depression (0.02%, p < 0.001), diabetes (0.05%, p < 0.001), and IHD (0.04%, p < 0.001) in the following year (Table 4).  We next examine the risk score's long-term predictive abilities, using the baseline value of the risk score in 1996 to examine time-to-diagnosis for each condition. Kaplan-Meier survival curves demonstrate a monotonic increase in likelihood of diagnosis for every condition with higher risk score quintiles (Fig 2). Cox proportional hazards models show that higher risk scores are associated with increased risk of asthma (HR 1.09, p = 0.001), diabetes (HR 1.09, p < 0.001), hypertension (HR 1.05, p = 0.007), and IHD (HR 1.10, p < 0.001) ( Table 5). Using risk score quintiles as a predictor in Cox models confirms the relationship observed in the Kaplan-Meier curves, i.e., there is a monotonic increase in likelihood of diagnosis with higher risk score quintiles for most health conditions (Table 6).
Similarly, we find that higher risk scores in 1996 are more strongly associated with mortality during the follow-up period (Fig 3), with a hazard ratio of 1.21 (p < 0.001) ( Table 5). When using risk score quintile rather than continuous risk score as the primary predictor, we find that the relationship between risk score and mortality is largely driven by those in the highest quintile (HR 2.24, p < 0.001), the only group with a significantly elevated HR (Table 6). Within this top quintile, we find that individuals in the 95 th -100 th percentile had a higher risk of mortality compared to those in the 80 th -95 th percentiles (HR 2.38, p < 0.001) (data not shown).

Discussion
The actuarial and insurance industries have long employed predictive algorithms to produce health risk scores for the purposes of medical management and cost prediction. In this paper we use one such risk score as a marker of overall health status for a broad array of applications  in epidemiology. We demonstrate that the score displays within-individual stability across time. Even after controlling for age, race, and gender, the risk scores increase over time, which may reflect changes in physician coding behavior or secular health utilization trends. This risk score is associated with multiple short-term health outcomes. It is possible that this correlation reflects the increased utilization of healthcare that may immediately precede a new diagnosis, but we also demonstrate its predictive ability for several long-term health outcomes, including mortality at higher quintiles. The size of the associations is modest, with likely limited clinical relevance at the individual level. The value instead lies in the risk score's potential use as a marker of overall health status in research studies and in its short-and long-term prediction ability at the population level. Interestingly, we find that individuals in the second and third risk score quintiles at baseline demonstrate increased long-term likelihood of being diagnosed with several different diseases compared to the lowest quintile, even though they are healthier than average (with risk scores between 0.53 and 0.93). This provides evidence of the sensitivity of the risk score in identifying individuals at high risk of developing chronic disease, even at low values.
In prior research by our group, we have found that this risk score is associated with several other health-related outcomes. For example, in a study of the impacts of the Great Recession of 2007-2009 on healthcare utilization among a panel of employees, individuals with higher risk scores at baseline in 2006 utilized more outpatient, emergency room, and inpatient services at baseline, as would be expected based on the manner in which the risk score is constructed. However, after a period of several years in which there was reversion to the mean, individuals with higher risk scores responded to the recession with greater increases in utilization, compared with those with lower risk scores at baseline [23]. In a prior study examining predictors of complications among diabetic patients, higher risk score quartiles predicted increased risk of complications including coronary artery disease, stroke, heart failure, and renal disease [24].
Higher risk scores are associated with a variety of non-health related outcomes, illustrating the broad applicability of the risk score to multiple fields of research. For example, those with higher risk scores were more likely to be laid off during the Great Recession [22,25]. In another study, individuals with higher risk score deciles were more likely to experience occupational injury, even controlling for other demographic and job-related factors [26]. Those with higher baseline risk scores were also more likely to enter retirement at younger ages [27], and were more likely to have a delayed return to work after a hospitalization (unpublished).
A number of studies by other groups have found that other claims-based risk scores predict mortality, long-term care utilization, and 30-day readmission after hospitalization [28][29][30]. Risk scores have also been employed in studies of moral hazard, generalized risk aversion, and adverse selection in the U.S. health insurance market [17,18,31]. For implementation of causal g-methods, such as marginal structural models or g-estimation, the risk score could serve as a longitudinal measure of health status. In this case, it serves as a time-varying measure of health status to address the healthy worker survivor effect. This bias arises if workers in better health tend to accrue more exposure than less healthy workers who are more likely to transfer to lower exposed jobs or leave work. In most occupational studies, time off work has been the only available surrogate for health status to address this bias and is only weakly related to exposure and health. In a recent study of exposure to particulate pollution and IHD incidence in this workforce, the authors reduced this bias by applying marginal structural models and treating risk score as a time-varying measure of comprehensive health status [32].
While these examples are not intended to illustrate a causal role for the risk score, they demonstrate the broad utility of this measure across a variety of health and non-health domains, and in particular its utility in mitigating confounding by health status.
Our group has found that this software is simple to use. Verisk offers the package at a discount to academic researchers, making it accessible to those who wish to apply this versatile tool. As health researchers strive to take advantage of the increasingly available vast quantity of claims data, including those from Medicare and private insurers, employing risk scores presents the possibility of collapsing large quantities of data into one index of overall health status. This technique is also relatively inexpensive compared to surveys required to capture subjective measures, if claims data are readily available [19].
While researchers wishing to predict a particular health outcome as accurately as possible may consider developing their own predictive algorithm, this requires large amounts of data and expertise that may not be available. In contrast, for those interested in a broadly applicable measure of overall health status that is available "off the shelf," existing risk score algorithms such as this one may be well suited for this purpose. Similarly, while a single composite measure is likely to explain less variance than multiple measures, in this case it is impractical to include the richness of an individual's entire claims history as individual variables. Moreover, a composite measure accounts for fewer degrees of freedom. Given the variety of claims-based risk scores produced in private settings and those available for public use (e.g., through CMS and the Agency for Health Research and Quality), researchers have several options in selecting amongst these tools. While a handful of prior studies have compared the predictive value of these risk scores for expenditures [6,7,19,33], fewer have compared their predictive values for health outcomes [34,35]. Moreover, it is likely that different measures are likely to predict different outcomes to varying degrees [36,37]. These are questions that can be explored in future studies.
This study has several limitations. The risk score we employ here is in part capturing health behaviors-i.e., willingness and ability to access healthcare services-rather than health itself. As we show, it is nevertheless associated with a variety of health outcomes. Given differences in the collection of claims data in other countries, the potential use of this particular risk score in international settings is limited. Additionally, such algorithms are often proprietary, meaning that the specific inputs that form the components of the risk score are unknown to those researchers who choose to use it. Finally, this study is limited in its application of this risk score among a non-representative sample of manufacturing employees. Employed individuals have been shown to be healthier than the general population [38], which limits the external validity of the specific findings that we describe here, and this sample has fewer women and minorities than the general population. The availability of extensive claims and other linkable data for this population, however, enabled us to conduct the diverse set of analyses we present here. Future studies should test this tool in other populations, including the non-employed, the elderly, and others.
In this paper, we describe the properties and possible applications of a claims-based health risk score, demonstrating its associations with mortality, incident disease diagnosis, and healthcare utilization, in addition to a range of non-health outcomes. These examples demonstrate the broad applicability of this tool across a variety of domains, and illustrate its utility as a measure of overall health status for epidemiologists and other academic health researchers.