Improving the Rank Precision of Population Health Measures for Small Areas with Longitudinal and Joint Outcome Models

Objectives The University of Wisconsin Population Health Institute has published the County Health Rankings since 2010. These rankings use population-based data to highlight health outcomes and the multiple determinants of these outcomes and to encourage in-depth health assessment for all United States counties. A significant methodological limitation, however, is the uncertainty of rank estimates, particularly for small counties. To address this challenge, we explore the use of longitudinal and pooled outcome data in hierarchical Bayesian models to generate county ranks with greater precision. Methods In our models we used pooled outcome data for three measure groups: (1) Poor physical and poor mental health days; (2) percent of births with low birth weight and fair or poor health prevalence; and (3) age-specific mortality rates for nine age groups. We used the fixed and random effects components of these models to generate posterior samples of rates for each measure. We also used time-series data in longitudinal random effects models for age-specific mortality. Based on the posterior samples from these models, we estimate ranks and rank quartiles for each measure, as well as the probability of a county ranking in its assigned quartile. Rank quartile probabilities for univariate, joint outcome, and/or longitudinal models were compared to assess improvements in rank precision. Results The joint outcome model for poor physical and poor mental health days resulted in improved rank precision, as did the longitudinal model for age-specific mortality rates. Rank precision for low birth weight births and fair/poor health prevalence based on the univariate and joint outcome models were equivalent. Conclusion Incorporating longitudinal or pooled outcome data may improve rank certainty, depending on characteristics of the measures selected. For measures with different determinants, joint modeling neither improved nor degraded rank precision. This approach suggests a simple way to use existing information to improve the precision of small-area measures of population health.


Methods
In our models we used pooled outcome data for three measure groups: (1) Poor physical and poor mental health days; (2) percent of births with low birth weight and fair or poor health prevalence; and (3) age-specific mortality rates for nine age groups. We used the fixed and random effects components of these models to generate posterior samples of rates for each measure. We also used time-series data in longitudinal random effects models for age-specific mortality. Based on the posterior samples from these models, we estimate ranks and rank quartiles for each measure, as well as the probability of a county ranking in its assigned quartile. Rank quartile probabilities for univariate, joint outcome, and/ or longitudinal models were compared to assess improvements in rank precision.

Results
The joint outcome model for poor physical and poor mental health days resulted in improved rank precision, as did the longitudinal model for age-specific mortality rates. Rank precision for low birth weight births and fair/poor health prevalence based on the univariate and joint outcome models were equivalent.

Introduction
The County Health Rankings, first published in 2010 by the University of Wisconsin Population Health Institute, provide population health measures for nearly all United States counties. The Rankings are designed to direct media and policy-maker attention toward the multiple determinants of health and encourage in-depth community health assessment [1]. The community engagement and discussion about health and its determinants is intended to motivate the implementation of evidence-based programs to improve population health.
One challenge to the Rankings is its reliance on small-area estimates, which are commonly affected by small sample sizes, large standard errors, and statistical outliers. These features of small-area estimates lead to uncertainty regarding the health of counties-especially for small counties-in the Rankings. Hierarchical Bayesian models, however, can be used to improve small-area estimates of health-related measures and the resulting ranks. The benefits of Bayesian estimates are well-known: they draw in extreme values that are often statistical artifacts due to data sparsity by using information from related units [2]. Furthermore, Bayesian estimates allow us to estimate more adequately uncertainty in performance across units-a critical feature when comparing (or ranking) entities [3].
Bayesian estimates can be generated from hierarchical models with no fixed effects (empty models), or estimates can be informed by adding covariates to the model which, in most cases, improve their precision [4]. Longitudinal and joint outcome models are another way to leverage information and estimate more accurately county-level health measures and ranks. Longitudinal data are ideally suited for hierarchical models, as data from different time points are nested within counties. Joint outcome or shared component models are a generalization of the longitudinal model. Two or more measures that share common determinants (i.e., risk factors) can be modeled together in order to pool data and "borrow strength" for more reliable point and rank estimation [5,6]. These features of joint or longitudinal modeling could prove valuable for the Rankings, as earlier work has demonstrated that applying univariate hierarchical Bayesian models result in limited improvement for rank precision relative to ranks based on observed data, particularly for middle-ranked counties [4].

Data
The County Health Rankings report data for more than 25 health-related measures. Five of these measures represent the "health outcomes" or the current health of a community. Health outcomes measures include premature mortality, self-reported fair or poor health prevalence, average poor mental health days per month, average poor physical health days per month, and percent of live births with low birth weight. The remaining measures in the Rankings model represent the determinants of these health outcomes and are divided into four categories: health care, health behaviors, socioeconomic factors, and the physical environment [7]. Not all of the measures have longitudinal data available or are appropriate for modeling together, so we restricted our analyses to the five health outcome measures. More information on the measures and methods used in the County Health Rankings is available at www. countyhealthrankings.org.
The measures and data sources for health outcomes in the 2010 Rankings are listed in Table 1. For all measures but premature mortality, the analyses used National Vital Statistics System (NVSS) and Behavioral Risk Factor Surveillance System (BRFSS) data aggregated to the county level and reported to the Institute.
The Rankings represent premature mortality by the rate of years of potential life lost before the age of 75 (YPLL-75) per 100,000 population, where each death in a county is weighted based on the difference between age 75 and the age at death [8]. Because YPLL-75 rates are a composite of age-specific death rates, we directly modeled the underlying mortality rates for nine age groups and used the resulting output to calculate posterior samples of YPLL-75. Raw mortality and population counts for age < 1, ages 1-4, and 10-year age groups between the ages of 5 to 74, were accessed from the CDC WONDER underlying cause of death query system for four 3-year time periods: 1995-97, 1998-2000, 2001-03, and 2004-06 [9].
The percentage of counties with missing health outcome data ranged from 3.1% to 13.7%. Counties with missing data were disproportionately rural. Values for vital statistics measures-YPLL-75 and low birth weight births-were suppressed if based on five or fewer events. BRFSS censored values for counties with fewer than 50 respondents or a 95% confidence interval width greater than 20% of the point estimate.

Longitudinal model
Mortality data for each of nine age groups included number of events (y) and population denominator (n) for the years 1995-97 (t = -3), 1998-2000 (t = -2), 2001-03 (t = -1), and 2004-06 (t = 0). Estimates from the years 2004-06 were our primary interest, so we set that time period as our intercept. For each age group, we fit the following generalized linear mixed effects log-linear regression models for age-specific mortality with state-and county-level random intercepts and slopes. y jkt $ Poissonðr jkt ; n jkt Þ Where y jkt = number of deaths in county j in state k during year t n jkt = population for age group in county j in state k during year t ρ jkt = mortality rate for age group in county j in state k during year t α = fixed effects intercept parameter for mortality at time = 0 (2004-06) a k = state-specific random intercept parameter s 2 k = variance of state-specific random intercept parameters a jk = county-specific random intercept parameter s 2 jk = variance of county-specific random intercept parameters β = fixed effects slope parameter for time b k = state-specific random slope parameter t 2 k = variance of state-specific random slope parameters b jk = county-specific random slope parameter t 2 jk = variance of county-specific random slope parameters for j = 1, 2, . . ., m k ; k = 1, 2, . . We used the resulting posterior samples of age-specific death rates in 2004-06 (ρ jkt ) to calculate premature mortality (YPLL-75) rates by county. For more information on the calculation and use of YPLL rates, see Wise et al., 1988 [8] and Gardner and Sanborn, 1990 [10].

Joint outcome models
We also examined the performance of joint outcome models for 2004-06 age-specific mortality data-borrowing strength across age groups rather than over time-and for two additional sets of measures: (1) average poor physical and poor mental health days per month and (2) percent reporting fair or poor health and percent of live births with low birth weight. In our measure dyads (1) and (2) we chose to jointly model measures that followed the same distribution (e.g., Gaussian, binomial). Although one could jointly model measures from different families of distributions [5,11], we do not consider such models in this paper.
The first measure pair was selected due to the positive correlation between average poor mental health days and poor physical health days per month. High county-level averages for poor mental and poor physical health days are both strongly associated with low levels of physical activity, high poverty rates, and inadequate social support [12]. Allostatic load or stress response is considered a common determinant for mental and physical health [13]. Thoughunlike the model for mortality rates-this joint modeling does not fundamentally increase sample size, we hypothesized that including more information per respondent in a joint outcome model could improve the accuracy of the resulting estimates for each measure. The second measure pair-fair/poor health prevalence and low birth weight births-was chosen to demonstrate that, for unrelated or weakly related measures, the joint modeling procedure will neither improve nor degrade the accuracy of the resulting estimates.
Premature mortality. Using cross-sectional mortality data (2004-06), we created the following joint outcome model for premature mortality across the 9 age groups.
Where y ajk = number of deaths for age group a in county j in state k n ajk = population for age group a in county j in state k ρ ajk = mortality rate for age group a in county j in state k α = fixed effects intercept parameter for reference age group (age <1 year) a k = state-specific random intercept parameter for reference age group s 2 k = variance of state-specific random intercept parameters for reference age group a jk = county-specific random intercept parameter for reference age group s 2 jk = variance of county-specific random intercept parameters for reference age group β a = fixed effects slope parameter for age group a b ak = state-specific random slope parameter for age group a t 2 ak = variance of state-specific random slope parameters for age group a b ajk = county-specific random slope parameter for age group a t 2 ajk = variance of county-specific random slope parameters for age group a for a = 1, 2, . . ., 9; j = 1, 2, . . ., m k ; k = 1, 2, . . ., 51 Average poor physical and poor mental health days. Average number of poor physical health days per month and the average number of poor mental health days per month were reported from the BRFSS. Data reported included the (observed) mean value by county (m), the number of respondents (n), and the 95% confidence limits for the county-specific means, from which standard errors (s) could be calculated. Based on the Central Limit Theorem, the sampling distribution of the observed county-specific mean value is approximately normal regardless of the underlying distribution of poor physical and poor mental health days. Using these data, we fit the following joint model for these outcomes.
Where m ijk = observed mean for outcome i in county j in state k μ ijk = population mean for outcome i in county j in state k s 2 ijk = sampling variance for log(m ijk ) α = fixed effects intercept parameter for reference measure (poor physical health days) a k = state-specific random intercept parameter for reference measure s 2 k = variance of state-specific random intercept parameters for reference measure a jk = county-specific random intercept parameter for reference measure s 2 jk = variance of county-specific random intercept parameters for reference measure β i = fixed effects slope parameter for poor mental health days b k = state-specific random slope parameter for poor mental health days t 2 k = variance of state-specific random slope parameters for poor mental health days b jk = county-specific random slope parameter for poor mental health days t 2 jk = variance of county-specific random slope parameters for poor mental health days for i = 0, 1; j = 1, 2, . . ., m k ; k = 1, 2, . . ., 51 Fair or poor health prevalence and percent low birth weight births. Data on fair or poor health prevalence, reported from BRFSS, consisted of a point estimate (p 0jk ) with nominal 95% confidence limits, which were used to calculate the county-specific standard error of p 0jk (s 0jk ). From the estimated standard error, we determined the effective sample size: n 0jk ¼ p 0jk ð1 À p 0jk Þ=s 2 0jk . Data on low birth weight births consisted of a census of all live births (n 1jk ), a count of births for which birth weight was less than 2500 grams (y 1jk ) and the observed proportion of low birth weight births (p 1jk = y 1jk /n 1jk ). The following joint logistic regression model was fit for these outcomes.
Where y ijk = count of outcome i in county j in state k π ijk = population proportion for outcome i in county j in state k n ijk = (effective) sample size for outcome i in county j in state k α = fixed effects intercept parameter for reference measure (fair or poor health prevalence) a k = state-specific random intercept parameter for reference measure s 2 k = variance of state-specific random intercept parameters for reference measure a jk = county-specific random intercept parameter for reference measure s 2 jk = variance of county-specific random intercept parameters for reference measure β i = fixed effects slope parameter for percent low birth weight births b k = state-specific random slope parameter for low birth weight births t 2 k = variance of state-specific random slope parameters for low birth weight births b jk = county-specific random slope parameter for low birth weight births t 2 jk = variance of county-specific random slope parameters for low birth weight births for i = 0, 1; j = 1, 2, . . ., m k ; k = 1, 2, . . ., 51

Cross-sectional models
In addition to the longitudinal and joint outcome models described above, each of the five health outcomes measures were estimated using cross-sectional, univariate data in a generalized linear mixed effects model with an intercept as fixed effect, and state-and county-level random effects. An example of the Poisson model specification is below.
Where y jk = number of events in county j within state k n jk = population county j within state k ρ jk = event rate for county j in state k β 0 = Intercept, or the average log event rate across all counties in all states e k = state-specific random effect parameter s 2 k = variance of state-specific random effects e jk = county-specific random effect parameter s 2 jk = variance of county-specific random effects For j = 1, 2, . . ., m k ; k = 1, 2, . . ., 51

Estimation and model fit
Model parameters (regression coefficients and variances of the state-and county-level random effects) were estimated by maximum likelihood. Empirical Bayes estimates of the random effects were obtained by conditioning on the estimated variance parameters. Samples from the joint posterior distribution of the regression coefficients, state-level random effects, and county-level random effects were drawn from (multivariate) normal approximations to their conditionally independent posterior distributions. Posterior samples of the county-specific measures were then derived from these samples. Estimation and simulations were performed in R (version 2.15.1), using the lme4 and MASS libraries [14][15][16].
To test our models' performance, we used random number generation procedures for binomial, Poisson, and normal distributions to produce posterior predictive data sets based on the posterior samples and the reported population data for each measure. Similarities between parameters calculated from the posterior predictive data sets and the original data were quantified by posterior predictive p-values. These p-values are similar to those in frequentist statistics, in which a probability is set as a threshold below which the null hypothesis is rejected. In this case, a significant p-value (p < 0.05) suggested a poor fit [17,18]. Mean values would be replicated even with a poorly fitted model, so we selected inter-quartile range and skew as summary measures of a variable's distribution.

Ranking
Posterior samples of county-specific ranks in the entire nation were obtained by ranking each draw of the county-specific means or rates. Point estimates of ranks were obtained by ranking the posterior mean ranks, which is equivalent to minimizing squared error loss on the ranks. (See Louis, 2001 for the comparative advantages of different loss functions using ranks [19].) We used these ranks to assign counties to (national) quartiles of performance: quartile 1 represents the healthiest counties, quartile 4 the least healthy. The probability of a county ranking in its assigned quartile across the posterior samples was also calculated as a means to represent rank certainty. We then mapped the posterior probability of a county ranking in its assigned quartile using the maps library in R (2.15.1) [20].

Estimates and model fit
For premature deaths, our posterior predictive checks initially suggested that the joint outcome model-modeling mortality rates for all age groups simultaneously-best captured the distribution of the original data. The posterior predictive p-values for inter-quartile range and skew were both insignificant (p > 0.05), suggesting good model fit. The cross-sectional and longitudinal models, conversely, were unable to capture the dispersion of observed premature mortality rates, represented by highly significant posterior predictive p-values for inter-quartile range. However, posterior predictive p-values for individual age-group mortality rates indicate that the joint outcome model performed poorly at replicating the original data for the following age groups: ages 1-4, ages 5-14, and ages 65-74. In contrast, posterior predictive p-values for individual age-group mortality rates based on the cross-sectional and longitudinal models were all > 0.05 and indicative of reasonable model fit.
In our two other joint outcomes models, we saw some improvement in model fit for poor physical health days using the joint outcome model. The posterior predictive checks indicated that samples from the joint outcomes model better captured the dispersion in the observed data for poor physical health days, but there was little to no improvement in model fit for poor mental health days. For the final measure pair, fair/poor health prevalence and percent of births with low birth weight, the model fits were equivalent between the cross-sectional and joint outcome models. For fair/poor health prevalence, both models successfully replicated the interquartile range and skew of the original data; for low birth weight births, the models captured the dispersion of the observed measure, but not the skew. (See Tables 2-9 for estimates of fixed and random effects from all measure-model combinations.)

Rank estimates
We used the posterior ranks to determine the probability of a county ranking in a quartile as a means to explore rank precision. For each measure-model combination, we assigned counties to a quartile based on their mean percentile rank and calculated the probability with which each county ranked in its assigned quartile. This probability typically ranged between 0.30 and 1.0, and reflects the degree of certainty with which we know a county's rank. Among counties within a quartile, we divided them into three classes based on the probability (certainty) with which they ranked in that quartile: low certainty (p < 0.50), medium certainty (p = 0.50-0.75), and high certainty (p ! 0.75).
Estimates of county ranks for premature death are similar across the three models considered (cross-sectional, longitudinal and joint outcomes). Based on the posterior probabilities of counties ranking in their assigned quartile, ranks are most precise using the longitudinal model and least precise using the joint outcomes model. Table 10 shows that the percent of counties ranking in their assigned quartiles with high probability (p ! 0.75) is highest for the longitudinal model (71.1% of all counties) and lowest for the joint outcomes model (46.8% of all counties) ( Table 10). The improvement in precision with the longitudinal model over the cross-sectional model is expected. The degradation in performance of the joint outcomes model relative to the univariate model, though surprising, may reflect challenges in model fit for the age groups with low mortality rates.
Joint modeling of poor physical and poor mental health days produced marginally better rank precision for poor mental health days compared with the univariate model. Using a joint outcomes approach, approximately 50% of counties ranked with high probability in their assigned quartile for poor mental health days, compared to 44% of counties with a univariate approach. However, improvements across counties for poor physical health days are marginal. Exploiting the association between these two measures improved our ability to differentiate the counties in terms of performance on poor mental health days and, to a lesser degree, on poor physical health days (Table 10).
As expected, joint modeling of fair or poor health prevalence and percent of births with low birth weight had little, if any, impact on the precision of the estimated ranks relative to separate univariate models for the two outcomes. The percent of counties ranking with high certainty in their assigned quartiles is the similar between the joint outcome and univariate models, indicating no improvement-and importantly no degradation-in rank precision.

Maps of rank performance
In the following choropleth maps, we used four colors (purple, blue, orange, red) to represent the quartiles. The color ramps indicate precision; greater saturation indicates greater certainty that the county ranks in its assigned quartile. Maps of quartiles for all measure-model combinations are shown in Figs 1 through 4B. Fig 1A-1C show U.S. counties mapped by quartile for premature death (quartile 1 representing the "healthiest" counties, quartile 4 representing the "least healthy"), based on the independent, longitudinal, and joint outcome models. These maps show differences in our ability to distinguish among counties based on our modeling approach. Ideally, a map would have very few light-colored counties, indicating that counties ranked in the same quartile across the posterior samples. The map representing rank performance from the longitudinal model (Fig 1B), though similar to the independent model (Fig 1A), shows greater rank precision. Seventyone percent of counties rank with high certainty (p ! 0.75) in their assigned quartiles using posterior samples from the longitudinal model; using posterior samples from the cross-sectional model, only 59% of counties rank with high certainty (p ! 0.75) in a quartile. Conversely, Fig 1C demonstrates how differences among counties are attenuated when using posterior samples from the joint outcome model, with only 47% of counties ranking in their respective quartiles with high certainty (p ! 0.75). In these three maps, counties in the Southeastern United States rank among the least healthy, whereas the Midwest, New England, and coastal California rank among the healthiest regions. Fig 2A and 2B show U.S. counties mapped by quartile and precision for poor mental health days, based on posterior ranks from the independent and joint outcome models. The precision of results from the independent model are poor for this measure, with nearly 20% of counties Improving the Rank Precision of Health Measures ranking in their respective quartiles with low certainty (p < 0.5). Only 14% of counties rank with such uncertainty based on posterior samples from the joint outcome model. Poor physical health days do not demonstrate notable improvement with a joint modeling approach, as greater than 20% of counties rank with low certainty in their assigned quartiles in both the univariate and joint outcome models (maps not shown). For both measures, high-performing (healthier) counties tend to cluster in the Plains states and along the mid-Atlantic seaboard. Low-performing counties are grouped in the states of Oklahoma, Kentucky, Mississippi, and Alabama.
In contrast to improvements in rank certainty that result from the longitudinal models for premature death and the joint outcomes model for poor mental health days, rank certainty based on the independent and joint outcome models is similar for percent reporting fair or poor health and percent of births with low birth weight (Figs 3 and 4 show the map results for Improving the Rank Precision of Health Measures the univariate models for fair/poor health and low birth weight births, respectively). The geographic distribution of counties ranking in the top and bottom quartiles for fair or poor health mimics that of the premature death measure: New England and the Midwest tend to rank best, with Southeastern counties disproportionately ranking in the lower quartile. High-performing counties in the low birth weight measure are clustered in the Upper Midwest and along the

Discussion
Overall, the expanded models in this paper-using either longitudinal or pooled data-can improve rank estimation and precision compared to cross-sectional or univariate models (for motivations for national rankings versus in-state rankings, see Athens et al. 2013 [4]). In the case of pooled or joint outcome models, advantages are not necessarily observed when the measures modeled jointly are insufficiently related, such as fair or poor health prevalence and percent of births with low birth weight. In this case, however, the quality of the estimates and ranks mimic that of the univariate models, and there is no degradation of rank precision through a joint modeling approach. The finding that posterior samples for premature deaths from the joint outcome model resulted in less rank precision over the univariate models was not anticipated. One cause of this rank degradation is poor model fit for age groups 1-4, 5-14, and 65-74. Mortality rates among the young age groups are particularly low; a number of counties had zero events over the 2004-06 time period. Though we did not allow for distributions to vary within a joint outcome model, using a negative binomial or zero-inflated Poisson model would better replicate the observed data in these age groups [21]. For the oldest age group, ages 65-74, the joint outcome model overestimated the variance in mortality rates. The independent and longitudinal models estimate almost one-third of the variance for the oldest age group compared to the joint outcome model. Though a death among the 65-74 age group is given little weight in the composite YPLL measure, deaths in this age group comprise about 10% of total years of potential life lost, given the very high mortality rates at these ages. Consequently, the variance from the joint outcome model fit for age 65-74 mortality is reflected in the overall premature mortality estimates. Refinement in rank estimates was the goal of the expanded models explored in this paper. For select measures, we saw improvement in the certainty with which we could assign counties to a specific quartile. Even so, the quartiles for which the most counties are placed with "high certainty" (p ! 0.75) are the top and bottom quartiles (Table 10). Ranks for middle-ranking counties remain more volatile. These results are consistent with Hall and Miller's examination of rank performance in which a small group of "highly performing" (or, alternately, poorly performing) entities tends to remain fixed in rank [22]. Adding more information confirms the rank location of counties at the extremes, but improving precision for middle-ranking counties is more difficult. Despite these limitations, these models may be useful in addressing another aspect of data sparseness reflected in the Rankings. The Institute currently aggregates up to seven years of data for its measures to improve estimate stability. This approach, however, limits the reactivity Improving the Rank Precision of Health Measures of the Rankings to changes in community health over time. A potential alternative is to report Bayes estimates and ranks based on data averaged over fewer years. Data from earlier years (or from other, related measures) could be leveraged in the modeling stage to provide reasonable estimates over a shorter time period.