Investigating associations between COVID-19 mortality and population-level health and socioeconomic indicators in the United States: A modeling study

Background With the availability of multiple Coronavirus Disease 2019 (COVID-19) vaccines and the predicted shortages in supply for the near future, it is necessary to allocate vaccines in a manner that minimizes severe outcomes, particularly deaths. To date, vaccination strategies in the United States have focused on individual characteristics such as age and occupation. Here, we assess the utility of population-level health and socioeconomic indicators as additional criteria for geographical allocation of vaccines. Methods and findings County-level estimates of 14 indicators associated with COVID-19 mortality were extracted from public data sources. Effect estimates of the individual indicators were calculated with univariate models. Presence of spatial autocorrelation was established using Moran’s I statistic. Spatial simultaneous autoregressive (SAR) models that account for spatial autocorrelation in response and predictors were used to assess (i) the proportion of variance in county-level COVID-19 mortality that can explained by identified health/socioeconomic indicators (R2); and (ii) effect estimates of each predictor. Adjusting for case rates, the selected indicators individually explain 24%–29% of the variability in mortality. Prevalence of chronic kidney disease and proportion of population residing in nursing homes have the highest R2. Mortality is estimated to increase by 43 per thousand residents (95% CI: 37–49; p < 0.001) with a 1% increase in the prevalence of chronic kidney disease and by 39 deaths per thousand (95% CI: 34–44; p < 0.001) with 1% increase in population living in nursing homes. SAR models using multiple health/socioeconomic indicators explain 43% of the variability in COVID-19 mortality in US counties, adjusting for case rates. R2 was found to be not sensitive to the choice of SAR model form. Study limitations include the use of mortality rates that are not age standardized, a spatial adjacency matrix that does not capture human flows among counties, and insufficient accounting for interaction among predictors. Conclusions Significant spatial autocorrelation exists in COVID-19 mortality in the US, and population health/socioeconomic indicators account for a considerable variability in county-level mortality. In the context of vaccine rollout in the US and globally, national and subnational estimates of burden of disease could inform optimal geographical allocation of vaccines.


Methods and findings
County-level estimates of 14 indicators associated with COVID-19 mortality were extracted from public data sources. Effect estimates of the individual indicators were calculated with univariate models. Presence of spatial autocorrelation was established using Moran's I statistic. Spatial simultaneous autoregressive (SAR) models that account for spatial autocorrelation in response and predictors were used to assess (i) the proportion of variance in county-level COVID-19 mortality that can explained by identified health/socioeconomic indicators (R 2 ); and (ii) effect estimates of each predictor.
Adjusting for case rates, the selected indicators individually explain 24%-29% of the variability in mortality. Prevalence of chronic kidney disease and proportion of population residing in nursing homes have the highest R 2 . Mortality is estimated to increase by 43 per thousand residents (95% CI: 37-49; p < 0.001) with a 1% increase in the prevalence of chronic kidney disease and by 39 deaths per thousand (95% CI: 34-44; p < 0.001) with 1% increase in population living in nursing homes. SAR models using multiple health/socioeconomic indicators explain 43% of the variability in COVID-19 mortality in US counties, adjusting for case rates. R 2 was found to be not sensitive to the choice of SAR model form. Study limitations include the use of mortality rates that are not age standardized, a spatial adjacency matrix that does not capture human flows among counties, and insufficient accounting for interaction among predictors. PLOS

Introduction
By the end of 2020, the Coronavirus Disease 2019 (COVID- 19) pandemic has resulted in 81.5 million documented cases and 1.8 million deaths globally [1]. The US has contributed nearly a quarter of these cases and has lost 1 in every 1,000 residents to COVID-19 [2]. The outbreak has affected all states in the US but with considerable differences in the trajectory and severity of individual outbreaks. Besides this inter-and intrastate geographical variability, the likelihood of adverse outcomes among those infected is reported to be associated with individual's age, gender, race/ethnicity, and underlying health conditions [3][4][5][6]. An estimated 22% of the global population and 28% of the US have 1 or more of the underlying conditions that pose increased risk of severe outcomes from COVID-19 [7]. Early studies on clinical characteristics of severe outcomes from COVID-19 were reported from China [5,8], after the first large outbreak in Wuhan, and concurring estimates were subsequently published from the United Kingdom, France, US, and elsewhere [3,4,[9][10][11][12]. Guan and colleagues [8] reported that among 1,100 of the earliest laboratory confirmed cases of COVID-19 in China, the presence of comorbidities such as diabetes, hypertension, and chronic obstructive pulmonary disease (COPD) were more prevalent in those with severe outcomes (admission to ICU, requiring mechanical ventilation, or death), along with a slightly elevated risk among men and by now well-established risk with increasing age. Using a larger data sample of 45,000 cases, Deng and colleagues [5] reported that mortality was associated (relative risk (RR) or hazard ratio (HR)) with cardiovascular disease (RR = 6.75, 95% CI = 5.40 to 8.43), hypertension (HR = 4.48, 95% CI = 3.69 to 5.45), diabetes (RR = 4.43, 95% CI = 3.49 to 5.61), and respiratory disease (RR = 3.43, 95% CI = 2.42 to 4.87, p < 0.001). In Italy, Grasselli and colleagues [13] reported associations with COPD (HR = 1.68; 95% CI = 1.28 to 2.19), hypercholesterolemia (HR = 1.25; 95% CI = 1.02 to 1.52), and diabetes (HR = 1.18; 95% CI = 1.01 to 1.39). Relatedly, Palmieri and colleagues reported differences in prevalence of comorbidities between younger (<65 years) and older (65+ years) deceased [14] as well as between the first 2 waves (March to May and June to August of 2020) of the pandemic in Italy [15].

PLOS MEDICINE
A later, more extensive study [9] from the UK linking 17 million cases to 11,000 deaths also found association between COVID-19 deaths and kidney disease (HR = 2.5, 95% CI = 2.3 to 2.7), diabetes (HR = 1.95, 95% CI = 1.8 to 2.1), extreme obesity (HR = 1.9, 95% CI = 1.7 to 2.1), and several other comorbidities. From a pooled analysis of 75 studies from multiple countries, Popkin and colleagues [11] summarized that individuals with obesity are at increased risk of death (odds ratio [OR] = 1.48; 95% CI = 1.22 to 1.80), hospitalization (OR = 2.13; 95% CI = 1.74 to 2.60), and ICU admission (OR = 1.74; 95% CI = 1.46 to 2.08). Based on these findings and the known prevalence of comorbidities that existed in the population before the emergence of the pandemic, the populations at risk of severe COVID-19 outcomes at county level in the US [16] and in several countries have been estimated [7]. Other studies have examined the associations of socioeconomic characteristics including poverty, income, and race/ ethnicity [17][18][19].
Over the past year, public health attempts to reduce transmission largely centered on nonpharmaceutical interventions such as social distancing, face coverings, and hand hygiene. In the US, these interventions have had limited success, and part of this failure stems from their dependence on collective compliance. The recent availability of high-efficacy vaccines gives individuals an additional tool to protect themselves (vaccine supply permitting), and, importantly, does not require cooperation from collective public.
The availability of vaccines also implies an opportunity to refocus our efforts from reducing infections to reducing severe outcomes by prioritizing vaccination for those at a higher risk of severe outcomes. To date, such strategies have been largely guided by individual characteristics such as age and occupation. We hypothesize that population-level characteristics can also guide the optimal allocation and distribution of vaccines geographically. This points to a potential 2-layered approach of first identifying high-risk communities within which high-risk individuals can be prioritized.
Here, we assess the feasibility of the first part of such an approach and evaluate the extent to which the geographical variability of mortality in US can be explained by population characteristics that predate the epidemic. Our outcome of interest is COVID-19-associated mortality rates at county resolutions, which we attempt to model as a function of population health and socioeconomic indicators. An initial set of indicators associated with COVID-19 mortality as reported in peer-reviewed studies, and data sources for estimates of these indicators were identified. A smaller subset of the variables were selected based on the correlation between the variables and their independent effects on the response.
Conventional regression models assume that observations are independent of one another, which in the case of spatial data translates to assuming observations in nearby locations are no more closely related than those farther away. Given the transmission dynamics of COVID-19, counties nearby are likely to be have similar case and death rates, and spatial dependence rather than spatial independence is a more appropriate assumption. This spatial dependence also extends to health and socioeconomic indicators and potentially latent and unobservable characteristics that effect mortality.
Spatial simultaneous autoregressive (SAR) models offer a parsimonious way to augment basic regression models with spatial dependence between locations [20] and are an extensively studied family of analytical approaches with applications ranging from econometrics, environmental studies, and health sciences [21][22][23]. In the current study, we first establish the presence of spatial autocorrelation in the response and explanatory variables, thus motivating the need for spatial models. We apply 3 forms of SAR models, show that they explain a greater proportion of the variability in mortality than linear models, and report effect estimates from each.

Data and methods
County-level indicators of population's health and social status were retrieved from public sources including the US census and large population surveys. In cases where the survey data are not available at county resolutions, data from prior studies on small-area estimates were used. We tried to limit the number of source dependencies, and, when alternative estimates were available from multiple sources, we preferred estimates from the US Centers for Disease Control and Prevention (CDC). See Table 1

The New York Times
Counts for cumulative cases and deaths through December 31, 2020 were retrieved from The New York Times public repository [24]. These data included both confirmed and probable cases and deaths at the US county level and is based on Times' monitoring and analyses of news conferences, data releases, and communications with public officials. The determination of cases and deaths as either confirmed or probable is made per definitions laid out in the position statement of the Council of State and Territorial Epidemiologists [25]. But as the application can vary across local agencies, here, we treat both confirmed and probable categories identically and use total cases and deaths. Case and death rates as a proportion of residents are based on county population estimates from the American Community Survey (ACS) 2014 to 2018 [26].
County-specific data for the 5 counties in New York City were retrieved from USAFACTS [27] as the Times' data source was found to combine counts for these 5 counties into a single entity. Fig 2 shows maps of reported county case and death rates.

Population Level Analysis and Community Estimates (PLACES)
From the PLACES study [29], a collaboration between the CDC and Robert Wood Johnson Foundation, estimates for population-level health and behavioral indicators were retrieved. These small-area estimates of population health outcomes across the US at county resolutions were generated using data collected through the Behavioral Risk Factor Surveillance System (BRFSS) [30], the US decennial 2010 census and the ACS, following a multilevel regression and post-stratification approach [31,32].
Of the 27 indicators available in PLACES, we extracted 5 measures of population-level prevalence of health conditions that are reported to have individual level associations with

PLOS MEDICINE
Investigating associations between COVID-19 mortality and population level indicators: A modeling study COVID-19 outcomes, namely obesity, diabetes, COPD, and chronic heart and kidney diseases. In addition, 3 related health indicators, the prevalence of high blood pressure and high cholesterol, and proportion of residents uninsured were also included.

Social Vulnerability Index
CDC's Social Vulnerability Index (SVI) is a measure of a county's relative vulnerability to hazardous events [33,34] and is intended to help public officials and planners better prepare for such events. Overall, county ranks are based on 15 socioeconomic indicators collected in the ACS. Three of the factors in the SVI, namely county population density, median per capita income, and proportion of the population that is older than 65 years of age, are hypothesized to be associated with COVID-19 mortality [17,18,35,36]. As association between the other variables in SVI and COVID-19 is uncertain, we limited inclusion to the raw estimates of these 3 variables and ignore the other variables in SVI and the overall index.

County Health Rankings
Two additional variables derived from the ACS 2014 to 2018 and available through the County Health Rankings (CHR) [37] are hypothesized to be measures of socioeconomic disparities in a county and included in this study: ratio of the 80th percentile income to 20th percentile as a measure of income inequality, and the proportion of non-white to white residents as a measure of racial diversity. We observed that estimates for these variables in a small percentage (approximately 1.5%) of counties were missing and used the following 3-step process to

US 2010 Census
It has also been reported that COVID-19 clusters occur in facilities in which people live in group quarters, where the increased vulnerability can result from either the living conditions in such facilities (difficulty to social distance in correctional facilities or on college campuses, for example) or the characteristics of the residents (elderly in nursing homes with underlying health conditions) [38,39]. As mortality from COVID-19 is known to be less likely in younger populations, we focused instead on elderly living in group quarters. An estimate of proportion of the population living in nursing homes or facilities with skilled nursing in each county was included in this analysis ( Table 1). As all data used here are routinely collected aggregate surveillance data, no ethics approval was deemed necessary for this study. A prespecified analysis plan has not been filed; a Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) checklist is available as Supporting information (S1 STROBE Checklist).

Methods overview
We first built linear univariate models for each predictor with county-level COVID-19 mortality as outcome, adjusting for county case rates. These models inform both the individual effects and the proportion of variance in mortality explained by each of these predictors. We followed this with a linear multivariate model, again adjusting for case rates. In both univariate and multivariate models, observational independence is inappropriate because of spatial autocorrelation in both the response and predictors. We verify this by standard tests on the residual of the multivariate model. We finally build spatial SAR models and report effect estimates.

Spatial weight matrix and spatial autocorrelation
As introduced in an earlier section, a key assumption in standard ordinary least square (OLS) regression models is the independence of observations that does not hold because COVID-19 cases and deaths in a county are related to cases and deaths in other counties (spatial dependence) and often counties adjacent to it (spatial autocorrelation). Models that do not account for spatial dependence and autocorrelation are shown to have inflated type I errors [23,40].
To establish adjacency of counties in the US, we define a simple spatial n x n matrix, W, using shape files that list county boundaries as an ordered set of geocoded reference points [41]. County adjacency is defined by queen congruity (at least 1 shared boundary point), and the spatial weight matrix is row standardized, i.e., for each county i, the weight of link to county j, w ij, is the inverse of the number of neighbors of i, if j is adjacent to i, and 0 otherwise; ∑ j w ij = 1. A county is assumed to not be a neighbor of itself, i.e., w ij = 0 when i = j.
Moran's I [42,43], a commonly used measure of global spatial autocorrelation, is calculated as follows: where n is the number of counties, x i is the variable of interest for county i, � x is the mean across all counties, and w ij is as defined by the spatial weight matrix, W. Here, as W is row standardized, ∑ i ∑ j w ij = n and the above equation can be simplified. The significance of the statistic was tested under the randomization assumption, i.e., x i are draws from a random distribution and there is no spatial association. A related measure to identify specific regions within the study region that exhibit spatial autocorrelation, the Local Moran's I, was also estimated.   [44]. Global Moran's I statistic is denoted by the label in each subpanel and found to be statistically significant for all variables. Maps generated with usmap R package [28].
We were also interested in determining whether spatial autocorrelation, if present, resided in the response or in the residual, as this also informs the choice of the spatial model. To identify this, we used robust Lagrange Multiplier (LM) tests that can detect possible autocorrelated residuals in the presence of an omitted lagged response and vice versa [45,46]. The statistics reported here are from implementations of these tests in the spdep [43,47] R [48] library.

Variable pruning
As the variables selected for inclusion are related, we calculated Spearman correlation between pairs of variables (Fig 4) and found some of the variables to be very highly correlated. Hence, it would not be appropriate to include these pairs together in models. We used the results of the univariate analysis to aid variable selection by only retaining those variables that have a correlation of less than 0.75 with variables of a higher R 2 . This led to the elimination of 5 variablesprevalence indicators for diabetes, heart disease, high blood pressure (all highly collinear with kidney disease), high cholesterol (collinear with COPD), and median per capita income. The linear multivariate model and the spatial models were built using this smaller set of predictors (n = 9).

Spatial simultaneous autoregressive models
The general form of an autoregressive model in spatial statistics is given by [20,23,49]: where y is a n x 1 vector of the response variable, X is a n x k matrix of k predictors for n counties, W is the n x n spatial weight matrix, ρ is the SAR lag coefficient and λ the spatial error coefficient, and β, u the coefficient and error vectors, respectively. When λ = 0, the autoregressive process is assumed to occur in the response only (captured by ρW) and the model is referred to as a spatial lag model. When ρ = 0, the autoregressive process is assumed to occur only in the errors (captured by λW), and the model referred to as spatial error model. Model implementations are per spatialreg [47,49,50] library in R.
Among socioeconomic indicators, the largest association was seen with the nursing home variable (adjusted R 2 : 29%) with an estimated increase of 39 deaths per thousand (95% CI: 34 to 44; p < 0.001) for every 1% increase in percent living in nursing homes. Mortality rates are estimated to increase by 2.8 (95% CI: 2.3 to 3.4; p < 0.001) and 2.4 (95% CI: 2 to 2.9; p < 0.001) for each 1% increase in percentage of the population who are elderly (65+ years) and uninsured 18 to 64 year olds, respectively. In contrast, mortality rate is estimated to decrease by 1.5 (95% CI: 1.05 to 1.87; p < 0.001) for every thousand dollar increase in per capita income. On average, the R 2 estimates for socioeconomic indicators are lower than for health indicators.
Following variable pruning to correct for collinearity, the multivariate model explained 38% of the variability in mortality with a few changes in effect estimates. Obesity's association is not statistically significant in the presence of kidney disease, and COPD's association is counterintuitively negative ( Table 2).
Moran's I test for spatial autocorrelation in residuals of the above model was found to be statistically significant (18.4, p < 0.001). Both robust LM tests were found to be significant indicating possible autocorrelation in both the error (28.7, p < 0.001) and response (33.5, p < 0.001). Hence, 3 model forms, the general SAR model, spatial lag, and spatial error models, were attempted.
The proportion of variability explained by the SAR models is about 14% higher than the linear model (Fig 6). The spatial error model had an Nagelkerke R 2 [51] of 43.5% with an estimated autocorrelation error coefficient (λ) of 0.418 (95% CI: 0.37 to 0.46; p < 0.001). The spatial lag model and the general model were observed to have an R 2 nearly identical to that of the error model. The autocorrelation coefficient in response (ρ) was found to be 0.347 (95% CI: 0.31 to 0.39; p < 0.001) for the spatial model, but when both coefficients were estimated simultaneously in a general model, the lag coefficient was found to be not significant: λ = 0.336 (95% CI: 0.244 to 0.429; p < 0.001); ρ = 0.083 (95% CI: −0.007 to 0.174; p = 0.07). Fig 6 demonstrates that the effect estimates of SAR models are generally smaller than the linear model, i.e., accounting for spatial autocorrelation reduces the magnitude of associations. Focusing on the spatial lag model, the most strongly associated health and socioeconomic indicators are the prevalence of chronic kidney disease (49; 95% CI: 37 to 61; p < 0.001) and proportion of nursing home residents (38; 95% CI: 33 to 43; p < 0.001), respectively, consistent with the univariate analysis. The negative association of COPD seen in the univariate model is also observed with the spatial models. On the other hand, inequality and obesity had significant association in the univariate model, but, after accounting for spatial autocorrelation and in the presence of other indicators, their association tends to be no longer significant (p > 0.05). The Global Moran's test on the residuals of all 3 models found no significant spatial autocorrelation (p > 0.05). Fig 7 shows the spatial lag model's fit and residuals. To test for sensitivity of models' R 2 to the variable pruning method, we additionally subset variables using alternative spearman correlation thresholds of 0.5, 0.65, and 0.85 and built linear and spatial models with each. Fig 8 shows that R 2 was not sensitive to the value of threshold, and the spatial models have a consistently higher R 2 than the linear model.

Discussion
We have built models to estimate COVID-19 mortality rates for given case rates and population health and socioeconomic characteristics. Our results indicate that, together, these indicators can explain 43% of the variability in US county mortality rates, when spatial autocorrelation is accounted for. We found that among health indicators considered, the prevalence of chronic kidney disease, and among socioeconomic indicators, the proportion living in nursing homes have the largest associations with mortality.
The choice and timeliness of control strategies in response to an outbreak do affect its progress and caseload. Our findings here show that differential risks of severe outcomes from COVID-19 across populations can be in part estimated from the structures and contexts in which the outbreak occurs, for example, a population's quality of health, its access to healthcare, and the disparities therein. With the availability of vaccines, these population-level indicators can serve as criteria for prioritizing geographical allocation of vaccines.
These findings may also be relevant to low-and middle-income countries (LMICs). It has been reported that almost all of the Pfizer-BioNTech and Moderna vaccine doses to be manufactured through the end of 2021 have been purchased and are reserved for distribution in the US, Canada, UK, and the European Union [52,53]. Of the 42 countries that have rolled out vaccines by early January 2021, only 6 are middle-income countries, and none are low-income countries [54]. The COVAX initiative with participation from governments of several LMIC countries, WHO, and partner nongovernmental organizations aims to achieve equitable and affordable access to vaccines globally through a common vaccine purchase and allocation framework [55]. When allocation decisions need to span multiple countries, national and subnational socioeconomic indicators and burden of disease estimates can potentially be leveraged to reduce overall risk of severe outcomes from COVID-19 as our findings demonstrate.
This study has a few limitations. Case and death counts were retrieved a week after the end of the study period. Given the lags in data reporting, particularly with deaths, events occurring at the end of the study period may not have been recorded, and the rates used are underestimates. Similarly, the outcomes may not yet be known for cases recorded near the end of the study period. Additionally, the case and mortality rates used in this study are crude rates that do not account for differences in age distribution among county populations. County rates standardized to US national age distribution would be more appropriate, but as age-stratified case and deaths counts at county scales are not publicly available, age standardization has not been possible. However, a supplementary analysis (S1 Text and Figs A-D in S1 Text) using

PLOS MEDICINE
Investigating associations between COVID-19 mortality and population level indicators: A modeling study age-specific rates at the state level as proxy for county rates indicate that the results presented here may be robust except for the 2 indicators directly related to age (proportion of populations in nursing home and proportion 65+ years of age).
Second, the adjacency based spatial weight matrix that was used in this study does not sufficiently capture the spread of COVID-19. Cases that occur in a county are not only correlated with those in counties geographically adjacent to it, but also with counties with which it has strong population mixing; for example, counties with metropolitan centers into which commuters travel from the suburbs or counties with major airports. Spatial weight matrices that capture mobility patterns may be more appropriate and lead to better spatial models. Similarly, methods that can explicitly account for spatial autocorrelation in predictors remain to be explored. Finally, the model structure presented may not be parsimonious in the number of predictors. Although we dropped a third of the predictors initially considered (to correct observed collinearity), model forms with a smaller subset of independent variables may yield near identical R 2 and need to be explored. This is also belied by the lack of significance of some of the predictors included in the spatial models. One approach could start with a minimal set of predictors and incrementally add predictors, while evaluating goodness of the resulting model in each iteration and terminating when the improvement is below a threshold. Similarly, the variable pruning discussed above is ad hoc; the variables included in the model may be interchangeable with those discarded with only marginal change in model performance. Exploration of interaction between indicators and inclusion of significant interactions in the SAR models could be a potential extension to the analysis presented.
Supporting information S1 STROBE Checklist. Checklist for STROBE guidelines for observational studies. STROBE, Strengthening the Reporting of Observational Studies in Epidemiology. (DOC) S1 Text. Supplementary analysis to check for changes in model effect estimates with approximate age-standardized mortality rates. Fig A: Distribution of crude COVID-19 mortality rates (deaths per thousand residents) for each of 8 age groups in US states. Each data point indicates a US state, and the bounded region indicates the distribution. Note that the yaxis is on log 10 scale. Fig B: Scatter plot of crude (y-axis) and age-standardized (x-axis) mortality rates in US states. A state below the diagonal (black dashed line) indicates that the mortality rate increases when standardized. Fig C: Effect estimates (95% CI) with a linear univariate linear model using crude mortality rate (red) and age-stratified mortality rate (in green) as response variables. Labels indicate adjusted R 2 . Inset magnifies select variables of smaller estimates. All estimates are significant (p < 0.05). The slight difference between the effect estimates with crude rates here and Fig 5 of the main text is due to the use of different data sources (The New York Times in the main text and NCHS provisional counts here). Fig  D: Variables estimates of spatial lag models built using crude (orange) and age-standardized (brown) rates as the response variable.