Spatial epidemiological study of the distribution, clustering, and risk factors associated with early COVID-19 mortality in Mexico

COVID-19 is a respiratory disease caused by SARS-CoV-2, which has significantly impacted economic and public healthcare systems worldwide. SARS-CoV-2 is highly lethal in older adults (>65 years old) and in cases with underlying medical conditions, including chronic respiratory diseases, immunosuppression, and cardio-metabolic diseases, including severe obesity, diabetes, and hypertension. The course of the COVID-19 pandemic in Mexico has led to many fatal cases in younger patients attributable to cardio-metabolic conditions. Thus, in the present study, we aimed to perform an early spatial epidemiological analysis for the COVID-19 outbreak in Mexico. Firstly, to evaluate how mortality risk from COVID-19 among tested individuals (MRt) is geographically distributed and secondly, to analyze the association of spatial predictors of MRt across different states in Mexico, controlling for the severity of the disease. Among health-related variables, diabetes and obesity were positively associated with COVID-19 fatality. When analyzing Mexico as a whole, we identified that both the percentages of external and internal migration had positive associations with early COVID-19 mortality risk with external migration having the second-highest positive association. As an indirect measure of urbanicity, population density, and overcrowding in households, the physicians-to-population ratio has the highest positive association with MRt. In contrast, the percentage of individuals in the age group between 10 to 39 years had a negative association with MRt. Geographically, Quintana Roo, Baja California, Chihuahua, and Tabasco (until April 2020) had higher MRt and standardized mortality ratios, suggesting that risks in these states were above what was nationally expected. Additionally, the strength of the association between some spatial predictors and the COVID-19 fatality risk varied by zone.


Introduction
risks. We used statistical methods, including spatial clustering through local indicators of spatial autocorrelation and generalized geographically weighted regression to accomplish these objectives. Additionally, we aimed to identify spatial units or regions in Mexico that should be examined in more detail to better understand the propagation of the disease and its associated fatality risk in the early stages of COVID-19 spread.

Data sources
We obtained state-level variables concerning the 32 states of Mexico (Tables 1 and 2). Individual-level variables were obtained from Mexico's epidemiological surveillance entity (Direccion General de Vigilancia Epidemiologica, Secretaria de Salud), the data set corresponding to observations until April 21 st 2020 [14]. These variables concern health and very general socioeconomic information associated with people who were suspected of having COVID-19 in Mexico (people who by their concerns were suspected of having the disease, presenting symptoms, or who were in contact with someone with COVID-19) and underwent real-time RT-PCR for SARS-CoV-2 confirmation. Available variables include the presence of diabetes, obesity, chronic renal problems (CKD), chronic obstructive pulmonary disease (COPD), pregnancy, hypertension, immunosuppression, cardiovascular disease, pneumonia, as well as age, and whether the patient was hospitalized, admitted to an intensive care unit (ICU), or required intubation. We grouped age into groups, as explained in the model selection process below, and we finally used three age groups for subsequent analyses: 10-39, 40-69, and 70 years old and over. We also computed the risk of death due to the disease on individuals who were tested for SARS-CoV-2, or mortality risk from COVID-19 among tested individuals (MRt), by considering as a death that which was recorded after having a positive test (the information concerning positive tests and death is available in the epidemiological data set), a method consistent with the official numbers. We aggregated all variables at a state level as counts and used them as relative frequencies (risks for the health variables and proportion of individuals for each age group) in all analyses.
Additional variables concerning socio-demographic, economic, mobility, and climatic features were obtained at a state-level. In terms of socio-demographic variables, from the National Institute of Geography and Statistics (INEGI), we extracted information concerning population density in various contexts: density as the number of people per km 2 in 2015 and the proportion of people in a household living in a crowded place in 2017; literacy rate of population aged �15 years in 2015; people settled in rural areas in 2010 (%) (a location was considered as rural when there were <2,500 habitants); and the number of physicians available for every 1000 people in 2015, which was obtained from the National Institute of Public Health (INSP). In terms of economic variables, we obtained from INEGI the state contribution to gross domestic product (GDP) in 2018; however, we used the multiplicative effect between a dummy variable concerning the states with the biggest cities in Mexico and GDP, in which the linearity assumption with the transformed response was improved. We also obtained information concerning people living in poverty in 2018 (%), as it is defined and calculated by CONE-VAL according to a multidimensional index obtained from per capita income and an index of social deprivation [15]. In terms of mobility, we extracted from INEGI information concerning internal migration, as the rate of people aged five years and over living in another state five years before 2014, and external migration, as the rate of people aged �5 years living in another country five years before 2014, both proxies of internal and external mobility, respectively. Concerning mobility, we also used the number of flights in 2019 by state, which we calculated from information associated with the number of flights by the airport in Mexico (Ministry of Communication and Transport) [16]. Finally, information concerning average temperature (˚C) in March 2020 was obtained from the National Council of Water (CONAGUA).
The risk of mortality in tested individuals (MRt) was chosen as the dependent variable based on relevant indicators of COVID-19 epidemiology in Mexico. The number of tests is associated with detection rates. Given the likely under-detection of mild SARS-CoV-2 cases, particularly at the beginning of the outbreak, standardizing deaths by tested cases considers the extent of detection, which could similarly be influenced by structural factors [17]. The remaining variables obtained and calculated from the different data sources were treated as explanatory, except for hospitalization, ICU, and intubation, which we considered as control variables, being an approximate measure of the presence of severe COVID-19 cases and possibly access to services attending COVID-19 in a region.

COVID-19 MRt estimation by state
To determine the mortality distribution throughout the country, we obtained quantile maps associated with raw and smoothed MRts of COVID-19 cases, and, to consider the different age and sex structures in each state the risks were adjusted for these variables using the average of the 32 states population age and sex compositions as the standard [18]. The risks by state adjusted for sex and age were smoothed using an empirical Bayes estimator, a biased estimator that improves variance instability proper of risks estimated in small-sized spatial units [19]. However, the analyses were performed with both raw and smoothed risks to compare results. We also obtained maps concerning the standardized mortality ratio (SMR) adjusted for sex and age, understanding an SMR, as a comparison of the observed number of events by state to a national standard, the latter using the expected number of events considering as if risks in a state were the same as those at a national level. In all standardization processes, the age structure corresponded to the age groups: 0-9,10-39, 40-69, and 70+, which were derived from the process described in the model selection step.

Spatial weight estimation and spatial autocorrelation
To determine how much the MRt is spatially associated, first, we obtained queen contiguity weights [20], which consider as neighbours those states sharing at least a point in common, obtaining a squared matrix of dimension 32 with all entries equal to zero or one, where one indicates that two states are neighbours. From these neighbours, weights are calculated by integrating a matrix in a row-standardized form. Moran's I statistic [21] was obtained as a measure of global spatial autocorrelation, and its significance was assessed through a random permutation inference technique based on simulations. To determine regions with similar mortality behaviours local indicators of spatial autocorrelation (LISA) adjusted for sex and age were obtained [22] and used to derive significant spatial clustering through four cluster types: High-High, Low-Low, High-Low, and Low-High. For instance, the Low-Low cluster indicates states with low values of a variable that are significantly surrounded by regions with similarly low values.

Spatial multivariable linear model
To determine which variables are associated with the MRt and the direction of the association, considering at the same time the spatial nature of the data, a multivariable Generalized Geographically Weighted Regression (GGWR) was fitted [23]. In this model, a dependent variable is measured for each spatial unit. Independent variables (inputs) are simultaneously considered as well, such that the corresponding parameters depend on the coordinates in which the state is spatially located (centroids in the case of polygons); therefore, a parameter is associated with each state and independent variable. To estimate such a model a weighting diagonal matrix is considered; we used Gaussian spatial weighting to generate it. These weights determine the relationship from any state to another in terms of the distance (Euclidean) between the centroids of states and a bandwidth. The bandwidth determines which spatial units are similar under the GGWR and can be selected using automatized methods; we used a cross-validation (CV) method with an adaptive scheme, i.e. a different bandwidth was used for each unit. Since the response variable is a count (number of deaths), a GGWR with a Poisson distribution and logarithm link function was used, including as offset term the number of people tested with COVID-19 in a logarithmic scale, thus modelling the MRt instead of just the number of deaths.
A global multivariable model for all states, a Poisson multivariable linear model (generalized linear model or GLM) with offset and a logarithmic link function, was also fitted, and significant variables were identified [24]. To obtain the best possible model, including variables with the greatest effect on the risks and satisfying as much as possible all statistical assumptions, we used the following selection process: 1) We fitted univariable Poisson models with offset and logarithmic link function, identified significant effects, and ordered them in absolute value from highest to lowest; 2) We identified variables with good and acceptable linear association with the log-transformed mortality risk by obtaining scatter plots between variables and the transformed risk, including a smoothed LOESS (locally estimated scatterplot smoothing) curve; and, when possible, variables were transformed to improve this assumption, as for GDP as explained above; 3) VIF was used to assess multicollinearity; thus, we fitted a model including variables with acceptable and good linear association, and eliminated any variable with a VIF>10; 4) We added to the resulting model one by one significant variables in the univariable models. First, we added the variable with the highest effect and fitted the associated model identifying whether VIF>10. If not, we added the variable, and if VIF>10, the model was not modified. We proceeded with the resulting model repeating the same process with the second-highest effect; and so on; 5) From this process, we ended with a model consisting of all variables with the greatest univariable effects (0.08 and above in absolute value), better linear behaviour, and without multicollinearity (VIF< = 10); 6) Associations between inputs were observed, mainly, between some explanatory variables and the confounders, but we wanted to obtain the pure effect associated with each explanatory variable after controlling for the confounders. Hence, we used residuals associated with three Poisson models in which ICU, intubated, and hospitalization were used as outputs, including a subset of the variables contained in the fatality risk model chosen in 5) as inputs. This process allowed us to obtain the effects each confounder has over the mortality risk eliminating the effects that explanatory variables have over them. These were confounders as in econometrics or experimental design analyses [25]. Thus, the interpretation of the estimated parameters is not of interest, but they are included to avoid spurious effects associated with the other variables we are interested in once controlling for the severity of the disease, 7) We evaluated goodness-of-fit and validated all model assumptions. For age, we obtained age groups: 0-1, 2-9, 10-19, . . ., 80-89, and 90 years old and over, fitted the corresponding univariable Poisson models with offset, and identified significant age groups and the direction of the association to obtain a new set of age groups: 10-39, 40-69, and 70 and more; and proceeded with these variables as with the others. Moran's I and its significance were obtained for each explanatory variable and confounder in the model to determine if each variable's values were spatially correlated.
The GGWR with the same variables as in the GLM was fitted, and multiplicative effects over the MRts (i.e. the exponentiated estimated parameters) associated with each variable were calculated. Maps associated with these effects were obtained, presenting only those associated with the explanatory variables significant in the global model. To compare the GGWR and GLM models, the squared sum of residuals (SSR), the coefficient of determination R 2 , and the Akaike Information Criterion (AIC) were obtained. All statistical analyses were conducted using R version 3.6.2 through the spdep, rgdal, and spgwr packages for spatial analyses and car package for the correlation analysis and GeoDa 1.14.0 were also used for some spatial analyses. The significance level for all analyses was 5% (i.e., alpha = 0.05).

MRt description and spatial autocorrelation of COVID-19 MRt between states
Maps for quartiles corresponding to the raw and smoothed MRts are similar, except for six states, Durango, Nayarit, and Zacatecas, which from a category using the raw risks move to the next higher quartile in the smoothed risks, the opposite occurring for Coahuila, Mexico, and Veracruz. Only the map concerning the smoothed values is shown (Fig 1A). We observed the largest MRts (raw and smoothed risks above 3.5% and 2.5%, respectively) and the greatest SMR (2.00-4.00) in Baja California, Chihuahua, Quintana Roo, and Tabasco ( Fig 1B). Globally, there is a non-significant spatial autocorrelation (Moran's I = -0.054, p = 0.390); however, there is a noticeable Low-Low cluster in the Northeast around San Luis Potosi, two Low-High clusters around Yucatan and Sonora, and one High-Low cluster around Colima (Fig 2). These Low-High clusters make sense since the associated states, especially Sonora, are surrounded by the states with the highest MRts in all the country. We verified that the spatial autocorrelation value and clustering were precisely the same using both the raw values or those obtained with the Bayes spatial technique.

Fit of multivariable generalized global and linear geographical models for mortality risk from COVID-19 among tested individuals
We found the presence of serious multicollinearity problems through a preliminary analysis obtained by fitting a Poisson multivariable linear model with offset and including all variables since we obtained Variance Inflation Factors (VIF) with values above 50 for some variables. Additionally, a correlation analysis between all input variables was performed (Fig 3A), identifying very correlated variables. Thus, we followed the model selection process, as explained above. Our final global model included as inputs: diabetes (%), obesity (%), GDP, internal and external migration (%), age group of 10 to 39 years (%), physicians-to-population ratio, cardiovascular disease (%), ICU (%), hospitalization (%), and intubated (%). The latter three variables are confounders and, as discussed in the Methods section, to eliminate effects of other variables on them, they are used as residuals associated with appropriate Poisson models with  independent variables: diabetes (%), obesity (%), age (%), physicians-to-population ratio, cardiovascular disease (%), plus ICU (%) for the model associated with the intubated variable. Inputs with a significant spatial autocorrelation were the percentages associated with external migration (I = 0.214, p-value = 0.027), age group of 10 to 39 years (I = 0.2, p-value = 0.037), intubated (I = 0.245, p-value = 0.014), and cardiovascular disease (I = 0.359, p-value < 0.001). We found that all variables were jointly significant (LR = 472.19, p-value<0.001). Additionally, through a PP-plot associated with the standardized Pearson residuals and associated Anderson-Darling test (A = 0.260, p-value = 0.689) and scatter plots between the fitted values and standardized residuals, and a similar plot using the root of the residuals instead, we determined that the link function and the way the explanatory variables are related with the response seems correct and there is a good fit. However, we found some overdispersion according to the Chi-squared statistic divided by its degrees of freedom of 1.972, but no significant overdispersion according to an analysis fitting a negative binomial model (p-value = 0.5). There were not any significant pairwise interaction effects between explanatory variables: those between diabetes and all health-related variables, physicians-to-population ratio, and age; those between obesity and all health-related variables; those between GDP and all non-health-related variables; those between migration (internal and external) and non-health-related variables; those between age and all variables and similarly for the physicians-to-population ratio.
A multivariable GGWR with a Poisson distribution, adaptive kernel, and the same input variables was also fitted. The R 2 was greater in the GGWR (0.871 vs 0.860) than in the GLM, and the SSR and AIC were both lower (3.35 vs 3.50 in the logarithmic scale and 287.15 vs 308.42, respectively). In Table 3, we summarise the multiplicative effects over the risks, exponentiated parameters, i.e. minimum, quartiles, and maximum, associated with each variable for all states in the GGWR and the global values corresponding to the GLM. Pearson residuals associated with the GGWR are shown in Fig 3B, showing that the worst fit corresponded to two states: Veracruz and Yucatan.

Predictors of COVID-19 spatial mortality in Mexico
The exponentiated estimated parameters under the GGWR for each state were obtained for each variable (not shown), and maps were obtained based on the.shp file provided in [26]. Nevertheless, we only present the maps for those explanatory variables significantly associated with the log-transformed MRt in the global model (GLM) (Fig 4). These significant variables were diabetes (%), obesity (%), GDP, internal and external migration (%), age group of 10 to 39 years (%), and physicians-to-population ratio. However, care should be taken when the map for GDP is interpreted considering this variable corresponds to GDP on states not having the biggest cities, as an interaction term between GDP and a binary variable, thus having a fixed value of zero in four states, which are represented as blank spaces in the map. All estimated terms were interpreted by considering fixed values for all variables except the one being interpreted.  Yucatan peninsula. On the other hand, the proportion of individuals between 10 and 39 years old has a significant negative association with the COVID-19 MRts (0.907; 95%CI 0.873-0.944), locally the effect is between 0.907 in Chiapas to 0.910 in Chihuahua, thus having a similar association in all the country.

Mobility and socio-economic predictors of MRt in Mexico
The percentage of internal migration in the spatial unit a patient comes from has a significant positive association with the COVID-

Discussion
In the present study, we demonstrate the relevance of spatial statistical analyses to understand how MRts are distributed throughout Mexico, the presence of spatial clusters related to COVID-19 MRts, the type of associations (negative or positive) with some independent variables and how these associations varied among states. Through our analysis, we were able to identify spatial units or regions in which care could have been considered in terms of mortality due to SARS-CoV-2. Identifying such areas could allow us to understand how these risks were distributed in the early stages of the outbreak.
Considering the global results, variables significantly associated with an increase in the COVID-19 MRt include variables associated with diabetes, obesity, external and internal migration, physicians-to-population ratio, and GDP in states that do not include one of the four largest cities in Mexico. Meanwhile, the proportion of individuals between 10 to 39 years old is significantly associated with a decrease in the MRt of COVID-19, which agrees with previous results analyzed using Mexican data. Cardio-metabolic diseases are associated with adverse COVID-19 outcomes and worst prognosis, probably since they are linked to chronic inflammation, which may synergize with the cytokine storm [27].
Regarding internal and external migration (%), these variables have a solid association with COVID-19 mortality among tested individuals, being external migration the variable with the second-highest positive association. This finding could be related to infectious diseases incidence given that greater movements of people transport pathogens from a geographical region to another, such as for Mycobacterium tuberculosis and/or HIV outbreaks [28][29][30]. Another important variable is the physicians-to-population ratio, which is related to urbanization, population density, and poverty. In this context, from the correlation plot, it can be seen that there is a large positive association between physicians-to-population ratio, population density, and overcrowded households (%), and a large negative association between this variable and both the rural and poverty proportions by state.
On the other hand, the positive association between GDP in states that do not include one of the four largest cities and the MRt might be related to economic activities of these states that prevent adequate self-isolation measures and, unlike the states with the largest cities, do not have the health infrastructure to prevent death among clinical cases.
The greatest MRts, both raw and smoothed, corresponded to the states of Chihuahua, Quintana Roo, Tabasco, and Baja California (raw values of 4.84%, 4.55%, 4.27%, and 3.96%, respectively, observations until April 21 st , 2020). It is noticeable that SMRs for such states are between 2 and 3, suggesting that the risks are above what is nationally expected in these states. We observed a spatial cluster concerning states with low risks; however, at least until the period of our study, there was no significant clustering of states with high risks using the LISA technique.
The fact that Quintana Roo is an important touristic centre might explain the higher number of COVID-19 cases. However, according to our models, the elevated MRt in this state is most strongly associated with the levels of diabetes, internal migration, and the physicians-topopulation ratio, whose possible interpretation was discussed above. In Chihuahua and Baja California, the variables with a particularly strong positive association with MRt were external migration (%), obesity (%), and GDP. Meanwhile, in Tabasco, these variables corresponded to diabetes (%) and internal migration (%).
In terms of local effects, the physicians-to-population ratio (heavily associated with urbanicity, overcrowding households, population density, and less poverty) is the variable with the highest positive association with the MRt, but it has a relatively similar strength of association across all of Mexico. External migration (%) has the second-highest association with the MRts, particularly in those states on the north side of the country, in which the risks were the highest. In contrast, internal migration has the fourth-highest association, particularly in the Centre, South, and Yucatan peninsula. The association of the percentages of obesity and diabetes with MRts is similar for both; for obesity, there is relatively a similar association in all the country. Nevertheless, for diabetes, there is a slightly higher association in the Centre, South, and Yucatan peninsula. The associations have also been reported in other studies in other geographical regions [31][32][33], and although it is still under study, they may be associated with inflammatory stages [34]. Finally, the proportion of individuals between 10 and 39 years old has a relatively similar effect throughout Mexico.
Economic effects, such as poverty, might be better studied in a disaggregated model, including it as a measure at the individual level. Unfortunately, such information is unavailable in the epidemiological data set, and at most, information concerning poverty at a municipality level could be attached to each individual, since the state is the spatial unit containing municipalities. However, we would still be using aggregated values, and in the methodology used to calculate poverty in Mexico, the state values are estimators obtained from a representative sample, whereas the municipality values are estimated through small area estimation techniques; thus, making the former more reliable. In this sense, all analyses were performed at a state instead of at a municipal level. The reason behind this decision is that there were many municipalities with zero values in the early stages of the pandemic, which is the focus of our analysis, and this would have resulted in many issues concerning model fit and convergence considering the probability distributions available for GGWR and other spatial linear models. To obtain similar results, we would have required different tools, such as zero-inflated geographical models, that are not currently available. Future studies could focus on the characteristics of municipalities with zero cases and mortalities, and focus on spatio-temporal analyses of these COVID-19 data.
Our results are robust in terms of the models fitted since most of the statistical assumptions were satisfied; and, though numerically there could be some overdispersion, after fitting a quasi-Poisson and a negative binomial linear model, we obtained similar results and no significant overdispersion according to the latter model. It is essential to notice that we are studying fatality risks associated with those individuals tested for the disease (MRt); thus, care should be taken if results want to be extrapolated. However, using the projected population in 2020 [35], instead of the tested individuals to model the mortality risks, we obtained similar results. We think that an analysis over the population is somewhat inaccurate because all health-related variables correspond to prevalence in individuals in the data set, which do not necessarily agree with those in the population, the same issue was present for age distribution. If we used population values by state, we would waste all the epidemiological data, except for the number of deaths. Additionally, in the analysis, the number of infected and/or deaths is even more poorly estimated when analyzing the early spread of the disease since, as in many countries, there was a highly selected nature of early tests.
It is essential to mention that a limitation in our results is that the associations should not be extrapolated to lower aggregation levels or individuals due to the potential for ecological bias [36]. Despite these limitations, we were able to identify some spatial predictors of mortality risks associated with COVID-19 at an early stage of the pandemic.
In conclusion, metabolic diseases (%), internal and external migration (%), physicians-topopulation ratio, GDP per capita in states without the biggest cities, and the proportion of individuals in the age group between 10 to 39 years old were significantly associated with early COVID-19 mortality risks in Mexico. These predictors likely influence the growth of the pandemic moving forward, but variables, such as the prevalence of metabolic diseases, cannot be easily modified in the short-term. However, identifying variables in Mexico associated with the risks and in specific geographical areas could help in the identification of public policies that could limit the impact of future epidemics. Even though our study focused on the early stages of SARS-CoV-2 spread, its results allow us to understand how the pandemic evolved within Mexico and the possible measures that should be taken to control additional waves of COVID-19 or similar diseases in Mexico and specific zones of the country.