Spatial analysis of COVID-19 spread in Iran: Insights into geographical and structural transmission determinants at a province level

The Islamic Republic of Iran reported its first COVID-19 cases by 19th February 2020, since then it has become one of the most affected countries, with more than 73,000 cases and 4,585 deaths to this date. Spatial modeling could be used to approach an understanding of structural and sociodemographic factors that have impacted COVID-19 spread at a province-level in Iran. Therefore, in the present paper, we developed a spatial statistical approach to describe how COVID-19 cases are spatially distributed and to identify significant spatial clusters of cases and how socioeconomic and climatic features of Iranian provinces might predict the number of cases. The analyses are applied to cumulative cases of the disease from February 19th to March 18th. They correspond to obtaining maps associated with quartiles for rates of COVID-19 cases smoothed through a Bayesian technique and relative risks, the calculation of global (Moran’s I) and local indicators of spatial autocorrelation (LISA), both univariate and bivariate, to derive significant clustering, and the fit of a multivariate spatial lag model considering a set of variables potentially affecting the presence of the disease. We identified a cluster of provinces with significantly higher rates of COVID-19 cases around Tehran (p-value< 0.05), indicating that the COVID-19 spread within Iran was spatially correlated. Urbanized, highly connected provinces with older population structures and higher average temperatures were the most susceptible to present a higher number of COVID-19 cases (p-value < 0.05). Interestingly, literacy is a factor that is associated with a decrease in the number of cases (p-value < 0.05), which might be directly related to health literacy and compliance with public health measures. These features indicate that social distancing, protecting older adults, and vulnerable populations, as well as promoting health literacy, might be useful to reduce SARS-CoV-2 spread in Iran. One limitation of our analysis is that the most updated information we found concerning socioeconomic and climatic features is not for 2020, or even for a same year, so that the obtained associations should be interpreted with caution. Our approach could be applied to model COVID-19 outbreaks in other countries with similar characteristics or in case of an upturn in COVID-19 within Iran.

Introduction On 11 th March 2020, the General Director of the World Health Organization (WHO), Dr. Tedros Adhanom Ghebreyesus, declared the new infectious respiratory disease COVID-19, caused by the infection of novel coronavirus SARS-CoV-2 as a pandemic, due to the rate of growth of new cases, the number of affected people, and the number of deaths [1]. As of the time of this writing (April 15 th , 2020), the number of infected cases world-wide corresponded to more than 1 million, being the most affected countries: Italy (16,523 deaths), Spain (13,341 deaths), USA (10,792 deaths), France (8,911 deaths), United Kingdom (5,373 deaths), and Iran (3,739 deaths) [2,3].
Iran was among the first countries outside of China to report a rapid increase in the number of COVID-19 cases and associated deaths; its first confirmed cases were reported on 19th February 2020 in the province of Qom imported from Wuhan, China [4]. Nevertheless, some reports suggest that the outbreak may have happened two or six weeks before the government official announcement [5]. Iran had one of the highest COVID-19 mortality rates early in the pandemic, and its rate of spread has been amongst the highest. However, as with other countries, it may be a sub-estimation of cases, and there may be other cases not officially reported [6].
The large count of COVID-19 cases and mortality in Iran are multifactorial. Iran's response to the epidemic has been highly affected by several imposed economic sanctions and armed conflicts within the last 20 years. Moreover, its difficult economic situation due to a recession, having inflation rates that are among the highest in the region, has taken a toll on its public health system [7,8]. Although there are approximately 184,000 hospitals and primary healthcare staff, limitations in the availability of COVID-19 testing kits, protective equipment, and ventilators are quite important. On the other hand, over the last years, Iran has slowed the rate of mortality associated with infectious and maternal diseases. It is currently undergoing an epidemiological transition where infectious diseases interact with chronic conditions [2]. In this sense, Iran may represent other similar developing world countries with poor health systems and an increased prevalence of chronic diseases.
Spatial statistics have emerged as a useful tool for the analysis of spatial epidemiology, concerning mapping and statistical analyses of spatial and spatio-temporal incidences of different pathogens. The aim of this paper is to perform spatial analyses which allow us to better understand the COVID-19 outbreak in Iran, not only in terms of the strength of its presence and socioeconomic and structural factors which facilitate the disease spread within Iranian provinces, but also in terms of how the disease is spatially distributed and which variables are spatially related with it considering the spatial effect to obtain adequate inferences. Given the role of climate and socio-economic factors in determining the distribution of cases and its impact world-wide, we also aimed to incorporate said factors as predictors of SARS-CoV-2 spread [9,10]. This could aid to understand the burden of COVID-19, its distribution in the country, and its implication on public health within Iran and similar countries [11] and could contribute to public health measures by providing insight to inform the implementation of interventions or to understand socio-demographic factors associated with the SARS-CoV-2 spread and COVID-19 heterogeneity as it has been applied to previous infectious diseases [12][13][14][15][16].

Data sources
We obtained province-specific data considering 31 provinces or polygons in Iran (Table 1). From the Statistical Centre of Iran [17], we extracted information concerning: 1) people settled in urban areas in 2016 (%), calculated from the population and household of Iran by province and sub-province information of the census, 2) people aged �60 years in 2016 calculated from the population disaggregated by age groups, sex, and province information of the census, 3) population density (people per km 2 ) in 2016, 4) literacy rate of population aged �6 years in 2016, obtained from the document of selected results from the 2016 census, 5) the Consumer Price Index percent changes on March 2020 for the national households in contrast to the corresponding month of the previous year (point-to-point inflation), and 6) the average temperature (˚C) of provincial capitals and 7) annual precipitation levels (mm) in 2015, both part of the climate and environment information. From the Iran data portal [18], we obtained, from the health section: 1) the number of physicians employed by the ministry of health and medical education in 2006 and 2) the number of beds in operating medical establishments in 2006 and from the government finance section: 3) the province contribution to gross domestic product (GDP) in 2004. We used a Transportation Efficiency Index (TEI) [19], constructed through Data Evelopment Analysis, being an indicator of the extent in which each province efficiently utilize their transportation infrastructure. The TEI has values between zero and one, values near to one indicate provinces better communicated, but we standardized it (values of each province minus its mean divided by the associated standard deviation) to allow interpretations in a better scale in terms of how the increase in a certain number of standard deviations of the TEI is associated with the number of COVID-19 cases. The cumulative number of cases with confirmed COVID-19 by Province from February 19 th to March 18 th , 2020, was also obtained [20]. It is important to notice that, in order to obtain more accurate rates of cases with COVID-19, population size in 2020 by province was derived by using mathematical projection methods (arithmetic, geometric, exponential, and logistic or saturation methods) using information contained in the population and housing censuses from 2006, 2011, and 2016. Since all methods provided similar results, we show here only those associated with the arithmetic method. Shapefiles were obtained from the Stanford Libraries Earthworks: https://earthworks. stanford.edu/catalog/stanford-dv126wm3595, in which files are freely available for academic use and other non-commercial use [21].

COVID-19 rate estimation by Iranian provinces
We obtained quantile maps associated with raw rates of COVID-19 cases, as well as smoothed case rates by province using an empirical Bayes estimator, which is a biased estimator that improves variance instability proper of rates estimated in small-sized spatial units [22] (i.e. provinces with a larger population size have lower variance than provinces with a smaller population size). Since raw and smoothed rates were surprisingly similar, only results of smoothed rates are reported. We also obtained maps concerning excess or relative risk, serving as a comparison of the observed number of cases by the province to a national standard. For the variable concerning the number of people aged �60 years by province, raw, and smoothed rates were obtained using empirical Bayes, and the latter were used in all analyses.

Spatial weight estimation and spatial autocorrelation
Since all spatial analyses require spatial weights, we obtained queen contiguity weights [23]. Provinces were considered as neighbors when they share at least a point or vertex in common, obtaining a squared matrix of dimension 31 (31x31 matrix) with all entries equal to zero or one, the latter value indicating that two provinces are neighbors. From these neighbors, weights are calculated by integrating a matrix in a row-standardized form, i.e., equal weights for each neighbor and summing one for each row. Moran's I statistic was obtained as an indicator of global spatial autocorrelation [24], and its significance was assessed through a random permutation inference technique based on randomly permuting the observed values over the spatial units [25]. Local indicators of spatial autocorrelation (LISA) were obtained, being these a decomposition of Moran's I used to identify the contribution of each province in the statistic [26]. LISA was used to derive significant spatial clustering through four cluster types: High-High, Low-Low, High-Low, and Low-High. For instance, the High-High cluster indicates provinces with high values of a variable that are significantly surrounded by regions with similarly high values. Analogous to Moran's I and LISA, estimates can be calculated to identify the spatial correlation between two variables and to identify bivariate clustering [27]. For instance, to identify provinces with high values in a first variable surrounded by provinces with high values for a second variable (cluster High-High). Bivariate clustering and quartile maps were obtained for each of the significant variables in a linear spatial model, to have a better understanding of the individual spatial effect of each of these variables over the smoothed rates associated with the disease.

Spatial multivariate linear models
Spatial multivariate linear models were fitted to identify variables that significantly impact the number of log-transformed COVID-19 cases [28]. This response variable was chosen since the corresponding model better satisfies all statistical assumptions, the other variables introduced in the Data section were simultaneously introduced as explanatory, first removing from the model all variables generating multicollinearity. Ordinary Least Squares (OLS) estimation was used to identify whether a linear spatial model was necessary by using a Lagrange Multiplier (LM) and a robust LM statistics to compare the non-spatial model with spatial models [29]. Two kinds of spatial models were compared; the spatial-lag model considers the spatially lagged response as an additional explanatory variable, whereas the spatial-error model considers that the error is a linear function of a spatially lagged error plus another error term. Another model was obtained by performing a backward selection process, considering the elimination of the most non-significant variable in each step and the minimization of the Akaike Information Criterion (AIC). This process allowed us to identify whether the associations obtained through this model were similar as those obtained through the model including all variables. For significant variables in the linear spatial models, interpretations in the original scale (i.e., as counts) were derived and we performed bivariate LISA significant clustering between each of these significant variables and the rate of cases with COVID-19, as explained above. All statistical analyses were conducted using GeoDa version 1.14.0. A two-tailed p-value<0.05 was considered as the significance threshold.

Rates description and spatial autocorrelation of COVID-19 case rates between provinces
Maps for quartiles corresponding to the smoothed rates and excess risk of COVID-19 cases are shown in Fig 1. We observed that the highest rates of COVID-19 and excess risk values were located in the Northern region of Iran corresponding to the provinces of Qom, Marzaki, Mazandaran, and Semnan. There were also high rates associated with the provinces of Alborz, Gilan, Qazvin, and Yazd (last quartile). We observed significant spatial autocorrelation (Moran's I = 0.426, p = 0.002), indicating that COVID-19 rates between provinces are significantly spatially related. From the heat and significance maps corresponding to significant clusters using an empirical Bayes spatial technique, we delimited a High-High cluster in red, indicating a northern zone around Tehran and Alborz with significant high COVID-19 rates surrounded by areas with similarly high rates. Conversely, we delimited a Low-Low cluster in blue indicating southern provinces with small rates surrounded by areas with similarly lower rates, which includes the provinces of Bushehr, Homozgan, Sistan, and Baluschestan. Interestingly, Golestan showed in light purple, has significantly lower COVID-19 rates despite being surrounded by a cluster of provinces with higher rates (Fig 2).

Selection of multivariate linear spatial model for COVID-19 spread
Since the variable hospital beds is strongly associated with variables GDP and number of physicians (Kendall correlation coefficients above 0.55), we eliminated it from all models to avoid multicollinearity (S1 Fig). We confirmed that a spatial model was necessary since the error term from the OLS fitting showed significant spatial autocorrelation (Moran's I = 0.134, pvalue = 0.025). Additionally, the LM and Robust LM statistics indicated that a spatial lag model was required since the spatial parameter (ρ) associated with the spatially lagged response was significant (LM = 10.669, p-value = 0.001; Robust LM = 13.557, pvalue < 0.001), which did not occur with the spatial error model since the corresponding spatial parameter was not consistently significant (LM = 1.256, p-value = 0.262; Robust LM = 4.144, p-value = 0.042). Thus, only the spatial lag model was fitted obtaining a significant spatial parameter (ρ = 0.723, p-value<0.001), which indicated that the rate of an area in the linear model is affected by COVID-19 rates in neighboring areas (R 2 = 0.877, σ 2 = 0.146). Normality and homoscedasticity assumptions were reasonably satisfied.

Predictors of COVID-19 spatial spread in Iranian provinces
The significant variables associated with the model obtained through the selection scheme were the same as those associated with the model including all variables. This simplified model excluded population density, Consumer Price Index, and annual precipitation. The estimated coefficients were similar for both models; however, we analyzed the estimations associated with the model including all variables to consider effects controlled for these three variables.

Spatial lag predictors and province clusters
Quartile maps for each of the significant variables in the spatial lag model are shown in Fig 3. Finally, concerning bivariate LISA significant clustering (Fig 4), we observed a positive spatial relationship (Moran's I = 0.341, p-value = 0.002) between urban population and COVID-19 rates; provinces with high urban rates surrounded by areas with high COVID-19 rates are the same as the ones in the High-High cluster for COVID-19, except for Mazandaran, and similarly for the Low-Low cluster, except for Bushehr. There is also a positive spatial relationship Both High-High and Low-Low clusters include similar provinces as the ones in the clusters for COVID-19, except for Qom and Alborz, which have significantly lower rates of people aged �60 years but are spatially surrounded by areas with high disease rates. Concerning literacy rates, we also identified a positive spatial relationship between literacy and disease rates (Moran's I = 0.362, p-value = 0.005). The associated High-High cluster and that obtained for COVID-19 rates are formed by the same provinces, whereas in the south, Hormozgan and Bushehr have high literacy rates but are surrounded by areas with low disease rates. Concerning average temperature levels, the global spatial autocorrelation is negative (Moran's I = -0.107, p-value = 0.103). The High-High clusters for temperature and COVID-19 rates are similar, except for Marzaki and Alborz, where there is significantly lower temperature surrounded by areas with high COVID-19 rates. In the south, there is a significantly high temperature with spatially lower disease rates.
There is a positive spatial relationship (Moran's I = 0.302, p-value = 0.003) between the number of physicians and the COVID-19 rate. There is a High-High cluster in the north with a High-Low zone between formed by Marzaki, Qom, and Semnan, with a significantly lower number of physicians; however, they are spatially surrounded by areas with higher disease rates. Concerning the TEI, the spatial correlation is close to zero (Moran's I = -0.096, pvalue = 0.112), indicating a particular random global spatial relationship between TEI and the disease rate. The High-High cluster is the same as the High-High cluster for the disease, except for Mazandaran and Alborz, which have significantly low TEI but are surrounded by areas Table 2 with high disease rates. In the south, two provinces, which formed a Low-Low cluster for COVID-19 cases, are now areas with high TEI spatially associated with areas with low disease rates.

Discussion
Here, we demonstrate that the rates of COVID-19 cases within Iranian provinces are spatially correlated. This could be due to the origin of the outbreak, which started on the north of Iran, and can be seen through an important province cluster with the highest number of COVID-19 cases that we found around Tehran and Qom. Several mathematical models have been used to model the COVID-19 outbreak, mostly focused on forecasting the number of cases and assessing the capacity of country-level healthcare systems to manage disease burden [30][31][32]. In the present report, we demonstrate that the spatial relationship and socio-demographic factors associated with the provinces must be considered to model the disease adequately, and this report also highlights structural factors that may lead to inequities in COVID-19 spread. Of relevance, we highlight the role of social determinants of health in sustaining SARS-CoV-2 transmission and provide additional evidence that human mobility or province interconnectedness might be associated in favoring disease spread [33]. Importantly, our approach demonstrates that urbanization, aging population, education, average temperatures, number of physicians, and inter-province communications are associated with the case numbers amongst Iranian provinces. The obtained results do not consider the spatial effect, which is accumulated since the spatially lagged response is part of the explanatory variables, and they consider fixed values for all variables except the one being interpreted. Overall, these variables spatially correlate with the COVID-19 province clustering indicating a consistent association with the observed variables. The greatest increase in the number of COVID-19 cases is associated with people aged �60 years, urban population, and how well the provinces are communicated, with age having one of the most important associations, an increase of 1% in the corresponding rate implies a percentage increase of 46.65% over the number of cases. Of relevance, mortality attributable to COVID-19 complications is higher in this age group, and age increases the likelihood of developing the symptomatic disease and increased disease severity [34,35]. Nevertheless, the association with older age could have different meanings depending on the number of comorbidities, with some reports labeling COVID-19 as an age-related disease [36]. Our data demonstrate that the spatial spread of COVID-19 has a relationship with population aging structures, a concept that must be explored in this setting to obtain population-specific estimates and lethality and which could represent a significant structural inequality related to COVID-19 burden [37].
Urbanization rates also are associated with a percentage increase over the number of COVID-19 cases; we observed a similar association regarding province interconnectedness, which goes in line with recent information on human mobility and its effect in decreasing disease spread through social distancing [33]. Urbanization, as a demographic phenomenon, leads to increased interconnectedness and human mobility as well as increased population density; these two factors facilitate disease spread. Emerging zoonotic diseases similar to SARS-CoV-2 have been linked to major structural factors that have been reported in other studies, including population growth, climate change, urbanization, and pollution [38,39]. Thus, communication and the degree of urbanity, and what this implies in terms of pollution, overcrowding, among other factors, seem to be relevant to determine the number of COVID-19 cases and should imply geographical targets for public health interventions to monitor disease spread and disease containment [40].
The only effect associated with a decrease in the number of COVID19 cases in our study was attributed to literacy, which might reflect several factors that ultimately influence disease spread. Data from several countries, including Iran, identified that higher health literacy was associated with a lower number of COVID-19 cases, probably reflecting attitudes towards public health measures including social distancing, early disease detection, and hand hygiene [41,42]. Interestingly, this poses a potential public health intervention given that individuals with reduced health literacy, not only might have higher rates of COVID-19, but also increased likelihood for depression and impaired quality of life in suspected cases. Literacy's protective effect on disease spread also indicates a strong influence on social inequity and vulnerability as risk factors for COVID-19 spread, particularly on the influence of health equity, which will likely define the long-term impact of COVID-19 in many developing countries [43].
Concerning average temperature levels, we were able to obtain information associated only with the capitals and not the provinces, being a limitation of the analyzed information, obtaining some inconclusive results. On one hand, the global spatial autocorrelation was negative, though not statistically significant, indicating that global areas with higher temperatures are spatially related to areas with lower disease rates. On the other hand, on the spatial linear model, we derived that more temperature is associated with more cases. However; the former result does not contradict the latter since the direct effect in each province of a variable over the response is different from the spatial relationship between two variables. The latter considers one of the variables as spatially lagged (COVID- 19), and thus the direct effect between variables in the same province is not included. In fact, this problem occurs in all the bivariate analysis, so care should be taken in all the interpretations. Notably, our results are consistent with previous analyses which have analyzed the impact of climate on SARS-CoV-2 stability and spread [44]. However, these results should be further studied considering the climate data limitations, that we obtained mixed results, and that some studies suggest there is no evidence that spread rates of the disease decline with higher temperatures [45].
Our study had some strengths and limitations. We approached COVID-19 using spatial analysis, which allowed us to identify province-level factors that are associated with the disease spread and which may be shared by other countries with similar socioeconomic or geographic structures by potentially identifying targets for country-wide public health interventions. This approach considers disease spread beyond individual-specific factors and could also be used to monitor areas of a potentially high number of undiagnosed cases that could facilitate disease spread and the surge of delayed waves of COVID-19 after initial mitigation [46,47]. Methodologically, all our analyses consider the spatial nature of the data. We identified significant spatial clustering and in terms of the spatial multivariate linear model, by including a spatial effect, we consider that the number of cases in an area is affected by those in neighboring areas. In this way, a lack of independency between spatial units is considered, being independence assumed in a usual linear model, thus obtaining more precise estimations. Of course, other statistical methods are available for this task, as generalized linear mixed models or geographically weighted regression; however, they do not use spatial weights, making our results more comparable with the Moran's I or spatial clustering, which are based on such weights. A limitation of our approach is that most of the variables used to explain COVID-19 disease rates were taken from previous years and not updates, given the unavailability of recent estimates. Furthermore, smoothed COVID-19 rates were calculated using a projection of the population in 2020 since the most recent census corresponds to 2016, thus rates could have slightly different values. In this sense, the explanatory variables were not projected since information of previous years was not always available; however, precise projections for each variable were out of the scope of this work; and, it is also possible that some variables have a time lagged effect over the response. However; the time lagged effect we included was unintentional and dependent on the information available and not considering a lagged time effect as defined by experts; for instance, for GDP we used a time lag of 16 years, when perhaps it should have been of fewer years. When obtaining estimates using both projected and population size in 2016, we observed no significant changes in the results, which confirms the robustness of our approach. In fact, with all the mathematical projection methods similar results were obtained. However, the projections by province could be improved by considering a demographic balance equation and probabilistic projection methods as the ones obtained by country by the UN [48]. In this sense, we suspect similar results would still be obtained since our projected values by country are similar to those obtained by the UN. We also observed that the smoothed and raw rates of COVID-19 cases were similar, with an absolute difference between them of at most 0.607 (considering rates for every 1000 individuals), this was probably due to Iran not having provinces with extremely small or large population size. Future work could be focused on evaluating spatio-temporal modeling, which could be useful to monitor disease spread and identify additional factors relating not only to transmission rates but also to transmission dynamics. Since COVID-19 is currently challenging health systems all over the world, science-centered public health decisions could benefit from spatial modeling to investigate larger factors targeted for public health interventions.
In conclusion, COVID-19 spread within Iranian provinces is spatially correlated. The main factors associated with a high number of cases are older age, high degrees of urbanization, province interconnectedness, higher average temperatures, lower literacy rates, and the number of physicians. Structural determinants for the spread of emerging zoonotic diseases, including SARS-CoV-2, must be understood in order to implement evidence-based regional public health policies aimed at improving mitigation policies and diminishing the likelihood of disease re-emergence.