Social distancing, a non-pharmaceutical tactic aimed at reducing the spread of COVID-19, can arise because individuals voluntarily distance from others to avoid contracting the disease. Alternatively, it can arise because of jurisdictional restrictions imposed by local authorities. We run reduced form models of social distancing as a function of county-level exogenous demographic variables and jurisdictional fixed effects for 49 states to assess the relative contributions of demographic and jurisdictional effects in explaining social distancing behavior. To allow for possible spatial aspects of a contagious disease, we also model the spillovers associated with demographic variables in surrounding counties as well as allow for disturbances that depend upon those in surrounding counties. We run our models weekly and examine the evolution of the estimated coefficients over time since the onset of the COVID-19 pandemic in the United States. These estimated coefficients express the revealed preferences of individuals who were able to and chose to stay at home to avoid the disease. Stay-at-home behavior measured using cell phone tracking data exhibits considerable cross-sectional variation, increasing over nine-fold from the end of January 2020 to the end of March 2020, and then decreasing by about 50% through mid-June 2020. Our estimation results show that demographic exogenous variables explain substantially more of this variation than predictions from jurisdictional fixed effects. Moreover, the explanations from demographic exogenous variables and jurisdictional fixed effects show an evolving correlation over the sample period, initially partially offsetting, and eventually reinforcing each other. Furthermore, the predicted social distance from demographic exogenous variables shows substantial spatial autoregressive dependence, indicating clustering in social distancing behavior. The increased variance of stay-at-home behavior coupled with the high level of spatial dependence can result in relatively intense hotspots and coldspots of social distance, which has implications for disease spread and mitigation.
Citation: Narayanan RP, Nordlund J, Pace RK, Ratnadiwakara D (2020) Demographic, jurisdictional, and spatial effects on social distancing in the United States during the COVID-19 pandemic. PLoS ONE 15(9): e0239572. https://doi.org/10.1371/journal.pone.0239572
Editor: Jaymie Meliker, Stony Brook University, Graduate Program in Public Health, UNITED STATES
Received: May 25, 2020; Accepted: September 10, 2020; Published: September 22, 2020
Copyright: © 2020 Narayanan et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: Our social distancing data is provided by SafeGraph. Their data platform is free for use for academic research and is available to any researcher who requests access. SafeGraph provides access only to de-identified data, aggregating individual behavior to the level of a Census block group (CBG). Social distancing at the CBG level is what is available to researchers for free. Per our contract with SafeGraph, we are not legally able to provide a copy of this dataset with our paper. The streamlined portal for researchers to request access to the data is available here: https://www.safegraph.com/covid-19-data-consortium. All other data used is publicly available from the U.S. Census.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
The transmission of an infectious disease such as COVID-19 depends on its basic reproduction number R0, the expected number of cases directly generated from contact with an infected person. Social distancing, a non-pharmaceutical tactic, can reduce R0 by limiting contact with infected persons. Social distancing may arise naturally if individuals prefer staying at home to avoid infection and have the capacity to stay at home. Or it can arise by governmental fiat as a non-pharmaceutical intervention to curb the spread of the disease through restrictions on individual activities over a jurisdiction. In this paper, we examine how these factors explain social distancing using U.S. cell phone tracking data. We also examine whether individual behaviors differ across jurisdictions or exhibit other spatial patterns.
Modeling social distancing requires confronting two issues. First, social distancing depends upon the prevalence of COVID-19, and the spread of COVID-19 depends upon social distancing. Therefore, social distancing and the spread of COVID-19 are simultaneously determined, which complicates estimation of their individual effects [1, p. 151]. Second, the various ways individual jurisdictions track the disease may introduce measurement error issues that further complicate identification [1, p. 134]. Insofar as the process of obtaining a test may depend upon demographic variables such as population density (rural versus urban), age, education, race, and income, this can lead to measurement error correlated with variables in the model. For example, drive-through testing has become popular, but this may select against poorer residents without a car. Measurement error correlated with variables in the model is termed differential measurement error, and this can lead to bias in estimation .
To avoid the issues of simultaneity and measurement error, we focus on reduced form modeling of social distancing as gauged by the change in the log of the percentage of individuals staying at home each week using cell phone tracking data. The reduced form involves a regression of the dependent variable as a function of exogenous demographic variables (population density, age, education, race, and income) as well as jurisdictional fixed effects for 49 states (Hawaii excluded due to data coverage). Because of the spatial nature of a contagious disease, we also model the spillovers associated with demographic variables in surrounding counties, as well as allow for spatially dependent disturbances at the county level. This also mirrors the possible spatial aspects of natural disasters [3, 4]. We run a separate reduced form regression for each week of data and examine the evolution of the estimated coefficients over time since the onset of the COVID-19 pandemic in the United States. These estimated coefficients express the revealed preferences of individuals who chose to stay at home to avoid the disease, subject to their constraints.
The estimation results show that predicted social distance from demographic exogenous variables explains substantially more variation in individual stay-at-home behavior than predictions from jurisdictional fixed effects. In addition, the explanations from demographic exogenous variables and jurisdictional fixed effects show an evolving correlation over the sample period. For the 12th week of the data (corresponding to the second full week of April), when all states had restrictions in place following the national emergency declaration, the predictions from the demographic variables explain 43.83% of the variance while the fixed effect predictions explain 13.11% of the variance, with a correlation between these two predictive components of −0.0130. By the last week of the data in our sample (the second week of June), demographic variable predictions explain 33.18% of the variance while fixed effect predictions explain 8.83% of the variance, with a correlation between these two predictive components of 0.1778. A negative correlation implies that predictions from fixed effects and demographic effects partially cancel or offset. A zero correlation implies that the fixed effect and demographic effect predictions operate independently. A positive correlation implies that fixed effect and demographic effect predictions reinforce or complement each other.
Regression models on spatial data often result in clustered residuals or spatial error dependence. For example, consultation among officials of nearby counties could lead to them adopting similar policies and thus obtaining similar results. Indeed, we find statistically significant spatial error dependence. In contrast, we do not find statistically significant spillovers from the demographic exogenous variables in surrounding counties. In other words, the racial composition or average income of surrounding counties do not appear to significantly affect the observed social distance in a county. In addition, because the underlying exogenous demographic variables and thus the predictions show substantial spatial autoregressive dependence (between 0.851 to 0.914, depending on the week), this indicates that social distances will also cluster. Over the course of the data, the cross-sectional variation in social distances increased over nine-fold from 0.0050 in week 1 to 0.0474 in week 9, and then fell by about half to 0.0234 in week 20. The increased variance coupled with the high level of spatial dependence can result in relatively intense hotspots and coldspots of social distances.
An emerging literature surveyed by  and  points out that the constant basic reproduction number assumed in compartmental models (the Susceptible-Infectious-Recovered, or SIR models) used in epidemiology to model disease transmission is untenable given that behavioral responses to the fear of infection [7–12], underlying health conditions [13, 14], and work environment [15–18] all influence the decision to social distance. Our results show that individual behavioral responses matter, even when jurisdictional restrictions are in place, in affecting the basic reproduction number.
Models that allow for the basic reproduction number to vary over time to capture changes in behavior or policy such as  show a better fit with the data on the spread of COVID-19. Our results add to this line of work by suggesting that individuals with the capacity and desire to avoid risk in hard hit areas may lead to models overestimating the spread of disease based on initial infections. Conversely, individuals in areas without much incidence of the disease may not engage in social distancing and thus be quite susceptible to the disease when it finally spreads. They also suggest that the worst cases for disease spread may exist in areas where the residents have reduced capacity to avoid risk through voluntary distancing such as in poor, densely populated neighborhoods. In this setting the constant R0 assumption may hold.
Data and background
Demographic data is collected from the American Community Survey (ACS) data provided by the U.S. Census. We use the 5-year estimates that conclude in the year 2018, because this is the most recent ACS data that includes coverage for every U.S. county. We take a county’s population size to be the population estimated by the U.S. Census for July 1, 2019. To convert the population level into a population density, we obtain county land area from the 2019 release of the U.S. Census Gazetteer data.
Table 1 reports summary statistics of the exogenous variables used in this study. Population density (Pop) is the most highly skewed variable, with a minimum of 0.04 people per square mile (Yukon-Koyukuk, AK) and a maximum of 71,872.6 people per square mile (New York, NY). In the median county, 29.3 percent of households have at least one child (Child) under 18 years of age. In the median county, 87 percent of the population have not obtained a bachelor’s degree or higher (HS), and 89.9 percent of the population is white (White). The median household income (Income) for the median county is $49,871.8.
We infer county-level rates of individuals staying at home using GPS data from SafeGraph. The company collects anonymized location data of individuals via GPS pings from cell phones. In response to the pandemic, SafeGraph released their aggregated and anonymized data to researchers, and began publishing additional data, including a Social Distancing Metrics dataset. This dataset covers all 50 U.S. states as well as Washington, D.C. We find incomplete county coverage in Hawaii, and therefore exclude it from our analysis. The Social Distancing Metrics data record, aggregated to a Census Block Group (CBG) level, the number of active devices in the CBG on a given day and the count of those devices that never leave the user’s inferred home location on that day. We aggregate these CBG-level numbers to the county-level and then define our proxy for social distancing by dividing the number of devices completely at home by the total number of active devices. The average rate of devices staying at home over a given week is then taken to be the level of social distancing in the county.
Table 2 presents the evolution of this measure of social distancing over the 21 weeks of data in our sample beginning with the third week of January, 2020. The median percentage of people social distancing is about 25% prior to President Trump’s National Emergency Declaration on March 14, 2020 (week 7). Following the declaration, there is a sharp increase in individuals staying at home in weeks 8 through 12. It begins to decline in mid-April when individual states relaxed their stay-at-home restrictions, but stays elevated relative to pre-emergency levels. The variation in the range of social distancing shows a similar pattern across the weeks.
The presidential emergency declaration was not the only governmental response to the pandemic. States responded to the health care crisis by issuing non-pharmaceutical interventions (NPI), such as mandated closure of certain businesses. The responses varied by state, and were implemented at varying points over our sample period. Moreover, these de jure NPIs have heterogeneous de facto implementation/enforcement. As  document, an extensive variety of the actual policies were implemented at different times over the outbreak. This heterogeneity impedes direct modeling of specific measures. To address these issues, we include state fixed effects along with other control variables. The efficacy of the underlying restrictions will implicitly appear as evolving parameter estimates by week.
We also obtain county-level disease data from the New York Times and report summary statistics of new disease counts in Table 3 for illustrative purposes. The highly decentralized nature of disease tracking leads to significant issues in U.S. disease case data. Substantial heterogeneity exists at the county level due to differences in testing eligibility (whether an individual can receive a test), testing access (whether the county had test kits available), testing rules (e.g., whether tests are required on the deceased), and reporting rules (e.g., classification of results for non-residents). As Table 3 shows, fewer than five percent of counties reported new cases in the week of the national emergency declaration. This is not because cases were rare at this time, but rather because of limitations in testing availability during the first months of the pandemic [21, 22]. The various sources of measurement error or selection mechanisms present in these data motivate the approach we use which avoids case data. We discuss the statistical justifications for this in the following section.
As mentioned earlier, the virus contagiousness depends on social distancing, and social distancing may depend upon both the contagiousness and virulence of the virus. Ideally, one could model these as simultaneous equations where the system dependent variable Yt is an n × 2 matrix. Specifically, Yt contains an n × 1 vector of observations on the virus vt and an n × 1 vector of observations on the social distancing variable, ht, which could represent the propensity of residents to stay in their homes at time t as shown in (1) and (2). In (2) X represents the exogenous explanatory variables, Γt represents the parameters governing the influence of each endogenous variable on the other at time t, and Et represents the disturbances at time t. (1) (2)
The equation representation in (2) is referred to as a structural form of the model [23, p. 528] and these require special techniques or data to avoid inconsistent estimation. Specifically, if the virus vt and the social distancing variable ht are simultaneously determined, estimating a single equation of social distancing as a function of the virus will lead to inconsistent parameter estimates of the effect of the virus vt on social distancing ht. In other words, estimating an equation such as ht = x1 β(t)1 + ⋯ + xp β(t)p + vt β(t)p+1 + ε(t) will lead to an inconsistent parameter estimate of β(t)p+1, the effect of the virus vt on social distancing ht [1, p. 151]. A similar situation exists in economics where modeling supply and demand but ignoring simultaneity led to incorrect signs on the effects of price on supply and demand. This motivated many of the techniques and ideas associated with simultaneous equations from the 1930s onward .
However, the identification and estimation of structural equations in (2) sometimes proves challenging. The most common means of obtaining identification requires a variable (instrument) that affects one of the endogenous variables, but not the other. Unfortunately, this exogeneity condition cannot be tested empirically and becomes a potential weak point in the analysis. Moreover, the non-uniform nature of reporting of COVID-19 tests, hospitalization, and deaths across jurisdictions raises issues about measurement error concerning the virus. In a simultaneous equation system, measurement error in one equation (virus) could affect the other equation (social distancing). To avoid this, we resort to modeling the unrestricted reduced form [23, p. 528] as shown in (3). In an unrestricted reduced form each dependent variable can be estimated consistently when using all the exogenous variables in X. The resulting parameter estimates Πt in (4) show the effect of a particular exogenous variable on the dependent variable. (3) (4)
Because the exogenous variables, X, do not change over the period, each cross-sectional estimate can yield different sets of estimates of Πt. In this way choices by individuals concerning their desire to avoid the virus can express an evolving response to the presence of the disease. Staying at home reveals a preference for safety (subject to the capacity to stay at home). Our interest concerns the changes in behavior over the rise of the disease and to do this we difference the reduced form equations at time t and time 0 as shown in (5). (5)
Note, for the unrestricted reduced form one can estimate this equation by equation. In this case, since the focus lies on social distancing as a function of the exogenous variables, we can move to a simpler single equation model, a subject taken up in the next section.
In summary, social distancing and disease are determined simultaneously and a simple single equation estimation that treats one of the endogenous variables as a typical right hand side variable can lead to inconsistent estimates. Although one could attempt to simultaneously model these variables using a structural form, identification of these systems can be challenging. In addition, the disease data have many non-uniform reporting issues that may adversely affect estimation. To avoid this we resort to unrestricted reduced form modeling. The repercussions of the prevalence of the disease show up in the evolving parameter estimates since the exogenous variables stay fixed over the period of the disease. By differencing the reduced forms at time 0 and time t we see the change in individual behavior as a function of the exogenous variables.
In this section we describe the actual single equation model used in estimation based on the reduced form interpretation in the previous section. We begin by defining y(t) as log of the percentage of the population staying at home at week t, h(t), as in (6). In contrast to the previous section, we switch to a parametric notation indexed by t because this allows use of both subscripts and powers for the various components in the models. The indexing of the parameters, disturbances, and dependent variable by t in the model indicates a series of cross-sectional regressions where the values of the parameters over time reflect individuals’ revealed preference for safety by staying at home. Because our interest focuses on the relative change in y(t) over time our dependent variable Δ(t) in (7) equals y(t) − y(0). Using changes may help reduce omitted variables that have a similar influence over period 0 and t. Because our period of differencing is relatively short (up to 20 weeks), many influences from omitted variables may be reduced. (6) (7)
We selected variables to represent the basic demographic characteristics of population density, age, education, race, and income. For ease of interpretation, we selected only one variable from each major category. Specifically, the explanatory variables include the log of the population density (Pop), log of the percentage of households with children under 18 (Child), log of the percentage of workers without a college degree (HS), log of percentage White in the population (White), and log of median income (Income) as shown in (8) to (12). Because some counties have no minority residents, we coded the race variable as White to avoid taking the log of 0. Similarly, some counties have no college graduates and so we coded the education variable as the percentage of the population without a college degree. This would include individuals that did not complete high school, but for simplicity we denoted this as HS. The Child variable captures some element of age, but also captures the effect of school shutdowns which may have forced many parents to stay at home. (8) (9) (10) (11) (12)
We use the demographic explanatory variables in (8) to (12) to compose the n × 5 matrix Z in (13) and also create a n × 50 matrix S of dichotomous variables for 49 States in (14) as well as a constant vector (intercept). Specifically, we include a constant term ιn defined in (15) and exclude the District of Columbia from the dichotomous variables to avoid perfect multicollinearity in (14). (13) (14) (15)
The contagious nature of COVID-19 suggests the need of explicitly incorporating spatial aspects into the model. We do so in four ways: (1) we examine changes in social distancing behavior and that should reduce the influence of omitted variables with relatively constant importance; (2) we incorporate fixed jurisdictional effects (S); (3) we allow for spillovers from exogenous demographic variables from surrounding counties to enter the model; and (4) we allow for spatial dependence in the disturbances which can capture nearby omitted variables.
To specify spillovers and dependence in the disturbances, spatial econometrics often uses a n × n weight matrix W to create spatial lags of variables so that Wv becomes the spatial lag of an n × 1 variable v. This matrix contains positive, fixed elements if observations neighbor each another and zero elements for non-neighboring observations as shown in (16). Also, to prevent neighbors from predicting themselves the diagonal elements of W equal 0 as specified in (17). Most commonly, W is row normalized so that each row sums to 1 as in (18). Because each row sums to 1 and W contains only non-negative elements, W is a row-stochastic matrix [25, 26]. Note, a matrix can be row-stochastic in the linear algebra sense while containing only non-stochastic elements in the probabilistic sense. Therefore, Wv becomes an average of observations that lie nearby each element of v and (Wv)i does not contain vi. Having λ(t)∈(−1, 1) as shown in (18) leads to a sufficient condition for a positive definite covariance matrix in many spatial econometric models using row-stochastic W. (16) (17) (18)
To incorporate all of these spatial aspects into a model, we selected the spatial Durbin model (SDM) in (19) and (20) as well as the spatial Durbin error model (SDEM) with jurisdictional fixed effects as general specifications as shown in (21) and (22). The SDM and SDEM subsume non-spatial ordinary least squares (OLS) with and without jurisdictional fixed effects, the spatial lag of X model (SLX) with and without jurisdictional fixed effects, and the spatial error model (SEM) with and without jurisdictional fixed effects. However, they do not nest each other and the SDM also nests the spatial autoregressive model (SAR). Note, the acronymn SAR stands for a spatial error model in spatial statistics such as in Orr (1975), but stands for a spatially lagged dependent variable model in spatial econometrics. The SDM can yield models with global spatial spillovers while the SDEM only yields local spatial spillovers. For untransformed dependent and independent explanatory variables, the estimates from OLS, SLX, SEM, and SDEM also equal the respective marginal effects. For the SDM and SAR the parameter estimates do not equal the marginal effects. For a discussion of these two models and the restrictions that lead to OLS, SLX, and SEM see , , as well as . Note, other research using these spatial econometric models for investigating the virus, but not social distance, has found significant spatial influences on the disease [29, 30]. (19) (20) (21) (22)
To show the ability of the SDEM to nest other well known spatial models in more detail, we present these more specific models in (23) to (28). The SDEM with fixed effects subsumes the non-spatial OLS with and without jurisdictional fixed effects (23, 24), spatial lag of X (SLX) with and without jurisdictional fixed effects (25, 26), and the spatial error model (SEM) with and without jurisdictional fixed effects (27, 28). In the models without jurisdictional fixed effects, the scalar κ(t) represents the effect at time t of the constant vector ιn or intercept. (23) (24) (25) (26) (27) (28)
Note, if θ(t) is non-zero and Z and WZ exhibit correlation, estimates of β(t) in a model without WX can exhibit bias [26, 28]. If λ(t) ≠ 0, inference based on the assumptions of independent disturbances can be inconsistent [26, 31]. Estimation of the independent error models in (23) to (26) can use ordinary least squares. Estimation of the SDM, SDEM and the SEM in (19), (21), (27) and (28) requires other techniques such as maximum likelihood [26, 31]. Note, minimizing sum of squared errors will typically lead to overly large values of λ(t), whereas maximum likelihood involves a log-determinant term ln|In − λ(t)W| that penalizes such large values [26, 31]. We used maximum likelihood to produce the estimates in the empirical section.
In terms of interpretation, β(t) from the non-spatial OLS and SEM has the usual partial derivative interpretation where changes to an explanatory variable’s value for an observation only affect that observation’s dependent variable. However, the SLX and SDEM contain the term WZ and this allows for an observations’ neighbor value to have an effect on the observation. In the SLX and SDEM β(t) measures the direct effects and θ(t) measures the indirect effects of a neighbor on an observation [26, 27]. Because of the double log specification of the empirical model, one can interpret the estimates as elasticities.
The incorporation of the spatially lagged dependent variable for these models means that the estimates do not equal the marginal effects in contrast to the error models (SLX, SEM, and SDEM) presented earlier. To use the example of SDM with fixed effects from (32), solving for Δ(t) yields an expression in (33) with the endogenous variable on the left-hand side and the exogenous variables on the right-hand side. Note, (In − λ(t)W)−1 = In+ λ(t)W+ λ(t)2 W2+ … and thus this has the effect of incorporating the immediate spatial lag of the exogenous variables, a spatial second order lag of the exogenous variables, and so forth. Terms such as W2 provide a second-order spatial lag so that W2 v equals W(Wv) so that this term gives the spatial average of the spatial average of v. Higher order powers of W act in a similar way. A high order power of W will often have all positive elements and therefore each observation affects every other observation and this means that models using such high powers of W have a global nature. Therefore, the SDM contains both local and global spatial lags of the exogenous variables. Note that if the magnitude of λ(t) is low, (In − λ(t)W)−1 might converge in just a few terms and therefore the model still has a local nature. This sometimes makes it difficult to distinguish SEM and SAR or SDEM and SDM when low levels of spatial dependence exist. (33)
To summarize, we use differencing of the dependent variable, jurisdictional fixed effects, demographic variables, spatial spillovers, and spatial error dependence to explain social distancing. First, differencing of the dependent variable may reduce the influence of any omitted variable whose importance does not change much over the sample time period (20 weeks). Second, since states have set forth restrictions due to the disease, jurisdictional fixed effects should play a role in social distancing. Third, most social cross-sectional phenomena vary with demographics and thus the inclusion of demographic exogenous variables. Specifically, the nature of the disease suggests that population density matters. Also, having a child present in the household may require an adult in residence with school and day care closures. Education often dictates the types of jobs that individuals hold and many of the jobs that allow remote work require more education. Income also allows individuals the ability to socially isolate. Fourth, a contagious disease suggests spatial spillovers and so these should be present in the more general models. Fifth, cross-sectional data models often exhibit spatially dependent residuals and so a model should be capable of handling them. This still leaves a number of possible specifications. Given the array of possible spatial specifications, we wish to examine their performance in the next section. We also examine a number of alternative specifications in a later section in which we explore the robustness of our principal model.
Because the choice of the model as well as the form of W has potential to affect the results, we conducted a specification search using the well known Bayesian Information Criterion  or BIC. For the BIC, the optimal model has the lowest value out of a series of candidate models. Specifically, we examined SAR, SEM, SDEM, and SDM with and without jurisdictional fixed effects. The results appear in Table 4 (note the BIC that we report is scaled by n to facilitate formatting). For the models with fixed effects, during the first five weeks, SAR had the lowest BIC, during week six SDM had the lowest BIC, and during weeks 7–20 SEM displayed the lowest BIC. For the models using only a constant, SEM showed the lowest BIC for weeks 1–7, 12, and 14–20, SDM showed the lowest BIC from weeks 8–11 and 13, while SDEM showed the lowest BIC in week 12. Taken as a whole the SEM seemed to display the lowest levels of BIC across the weeks and so we will take this as our base model. Note, the models without fixed effects provided a lower BIC than the models with fixed effects. However, the role of the fixed effects is theoretically important in this application and so we will leave these in our base model. We also examined different numbers of neighbors. Using 15 nearest neighbors resulted in the lowest BIC in week 20 for SEM with fixed effects (3.7288). Using 14, 16 nearest neighbors yielded higher values of 3.7294, 3.7304. In summary, we will use SEM with fixed effects (SEM-FE) with 15 nearest neighbors as our base model.
The various information criteria such as Akaike Information Criterion (AIC) [33, 34] and BIC differ on the weight given to parsimony versus performance and a criterion such as AIC which does not penalize model complexity as much as the BIC would result in different rankings of the models. To further examine the performance of alternative models, in a later section we fit the SAR, SEM, SDEM, and SDM with and without jurisdictional fixed effects, SEM FE with 10 as well as 20 nearest neighbors, and SEM FE applied to an alternate measure of social distance. We subsequently examine the correlation among the marginal effects of these 8 specifications across the 20 weeks of the data. In the next section, we provide our base model SEM-FE estimates.
We use the spatial error model (SEM) with fixed effects for each jurisdiction as the baseline model as described previously in (28) and (22). Estimation uses a W matrix based on 15 nearest neighbors. We examine the sensitivity to choices of W in the following section. The results of estimation of the demographic exogenous variables appear in Table 5 where t-values appear below their respective estimates. The results of estimation of the significant fixed effect coefficients appear in Table 6.
Over the course of time the estimated coefficient on population density (Pop) started at a significant −0.007 in week 1 indicating that urban dwellers stayed home less than rural dwellers at the beginning of the rise of the disease. However, this changed over time until population density showed a positive 0.025 coefficient indicating that urban dwellers stayed home more than rural dwellers by week 20. The break in behavior occurred between weeks 5 and 6 where population density switched from a significant −0.014 to a significant 0.014. Percentage of households with children (Child) also showed an evolution from significant −0.021 to 0.046, with a large break occurring from week 6 to week 8. This may represent the repercussions of school closures in many areas. The Child variable coefficient reached a peak of 0.125 in week 9 and declined to 0.046 in week 20. The coefficient associated with the log of the proportion of individuals without a college degree (HS) went from a −0.068 to a peak of −0.954 in week 15, with a break happening between week 7 and 8. The coefficient declined to a significant −0.768 by week 20. This could have happened because individuals with a college degree may have been able to work at home which increased the percentage of the time they stayed at home relative to their less-educated counterparts. Areas with a higher log proportion of whites (White) showed an insignificant effect of staying at home in week 1 of −0.006, which rose to a significant 0.089 by week 9 and declined back to insignificance (−0.011) by week 20. Again, the largest break happened between weeks 7 and 8. Finally, median income (Inc) also began with an insignificant effect in week 1 of −0.007, then rose by a factor of 22 by week 13 to a significant 0.502, again with a large break between weeks 7 and 8. From the peak it declined in magnitude to a significant 0.046 in week 20. Note, estimates of the spatial error dependence parameter λ(t) always exhibited significance but generally fell from early in the crisis (λ(t) = 0.542, 0.657 in weeks 1, 2) to later in the crisis (λ(t) = 0.349, 0.399 in week 18,20). This indicates that the influence of spatially related omitted variables fell over this time period.
The fixed effect estimates across jurisdictions shown in Table 6 show some interesting features. Strikingly, at a one percent level (two tailed test), none of the 49 jurisdictions show a statistically significant effect in weeks 1 through 8 relative to the base jurisdiction of the District of Columbia. In week 10 and 11, eight and five of the 49 states did exceed the critical value of 2.58, respectively. Despite the paucity of t−values with a large magnitude, a joint likelihood ratio test across all jurisdictions shows in terms of a fixed effects versus a single intercept model, twice the difference in likelihood between SEM-FE and SEM gives values between 84.8 to 199.2 over the 20 weeks. The critical value at one percent would be 74.92 (based on 49 d.f.). Therefore, one cannot reject the hypothesis that the fixed effects are jointly significant. Interestingly, the intercept in Table 6 only differs significantly from zero in 3 out of the 20 weeks.
To obtain some insight into the relative contributions of jurisdictional fixed effects and of demographic exogenous variables, we break apart the overall prediction in (34) into a fixed effects component in (35) and a demographic component in (36). As is well known, the variance of the sum of two random variables equals the sum of their individual variables along with an interaction term in (37). (34) (35) (36) (37)
To make (37) more intuitive, we redefine (37) in terms of R2(t) in (38) with R2(t) for individual components of R2(t)FE in (39) and R2(t)D in (40). This allows writing the overall R2(t) in terms of the individual components of R2(t)FE and R2(t)D as well as the correlation between the predictions in (41). (38) (39) (40) (41)
From (41), when the correlations between the fixed effect and demographic predictions are negative, the sum of the relative variances overstates R2(t). In other words, when correlations are negative, the predictions from fixed effects and demographic variables work partially in opposite directions and thus the overall variance of the prediction is lower than the sum of the parts. When correlations are positive, the predictions from fixed effects and demographic variables work partially in the same directions and thus the overall variance of the prediction is higher than the sum of the parts. Note, when the correlation between the fixed effect predictions and the demographic predictions equals 0, the overall R2(t) just equals the sum of fixed effect R2(t)FE and the demographic variable R2(t)D.
In the context of social distancing, a positive correlation between the fixed effect predictions and the demographic predictions suggests that the more restrictive counties also had individuals reducing their exposure, while the less restrictive counties had individuals taking on relatively more exposure. A negative correlation suggests that the actions of individuals and counties partially cancel.
Table 7 shows the variance of y by each week as well as R2(t)FE, R2(t)D, R2(t), and the correlation between and . Several patterns emerge from Table 7. First, the variance increases by over eight-fold from week 1 to to its peak at week 9, before falling by about 50% by the end of the data in week 20. Second, the overall R2(t) increases from 0.1594 to 0.5685 from week 1 to week 9, before falling to 0.4809 in week 20. Third, the correlation between and is materially negative from weeks 1 through 5, almost zero from weeks 6 to 13, and materially positive from weeks 14 to 20. Notwithstanding their magnitudes, all the correlations, except in weeks 6, 12, and 13, were statistically significant from 0 at the 1% level. Therefore, the additive property of the relative R2(t) components approximately holds from week 6 through 13. Fourth, over weeks 8 through 20 the explanatory power of demographic variables R2(t)D rises relative to its values over weeks 1 through 7. Note, during this latter half of the sample, states invoked varying portfolios of non-pharmaceutical interventions to encourage social distancing. These actions, as a well as a national declaration of a state of emergency and increased media coverage of the pandemic, simultaneously increased the salience of the risk of COVID-19. By week 14, more positive correlations between the fixed effect predictions and the demographic variable predictions emerged, which suggests some reinforcing alignment between jurisdictional and individual behavior.
We now examine the potential for spatial clustering of social distances. Table 8 shows the week-by-week cross-sectional spatial autoregressive coefficients for the change in social distancing (Δ(t)), the overall predicted change in social distancing (), the fixed effect predictions (FE), the demographic variable predictions (D), and the regression residuals from the SEM fixed effect model (). The overall predicted change in social distance displays more spatial autocorrelation than the actual change in social distance. Breaking this down further into the fixed effect and the demographic predictions, the fixed effect predictions show more spatial autocorrelation than the overall prediction such that each week has an autoregressive coefficient of over 0.98. Thus, the fixed effect predictions slowly vary over space. This occurs because the effects appear in blocks for all the counties in the state and only show transitions at borders. The demographic predictions also display substantial spatial autocorrelation from 0.851 to 0.914 and show the potential for peaks and valleys of changes in social distance. For completeness, we present the spatial autoregressive coefficients for the residuals from the SEM model with fixed effects. These are much lower, but still material.
In summary, the change in social distancing shows substantial spatial dependence which leads to hotspots (peaks) and coldspots (valleys) in social distancing behavior and that the fixed effect predictions are too smooth (spatially autocorrelated) to account for the variation. Of the various components forming the prediction, the demographic predictions come closest to matching the spatial character of actual changes in social distance.
In this section we examine the robustness of the findings to alternative formulations of the model. Specifically, in the following five subsections, we (a) fit SEM without fixed effects; (b) calculate marginal effects from SDM with fixed effects; (c) estimate SEM-FE employing an alternative measure of social distance; (d) examine the correlations from the marginal effects of the demographic variables from various models using different W and with different dependent variables; and (e) examine the sensitivity of the results from employing alternative specifications of the explanatory variables.
SEM without fixed effects
In the specification search section of our paper, we computed the BIC from SEM, SAR, SDEM, and SDM with and without fixed effects. In 14 out of the 20 weeks SEM without fixed effects achieved the lowest BIC in 14 out of the 20 weeks. Nonetheless, we selected SEM with fixed effects as the base model because theoretical considerations indicated that the model should have fixed effects. In this subsection, we explore the performance of SEM without fixed effects. Specfically, we fit (27) using maximum likelihood. The results appear in Table 9. The signs and magnitudes for the coefficients seem similar to those for the base model, and we will take up a more detailed comparison in a later subsection. However, we note that the autoregressive coefficient λ(t) has increased substantially from the base model estimates and now ranges between 0.637 to 0.766 versus 0.349 to 0.657 for the base regression in Table 5. This points to spatial error dependence as a more parsimonious substitute for fixed effects across jurisdictions.
The SDM with fixed effects often resulted in the highest likelihood across weeks. As discussed in the model section of our paper, the SDM in (33) results in marginal effects that do not equal parameter estimates. Accordingly, we estimated (33) and computed the direct and indirect average marginal effects which appear in Tables 10 and 11. We omit the estimates of fixed effects as these were not significant at the one percent level. As a casual summary, the estimated direct effects from the SDM with fixed effects match closely the estimates of SEM with fixed effects (estimates are marginal effects for the SEM). As before, we will takeup a more detailed comparison between the base regression, SEM without fixed effects, and the direct effects from SDM with fixed effects in a later subsection.
Different measure of social distance
We now turn to an alternative measure of social distance. We examine median home time as opposed to the previous analysis of percent of time spent at home. Table 12 shows the percentiles of this variable by week. Like percent of time spent at home, median home time started at a lower level, reached a peak around week 12, and then declined.
Table 13 provides estimates of SEM with fixed effects for the exogenous variables. The results qualitatively match those from estimating SEM with fixed effects when using the percentage stayed at home dependent variable. Namely, Pop, Child, and Income have a positive effect on social distancing while HS has a negative effect on social distancing. The White variable shows some differences from the previous analysis as it goes from significant and positive in the early weeks but becomes significant and negative in the last two weeks. We will go into a further analysis of the similarity of the estimates from different methods, different W, and the two dependent variables in the following section.
Correlations of marginal effects across models
We now compare the base case SEM with fixed effects using 15 nearest neighbors, direct effects from the SDM with fixed effects using 15 nearest neighbors, SEM without fixed effects (intercept only) using 15 nearest neighbors, and the direct effects from SDEM with fixed effects using 15 nearest neighbors, direct effects from SAR with fixed effects using 15 nearest neighbors, SEM with fixed effects using W based on 10 nearest neighbors, SEM with fixed effects using W based on 20 nearest neighbors, and using the alternative dependent variable of median home time estimated via SEM with fixed effects and 15 nearest neighbors. We do this for four variables—Child, HS, White, and Income. To avoid a table, we did not do this with Pop, but the results are similar.
We compute the correlations between the methods across weeks and the results appear in Table 14. The upper triangle gives correlations among estimated coefficients across specifications for the Child variable and the lower triangle does the same for the HS variable. We see that the various methods appear to substantially agree with each other, with the minimum correlation across methods of 0.59 for the Child variable estimated by SAR with fixed effects using log of percent home and SEM with fixed effects using log of median home time. The largest correlations were 1.0000 for the HS variable estimated with SEM with fixed effects and SDM with fixed effects.
We perform a similar exercise when examining the White and Income variables in Table 15. The lowest correlation for White of 0.7436 is between SEM with fixed effects using log of median home time as the dependent variable and SAR with fixed effects when using log of percent home as the dependent variable. For income the lowest correlation is 0.9473 between SAR with fixed effects when using log of percent home as the dependent variable and SEM with fixed effects when using log of median home time as the dependent variable. Overall, the correlations between the estimated coefficients are quite high.
The high correlations among the marginal effects across the models provides indirect support for the basic specification. Models that differ by specfications of the disturbances should yield very similar estimates for large data sets under correct specification of the explanatory variable part of the model. To formalize this intuition,  used this to propose a Hausman test for the spatial error model relative to OLS. To the degree that the SDM and SEM models showed relatively weak explanatory power of indirect effects, these models also come closer to being error models, and thus the agreement between their results and those from other models should not be surprising. In addition,  showed that small changes in W may have less effect than commonly thought because a 20 nearest neighbor weight matrix and a 19 nearest neighbor weight matrix have 95% of the same positive elements when basing the choice of neighbors on the same metric. For two weight matrices Wa with ma neighbors and Wb with mb neighbors where ma ≤ mb and v ∼ N(0, In), corr(Wa v, Wb v) = (ma/mb)1/2. This also appears to happen here since we obtained virtually the same results from using 10, 15, and 20 nearest neighbors.
Explanatory variable choice and R2 decomposition
The base model employed used five variables on population density, age, education, race, and income (all logged). Of course, many other specifications could have been used and this poses the question of the robustness of the decomposition of the fixed effect and demographic predictions. In this subsection we provide an augmented model and examine the resulting decomposition. In the alternative model we augment the base model with (logged) non-black (coded this way to avoid 0), high school education (which differs from HS in the base model because not having a college degree does not imply having a high school degree), household size, percent on social security, percent with insurance, percent of children, and three variables measuring the percent of individuals of age 60 and older. The augmented model provides some alternative proxies for basic demographics. To keep the dimension of the model the same, we took all of the augmented variables, extracted the first five principal components, and repeated the R2(t) decomposition in Table 16. Relative to Table 7, the R2(t) for the principal components model is usually higher by a small amount in the later periods and the relative variance of the demographic to the fixed effect predictions is somewhat higher. The principal components reach their highest relative variance in week 18 at 5.1758 versus 4.7717 for the base model in week 14. As this illustrates, better models have the potential to show that demographic variables may have a greater influence on social distancing behavior relative to the base model. As more demographic variables are added and more optimization of the model occurs, this suggests that the importance of demographic variables will rise relative to fixed effects.
We employed cell phone tracking data to obtain a picture of how stay-at-home behavior evolved during the initial spread of COVID-19 in the United States. Because of simultaneity issues associated with stay-at-home behavior and the spread of the disease, and the problems of decentralized disease measurement in the United States, we posited a reduced form model that uses changes in stay-at-home behavior by county over time as a function of exogenous demographic variables such as population density, households with children, education, race, and income. Due to the contagious nature of COVID-19, we focused on possible spatial aspects of the behavior. We modeled spatial aspects using: (1) changes in social distancing behavior that reduce the influence of omitted variables, spatial or otherwise; (2) fixed effects for each jurisdiction; (3) spatial spillovers as coming from the exogenous demographic variables from surrounding counties; and (4) spatial autoregressive behavior of disturbances from surrounding counties.
Our research produced three main results. First, we found that as the crisis progressed the demographic exogenous variables explained an increasing proportion of the overall variance of behavior. By week 12, demographic exogenous variables explained over three times as much variance as jurisdictional fixed effects; by week 18 demographic variables explained over four times as much variance as jurisdictional fixed effects, before falling to 3.76 times in week 20. Second, over the span of the data, the correlation between fixed effect predictions and demographic variable predictions went from significant and negative in weeks 1–5 to relatively low magnitude correlations in periods 6–13, then finished with significant and positive correlations in periods 14–20. A negative correlation signifies fixed effect predictions and demographic variable predictions partially cancel each other. A zero correlation signifies that the fixed effect predictions and demographic variable predictions operate independently. A positive correlation signifies that fixed effects predictions and demographic variable predictions reinforce or complement each other. In other words, states with positive fixed effects also had larger amounts of social distancing based on demographic variables. Third, we found that observed social distancing and the demographic variables showed high levels of spatial autocorrelation or clustering which result in hotspots or peaks as well as coldspots or valleys in social distances. In summary: (1) fixed effects explain less of the variance in social distancing behavior relative to demographic variables; (2) correlations exist between fixed effect predictions and demographic variable predictions that result in partial cancellation or reinforcement in the overall predicted social distances; and (3) observed social distances and demographic variables display high levels of spatial autocorrelation or clustering.
What are the implications of our results? First, the strong tendency of social distances to cluster has a number of ramifications. Clustering of low social distance counties could show increasing levels of disease while clusters or coldspots of high social distance counties could show decreasing levels of disease. Even if this leads to an aggregate decrease over some period, having clusters of low social distance counties with increasing incidence of disease may impede economic recovery, even in clusters of high social distance counties, because of the potential for contagion between low and high social distance county clusters. Second, although relatively severe restrictions have worked in various places, in the absence of binding jurisdictional restrictions, persuading individuals to increase their social distancing voluntarily may provide a lower cost means to reducing disease incidence. Third, such persuasion may also improve acceptance of restrictions and result in a positive correlation between individual and jurisdictional actions that reinforces the amount of social distancing. This potentiation or reinforcement provides additional returns to NPIs that raise actual or effective (e.g., masks) social distancing.
We thank James P. LeSage, Texas State University, and Xiaoyu Zhou, SAS Institute, for their detailed comments and Frank Miele for providing valuable editorial suggestions. We also thank SafeGraph for providing social distancing data, free of charge, for academic research.
- 1. Kennedy P. A guide to econometrics. Third Ed., Cambridge: MIT Press; 1982.
- 2. Carroll RJ, Ruppert D, Stefanski LA, Crainiceanu CM. Measurement error in nonlinear models, a modern perspective. Boca Raton: Chapman and Hall. 2006.
- 3. Lam NS, Arenas H, Pace RK, LeSage JP, Campanella R. Predictors of business return in New Orleans after Hurricane Katrina. PloS One. 2012;7(10): e47935.
- 4. Lam NS, Pace RK, Campanella R, Lesage JP, Arenas H. Business return in New Orleans: decision making amid post-Katrina uncertainty. PloS One. 2009;4(8): e6765.
- 5. Stock JH. Data gaps and the policy response to the novel coronavirus. NBER [Preprint]. 2020 NBER 26902 [posted May 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w26902.
- 6. Avery, C, Bossert W, Clark A, Ellison G, Ellison S. Policy implications of models of the spread of coronavirus: perspectives and opportunities for economists. NBER [Preprint]. 2020 NBER 27007 [posted April 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w27007.
- 7. Allcot H, Boxell L, Conway J, Gentzkow M, Thaler M, Yang D. Polarization and public health: partisan differences in social distancing during the coronavirus pandemic. SSRN [Preprint]. 2020 SSRN 3570274 [posted 7 Apr 2020; revised 19 May 2020; cited 25 May 2020]. Available from: https://ssrn.com/abstract=3570274.
- 8. Aum, S, Lee SY, Shin Y. Inequality of fear and self-quarantine: is there a trade-off between GDP and public health? NBER [Preprint]. 2020 NBER 27100 [posted May 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w27100.
- 9. Barrios J, Hochberg Y. Risk perception through the lens of politics in the time of the COVID-19 pandemic. NBER [Preprint]. 2020 NBER 27008 [posted April 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w27008.
- 10. Brzezinski, A, Kecht V, Van Dijcke D, Wright A. Belief in science influences physical distancing in response to COVID-19 lockdown policies. SSRN [Preprint]. 2020 SSRN 3587990 [posted 4 May 2020; cited 25 May 2020]. Available from https://ssrn.com/abstract=3587990.
- 11. McFadden S, Malik A, Aguolo O, Willebrand K, Omer S. Perceptions of the adult US population regarding the novel coronavirus outbreak. PloS One. 2020;15(4): e0231808.
- 12. Painter M, Qiu T. Political beliefs affect compliance with COVID-19 social distancing orders. SSRN [Preprint]. 2020 SSRN 3569098 [posted 6 April 2020; cited 25 May 2020]. Available from: https://ssrn.com/abstract=3569098.
- 13. Rampini A. Sequential lifting of COVID-19 interventions with population heterogeneity. NBER [Preprint]. 2020 NBER 27063 [posted April 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w27063.
- 14. Acemoglu D, Chernozhukov V, Werning I, Whinston M. A multi-risk SIR model with optimally targeted lockdown. NBER [Preprint]. 2020 NBER 27102 [posted May 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w27102.
- 15. Anderson, M, Maclean J, Pesko M, Simon K. Effect of a federal paid sick leave mandate on working and staying at home: evidence from cellular device data. NBER [Preprint]. 2020 NBER 27138 [posted May 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w27138.
- 16. Baker MG, Peckham TK, Seixas NS. Estimating the burden of United States workers exposed to infection or disease: a key factor in containing risk of COVID-19 infection. PLoS One. 2020;15(4): e0232452. https://doi.org/10.1371/journal.pone.0232452. pmid:32343747
- 17. Chiou L, Tucker C. Social distancing, internet access and inequality. NBER [Preprint]. 2020 NBER 26982 [posted April 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w26982.
- 18. Mongey S, Pilossoph L, Weinberg A. Which workers bear the burden of social distancing policies? NBER [Preprint]. 2020 NBER 27085 [posted May 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w27085.
- 19. Fernández-Villaverde J, Jones C. Estimating and simulating a SIRD model of COVID-19 for many countries, states, and cities. NBER [Preprint]. 2020 NBER 27128 [posted May 2020; cited 25 May 2020]. Available from: https://www.nber.org/papers/w27128.
- 20. Raifman J, Nocka K, Jones D, Bor J, Lipson S, Jay J, et al. COVID-19 US state policy database. 2020 [cited 17 July 2020]. Available at: www.tinyurl.com/statepolicies.
- 21. Cheng M, Papenburg J, Desjardins M, Kanjilal S, Quach C, Libman M, et al. Diagnostic testing for severe acute respiratory syndrome–related Coronavirus 2. Annals of Internal Medicine. 2020;172(11): 726–734. pmid:32282894
- 22. Sharfstein J, Becker S, Mello M. Diagnostic testing for the novel Coronavirus. JAMA. 2020;323(15):1437–1438. pmid:32150622
- 23. Davidson R, MacKinnon J. Estimation and inference in econometrics. New York: Oxford University Press. 1993.
- 24. Christ C. The Cowles commission’s contributions to econometrics at Chicago, 1939–1955. Journal of Economic Literature. 1994;32(1): 30–59.
- 25. Bavaud F. Models for spatial weights: a systematic look. Geographical Analysis. 1998;30: 153–171.
- 26. LeSage JP, Pace RK. Introduction to spatial econometrics. Boca Raton: Taylor and Francis, CRC Press; 2009.
- 27. Elhorst JP. Applied spatial econometrics: raising the bar. Spatial Economic Analysis. 2010;5(1): 9–28.
- 28. Vega SH, Elhorst JP. The SLX model. Journal of Regional Science. 2015;55(3): 339–363.
- 29. Guliyev H. Determining the spatial effects of COVID-19 using the spatial panel data model. Spatial statistics, 100443. 2020. Advance online publication. https://doi.org/10.1016/j.spasta.2020.100443.
- 30. Krisztin T, Piribauer P, Wögerer M. The spatial econometrics of the coronavirus pandemic. 2020 Working paper [posted April 2020; cited 25 May 2020]. Available from: https://www.researchgate.net/publication/340620046_The_spatial_econometrics_of_the_coronavirus_pandemic.
- 31. Ord JK. Estimation methods for models of spatial interaction. Journal of the American Statistical Association. 1975;70: 120–126.
- 32. Schwarz G. Estimating the dimension of a model. The annals of statistics. 1978;6(2): 461–464.
- 33. Akaike H. A new look at the statistical model identification. IEEE Transactions on Automatic Control. 1974;19(6): 716–723.
- 34. Bozdogan H. Model selection and Akaike’s information criterion (AIC): the general theory and its analytical extensions. Psychometrika. 1987;52(3): 345–370.
- 35. Pace RK, LeSage JP. A spatial Hausman test. Economics Letters. 2008;101(3): 282–284.
- 36. LeSage JP, Pace RK. The biggest myth in spatial econometrics. Econometrics. 2012;2(4): 217–249.