Space-Time Covid-19 Bayesian SIR modeling in South Carolina

The Covid-19 pandemic has spread across the world since the beginning of 2020. Many regions have experienced its effects. The state of South Carolina in the USA has seen cases since early March 2020 and a primary peak in early April 2020. A lockdown was imposed on April 6th but lifting of restrictions started on April 24th. The daily case and death data as reported by NCHS (deaths) via the New York Times GitHUB repository have been analyzed and approaches to modeling of the data are presented. Prediction is also considered and the role of asymptomatic transmission is assessed as a latent unobserved effect. Two different time periods are examined and one step prediction is provided.


1) Introduction
The Covid 19 pandemic has spread across the world since the beginning of 2020. In the US there has been a large increase of cases (symptomatic) and deaths attributable to Covid19. Many serious issues remain about predictive modeling, interventions and vaccines (for a recent overview see Banks et al, 2020). Various data sources are now available that provide daily counts of cases and fatalities for this disease. In this report we demonstrate the use of a Bayesian susceptible-Infect-removed model with spatial geo-referencing at the county level in South Carolina. The model specifies a transmission rate within each county as well as optionally including neighboring county transmissions. Asymptomatic cases are assumed to be a proportion of symptomatic and are included in the transmission model. An accounting equation is assumed where the susceptible population is decremented by infectives and removals. Predictions based on this model are possible, but the focus of this paper is mainly on the model development and fitting. As the pandemic of Covid-19 develops, a variety of approaches to modeling its spread are being presented ( e. g. Calavetti et al, 2020;Calafiore, 2020;Rouabah et a;. Many of these models are time series-based, and do not incorporate spatial structure in the modeling process. Within Bayesian modeling, few published examples exist where the asymptomatic process is modelled (Zhou et al , 2020). However it is clear that cross boundary transmission must occur (at least at the county level) and so ignoring this must limit the ability of the models to account for this feature. In addition the use of random effects to allow for confounding is often ignored. In the present approach we examine a range of models with different assumptions about transmission and spatial dependence both in terms of fixed predictors and

2) Data sources
Both Johns Hopkins University CSSE (https://coronavirus.jhu.edu/map.html) and New York Times (NYT) (https://github.com/nytimes/covid-19-data ) provide data resources at the county level within states of the US. In this report we have analysed the raw data hosted by NYT provided from NCHS (deaths) and state health departments (cases). The data in general is under-ascertained and potentially recorded at time of analysis with error due to misreporting and changes in recording. This is especially true for death data where originally only hospital deaths from Covid19 were recorded and reporting lag time is also present.
It is known that Covid19 has associated asymptomatic cases as well as symptomatic. Asymptomatic cases are not reported usually, could be pre-symptomatic or not, and so cannot be observed unless widespread repeated surveys are carried out in the population. Hence, as these cases can infect others, they must be accounted for in any model. Deaths from Covid-19 for asymptomatic cases are not recorded at all, as they are not tested.
In what follows we assume that the case count (symptomatic) and death count at county level for each day, are the data to be modelled. Of course the under-ascertainment of both case count and deaths is a caveat in the subsequent interpretation. Handling of asymptomatic and under-ascertainment is discussed in the modelling section.

3) Model Development
Susceptible-Infected-removed (SIR) compartment models have been used extensively for infection modeling (Keeling and Rohani, 2008). Here the seasonal influenza SIR models by Lawson and Song (2010) (L&S) which were applied spatially to county level biweekly count data (C+ve laboratory notifications) for a single flu season, have been modified and employed to model the Covid-19 count data. These were developed for a spatial application from the time series models of Morton and Finkenstädt (2004)

 
We also assume that the county susceptible population ( ij S ) at any given time is available. As this does not vary much over a period of a few months then we can assume that the susceptible population at the beginning of the monitoring can be used as a baseline. We do not have access to disease data on sub groups within the population, such as age or gender, and so we must assume a common population base.
The susceptible population in each spatial region (county) is quite large compared to the case count and so it is reasonable to assume a Poisson model for the symptomatic counts, i.e.
(1) Note that the conditional independence in the Bayesian hierarchical formulation allows us to assume independence and hence a Poisson data model. An alternative to use a binomial spatial model would be possible also. This doesn't prohibit any extra variation or overdispersion which often arises, as in the Bayesian formulation such variation can be accommodated at higher levels of the hierarchy via prior distributions.
The Poisson mean (1) ( as for M&F) is assumed to be a function of current susceptibles and some function of previous infectives as well as other propagating effects (e. g. random effects or socio-economic predictors/covariates). Note that we also assume that the observed symptomatic cases are to be modeled as opposed to the unobserved total case count. M&F assumed that you can use a binomial model with unobserved total count as denominator (such as ( , ) s y m t r i j i j where  is a scaling factor) and the total (latent) count is subsequently modelled. However this is a computationally very expensive undertaking, as none of the t r i j y are observed and the complete latent field would have to be estimated. It is simpler to model the observed count and use a scaling factor to estimate the true count at a later stage. This was the approach adopted by L&S and is commonly assumed by health departments during conventional influenza seasons.
In addition to the data model for new counts we also assume an accounting equation which is used to update the susceptible population at each time point. We assume a susceptible -Infected -removed (SIR) model, which is the simplest compartment model. SEIR models including an exposed category are infeasible due to the lack of information about exposure transitions. We assume that susceptibles, if infected, move to the infected category (symptomatic and asymptomatic cases) and either are subsequently removed via death or other forms such as recovery. In the Covid19 data available currently for the US we do not have information on recovered individuals nor direct evidence of asymptomatic cases. As these can affect the progression of the infection we must make assumptions about them. We assume in what follows that the recovery rate is a proportion of the infective count.
The accounting equation is: where the current susceptible population is updated by decrementing the previous population by (true) infected count and removal. In our modeling we replace the , 1 t r i j y  by the sum of asymptomatic and symptomatic cases for that period ( , , sym asym i j i j y y -- where is the proportion of asymptomatics in the population ). This is equivalent to the extended latent influx process of M&F. Also removal is defined as Note that b 0 is a transmission rate (on the log scale) and we could estimate R 0 based on but when variants are used such as time dependent j b 0 or spatially dependent i b 0 then this is inappropriate. Model 1 has simple prior dependence on the last infective count in the region, and a spatially structured effect i b . Model 2 has an added neighborhood dependence, which could represent cross-boundary infection. Model 3 is as for model 1 but with a county level poverty covariate, and model 4 and 5 include that covariate but also have different transmission parameterisations. Model 5 drops the spatial random effect as the transmission is spatially dependent, but includes a uncorrelated spatial effect ( i v ). The inclusion of the spatial effects is really to allow for residual confounding whether spatially structured or not.

Prior distributions
The following prior distributions were assumed for the free model parameters. For fixed b 0 and b 1 and b 2 we assume a zero mean Gaussian prior distribution with precision having a gamma (2,0.5) . The spatially structured effect is assumed to have a markov random field prior distribution (ICAR) whereby The precision has a gamma prior distribution also. The ICAR model was also assumed for the i b 0 in model 5. The temporally dependent j b 0 was allowed to have an uncorrelated zero mean Gaussian prior distribution with precision which also had a gamma (2,0.5) distribution. Prior distribution choice is as to be as non-informative as possible, where appropriate. The gamma prior distributions on precisions are weakly informative (STAN wiki 2020).

Computation
Our Bayesian hierarchical models and variants can be conveniently implemented via the Bayesian modeling package nimble. This package is available on R and allows the modeling of relatively complex hierarchical models with spatial signatures. The package includes both ICAR and PCAR variants and is based on parsed versions of BUGS code but composes its own C++ samplers. Its benchmark speed is exceptional for a range of problems.
All models were run to convergence, and checked using single chain Geweke diagnostics.  The deviance (D) based measure DIC (Spiegelhalter et al, 2014) was used, with the effective number of parameters computed from the conservative ˆv ar( ) / pD D = 2 , which is always available from MCMC samples. Lower values of DIC suggest improved fit.

Daily Data: January 22 nd to April 12 th
The first analysis undertaken is for data from the period January 22 nd to April 12 th . This consists of 82 days. Most of the early incidence is sparse and so many days through January and February have zero counts for virtually all counties. Figure1 displays the county map of South Carolina. The main urban areas in the state are in Greenville, Spartanburg, Richland, Charleston and Horry counties. As early as March 7 th cases appeared in Kershaw county, which is close to the state capital county of Richland. After this, most counties experience incidence which quickly rises to various peaks in early April. Figure 2 displays the symptomatic count and death time profiles for four counties.

Figure 2 Daily counts of Covid-19 positive symptomatic cases and deaths for four SC counties for the period January 22 nd to April 12 th .
It is notable that the coastal communities of Horry and Beaufort counties, which have an older demographic, have earlier initiation of infection and earlier deaths than the main urban areas of Charleston and Richland counties.
The comparison of the models 1) -5) yielded the following DICs in Table 1. It is clearly the case that the lowest DIC is for model 4 and that models 1, 2, 3, 5 are not competitive. Model 4 with its time dependent transmission rate is clearly better in describing these data than other models. Model 4 variants were examined also. Model 4b allowed the asymptomatic rate (j ) to change from 0.25 to 0.5 in the model. While model 4c assumes a rate of 10%. The lowest DIC is for model with 50% asymptomatic, although the 25% rate has only a slightly higher DIC. Model 4d variant simply removed the ICAR spatial component and also yielded a larger DIC. It's also notable that the ICAR component seems to be important in the description of the residual confounding in these data. The Appendix displays the nimble code for the model 4).

Figure 3 Posterior mean infection risk profiles for Richland and Kershaw counties for model 4
and Kershaw counties. Figure 4 displays the posterior mean risk profiles for Beaufort and Charleston counties. It is clear that the SIR model with time dependent transmission rate and spatial confounding appears to fit the observed infection profiles reasonably well.

Figure 4 Posterior mean infection risk profiles for Beaufort and Charleston counties for model 4.
The overall DIC for the model is 14428.8, with pD of 128.2 which is considerably lower that alternate models considered.

Daily Data: April 2 nd to June 29 th
The early period data demonstrates the beginnings of the outbreak where there were zero cases until March and then a large increase during March and April. Crucially, the state was not completely locked down until April 6 th when a stay-at-home order was issued. By this time there had been a large increase in cases. Partial lifting occurred on April 24 th when some businesses were allowed to open. Restaurants opened on May 1 st . Following this date a gradual rise in cases numbers occurred as social distancing and face covering recommendations were apparently not adhered to, both in the general population and in hospitality industry venues.
A range of models as per the earlier time period were fitted to these data. Table 2 displays the goodness of fit results. For these data model 3 provides the lowest DIC, although model 1 and model 5 have close values. Model 3 assumes an asymptomatic rate of 25% and includes the poverty covariate. Variants of model 3 with 50% asymptomatic and also no ICAR component had much higher DICs. The posterior mean parameter estimates (and SDs) for the different components are 0 b : 8.036 (0.064), 1 b : -0.542 (0.0052), 2 b : -0.800 (0.0031). These are well estimated and suggests an average transmission of 0.6 over the whole period, and a dependence on % poverty of 0.449. It is notable that the asymptomatic rate assumption of 25% proved to yield a better fit than 50%.  Figures 5 and 6 display the case count profiles and posterior mean estimates for a selection of counties around the state, for model 3. It is notable that the estimates track the case counts in a smooth manner but tend to adjust upward the lower counts during April and May but are downward biased during June, when the counts increases markedly. Figure 7 displays the estimated mean square error profile for the fitted model 3, computed at each time point. This displays the variation in fit and suggests a good fit in early days but also supports that the model fits progressively less well when counts spike.

Prediction
Both within sample and beyond sample prediction could be important in the context of infection modeling. The mean square predictive error is a measure of loss between predicted model and observed outcomes within sample (Gelfand and Ghosh, 1998). Figure 8 displays the MSPE for the fitted model 3 for these data, computed at each time point. It is noticable that the error is reasonably small until a large peak or jump in risk occurs although the error does increase with increasing spiking of risk. As part of the predictive capability of the Bayesian space-time models it is possible to consider 'step ahead' forecasts for these data. There are a number of ways that predictive inference can be used for temporal prediction. Here a simple one step (one day) prediction is made using the final day (June 29 th ) to provide accounting update and model dependence on the final observed counts. Figure 9 displays the prediction of SC counties for June 30 th , with associated 95% credible interval (shaded).areas. In particular Charleston, Greenville, Horry, Lexington, Richland and Spartanburg have wide ranges. Its clear that the prediction interval is highly asymmetric with a long tail in the higher risk levels.
Any prediction beyond one step that would involve extra variation due to the use of estimated quantities in the modeling process and hence uncertainty would be propagated.

Models fitted to smoothed 3 day averages
Smoothed 3-day average data provides for some removal of anomalies in the count data. Anomalies such as weekend reporting delays and changes in recording, could be ameliorated to some degree by such smoothing. The smoothed counts could be assumed to be Poisson distributed (even if averaged) and the smoothing will induce temporal correlation that can be accommodated by the models. However, for consistency, the 3 day average of the centered counts was computed, which produced smooth estimates at each daily time point. This should also reduce the differentiation of neighboring areas and anomalies relates to weekend nonreporting. Figure 10 displays the smoothed case data and death count for 4 SC counties for the first period January 22 nd to April 12 th . Figure 11 displays the smoothed case and death data for April 2 nd -to June 29 th . Assume that sm ij y is the 3 day smoothed symptomatic case average. As this is now continuous and strictly positive, a different data model must be assumed. A log normal model for the 3 day average was adopted so that . A range of models for ij m have been examined as in the non-smoothed case.
Both the original January-April and April-June data have been analyzed using the models 1) -5) The results are found in Table 3. It is notable that for the earlier data the model with the spatially varying intercept parameter (model 5) has lowest DIC. This suggests that spatial structure plays an extra role in the earlier period. In contrast in the later period the model with only the poverty covariate and spatial CAR effect (model 3) has lowest DIC. For the later period a one step ahead forecast has also been obtained from the smoothed model (3). Figure 12 displays the smoothed 3 day prediction for the later data set. It is clear that Charleston, Greenville, Horry and Richland have the largest intervals now and largest predicted counts, and the influence of smoothing has reduced the prediction for Lexington and Spartanburg.

Discussion
The main results from the different data sources and models are quite revealing. First it is clear that for the early unsmoothed data the large increase in risk in late March-early April favored a large asymptomatic rate and time dependence. This is likely due to the large variation over time from zero counts to high incidence. For the later period from April 2 nd to June 29 th an asymptomatic rate of 25% yielded a better fit and a model with both the poverty covariate and spatial residual CAR effect were best for either the crude data or the smoothed data. In general the smoothed data support a lower asymptomatic rate of 25%, inclusion of the county level poverty covariate and spatial structure, either as a spatially-structured intercept (model 5) or as a residual confounding (model 3). This further suggests that in the earlier period the time dependence in the risk in the unsmoothed data was heterogeneous with respect to county differences and so for the smoothed data spatial differentiation of risk was more obvious.  Figure 13 displays the residual spatial ICAR effect for model 3 for the smoothed April-June data set. It is clear that the counties with greatest residual risk unexplained are rural in nature: the urban areas have much lower and negative residual effect. Figure 14 displays the spatially structured ICAR intercept effect for model 5 for the Jan-April smoothed data. This suggests that transmission is differentiated spatially and is higher in rural areas. smoothing. Smoothed data could induce greater spatial correlation and this was highlighted above. 2) The asymptomatic rate was varied in these models and was found to best fit when it was high (50%) for early unsmoothed data with a large spike in risk. More commonly 25% rate was favored for later data and the smoothed 3 day average data in general. CDC2020 lists a range of rates but cites 40% as the current best estimate 3) Poverty does provide better explanation of risk variation in these county level examples, and hence its surrogates of unemployment, low median income, and ethnicity differences could also be assumed to play a role. These additional predictors have not been included here but may help to reduce the residual spatial structure in the risk variation. 4) The value of spatio-temporal Bayesian modeling lies in the ability to flexible add both observed explanatory variables and unobserved latent effects to model disease risk. In our models we have achieved a reasonable measure of fit and made comparisons of different models with different assumptions, provided insight into asymptomatic rate and spatial differentiation of risk. The role of poverty is also clear. The ability to add crossboundary transmission as a gravity effect is also a feature of the modeling, although in this case this was not selected in the best model.