Predicting increases in COVID-19 incidence to identify locations for targeted testing in West Virginia: A machine learning enhanced approach

During the COVID-19 pandemic, West Virginia developed an aggressive SARS-CoV-2 testing strategy which included utilizing pop-up mobile testing in locations anticipated to have near-term increases in SARS-CXoV-2 infections. In this study, we describe and compare two methods for predicting near-term SARS-CoV-2 incidence in West Virginia counties. The first method, Rt Only, is solely based on producing forecasts for each county using the daily instantaneous reproductive numbers, Rt. The second method, ML+ Rt, is a machine learning approach that uses a Long Short-Term Memory network to predict the near-term number of cases for each county using epidemiological statistics such as Rt, county population information, and time series trends including information on major holidays, as well as leveraging statewide COVID-19 trends across counties and county population size. Both approaches used daily county-level SARS-CoV-2 incidence data provided by the West Virginia Department Health and Human Resources beginning April 2020. The methods are compared on the accuracy of near-term SARS-CoV-2 increases predictions by county over 17 weeks from January 1, 2021- April 30, 2021. Both methods performed well (correlation between forecasted number of cases and the actual number of cases week over week is 0.872 for the ML+Rt method and 0.867 for the Rt Only method) but differ in performance at various time points. Over the 17-week assessment period, the ML+Rt method outperforms the Rt Only method in identifying larger spikes. We also find that both methods perform adequately in both rural and non-rural predictions. Finally, we provide a detailed discussion on practical issues regarding implementing forecasting models for public health action based on Rt, and the potential for further development of machine learning methods that are enhanced by Rt.


44
Rural communities in the United States (US) have been heavily impacted by the novel coronavirus 45 (SARS-CoV-2) pandemic. SARS-CoV-2 related deaths have occurred disproportionately among 46 rural areas of the US, and negative impacts on health and economic well-being have been 47 Two strategies to predict near-term increases in SARS-CoV-2 cases were developed using recent 65 county-specific incidence of infections and Rt -one method is a dynamical algorithm-based 66 prediction using Rt and the serial interval while the second method uses a long short-term 67 memory (LSTM) machine learning strategy. The goal was to recommend counties of outbreak for 68 targeted testing. Here we compare accuracy of the two methods to predict short-term increases 69 in county-specific SARS-CoV-2 incidence and discuss conditions favoring one method or the the Rt Only method, a forecast based on the reproduction number and associated serial interval 79 that predicts the future Rt that is then extrapolated to estimate the number of future cases; 2) a 80 Long Short-Term Memory (LSTM) machine learning model (ML+Rt) that utilizes the reproduction 81 number from the Rt Only method as an input, but also utilizes total cases and population, among 82 other inputs, to predict the total number of cases for a given period of time. 83 84 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint We received daily reports of all daily COVID-19 polymerase chain reaction (PCR) and antigen 85 testing results conducted in WV since March 2020 directly from the WV Department of Health 86 and Human Resources (WVDHHR). Noteworthy is that all SARS-COV-2 testing data are required 87 to be reported to WVDHHR. Information for each unique patient is collected and contains test 88 procurement date, test result date, patient zip code, patient county of residence, testing site 89 name, county where the test is obtained, and test result. As patients who test positive may be 90 tested multiple times, we only consider the first positive tests on a patient. When applying this 91 filter, we consider data obtained from all testing sites (i.e., hospital, clinic, pharmacy, drive-92 through, mobile van). The number of daily cases for each county is calculated by adding the lab 93 confirmed cases and clinical confirmed cases after filtering out repeated tests or COVID-19 94 diagnoses. This daily incidence data on first diagnosed infection is the basis for calculation of Rt. al. (Thompson, et al., 2019). This approach relates the daily incidence (number of new cases) to 100 past cases through an instantaneous reproduction number Rt which characterizes the daily 101 dynamics of transmission reflects a multitude of factors relating to individual and group behavior 102 the rate parameter Γ coincides with the average daily incidence, ⟨ ⟩ = Γ. In the instantaneous 107 % framework, the expected incidence on day is a product of two quantities, the infection 108 potential and the reproduction number, Γ % =Λ % % . The infection potential Λ % summarizes the 109 record of past cases in the community and the typical variation of the infectiousness of an 110 individual over time. 111

112
The infection potential Λ % is determined by the incidence %$& on prior days = 1,2, ⋯ and the 113 infections on day , due to past cases in the community, under ideal "steady state" conditions, 124 such that over time, each primary case causes exactly one secondary case. 125 126 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint The time varying reproduction number, % , captures conditions of transmission that are external 127 to the infected individuals and reflect community behavior. In this framework, % is a random 128 variable with a Gamma distribution ( ) = ) * % +(-) -$) $//* . The parameters % , % are 129 determined for each day through Bayesian (maximum a posteriori probability) estimation. The 130 parameters of interest are estimated using incidence data up to and including the current day, 131 ) , 1 , ⋯ % as follows: 132 This estimated % distribution applies to the most recent days, but it requires the values of % & 134 for 8 ≤ going back to 8 = − 9:; where 9:; is the length of the serial interval distribution. 135 For the serial interval & we used the discretized gamma distribution with mean and standard 136 deviation of t_s = 7.0 ± 4 days, provided in the software similar to Cori The semi-deterministic model for future incidence, based on Cori's method regards the daily 143 distributions of % (values of % , % ) as inputs that summarize the current conditions for disease 144 transmission within the community of interest. The serial interval distribution & , which is fixed 145 with regard to time, is also an input. Thus, on day we have access to the distribution of % that 146 applies to this day (assessed using the most recent days, similar to a trailing moving average). 147 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 7, 2021.  , of ( %?) | , ) are known, the probability of exactly new infections on day + 1 is: 152 For the purpose of predicting a likelihood range for the daily incidence, we use the CDF of %?) : 158 We obtain a [5% -95%] credibility interval for the daily incidence using the inverse CDF for and 160 multiplying by the corresponding infection potential. This provides a smaller variance than the 161 discrete distribution ( ) but is a more practical indication of the incidence rate. 162 163 Extrapolation over multiple days: To go beyond the "next" day, we iterate the one-day 164 prediction, using predicted values to expand the incidence data. One can reasonably extrapolate 165 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint the current distribution of % to + 1 and any number of days in the future. For the short term 166 (7 day) predictions discussed here, we assumed the value of the most recent available % remains 167 the same over the prediction interval, _ %?! = % . 168

169
The estimated incidence for day + 1 requires the infection potential on that day Λ %?) , which 170 is computed based on incidence up to the preceding day . 171 Predictions for day + 2 and beyond can be obtained using the predictions for preceding days 174 for the incidence and iteratively applying the approach for any number of k days into the future. 175 We estimate credibility intervals similar to the one-day case, using only the corresponding range 177 for the reproduction number % , and not compounding with uncertainty for each estimated 178 incidence ̅ %?! or with the additional uncertainty due to the Poisson distribution of the daily 179 (integer) incidence. While this provides a narrower range, the credible interval serves as a relative 180 measure of the uncertainty affecting the prediction. 181 182 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint Correction for imported cases: Not accounting for imported SARS-CoV-2 cases into a county will 183 lead to over estimation of Rt. In practice, we are not able to directly identify imported cases, so 184 an adjustment must be made to identify them. Assuming the daily incidence % can be separated 185 into imported and community-spread parts: 186 Then, imported cases are an additional input to the model. Imported cases are included in the 188 infection potential because they contribute to new local infections, but are not included in the 189 number of new cases when estimating the reproduction number: 190 Turning to predictions, the reproduction number and infection potential computed in the 192 standard framework can only predict the local cases: 193 By definition, imported cases cannot be predicted in the % model; however, we can identify 195 events when the observed number of new cases vastly exceeds the expectation from local 196 transmission. We use this hindsight to improve our estimate of the reproduction number as 197

follows. 198
We estimate the likely number of imported cases on a given day by comparing the actual 199 incidence to the Bayesian credible interval for new local cases estimated from the previous days. 200 This estimated past incidence is then incorporated in a corrected estimate for % . 201 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 7, 2021. ; In an initial pass we compute the % , % parameters for time point based on the incidence time 202 series { 2 } 2(3,),⋯%$) . We compute the one-day predicted incidence on day as described above, 203 using the infection potential Λ % and the distribution of _ % ≡ %$) (so we do not rely on the 204 knowledge of % ). We take the value corresponding to the upper = 95% credible interval as a 205 cutoff and identify the part of the incidence that exceeds the cutoff with imported cases. 206 We use the estimated local incidence % (F7G:F,IRH) to provide revised estimates for the reproduction 208 number as described above (also consistent with Cori and Thompson's approach). Finally, we use 209 the resulting % parameters for the most recent day and the full incidence to provide revised 210 estimates for days + 1, + 2, … + . 211 212 ML+R t Method: Using Long Short-Term Memory (LSTM) Network to Forecast Outbreaks

213
As previously mentioned, the LSTM method implemented in this project is meant to build on the 214 widely used Rt Only approach described in the previous section. The novelty of this LSTM 215 approach is that it provides for the input of epidemiological modeling while taking advantage of 216 cutting-edge machine learning techniques. The combination of the two allows the LSTM model 217 to incorporate the epidemiological principles used to produce the Rt estimate while adding 218 additional information such as temporal and demographic information that can be leveraged 219 with traditional machine learning models. Further, the calculation of Rt using the Rt Only method 220 uses independent data sets for each county in turn creating a unique model for each county that 221 does not consider the impact of possible relationships between counties. By contrast, the ML+Rt 222 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint approach uses global trends across counties. By training on all the data, we are not only able to 223 take advantage of global trends, but by including spatial information, we are also able to show 224 how these trends impact specific counties. 225 226 Daily county-specific Rt, summary statistic information on the estimated Rt such as standard 227 deviation, confidence intervals, and the probability of Rt >1 are also provided. We include values 228 of Rt computed using both 7 and 14 day intervals. All these factors along with temporal 229 information such as daily information on whether it is a weekend or not, holiday or not, days 230 passed from last major holiday, days to the next major holiday were utilized as inputs for our 231 model. 232

233
As mentioned previously, due to the length of time it takes to receive a test result (lag time), we 234 had to consider the deflated effect on the positive cases when considering test procurement 235 date. We observed an average lag of 3 days for results to achieve close to actual levels. To 236 mitigate the effect of the testing lag we impute day t, t-1, t-2 with the actual SARS-CoV-2 cases 237 for days t-3, t-4, t-5 respectively. 238

239
We utilize a Long-Short Term Memory (LSTM) recurrent neural network (Hochreiter & 240 Schmidthuber, 1997), implemented in Python with an Adam optimizer, as our model of interest 241 for this analysis, permitting consideration of all available county-specific input information for 242 the past 7 days with a prediction of the number of positive cases for the county as an output. 243 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint Other advantages of the LSTM approach are the ability to exploit autocorrelation between time 244 points and the utilization of dropout layers to remove redundant information. 245

246
In general, the LSTM models are more complex versions of recursive neural networks (RNNs). 247 The multi-layer LSTM method deployed here follows the framework described in Figure 1 where of the cell, % , and output of the cell, ℎ % , is calculated with the following formulas given input 255 data, % : 256 Where, w are the weight variables (traditionally thought of like regression coefficients), and b are 262 the bias variables (traditionally thought of as intercept terms). Activation functions, and 8 are 263 non-linear transformation functions such as, sigmoid and hyperbolic tangent. A feature of each 264 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint cell is input, output, and forget gates. These gates are what give the LSTM, the memory property 265 which allows it to account and adjust for auto correlation. We define: 266 The

275
In addition, as we cannot guarantee the importance of the inputs (including Rt and associated 276 summary statistics), we add dropout layers which allow for the identification of important inputs. 277 Using these dropout layers, we filtered out inputs that would be considered insignificant in order 278 to detect the important signals coming from the input data and also protect against overfitting. As the closeness of the predicted number of cases to the actual case number, i.e., the "quality" 323 of the prediction, we considered a second metric to consider the quality of the prediction rather 324 than just considering a binary outcome. To accomplish this, we define Spike DCG as: 325 is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

352
The daily number of tests from April 2020-April 2021 were highly variable (Figure 2 with some 353 weeks having very low testing rates as illustrated by Figure 3). Each of the two prediction 354 methods utilized all available data and was updated weekly to obtain county level predictions. 355 We note that this study specifically focuses on evaluating predictions in the latter part of this 356 time frame, and coincided with vaccinations becoming available to different demographics of 357 residents of West Virginia residents, though data from the entire study was used to train each of 358 the methods. 359

389
Critically important is analysis on the performance of the two forecasting strategies in rural 390 compared with more urban counties in WV. Correlations between predicted 7-day positive case 391 totals and actual 7-day positive case totals are higher for non-rural counties than rural counties 392 for both methods (Table 2). 393 394 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint For rural areas, the two methods perform similarly with the ML+Rt method slightly 395 outperforming Rt Only in regard to Spike DCG ( Figure 6). For non-rural areas, we observe that 396 ML+Rt outperforms Rt Only for both DCG metrics ( Table 2). The Rt Only Method performs well 397 when identifying counties in the top 10, but ML+Rt method identifies larger spikes in the top 10 398 recommendations. 399  lower daily incidence this alleviates any concern of bias of the ML method on rural counties. 407 408

409
In this study, we deployed two methods to predict short term incidence of SARS-CoV-2 infection 410 for purposes of identifying West Virginia counties that might benefit from enhanced SARS-CoV-2 411 testing. One method, Rt Only, utilizes the Cori model [5], assuming that all positive cases are 412 known. In contrast, the ML+Rt method utilizes Rt as an input value, but bases predictions on an 413 LSTM framework that utilizes other factors such as population size. 414 415 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 7, 2021.  et al., 2020) ). However, this raises a practical issue in that data for 432 day is typically incomplete on day and is reported gradually over several days. To address this 433 issue, we estimate SARS-CoV-2 incidence using data from 3 days prior ( 5I475H = 438 A second issue with the Rt Only method is that we do not have access to a reliable record of 439 imported cases as they are a theoretical concept in this model. In practical settings, the term 440 "imported" is to be taken in a (very) broad sense. There are a number of situations that have a 441 similar effect. 442 1. True "exogenous" cases likely occurred due to county residents traveling for school or 443 To address these issues, we use the Bayesian credible interval to better define the number of 459 imported cases in the Rt Only method. By the iterative fitting technique proposed we are able 460 to better estimate the number of imported cases that will be observed. 461

462
The ML+Rt value suffers from issues with practical implementation as well. The same issues with 463 data quality from testing lags can be found when using any data driven method to forecast cases. The approaches proposed in this work provide a framework for forecasting outbreaks at a local 495 level that utilizes two different approaches. The first is a model based on epidemiological theory, 496 while the second is a machine learning approach that simultaneously considers historic trends 497 and other inputs. Both methods are useful specifically the Rt Only method when data is limited, 498 while the ML+Rt method performs well when data has been collected and a historic perspective 499 can be presented. 500 501 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

502
This study addressed the West Virginia SARS-CoV-2 epidemic from January -April 2021. At that 503 time, only one case of the Delta variant had been detected, therefore, our models do not address 504 prediction of new SARS-CoV-2 incidence when Delta is the prevalent variant. As the Delta variant 505 has unique epidemiologic characteristics compared to earlier SARS-CoV-2 variants such as a 506 shortened serial interval which influences calculation of Rt, models must be adjusted as new more 507 virulent strains of SARS-CoV-2 appear in the population (Baisheng, et al., 2021). 508 509

510
This study provides important information on strategies for predicting near-term increases in 511 SARS-CoV-2 incidence, and hence, for targeting SARS-CoV-2 testing. We provide a new approach, 512 Rt Only, that utilizes the estimation of the reproduction number to provide recommendations on 513 county-specific areas where outbreaks will likely occur. We also describe a second approach, 514 ML+Rt, utilizing long short-term memory models that consider epidemiological statistics such as 515 Rt, county population information, and time series trends including information on major 516 holidays to forecast outbreaks and create county recommendations. Comparison of the two 517 approaches shows the top 10 recommendations produced by the ML+Rt method outperform the 518 Rt Only method over the period of this study. Our data suggest that traditional epidemiological 519 modeling can be enhanced by modern machine learning tools to inform decisions on where to 520 target SARS-CoV2 testing. 521 522 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint SARS-CoV-2 serial interval and the impact of parameter uncertainty on the COVID-19 543 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 7, 2021.  . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. The PMF of the expected number of cases is obtained by integrating over the values of % : 605 . CC-BY-NC-ND 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted October 7, 2021. ; https://doi.org/10.1101/2021.10.06.21264569 doi: medRxiv preprint