Predicting and analyzing the COVID-19 epidemic in China: Based on SEIRD, LSTM and GWR models

In December 2019, the novel coronavirus pneumonia (COVID-19) occurred in Wuhan, Hubei Province, China. The epidemic quickly broke out and spread throughout the country. Now it becomes a pandemic that affects the whole world. In this study, three models were used to fit and predict the epidemic situation in China: a modified SEIRD (Susceptible-Exposed-Infected-Recovered-Dead) dynamic model, a neural network method LSTM (Long Short-Term Memory), and a GWR (Geographically Weighted Regression) model reflecting spatial heterogeneity. Overall, all the three models performed well with great accuracy. The dynamic SEIRD prediction APE (absolute percent error) of China had been ≤ 1.0% since Mid-February. The LSTM model showed comparable accuracy. The GWR model took into account the influence of geographical differences, with R2 = 99.98% in fitting and 97.95% in prediction. Wilcoxon test showed that none of the three models outperformed the other two at the significance level of 0.05. The parametric analysis of the infectious rate and recovery rate demonstrated that China's national policies had effectively slowed down the spread of the epidemic. Furthermore, the models in this study provided a wide range of implications for other countries to predict the short-term and long-term trend of COVID-19, and to evaluate the intensity and effect of their interventions.


Introduction
Novel coronavirus pneumonia (coronavirus disease 2019, COVID-19) break out firstly in Wuhan, Hubei Province, China in December 2019, then the epidemic became prevalent in the rest of the world. With the research on COVID-19 so far, through the comparison of the gene sequence of the virus with that of the mammalian coronavirus, some studies found that its source may be related to bat, snake, mink, Malayan pangolins, turtle and other wild animals [1][2][3][4]. COVID-19 can also cause severe respiratory diseases such as fever and cough [5], and there is a possibility of transmission after symptoms of lower respiratory diseases [6]. However, unlike SARS-CoV and MERS-CoV, COVID-19 is separated from airway epithelial cells of patients [6], yet the mechanism of receptor recognition is not consistent with SARS [7]. Therefore, the pathogenicity of COVID-19 is less than that of SARS [8], and its transmissibility is higher than that of SARS [9]. In addition, this new coronavirus presents human-to-human transmission [10], and close contact could lead to group outbreaks [11]. As of July 7th, 2020, 85,359 confirmed cases and 4,648 deaths had been reported in China [12]. In addition to China, there are over 200 countries and regions in the world with a total of 11,630,898 of confirmed cases and 538,512 of deaths [12]. The outbreak of COVID-19 happened right before the Lunar New Year, which is typical Chinese Spring Festival transportation period. With a population of over 11 million, Wuhan is one of the major transportation hubs in China as well as a core city of the Yangtze River Economic Belt. The time and location of the outbreak further led to the rapid spread of the epidemic in China [13]. Since there is still no vaccine or antiviral drug specifically for COVID-19, the government's policies or actions play an important role in flatting the epidemic curve [14]. From the perspective of public health, the interventions of Wuhan government have achieved the purpose of reducing the flow of people and the risk of exposure to the diagnosed patients, and also effectively slowed down the spread of the epidemic [15]. Nevertheless, COVID-19 can be transmitted by asymptomatic carriers [16], and some of the recovered patients may still be virus carriers [17]. In order to implement non-pharmaceutical interventions more effectively, we used a combination of epidemiological methods, mathematical or statistical modeling tools to provide valuable insights and predictions as benchmarks.
For the study of infectious diseases like COVID-19, SARS, and Ebola, most of the literature used descriptive research or model methods to assess indicators and analyze the effect of interventions, such as combining migration data to evaluate the potential infection rate [18,19], understanding the impact of factors like environmental temperature and vaccines that might be potentially linked to the diseases [20,21], using basic and time-varying reproduction number (R 0 & R t ) to estimate changeable transmission dynamics of epidemic conditions [22][23][24][25][26][27], calculating and predicting the fatal risk to display any stage of outbreak [28][29][30], or providing suggestions and interventions from risk management and other related aspects based on the results of modeling tools or historical lessons [31][32][33][34][35][36][37][38][39]. Some literature only used one kind of model to simulate and predict the course of diseases. For instance, to use relatively common epidemiological dynamics models like SEIR or SIRD to forecast epidemic trends and peaks in certain provinces, even the world [9,[40][41][42][43][44]; to apply some other types of statistical models such as the logistic growth models or time series approaches to analyze the epidemic situation [45,46], or to develop new models to support more complex trajectories of epidemics or to predict the number of confirmed cases and the spatial progression of outbreaks [47][48][49]. Several studies were further expanded based on the basic epidemic dynamic models. For example, joining the border protection mechanism with the SEIR model to better identify high-risk groups and infected cases [50]; adding the effect of media or awareness into basic models to assess whether these outside influences would possible change the transmission mode of infectious diseases [51,52]; or according to transmission routes contained in dynamic models, using a multiplex network model or transmission network topology to analyze the outbreak scale and epidemic spread more accurately [53,54]. A small number of studies combined the analysis capabilities of two types of models, like SEIR model and the recurrent neural networks model (RNN), to determine whether certain interventions could affect the results of outbreak control [55]. However, we did not find any analysis method using geographically weighted regression (GWR) on COVID-19 study based on our literature research. There is also a lack of understanding the model efficacy of predicting the epidemic curve among different algorithms.
In this study, an SEIR's extended model SEIRD was used to simulate the epidemic situation in China and to predict the number of confirmed and cured cases in each province and several major Chinese cities. An LSTM model combined with traffic data and a GWR model were used to predict the number of confirmed patients. Specifically, GWR Model showing geographical differences was used to predict the development of epidemic situation and analyze the impact of geographical factors. This paper also compares the characteristics and prediction ability of these models. In the absence of vaccines and drugs for COVID-19, it makes sense to use multiple models to show the situation and intensity of non-pharmaceutical interventions needed to simulate and guide the control of outbreaks.

Data sources
Daily updated COVID-19 epidemiological data used in this study were retrieved from National Health Commission of China [12] and accessed via https://github.com/wybert/openwuhan-ncov-illness-data. The daily number of outbound from Wuhan city and relevant migration indice from January to March were collected from an online platform called Baidu Qianxi [56]. The demographic data and medical resources data were from China urban statistical yearbook published by the National Bureau of Statistics as shown in S1 Table.

Modified SEIRD model
This study used SEIRD model and the changes in the status of the susceptible (S), exposed (E), infected (I), recovered (R) and dead (D) population in the total population (N) are shown in Fig 1. According to the medical characteristics and clinical trials of COVID-19, both confirmed patients and asymptomatic carriers have the ability to transmit the virus. Therefore, susceptible people have a certain chance to become infected after they come into contact with exposed or infected individuals [43]. Carriers in the exposed status may develop obvious symptoms after the incubation period and become diagnosed or they may be recovered. The final status of individuals can be basically divided into two categories: one is the recovery from the combined effects of treatment in hospital and autoimmunity, and the other is the death without effective treatment. In the model formula, the infectious rate β needs to be adjusted in real time to adapt to the trend of disease development. In the middle and late stages of the epidemic, the number of daily new cases decreased significantly due to the positive influence of government policies. Thus, to better fit the model, we added an attenuation factor desc to β. Based on the basic SEIRD model formulas [57,58], our modified model was shown as Eqs (1)(2)(3)(4)(5)(6).
Here, the parameter t denotes the time; β is the infectious rate; α is the rate for the exposed to be infected; γ 1 is recovery rate for the exposed; γ 2 is the recovery rate for the infected; k is the mortality rate; "desc" is the attenuation factor for β, so that β decays exponentially when 0<desc<1, and β is a constant when desc = 1.

LSTM model
LSTM (Long Short-Term Memory) architecture for recurrent neural networks was first proposed in 1997 [59]. A LSTM block is illustrated in Fig 2. It features three gates (input, forget, and output), a block input and an output. The output of the block is recurrently connected to the input of the block.
Here, z t , i t , f t , c t , o t and h t denote the block input, input gate, forget gate, cell state, output gate and block output, respectively. And x t represents the input vector at time t, ⊙ is the pointwise multiplication operator of two vectors, the W z , W i , W f , and W o are input weight matrices, is used as the activation function of the gates and ReLU is used as the activation function of the block input and output.

GWR model
Epidemic situations and medical resources in different geographic situations may have different extents of influence on the development of the epidemic. Ordinary least squares fitting method for regression may not be applicable in this case. Geographically weighted regression model (GWR) was proposed in 1996 [60], which extended the ordinary linear regression model and embedded the geographic location data into the regression parameters as shown below: where y i is the i th dependent variable, x ik is the k th independent variable in location i, p is the total number of independent variables, β i0 is the intercept parameter in location i, β ik is the regression coefficient for the k th independent variable in location i, which varies with the geographical location, and ε i is the error term in location i. The spatial weight matrix in this study uses the bi-square kernel function shown below: if d ij <b, otherwise w ij = 0, where b is the bandwidth, a non-negative attenuation parameter and d ij denotes the distance between the i th and j th observation points. The bandwidth is calculated by optimizing the root mean square prediction error of cross-validation [61,62].

SEIRD model
In this study, we used the modified SEIRD model to make predictions of the number of cumulative confirmed cases in the next day for all provinces, province-level municipalities and autonomous regions in China as well as Wuhan City. The parameters were adjusted daily in our dynamic SEIRD model based on the daily updated epidemic data. The comparison of the actual data on February 14 th and February 25 th with the forecast results of our models is shown in Table 1. The percent error was calculated using the formula: (predicted number-actual number)/ actual number × 100%. On February 14 th , the absolute percent errors of all provinces were < 5%. The percent error for Wuhan City, Hubei Province and China were -3.00%, -1.60% and 1.00%, respectively. On February 25 th , the absolute percent error of prediction of cumulative confirmed cases in China was < 0.10%. The absolute percent errors of most provinces were < 0.10%, among which the absolute percent errors in Wuhan City was < 0.10% and that of Hubei province was less than 0.10%. Regarding the number of recovered cases, Wuhan City and Hubei Province had percent errors of -6.03% and -3.12%, respectively. The overall prediction of recovered of the whole country was consistent with the actual situation with percent error of -2.46%. The predicted number of deaths in Hubei province was off by 1.40% (forecast 2,599 vs. actual 2,563). Actual and predicted number of confirmed cases using the modified SEIRD model for China, Hubei province and Wuhan city are shown in Fig 4 (Hubei province and Wuhan City adjusted the criteria for diagnosis on February 13 th , and the number of confirmed cases increased by about 10,000 on that day [63]. In order to smooth the sudden change, the number of cumulative cases before February 12 th in Hubei City and Wuhan province was proportionally enlarged according to the new criteria. The same for Fig 5). The actual and calculated values of these three regions provided satisfying fitting curves, indicating that the situation simulated by the model was basically in line with the actual situation of the epidemic development. In this study, the inflection point was defined as the date when the number of existing confirmed cases has the largest slope. According to the SEIRD dynamic model, the inflection points of all provinces appeared generally in February, while the specific time varied from region to region. The results of model simulation revealed that the inflection point in Wuhan city and Hubei province showed up in early February, and that of the whole country roughly in the first half of February, which basically conformed to the spread of COVID-19 in China.
Using data on March 5 th , the model predicted the long-term trends in the number of confirmed, cured and deaths for China, Hubei province and Wuhan city (Fig 5). Again, the model used adjusted historical data as discussed above. Under the various social non-pharmaceutical interventions and not allowing for the imported cases from foreign countries, the cumulative number of confirmed nationwide was expected to reach about 83,000 at the end of the epidemic. Hubei Province was expected to have a total of about 70,000 confirmed cases and Wuhan City about 50,000.

LSTM model
Data from four regions, Zhejiang, Guangdong, Beijing, and Shanghai were selected to train the LSTM neural network to predict the number of cumulative confirmed cases of the next day. Since the LSTM model had a memory function, the first feature included in the model was the number of cumulative confirmed cases on the previous day. Considering that the number of migrants from Wuhan also affected the studied city, thus the number of migrants from Wuhan was also included in the analysis. There was a certain probability that some migrants from Wuhan may be patients because of the virus's incubation period, and the inference of this probability was based on the number of confirmed cases in Wuhan. Therefore, the second feature considered the number of migrants from Wuhan on the previous day, and the confirmed number of patients in Wuhan on the previous day. The feature was calculated as the cumulative number of immigrants from Wuhan multiplied by the incidence of COVID-19 in Wuhan on the previous day. This LSTM architecture was designed into 4 layers: an input layer, an LSTM layer (hidden layer), a fully-connected layer and an output. Each LSTM neuron had 10 hidden features, and the activation function was ReLU. The loss function was MSE, and the optimizer was "Adam". The model structure diagram is as Fig 6. This study used the grid search method to set different hyperparameters for data in different regions.
The model was trained and the predicted results for latest 8 consecutive days as shown in

GWR model
In this study, the data of 220 cities that had confirmed cases on February 2 nd were selected to predict the number of confirmed cases on February 3 rd . The number of confirmed cases, the number of deaths and the number of cured cases are main indicators for the epidemic. Among them, the number of confirmed cases was the mostly used and reflected the severity of COVID-19 epidemic. Therefore, this study used the cumulative number of confirmed cases in different places released by the National Health Commission as dependent variable. In this study we select the population of each city, the number of hospitals per 10,000 people, the number of doctors per 10,000 people, the number of inpatient beds per 10,000 people, the number of confirmed cases, the number of cured cases, and the number of deaths one day and 2 days ago as independent variables.
The GWR model was fitted using the data of February 2 nd , and we further made forecast for the number of the confirmed cases on February 3 rd . The R 2 of GWR regression on February 2 nd was 99.98% and the R 2 of the prediction of the data on February 3 rd was 97.95%. The percent errors of fitting and prediction varied for different cities: for Beijing were 11.67% and 3.95%, respectively; for Shanghai were 2.24% and -5.88%, respectively, for Xiaogan in Hubei Province were -1.27% and 1.70%, respectively, and for Wuhan were 0.00% and 14.57%, respectively.   The summary of the intercept and coefficients of the independent variables were listed in Table 4. It shows that the coefficients of the demographic data, and the medical resources data have larger variations than those of epidemic data. The coefficients of population, number of hospitals per 10,000 people, number of doctors per 10,000 people, dead_lag1, confirmed_lag2, cured_lag2 were negative, showing that these factors have negative influence on the dependent variable. While the other independent variables, number of inpatient beds per 10,000 people, confirmed_lag1, cured_lag1, dead_lag2 have positive coefficients, indicating positive influence on the dependent variable as shown in Table 4.

Sensitivity analysis of parameters
As of mid-March 2020, more than 60,000 people had been cured in 31 provinces, provincelevel municipalities, and autonomous regions in China, and new cases of infection were mainly led by overseas imports. Although the COVID-19 epidemic was not over, the traffic in the low-and medium-risk areas in Hubei province had been gradually resuming, indicating that the government's non-pharmaceutical interventions had significantly positive effects. In this study, the modified SEIRD model was used to conduct parameter sensitivity analysis of β, desc, and γ 2 based on data before March 5 th , so as to simulate the impact of prevention and control measures on real-time infections for China, Hubei Province, Wuhan city, and Beijing city (Fig  9).
The decrease of the infectious rate β would promote the reduction of infections during the entire epidemic stage with other conditions being equal (Fig 9A). The shape of the epidemic curve was basically unchanged, but the duration of the epidemic increase as the infectious rate itself increases. The number of cases increased obviously, and the peak of real-time infections was postponed as the infectious rate increases. When the infectious rate increased to 125%, the epidemic size doubled with the delay of the peak of real-time infections by about 10 days ( Fig  9A).  Moreover, increasing the attenuation factor of infectious rate could lead to a significant slowdown in the spread of the epidemic and the shape of the epidemic curve changed (Fig 9B). In the beginning, the growth of attenuation factor changed the number of confirmed cases little, but the number had changed dramatically over time, the peak of the epidemic moved forward with the increase in the attenuation factor (Fig 9B). The duration of the epidemic also advanced correspondingly. A combination of the changes in the infectious rate β itself and the changes in the attenuation factor of β could reflect the effects of the measures such as timely isolation of confirmed or suspected patients and reduction of population mobility. Coupled with the community containment measure, the number of exposed, infected and susceptible individuals outside were greatly reduced, so that the extent of the epidemic in China had been under control. Implemented metropolitan-wide quarantine of Wuhan city itself could also interfere with the change of infectious rate. The decrease in the number of daily new confirmed cases since late February showed that the corresponding policies had effectively blocked the spread of the epidemic.
The change in the recovery rate of infected γ 2 had little effect in the early stage of the epidemic. As time went by, the growth of recovery rate could significantly raise the number of recovered, thus advancing the peak time of the real-time confirmed cases (Fig 9C). When the recovery rate raised from 75% to 125%, the whole country, Hubei province, Wuhan city and Beijing city could reach the time of maximum real-time infections about 6-15 days in advance, and the scale of the epidemic could be reduced as well (Fig 9C). In fact, China transported advantage medical resources of more than 20,000 people to Hubei province [5] in order to achieve the goal of early detection, early reporting, early diagnosis, and early isolation. Besides, the measure of "one province helping one city" established provincial counterparts to support the rescue work in Hubei province except Wuhan [5], so as to rationally allocate advanced resources. These interventions could improve the treatment and medical level of key provinces and cities, thereby increasing the recovery rate of infected and reducing the mortality rate. By March 13 th , 2020, more than a thousand people each day have been cured and discharged for 29 consecutive days [6], indicating the effectiveness of related policies. Although the COVID-19 has been effectively controlled in China, it has spread rapidly in other countries. Italy, the United States and Spain have become the focused areas of the outbreak. By May 2 nd , 2020, the United States, as the country with the largest number of confirmed cases, has over 1.1 million cases, and Spain had 216,582 cases, and Italy ranked the third with 207,428 confirmed patients [12]. In order to control the spread of coronavirus, America took measures to reduce the mobility of the population, built hospitals and facilitate the treatment of the coronavirus [64][65][66][67]. Similar to the US, Italy and Spain also tried to limit the movement and gathering of the crowds, improve the protection level and provide more medical resources [64,[68][69][70].
In conclusion, all three countries have implemented various interventions to slow down the spread of the COVID-19 disease. The measures could be basically divided into two categories: reducing the infection rate and increasing the recovery rate. However, according to the recent large-scale outbreak in the United States and Spain, it could be found that a part of the people in these two countries might have insufficient awareness of prevention and control of the epidemic [64]. The supervision of those prevention and control measures needs further improvement. Thanks to the joint efforts of the people across the Italy, while the number of confirmed cases in Italy is still large, this country, which was called "the second Hubei province" in the early stage of the epidemic, has a trend of declining new cases of infection and death [12].
In order to test the capability of the SEIRD model in foreign countries, data before June 29 th , 2020 of Italy were used to calculate the epidemic curve. The results of the model also fitted well with the actual data as shown in Fig 10. Although some other countries successfully controlled the epidemic using similar measures with China [71], they may not always work in other countries because the effect depends on the public attitudes towards the measures and commitment to the intervention as debated in [72]. Therefore, in the face of the same epidemic situation and similar crises, our SEIRD dynamics model can be potentially applied to other countries to evaluate the intensity and effect of policies implemented by simulating and forecasting the situation of the epidemic, but the effect may be limited by the attitudes and action of the public.

Spatial distribution of coefficients in GWR model
To better understand the spatial distribution of the coefficients of the independent variables in the GWR model, four parameters and their correlations in the model of February 2 nd have been studied to evaluation the heterogeneity of their coefficients in space. There was a strong negative correlation between the number of hospitals per 10,000 people and the number of confirmed cases (Fig 11A). This can be explained as that the isolation of confirmed cases in the hospital can prevent contagion. From the perspective of the spatial distribution of the regression coefficients, it has a trend of gradual decline from the northeast to the southwest and northwest of China (Fig 11A). The most influenced areas are located in the northeast of China, while the least influenced areas are in southwest and northwest of China.
There was a negative correlation between the number of doctors per 10,000 people and the number of confirmed cases (Fig 11B). From the perspective of the spatial distribution of regression coefficient, it shows a gradually decreasing trend from the northeast and northwest of China to the south (Fig 11B). The regions that are influenced the most are concentrated in northeast and northwest of China, while the least influenced regions are in the south.
There was a positive correlation between the number of confirmed cases and the confirmed cases one day ago (Fig 11C). This suggests that the more cases confirmed the day before, the more confirmed cases would emerge the next day. Effective local quarantine measures can be used to prevent a pandemic. From the perspective of the spatial distribution of the regression coefficient, it shows a trend of gradual decline from the northeast to the southwest and northwest of China (Fig 11C). This trend is not significant, which shows a universal pattern across the country. There was a positive correlation between the number of cured case and the number of confirmed cases one day ago (Fig 11D). From the perspective of the spatial distribution of regression coefficient, it shows a gradually decreasing trend from the northeast and northwest of China to the south, with the most influenced areas in the northeast and northwest, and the least influenced areas in the south (Fig 11D).

Comparison of SEIRD, LSTM and GWR models
By comparing the prediction capabilities of these three types of models, the modified SEIRD, LSTM and GWR model could effectively predict the epidemic data for the next day generally. The percent errors of the SEIRD model to predict confirmed cases were within ±5.0% in all of these four selected regions (Beijing, Wuhan, Hubei and China) shown in Table 5. The LSTM model also fit well to the real curve by incorporating traffic big data, indicating good simulation and prediction effects. The average percent error of LSTM model predictions for the four selected provinces and cities was within ±1.0% on February 14 th (Table 5). GWR model could reflect spatial heterogeneity but larger percent errors showed than the other two models in some cases ( Table 5). The MAPE (Mean Absolute Percentage Error) for the SEIRD, LSTM and GWR models in the selected areas were 1.70%, 1.51%, 3.44%, respectively. In order to compare the APE (Absolute Percent Error) of the three models, we ran Wilcoxon Signed Rank Test for the paired observations in Table 5. The p-values for the hypotheses: the APE of GWR> that of LSTM, the APE of GWR > that of SEIRD and the APE of SEIRD> that of LSTM were 0.173, 0.187 and 0.459, respectively, thus not significant at the level of 0.05. Overall, the prediction efficacy of GWR model was inferior to those of SEIRD and LSTM models according to the MAPE and p-values.

Conclusions
In this study, the modified SEIRD model, the LSTM model with traffic data and the GWR model reflecting the geographical environment were used to make forecasts for the development of COVID-19 in China. These three types of models all showed remarkable prediction capabilities. The parameter sensitivity analysis reflected the effectiveness of non- pharmaceutical interventions. Now the epidemic quickly spread abroad, in the absence of targeted pharmaceutical treatment such as vaccines, the interventions implemented in various countries were basically similar to those in China, which were based on the two aspects: reducing the infectious rate and improving the recovery rate. As the number of daily new cases continues to increase globally, models in this study shows potential being used for epidemic curve prediction and prevention of COVID-19 in other countries.
Supporting information S1