Prediction and analysis of Corona Virus Disease 2019

The outbreak of Corona Virus Disease 2019 (COVID-19) in Wuhan has significantly impacted the economy and society globally. Countries are in a strict state of prevention and control of this pandemic. In this study, the development trend analysis of the cumulative confirmed cases, cumulative deaths, and cumulative cured cases was conducted based on data from Wuhan, Hubei Province, China from January 23, 2020 to April 6, 2020 using an Elman neural network, long short-term memory (LSTM), and support vector machine (SVM). A SVM with fuzzy granulation was used to predict the growth range of confirmed new cases, new deaths, and new cured cases. The experimental results showed that the Elman neural network and SVM used in this study can predict the development trend of cumulative confirmed cases, deaths, and cured cases, whereas LSTM is more suitable for the prediction of the cumulative confirmed cases. The SVM with fuzzy granulation can successfully predict the growth range of confirmed new cases and new cured cases, although the average predicted values are slightly large. Currently, the United States is the epicenter of the COVID-19 pandemic. We also used data modeling from the United States to further verify the validity of the proposed models.


Introduction
A number of unexplained pneumonia cases have successively been discovered in China since December 2019, which have been confirmed to be acute respiratory infectious diseases caused by a novel coronavirus.The outbreak of COVID-19 has experienced three stages since mid-December 2019: local outbreak, community transmission and large-scale transmission.①Local outbreak stage: This stage mainly forms a local outbreak among the people exposed to the seafood market before the end of December 2019.Most of the cases at this stage were related to the exposure of seafood market.② Community transmission stage: due to the spread of the epidemic, the virus spread to communities through the early-infected people, forming community transmission.Interpersonal and cluster transmission occurred in multiple communities and families in Wuhan.③The stage of large-scale transmission of the spread of the epidemic : The epidemic rapidly expanded and spread from Hubei Province to other parts of China due to the great mobility of personnel during the Chinese Lunar New Year, while the number of COVID-19 cases in other countries gradually increased.
As of 24:00 on February 29, 2020, China has reported a total of 79,824 confirmed cases of COVID-19 and 2,870 deaths 1 .The cumulative number of confirmed cases and deaths in Wuhan accounted for 61.5% and 76.5% of the country respectively, which is the priority area for epidemic prevention and control.At the same time, countries and regions outside China reported 7,661 confirmed cases and a total of 121 deaths.Infectious diseases cause disastrous harm to human society and are one of the important factors that seriously threaten human life and health, restrict social and economic development and endanger national security and stability.The effects of economic globalization, internationalization of production, more convenient transportation, and faster human and cargo flows have created favorable conditions for the widespread spread of infectious diseases, making the spread of infectious diseases faster and wider 2 .Some infectious diseases that have occurred in recent years, such as COVID-19, SARS in 2003, influenza HIN1, H5N1, etc., have greatly affected human health and social life.How to contain the outbreak of infectious diseases and ease the spread of infectious diseases is an urgent issue facing the society at present 3 .
Theoretical analysis, quantitative analysis and simulation are needed for the prediction of various infectious diseases.The above analysis cannot be carried out without models established for various infectious diseases.
Infectious disease transmission is a complicated diffusion process occurring in the crowd.Models can be established for this process to analyze and study the transmission process of infectious diseases theoretically 4 , so that we can accurately predict the future development trend of infectious diseases 5 .
Therefore, in order to control or reduce the harm of infectious diseases, the research and analysis of infectious disease prediction models have become a hot research topic 6 .

Traditional infectious disease prediction model
Traditional infectious disease prediction models mainly include differential equation prediction models and time series prediction models based on statistics and random processes.
The differential equation prediction models are to establish a differential equation that can reflect the dynamic characteristics of infectious diseases according to the characteristics of population growth, the occurrence of diseases and the laws of transmission within the population.Through qualitative and quantitative analysis and numerical simulation of the model dynamics, the occurrence process of diseases is displayed, the transmission laws are revealed, the change and development trends are predicted, the causes and key factors of disease transmission are analyzed, the optimal strategies for prevention and control are sought, and the theoretical basis and quantitative basis are provided for people to make prevention and control decisions.Common models for predicting infectious disease dynamics differential equations have ordinary differential systems, which directly reflect the relationship between the instantaneous rate of change of individuals in each compartment and the corresponding time of all compartments.Partial differential system is a common model system when considering age structure.
Delay differential system is a kind of differential system that appears when the structure of the stage is considered (e.g. the infected person has a definite infectious period, the latent person has a definite incubation period, the immunized person has a definite immune period, etc.The currently widely studied and applied models include SI model, SIS model, SIR model and SEIR model, etc 7 .System individuals are divided into different categories, and each category is in a state, respectively: S (Susceptible), E(Exposed), I (Infected) and R (Remove).
The classical differential equation prediction model assumes that the total number of people in a certain area is a constant, which can prompt the natural transmission process of infectious diseases, describe the evolution relationship of different types of nodes with time, and reveal the overall information transmission law.However, in practice, the population is changing over time.There will always be some form of interaction with other populations in terms of food, resources and living space.The connection between individuals is random, and the difference between spreading individuals is ignored, thus limiting the application scope of the model.Time series prediction models, based on statistics and random processes, predict infectious diseases by analyzing one-dimensional time series of infectious disease incidence, mainly including Autoregressive Integrated Moving Average model (ARIMA), Exponential Smoothing method (ES), Grey Model (GM), Markov chain method (MC), etc.The widely used time series prediction model is ARIMA prediction model, which uses several differences to make it a stationary series, and then represent this sequence as a combination autoregression about the sequence up to a certain point in the past 8 .
The infectious disease prediction model established by this method relies on curve fitting and parameter estimation of available time series data, so it is difficult to apply it to a large number of irregular data.

Internet-based infectious disease prediction model
Infectious disease surveillance research based on the Internet has begun to rise since the mid-1990s 9 .It can provide information services for public health management institutions, medical workers and the public.After analyzing and processing, it can provide users with early warning and situational awareness information of infectious diseases 10 .
In the early research, traditional Web page web information (for example, related news topics, authoritative organizations, etc.) was the main data source.However, with the development of the Internet, research has begun to expand data sources to social media (such as Twitter, Facebook, microblog, etc.) and multimedia information in recent years.Due to the global spread of the Internet, people use Internet search engines, social networks and online map tools to track the frequency and location information of query keywords, strengthen the integration of information on social, public focus and hot issues, realize disease monitoring based on search engines and social media, and predict the incidence of infectious diseases, which can provide important reference for the decision and management of infectious disease prevention and control 11 .
In theory, Internet search tracking is efficient, and can reflect the real-time status of infectious diseases.
Therefore, the infectious disease prediction models based on Internet and search engine are good supplement to the traditional infectious disease prediction models 12 .U.S. scientists compared the flu estimates in different countries and regions from 2004 to 2009 with the official flu surveillance data, and found that the estimates from Google search engine were close to historical flu epidemic 13  and in 2016, Google Flu Trends (GFT) and other tools to quantitatively track the spread trend of infectious diseases such as dengue fever and influenza in multiple regions of the world according to Google's search patterns 15 .
Compared with the traditional prediction models, the Internet-based infectious disease prediction models have the advantages of real-time and fast, which can predict the incidence trend of infectious diseases as early as possible, and are suitable for data analysis of a large number of people.However, the sensitivity, spatial resolution and accuracy of its prediction need to be further improved.So Internet-based infectious disease prediction models cannot replace the traditional prediction models, and they can just be used as an extension of the traditional infectious disease prediction model 16

Early Prediction Model of Infectious Diseases Based on Machine Learning
In short, machine learning is to learn more useful information from a large amount of data using its own algorithm model for specific problems.Machine learning spans a variety of fields, such as medicine, computer science, statistics, engineering technology, psychology, etc 17 .For example, neural network, a relatively mature machine learning algorithm, can simulate any high-dimensional non-linear optimal mapping between input and output by imitating the processing function of the biological brain's nervous system.When faced with complex data relations, the traditional statistical method is not such effective, which may not receive accurate results as the neural network 18 .
Since most new infectious diseases occurring in human beings are of animal origin (animal infectious diseases), it is an effective prerequisite to predict diseases by determining the common intrinsic characteristics of species and environmental conditions that lead to the overflow of new infections.By analyzing the intrinsic characteristics of wild species through machine learning, new reservoirs (mammals) and carriers (insects) of zoonotic diseases can be accurately predicted 19 .The overall goal of machine learning-based approach is to extend causal inference theory and machine learning to identify and quantify the most important factors that cause zoonotic disease outbreaks, and to generate visual tools to illustrate the complex causal relationships of animal infectious diseases and their correlation with zoonotic diseases 20 .However, the highly nonlinear and complex problems to be analyzed in the early prediction model of infectious diseases based on machine learning usually lead to local minima and global minima, leading to some limitations of the machine learning model.
Infectious disease prediction models mainly include differential equation prediction models based on dynamics and time series prediction models based on statistics and random processes, Internet-based infectious disease prediction model and machine learning methods.Some models are too complicated and too many factors are considered, which often leads to over-fitting.In this paper, Logistic model, Bertalanffy model and Gompertz model, which are relatively simple but accord with the statistical law of epidemiology, are selected to predict the epidemic situation of COVID-19.After the model is selected, the least square method is used for curve fitting.Least square method is a mathematical optimization technique.It finds the best function match of data by minimizing the sum of squareed errors.Using the least square method, unknown data can be easily obtained, and the sum of squares of errors between these obtained data and actual data is minimized.

Model Selection
(1) Logistic model Logistic model is mainly used in epidemiology.It is commonly to explore the risk factors of a certain disease, and predict the probability of occurrence of a certain disease according to the risk factors.We can roughly predict the development and transmission law of epidemiology through logistic regression analysis,.
Qt is the cumulative confirmed cases (deaths); a is the predicted maximum of confirmed cases (deaths).
b and c are fitting coefficients.t is the number of days since the first case.t0 is the time when the first case occurred.
(2) Bertalanffy model Bertalanffy model is often used as a growth model.It is mainly used to study the factors that control and affect the growth.It is used to describe the growth characteristics of fish.Other species can also be used to describe the growth of animals, such as pigs, horses, cattle, sheep, etc. and other infectious diseases.
The development of infectious diseases is similar to the growth of individuals and populations.In this paper, Bertalanffy model is selected to describe the spread law of infectious diseases and to study the factors that control and affect the spread of COVID-19.
Qt is the cumulative confirmed cases (deaths); a is the predicted maximum of confirmed cases (deaths).
b and c are fitting coefficients.t is the number of days since the first case.t0 is the time when the first case occurred.
(3) Gompertz model The model was originally proposed by Gomperts (Gompertz,1825) as an animal population growth model to describe the extinction law of the population.The development of infectious diseases is similar to the growth of individuals and populations.In this paper, Gompertz model is selected to describe the spread law of infectious diseases and to study the factors that control and affect the spread of COVID-
Qt is the cumulative confirmed cases (deaths); a is the predicted maximum of confirmed cases (deaths).
b and c are fitting coefficients.t is the number of days since the first case.t0 is the time when the first case occurred.

Model Evaluation
The regression coefficient (R 2 ) is used to evaluate the fitting ability of various methods and can be obtained by the following equation.
is the actual cumulative confirmed COVID-19 cases;  �  is the predicted cumulative confirmed COVID-19 cases;  � is the average of the actual cumulative confirmed COVID-19 cases.The closer the fitting coefficient is to 1, the more accurate the prediction.

Fitting and analysis of SARS epidemic
As COVID-19 and SARS virus are both coronaviruses, the infection pattern may be similar.Firstly, we used SARS data to verify the rationality of our model.

Number of Confirmed Cases
The   Among them, Logistic model is better than the other two models in fitting all the data in Wuhan, while Gompertz model is better in fitting the data outside Wuhan.
Due to various reasons, It is worth noting that the number of confirmed cases suddenly increased by 13,332 on February 12, 2020.Obviously, the mutation of this data does not originate from the mechanism of the virus, so our treatment method is to remove the impact of this part of data mutation (13,332 people).Then, the impact of the sudden increase confirmed cases will be considered in the later fitting analysis.
According to the daily real-time updated data of COVID-19, we used the above three mathematical

Bertalanffy models
In order to predict the turning point, we use the above three models to compare the new confirmed cases in Wuhan, China mainland and non-Hubei areas.As can be seen in Figure 4, the turning points in W Wuhan, China mainland and non-Hubei areas are February 9, February 6 and February 2, 2020 respectively.From the results, for the prediction of newly confirmed cases, the three models can predict the COVID-19 epidemic well in the early and late stages of the epidemic.Among them, the Logistic model is better than the other two models in fitting all the data.
(a) Three Models for Predicting new COVID-19 Cases in Wuhan since January 15,2020 1.E-02 1.E-01 1.E+00 1.E-02 1.E-01 1.E+00  1.E-02 1.E-01 1.E+00  The fitting parameters of each model can be seen in Table 2 in which a is the prediction of the death toll; b and c are fitting coefficients; t is the number of days since the first case.R 2 (DC) means the fitting goodness of cumulative deaths.
According to the available data, the death toll predicted by the three models is Wuhan: 2502-5108; non-Hubei areas: 107-125; China mainland: 3150-6286.As we can see, the results of the death toll predicted by the three models are quite different.It may be due to the fact that the factors affecting the death rate during the epidemic period are more than the cumulative confirmed number and the newly confirmed number, such as the continuous improvement of treatment level, emergency equipment and measures, etc.However, judging from the fitting precision of the models in Figure 4, the Logistic model is obviously better than the other two models.Considering that the later fitting results of the mathematical model is more important than the earlier fitting results, the fitting results of Logistic model may be more accurate, that is, the total death toll of COVID-19 is about 2502 in Wuhan, 107 in non-Hubei areas and 2150 in China mainland, respectively. 1.E-02 1.E-01 1.E+00

Discussion
The prediction methods of Logistic model, Gompertz model and Bertalanffy model are similar, but the mathematical models are different.From the results, for the prediction of the cumulative number of confirmed diagnoses, the three models can better predict the development trend of the COVID-19 epidemic in the later stages of the epidemic.Among them, the Logistic model is better than the other two models in fitting all the data in Wuhan, while Gompertz model is better in fitting the data in non-Hubei areas.For the prediction of newly confirmed cases, the three models can all well predict the epidemic situation of the COVID-19 in the early and late stages of the epidemic.Among them, the fitting result of Logistic model for all data in Wuhan and non-Hubei areas is better than the other two models.For the prediction of the cumulative death toll, the fitting coefficients of the three models are relatively high, and the figure can be well predicted at the later stage of the epidemic.Various medical resources are becoming more abundant in the later period, the capabilities of medical personnel in various aspects are getting stronger, the support for various resources across the country is getting stronger, and the ability to refine management and treatment is getting stronger.These factors are likely to rapidly reduce the mortality rate of COVID-19.The above factors may also have some influence on the cumulative number of confirmed cases, but due to the large number of confirmed cases, the influence of these favorable human factors on the cumulative number of confirmed cases may be small.
At present, there are only a few papers on the prediction of COVID-19 epidemic.We have collected some COVID-19 epidemic predictions of other researchers, as shown in Table 3.It can be seen from Table 3 that the total prediction results of different models are quite different.According to the prediction results of this article, the cumulative number of confirmed cases will reach maximum in Wuhan and the country around the end of March 2020 at the earliest and around April 2020 at the latest.The results are basically consistent with the results of the Zhong's team 22 that the basic control of the epidemic was at the end of April.The total number of confirmed diagnoses is predicted to be 49852-57447 in Wuhan, . Jiwei et al. filtered the Twitter data stream, retained flu-related information, and tagged the information with geographic location to show where the flu-related Twitter information came from and how the information changed over a certain period of time.They counted 3.6 million flu-related Twitter messages published by about 1 million users from June 2008 to June 2010, showing that there is a highly positive correlation between Twitter's influenza information and influenza outbreak data provided by the U.S. Centers for Disease Control and Prevention 14 .In 2011, Google launched Google Dengue Trends (GDT) . This paper will use 2003 SARS data to verify three mathematical models (Logistic model, Bertalanffy model and Gompertz model) to predict the development trend of the virus, and then use these three models to fit and analyze the epidemic trend of COVID-19 in Wuhan, mainland China and non-Hubei areas, including the total number of confirmed cases, the number of deaths and the end time of the epidemic.

Figure 1 3 . 2
Figure 1 The prediction of cumulative number of confirmed SARS cases fitted by Gompertz, Logistic and Bertalanffy models

Figure 2 4 Fitting and analysis of COVID-19 epidemic 4 . 1
Figure 2 The prediction of SARS death toll in mainland China fitted by Gompertz, Logistic and Bertalanffy models models (Logistic model, Bertalanffy model and Gompertz model) to carry out fitting analysis on the epidemic of COVID-19.The prediction results are shown in

Figure 3
Figure 3 The prediction of cumulative number of COVID-19 cases fitted by Gompertz, Logistic and

R 2 Figure 5
Figure 5 Three Models for Predicting COVID-19 death toll 12972-13405 in Non-Hubei areas, and 80261-85140 in China mainland.The predicted total death toll is 2502-5108 in Wuhan, 107-125 in non-Hubei areas, and 3150-6286 in China mainland.It should be noted that the China mainland data in this article are the data of 31 provinces (autonomous regions, municipalities) and Xinjiang Production and Construction Corps.In addition, another concerned question is: When will the epidemic of the new coronavirus COVID-19 end?Judging from the SARS situation in 2003, the date corresponding to the maximum number of cumulative diagnoses was basically the date when the epidemic ended.According to the results and data of this article, it is estimated that the epidemic of COVID-19 in novel coronavirus will end at the end of April 2020 in Wuhan and at the end of March 2020 in Non-Hubei areas.It is worth noting that the above results and conclusions are under the precondition that the prevention and control measures for the epidemic situation of COVID-19 are stable and reliable, foreign cases are not imported into China on a large scale, and the virus of COVID-19 does not produce new and serious acute variations.

Table 1
According to the calculation results of the three models, it is estimated that the final cumulative number of confirmed cases of COVID-19 in Wuhan is 49852-57447.Non-Hubei areas: 12972-13405; China mainland: 80261-85140, respectively.
in which a is the prediction of cumulative confirmed number (the final predicted cumulative confirmed number = a+13332); b and c are fitting coefficients; t is the number of days since the first case.R 2 (C) means the fitting goodness of cumulative confirmed cases, R 2 (N) means the fitting goodness of new confirmed cases