Modeling and prediction of the 2019 coronavirus disease spreading in China incorporating human migration data

This study integrates the daily intercity migration data with the classic Susceptible-Exposed-Infected-Removed (SEIR) model to construct a new model suitable for describing the dynamics of epidemic spreading of Coronavirus Disease 2019 (COVID-19) in China. Daily intercity migration data for 367 cities in China were collected from Baidu Migration, a mobile-app based human migration tracking data system. Early outbreak data of infected, recovered and death cases from official source (from January 24 to February 16, 2020) were used for model fitting. The set of model parameters obtained from best data fitting using a constrained nonlinear optimisation procedure was used for estimation of the dynamics of epidemic spreading in the following months. The work was completed on February 19, 2020. Our results showed that the number of infections in most cities in China would peak between mid February to early March 2020, with about 0.8%, less than 0.1% and less than 0.01% of the population eventually infected in Wuhan, Hubei Province and the rest of China, respectively. Moreover, for most cities outside and within Hubei Province (except Wuhan), the total number of infected individuals is expected to be less than 300 and 4000, respectively.


Introduction
The Novel Coronavirus Disease 2019 (COVID-19) (known earlier as New Coronavirus Infected Pneumonia) began to spread since December 2019 from Wuhan, which has been widely regarded as the epicenter of the epidemic, to almost all provinces throughout China and 200 other countries. Up to July 23, 2020, a total of 15,416,529 cases of COVID-19 infection have been confirmed in 213 countries, and the death toll has reached 631,177. In the early phase of the outbreak, China was almost the only country affected by the virus, and on February 19, 2020 (when this work was completed), a total of 74,579 cases were confirmed in China, and the death toll was 2,119. Moreover, as human-to-human transmission had been found to occur in some early Wuhan cases in mid December [1], the high volume and frequency of movement of people from Wuhan to other cities and between cities was an obvious cause for the wide and rapid spread of the disease throughout the country. Studies also suggested strong correlation between the spreading of infectious diseases with intercity travel [2]. The Susceptible-Exposed-Infected-Removed (SEIR) model has traditionally been used to study epidemic spreading with various forms of networks of transmission which define the contact topology [3], such as scalefree networks [4][5][6], small-world networks [7,8], Oregon graph [9,10], and adaptive networks [11]. Moreover, in most studies, the contact process assumes that the contagion expands at a certain rate from an infected individual to his/her neighbor, and that the spreading process takes place in a single population (network). The COVID-19 outbreak, however, began to occur in China and escalated in a special holiday period (about 20 days surrounding the Lunar New Year), during which a huge volume of intercity travel took place, resulting in outbreaks in multiple regions connected by an active transportation network. Thus, in order to understand the early transmission process of COVID-19 in China, it was essential to examine the human migration dynamics, especially between the epicenter Wuhan and other Chinese cities. Recent studies have also revealed the risk of transmission of the virus from Wuhan to other cities [12].
In this paper, we utilized the human migration data collected from Baidu Migration [13], which provided historical indicative daily volume of travellers to/from and between 367 cities in China [14,15]. To demonstrate the impact of intercity traffic on the COVID-19 epidemic spreading, we plot in Fig 1 the number of infected individuals in different cities versus the inflow traffic volume from Wuhan, which clearly shows that for cities farther away from Wuhan, the number of infected individuals almost increases linearly with the inflow traffic from Wuhan. In view of the importance of human migration dynamics to the disease spreading process, we combine, in this study, intercity travel data collected from Baidu Migration [13] with the traditional SEIR model [3] to build a new dynamic model for the spreading of COVID-19 in China. Using official historical data of infected, recovered and death cases in 367 cities, we performed fitting of the data to estimate the best set of model parameters, which were then used to estimate the number of individuals exposed to the virus in each city and to predict the extent of spreading in the coming months. It should be noted that since January 24, 2020, very strict migration control had been imposed in various provinces and cities to restrict travel and hence to curb the spreading of the virus. Based on the early data, our study showed that provided such migration control and other stringent measures continued to be in place, the number of infected cases in various Chinese cities would peak between mid February to early March 2020, with about 0.8%, less than 0.1% and less than 0.01% of the population eventually infected in Wuhan, Hubei Province, and the rest of China, respectively, and no new cases to be expected from mid March. Moreover, for most cities in and outside Hubei Province (except Wuhan), the total number of infected individuals would be less than 4000 and 300, respectively. Finally, as the effectiveness of treatment improved, the recovery rate should increase and the epidemic in China was expected to end by June 2020. It should be stressed that our prediction, completed on February 19, 2020, used the early and relatively small amount of data, and thus verified effectiveness of the model using limited initial outbreak data in predicting pandemic progression.
In the remainder of the paper, we first introduce the official daily infection data and the intercity migration data used in this study. The SEIR model is modified to incorporate the human migration dynamics, giving a realistic model suitable for studying the COVID-19 epidemic spreading dynamics. Historical data of infected, recovered and death cases from official source and data of daily intercity traffic (number of travellers between cities) extracted from Baidu Migration were used to generate the model parameters, which then enabled estimation of the propagation of the epidemic in the following months. We will conclude with a brief discussion of our estimation of the propagation and the reasonableness of our estimation in view of the measures taken by the Chinese authorities in controlling the spreading of this new disease.

Official data of COVID-19 cases
The availability of official data of infected cases in China varies from city to city. Wuhan, being the epicenter, had the first officially confirmed case of COVID-19 infection in China on December 8, 2019 [1]. Most other cities in China began to report cases of COVID-19 infections around mid January 2020. Our data of daily infected and recovered cases, and death tolls, were based on the official data released by the National Health Commission of China, and the daily data used in our study were from January 24, 2020, to February 16, 2020, including the daily total number of confirmed cases in each city, daily total cumulative number of confirmed cases in each city, daily cumulative number of recovered cases in each city, and daily cumulative death toll in each city. It should be emphasized that the official data may not be the actual (true) data. Although the earliest confirmed case in China appeared on December 8, 2019, subsequent missing cases were expected to be significant in Hubei Province in the early stage of the epidemic outbreak. Systematic updates of infection data in other cities began after January

Intercity travel data
As human-to-human transmission had been confirmed to occur in the spreading of COVID-19, gatherings of people and intercity travel of infected and exposed individuals within China were identified as the main drives that escalated the spreading of the virus. The period (around 20 days) surrounding the Lunar New Year (mid January to early February in 2020) was the most important holiday period in China. Migrant workers and students traveled from major cities to country towns for family reunions, and returned to the cities at the end of the holiday period. Holiday goers also traveled to and from tourist cities. China's Ministry of Transport estimated around 3 billion trips to be taken during this period. Wuhan, being a major transport hub and having a large number of higher education institutions as well as manufacturing plants, was among the cities with the largest outflow and inflow traffic before and after the Chinese New Year festival. Our study aimed to incorporate these important human migration dynamics in the construction of the spreading model. We collected daily intercity travel data in China from Baidu Migration, which was a mobile-app based big data system recording movements of mobile phone users. Specifically, we collected Baidu Migration data for 367 cities (or administrative regions) in China over the period of January 1, 2020, to February 13, 2020. Moreover, Baidu Migration data were expected to be inexact and only indicative of the relative volume of movement of people from one city to another. Thus, the migration strengths of cities served as indicative measures of the human traffic volume moving in and out of individual cities and administrative regions, as depicted by the inflow and outflow networks • M records migration from one city to another. Movement within a city is not counted, i.e., m ii (t) = 0 for all i.
• M is non-symmetric as traffic from one city to another is not necessarily reciprocal at any given time, i.e., m ij (t) 6 ¼ m ji (t).

PLOS ONE
The right panels in Fig 3 plot the daily total inflow and outflow migration strengths of Wuhan, showing the abrupt decrease of migration strengths after the city shut down all inbound and outbound traffic from January 24, 2020.

Method
In the SEIR model, each individual in a population may assume one of four possible states at any time in the dynamic process of epidemic spreading, namely, susceptible (S), exposed (E), infected (I) and recovered/removed (R). The dynamics of the epidemic can be described by the following set of equations: EðtÞ ¼ bSðtÞIðtÞ À kEðtÞ; _ IðtÞ ¼ kEðtÞ À gIðtÞ; and _ RðtÞ ¼ gIðtÞ; where S(t), E(t), I(t) and R(t) are, respectively, the number of people susceptible to the disease, exposed (being able to infect others but having no symptoms), infected (diagnosed as confirmed cases), and recovered (including death cases); β is the exposition rate (infection rate of susceptible individuals); κ is the infection rate of exposed individuals; and γ is the recovery rate. For simplicity, recovered individuals include patients recovered from the disease and death tolls. In discrete form, the SEIR model can be represented by DEðtÞ ¼ bSðt À 1ÞIðt À 1Þ À kEðt À 1Þ; DIðtÞ ¼ kEðt À 1Þ À gIðt À 1Þ; where , with t being a daily count. As the incubation period for COVID-19 can be up to 14 days, the number of exposed individuals (who show no symptom but are able to infect others) plays a crucial role in the spreading of the disease. The state E, which is not available from the official data, is thus an important state in our model. Furthermore, combining death toll with the recovered number as state R will simplify the computation without affecting the accuracy of our data fitting and subsequent estimation.

Model
Suppose, for city i, the four states are S i (t), E i (t), I i (t) and R i (t), at time t. Here, we also define a total susceptible population, N s i , which is the eventual number of infected individuals in city i. Thus, N s i represents the size of the group of susceptible, infected, exposed and removed individuals. Moreover, if city i has a population of P i and the eventual percentage of infection is The classic SEIR model would give ΔI i as the difference between the number of exposed individuals who become infected and the number of removed individuals. However, the onset of the COVID-19 epidemic has occurred in a special period of time in China, during which a huge migration traffic is being carried among cities, leading to a highly rapid transmission of the disease throughout the country. In view of this special migration factor, the SEIR model should incorporate the human migration dynamics in order to capture the essential features of the dynamics of the spreading. In particular, for city i, in addition to the abovementioned classic interpretation, the daily increase in the number of infected cases should also include the inflow of infected individuals from other cities, less the outflow of removed cases from city i. In reality, inflow and outflow of exposed individuals to and from the city are also important and to be estimated in the model. Thus, if m ij (t) people move from city i to city j on day t, and the population of city i is P i (t), then the number of infected individuals moving from city i to city j is DI in ij ðtÞ ¼ Also, the number of migrants leaving from city j is P N i¼1 m ji , and the number of infected cases that have migrated out of city j is DI out j ðtÞ ¼ where P j (t) is the population of city j on day t. Thus, the increase in infected cases on day t in city j is given by where ΔI j (t) = I j (t + 1) − I j (t) and κ j (t) is the infection rate in city j on day t, i.e., the rate at which exposed individuals become infected. Moreover, infected individuals, once confirmed, would unlikely be able to migrate to another city. We thus implement this condition by writing (8) as where 0 < k I � 1 is a constant representing the possibility of an infected individual moving from one city to another.
Likewise, incorporating the migrant dynamics, the increase in exposed individuals on day t in city j is where ΔE j (t) = E j (t + 1) − E j (t), β j is the infection rate of susceptible individuals in city j, and α j is the infection rate of exposed individuals in city j. In a likewise fashion, we have where ΔS j (t) = S j (t + 1) − S j (t). Finally, we have where ΔR j (t) = R j (t + 1) − R j (t). In the above derivation, we should note that • the recovered individuals are assumed to stay in city j; • the recovery rates in different cities are assumed to be different due to varied quality of treatments and availability of medical facilities; • the recovery rates increase as time goes, as treatment methods are expected to improve gradually (i.e., taking γ j (t) as a monotonically increasing function); • the eventual recovery rates in all cities will converge to the same constant Γ � 1.
In addition, due to intercity migration, the population of city j on day t would increase or decrease according to  where ΔP j (t) = P j (t + 1) − P j (t). Thus, the total susceptible population should be DN where subscript j denotes the city itself, and subscript i denotes another city from/to which people migrate on day t. Letting X j (t) be the extended state vector, i.e., X j ðtÞ ¼ ½S j ðtÞ E j ðtÞ I j ðtÞ R j ðtÞ P j ðtÞ N s j ðtÞ� T , we write the above difference equation as where f(x) is the right side of (15), and μ j is the set of parameters including α j , β j , γ j , κ j and δ j . For computational convenience, we write (15) as In performing the data fitting, we assume α j (t), β j (t), γ j (t), κ j (t), and δ j are constants throughout the period of spreading, and the spreading begins at t 0 , at which N s j ðt 0 Þ ¼ d j P j ðt 0 Þ.

Parameter identification
The model represented by (17) describes the dynamics of the epidemic propagation with consideration of human migration dynamics. The parameters in model (17) are unknown and to be estimated from historical data. We solve this parameter identification problem via constrained nonlinear programming (CNLP), with the objective of finding an estimated growth trajectory that fits the data. An estimated number of infected cases of each city can be generated from (15) with unknown set θ j , i.e., where I j,0 = I j (t 0 ) is the initial number of infections in city j, and {α j , β j , γ j , κ j , δ j } are parameters that determine the rates of spreading and recovery in city j. Then, the unknown set is Θ = {θ 1 , θ 2 , � � �, θ K } essentially has 5K unknowns, where K is the number of cities, thus requiring an enormous effort of computation. Here, to gain computational efficiency, we assume that • all cities share one parameter set θ = {α, β, κ, γ}; • the numbers of initial infected and exposed individuals in city i are λ I I i (t 0 ) and λ E I i (t 0 ), respectively, where λ I and λ E are constant. Here, I i (t 0 ) represents the actual infected number at time t 0 , while λ I I i (t 0 ) represents the initial infection number used in the model; • each city has an independent δ i .
Then, the size of the unknown set becomes computationally manageable, i.e., Y ¼ fa; b; k; g; d i ; l I ; l E g: Finally, the parameter estimation problem can be formulated as the following constrained nonlinear optimisation problem: where F(�) represents model (15) andxðtÞ ¼ ½ÎðtÞ;RðtÞ; EðtÞ; SðtÞ; PðtÞ; N s ðtÞ� is the set of estimated variables, with unknown set Θ, which is bounded between Θ L and Θ U . In this work, an inverse approach is taken to find the unknown parameters and states by solving (19).
The Root Mean Square Percentage Error (RMSPE) is adopted as the criterion, i.e., fitting error, to measure the difference between the number of infected individuals generated by the model and the official daily infection data.

RMSPE ¼
ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffiffi where K is the number of cities to be evaluated.

Results
We perform data fitting of the model, described by (17), using historical daily infection data provided by the National Health Commission of China, from January 24, 2020 to February 13, 2020. Our approach, as described in the previous section, is to apply constrained nonlinear programming to find the best set of estimates for the unknown parameters and states. Data fitting for all 367 cities are performed. Values are updated iteratively in the optimisation process. Moreover, since all parameters, like infection rates, are to be generated by fitting data with the model, the integrity of the data becomes crucial. As the official Wuhan data are expected to deviate from the true values quite significantly during the early outbreak stage due to uncertainty in diagnosis and other issues related to reporting of the epidemic by the local government, we have allowed the fitting errors for Wuhan to expand over a reasonable range, while the fitting errors for most other cities remain small. In addition, as the epidemic propagates in time, effective control measures and improved public education would reduce the infection rates for the susceptible and exposed individuals, making these parameters time varying in reality. Nonetheless, our fitting assumes these parameters being constant during the short fitting period for computational simplicity. The propagation profiles, in terms of the number of infected individuals and estimated number of exposed individuals, for all 367 cities are estimated. As limited by space, we only show in Fig 4 the results for 20 selected cities. This model can also provide projections of the number of infected and exposed individuals in the next 200 days, as shown in Fig 5, which clearly show that the daily infection would reach a peak sooner or later. By running the identification algorithm, we identified the optimal parameter set as α = 0.5869, β = 0.8949, κ = 0.1008, γ = 0.0602, λ I = 1.9407, and λ E = 1.5144. From the estimated propagation profiles of the COVID-19 epidemic for all 367 cities, we have the following findings:

Discussions
Opinions diverged on the estimated extent of the outbreak of the new coronavirus disease (COVID-19). While there were pure speculations, there were also predictions based on rigorous study of the spreading dynamics. Different models used for prediction and different assumptions made regarding the transmission process would lead to different results and quite diverged conclusions. For instance, an AI-powered simulation run had predicted 2.5 billion people to be infected in 45 days [17]. Academics in Hong Kong expected 1.4 million eventually infected in the city of 7.5 million people. Our results, however, did not seem to agree with such predictions. In fact, our results were expected to be optimistic, under normal circumstances, in the sense that the projected severity and duration of the epidemic were valid provided stringent measures continued to be in place to curb the spreading of the virus, especially before PLOS ONE mid March. Moreover, the effectiveness of medical treatment was expected to improve and the recovery rate was expected to increase in the following months. As our simulation was based on the data collected in the early outbreak phase, the recovery rate could be under-estimated. Should the recovery rate increase by 0.0005 each day, namely, the number of daily recovered individuals increases by 1% of the total number of infected individuals every 20 days, most cities in China would have zero infection case by June 2020. However, as the world is connected and unless strict travel bans were in place (currently most countries still allow their own citizens to return), possibility exists for infected individuals including those who are asymptomatic to move from city to city, however small in quantity. Second and third waves of outbreaks could not be ruled out! A high level of vigilance should be maintained to prevent the continuous spread of the virus, especially via the active transportation network. Furthermore, since this work was completed on February 19, 2020 (medRxiv 10.1101/2020.02.18.20024570), we used a short historical epidemic data and migration data to develop the model and the corresponding system identification algorithm. At the time of performing this work, there was no attempt in combining SEIR model, migration data and system identification techniques to analyze and predict the spread of COVID-19. The results thus have important indicative values on the effectiveness of using limited initial outbreak data in predicting pandemic progression.

Conclusion
The Novel Coronavirus Disease 2019 (COVID-19) epidemic has initially hit China hard. While the virus began to spread to other countries from February 2020, the extent of the outbreak in China remained to be severe in comparison to other countries for much of March and April 2020. Prediction of the severity and duration of the epidemic provided essential information for illuminating social and non-pharmaceutical interventions. However, prediction with the needed level of accuracy was a non-trivial task. In this work, we employed human migration data to provide information on intercity travel that was crucial to the transmission of the novel coronavirus disease from its epicenter Wuhan to other parts of China. The model described in this paper was essentially the classic SEIR model, with intercity travel data supplying the essential information about the number of infected, exposed and recovered individuals moving between different cities. All parameters of the model, including infection rates, recovery rates, and eventual percentage of infected population for 367 cities in China, were identified by fitting the official data collected up to mid-February with the model using a constrained nonlinear programming procedure. Using these parameters, predictions of the number of exposed individuals in 367 cities as well as projections into the next 200 days were made. Our model, however, did not consider the contact network topology that would be necessary if details of the transmission process, such as superspreading events, were to be captured. Nonetheless, our model provided a highly consistent estimation of the propagation of average numbers of exposed, infected and recovered individuals, despite missing details of fluctuation (e.g., sudden surge due to a superspreading event).
Our prediction in mid February 2020 was that provided stringent control measures including travel restriction continue to be in place, the COVID-19 epidemic spreading would peak between mid February to early March 2020, with about 0.8%, less than 0.1% and less than 0.01% of the population eventually infected in Wuhan, Hubei Province and the rest of China, respectively. Moreover, as the effectiveness of treatment improved, the COVID-19 epidemic was expected to end by June 2020. However, possibilities of a second or third wave of outbreaks may exist as intercity travel is still permitted, e.g., homebound travel from regions which are still at different stages of the pandemic progression. It is thus advisable to maintain a high level of vigilance by the public as well as a high level of preparedness for reactivating stringent control measures by government authorities.