Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Prediction of cross-border spread of the COVID-19 pandemic: A predictive model for imported cases outside China

  • Ying Wang ,

    Contributed equally to this work with: Ying Wang, Fang Yuan, Yueqian Song

    Roles Conceptualization, Formal analysis, Investigation, Software, Visualization, Writing – original draft

    Affiliations Science and Technology Research Center of China Customs, Beijing, China, School of Epidemiology and Public Health, Shanxi Medical University, Taiyuan, China, Department of Preventive Medicine, Changzhi Medical College, Changzhi, China

  • Fang Yuan ,

    Contributed equally to this work with: Ying Wang, Fang Yuan, Yueqian Song

    Roles Formal analysis, Investigation, Software, Visualization, Writing – original draft

    Affiliation Science and Technology Research Center of China Customs, Beijing, China

  • Yueqian Song ,

    Contributed equally to this work with: Ying Wang, Fang Yuan, Yueqian Song

    Roles Conceptualization, Methodology, Supervision

    Affiliation Science and Technology Research Center of China Customs, Beijing, China

  • Huaxiang Rao,

    Roles Data curation, Formal analysis, Writing – review & editing

    Affiliation Department of Preventive Medicine, Changzhi Medical College, Changzhi, China

  • Lili Xiao,

    Roles Formal analysis, Methodology, Validation

    Affiliation Science and Technology Research Center of China Customs, Beijing, China

  • Huilin Guo,

    Roles Methodology, Validation

    Affiliation Science and Technology Research Center of China Customs, Beijing, China

  • Xiaolong Zhang,

    Roles Formal analysis, Validation

    Affiliation Science and Technology Research Center of China Customs, Beijing, China

  • Mufan Li,

    Roles Investigation

    Affiliation School of Epidemiology and Public Health, Shanxi Medical University, Taiyuan, China

  • Jiayu Wang,

    Roles Investigation

    Affiliation School of Epidemiology and Public Health, Shanxi Medical University, Taiyuan, China

  • Yi zhou Ren,

    Roles Investigation

    Affiliation School of Epidemiology and Public Health, Shanxi Medical University, Taiyuan, China

  • Jie Tian ,

    Roles Conceptualization, Funding acquisition, Project administration, Resources, Supervision, Writing – review & editing

    jzyang@aliyun.com (JY); tianjie790808@163.com (JT)

    Affiliation Science and Technology Research Center of China Customs, Beijing, China

  • Jianzhou Yang

    Roles Conceptualization, Funding acquisition, Project administration, Supervision, Writing – review & editing

    jzyang@aliyun.com (JY); tianjie790808@163.com (JT)

    Affiliation Department of Preventive Medicine, Changzhi Medical College, Changzhi, China

Abstract

The COVID-19 pandemic has been present globally for more than three years, and cross-border transmission has played an important role in its spread. Currently, most predictions of COVID-19 spread are limited to a country (or a region), and models for cross-border transmission risk assessment remain lacking. Information on imported COVID-19 cases reported from March 2020 to June 2022 was collected from the National Health Commission of China, and COVID-19 epidemic data of the countries of origin of the imported cases were collected on data websites such as WHO and Our World in Data. It is proposed to establish a prediction model suitable for the prevention and control of overseas importation of COVID-19. Firstly, the SIR model was used to fit the epidemic infection status of the countries where the cases were exported, and most of the r2 values of the fitted curves obtained were above 0.75, which indicated that the SIR model could well fit different countries and the infection status of the region. After fitting the epidemic infection status data of overseas exporting countries, on this basis, a SIR-multiple linear regression overseas import risk prediction combination model was established, which can predict the risk of overseas case importation, and the established overseas import risk model overall P <0.05, the adjusted R2 = 0.7, indicating that the SIR-multivariate linear regression overseas import risk prediction combination model can obtain better prediction results. Our model effectively estimates the risk of imported cases of COVID-19 from abroad.

Introduction

Coronavirus disease 2019 (COVID-19), formerly known as novel coronavirus pneumonia, is an infectious pneumonia caused by emerging pathogens novel coronavirus ("SARS-CoV-2");the primary transmission channels include droplets, contact, and aerosols under specific conditions [1]. Since the outbreak of COVID-19 in December 2019, 762,200,000 cases of infection and 6,890,000 cases of deaths have been caused as of April 6, 2023 [2], which seriously affects the lives and health, lifestyles, economic activities, and social order of people across the world.

Zhan C predicted the development of the epidemic and concluded that high COVID-19 contagion rates [3] and new SARS-CoV-2 variants are one of the main factors leading to multiple waves of the pandemic [4]. In the face of a highly contagious and highly mutable virus, if a country’s public health system lacks preparation to deal with its cross-border spread, or if quarantine management and isolation measures for imported cases are underdeveloped, the risk of COVID-19 spread will Increase [5]. A UK report based on SARS-CoV-2 sequencing among international travelers showed severe cross-border transmission [6] of high-risk variants in this nation. Another report on phylogenetic analysis showed that the SARS-CoV-2 novel variant 20E (EU1) spread immediately to other European countries in the summer of 2020 after its first identification in Spain [7]. Such a rapid spread of EU1 suggests that European travel guidelines and restrictions were primarily inadequate to reduce the risk of cross-border transmission. In addition, it was also estimated that Belgium, the Netherlands, and Norway had more imported cases than exported cases in the summer of 2020, which highlighted the role of imported cases in COVID-19 outbreaks in these countries. In Switzerland, COVID-19 cases associated with cross-border transmission had a considerable impact on the spread dynamics of the local epidemic, which could account for the steady increase of the epidemic in the summers of 2020 and 2021 [8]. Therefore, the risk of COVID-19 transmission between countries cannot be ignored.

To predict the epidemic situation of COVID-19 in a timely, accurate, and reliable manner, scholars have conducted numerous studies on the prediction, prevention and control of COVID-19 transmission [913], and an infectious disease dynamics model has been proposed. As a tool aimed at epidemic prediction and as well as actual application, this model considers the transmission speed, transmission mode, and various prevention and control measures of infectious diseases as well as other factors as a whole [14], and thus has significant application value for early warning of infectious diseases as well as for assessing prevention and control effects on the diseases. Reviewing a large amount of literature [1517], based on the information released by the Chinese Health Commission every day and Hesheng’s modeling analysis and prediction of the Wuhan epidemic control and free transmission stages [18], we believe that the cross-border transmission of COVID-19 is divided into two Staged modeling and analysis is currently the best research solution. In the first phase, the epidemic infection status of the exporting country before the entry of imported cases from abroad was simulated and predicted. Hence, the classic SIR model and the random coefficient method were used to calculate the daily existing infected person series and the SIR fitting curve. Then, the SIR model fitting value and the actual value were compared to verify the effectiveness of the model. The other phase included the prediction and analysis of the number of imported cases after the entry of overseas personnel. Specifically, multiple linear regression models were employed to characterize and analyze the impact of the epidemic infection status, the total number of people entering the country, and the effectiveness of prevention and control measures at the airport on the imported cases. The model proposed in this study was used to predict the imported case data in the first half of 2022, which in turn tested the applicability of the model.

Materials and methods

SIR model of epidemic infection status in exporting countries before entry

COVID-19 is a particularly contagious respiratory infectious disease, and its pathogen SARA-CoV-2 poses a threat to human life safety, with high transmission efficiency and serious consequences of infection, despite a low mortality [19]. The incubation period of COVID-19 is generally 7 days and victims will not be reinfected shortly after cure.

Based on this information, we made the following assumptions: (1) The information of the overseas imported patients reported by the health committees of various provinces and cities in China and the information on the epidemic situation of the exporting countries provided by the WHO were true and credible; (2) Only the existing infected population was predicted and analyzed, without potential and suspected patients being considered, and once the confirmed patient came into contact with the susceptible population, it presented a certain degree of infectivity; (3) COVID-19 recoverers would not be reinfected and were not infectious any longer; (4) the natural births and natural death of the population during the transmission of COVID-19 were not considered; and (5) All patients could recover within 14 days after diagnosis and were no longer infectious.

Symbols involved in the model proposed in this study are described as follows: (1) S (susceptible) represents the susceptible population; (2) I (infectious) represents the infected population; and (3) R (recover) represents the recovered population, which are those who are cured. At any time t, the total population N is expressed as N (t) = S (t) + I (t) + R (t). β denotes the probability that the susceptible population S is infected by the infected person I, and γ denotes the probability that the infected person I returns to R.

Establishment of differential equations based on the classic SIR model

SIR model was first proposed in 1926 and is mainly used to study the epidemic patterns of the black death and plague in London and Mumbai [20]. This model is classical kinetic transmission models that can be used for transmission studies not only of diseases [21] but for of information and computer viruses as well [22, 23]. Drawing on the classical SIR model for infectious disease, We express the spread process of COVID-19 with the following differential equation: (1) (2) (3)

The propagation process is shown in Fig 1:

Data sources

Cases imported from overseas.

Collect information on the entry time, place of entry, time of diagnosis, flight taken, country of export and other information of overseas imported cases announced by the health committees of 13 provinces (municipalities) including Beijing and Shanghai in China.

Flight type and number of passengers taken.

Find the passenger capacity and number of arrivals of the corresponding aircraft model through the port of entry information system.

The daily cumulative number of confirmed cases in the exporting country.

Under the World Health Organization (WHO) COVID-19 topic (https://www.who.int/emergencies/diseases/novel-coronavirus-2019), find the global data on confirmed cases of new coronavirus infection and select the exporting country with confirmed cases The daily cumulative number of confirmed cases.

Total country population.

Log in to the Our Word in Data (Our World, https://ourworldindata.org/) data website to collect the latest published total population of each country.

COVID-19 vaccination status.

Log in to the Our Word in Data website to collect the number and percentage of daily vaccinations in each country.

Epidemic prevention measures before flight in exporting countries.

When considering the classification of entry policy prevention and control levels, the levels of epidemic prevention measures at the pre-entry airport are uniformly divided according to the requirements for nucleic acid testing, IgG or IgM antibody testing, and antigen testing.

The prevalence of COVID-19 virus variants at different stages in China’s shipping countries.

Collect information on mutant strains through the Our World in Data website.

Data processing

Dataset classification.

Data on the imported cases from 51 countries were collected. In the process of SIR model fitting, we divided the dataset as follows: Firstly, the original data regarding the cumulative number of the confirmed cases in the 51 countries from March 1, 2020 to December 31, 2021 were summarized and sorted into time series data; then, the total number of people entering each of the 51 country, the number of confirmed people entering the country, the vaccination rate of the exporting country, the infection rates of variant strains and the epidemic prevention measures taken at the airport were sorted according to the time series, and the obtained data is sorted according to the time series to form a training set. The data from January 1, 2022 to June 30, 2022 composed test data, which were organized as was described above.

Data splicing and preprocessing.

The original data provide sequential data of the cumulative confirmed numbers in 2020, 2021, and the first half of 2022, respectively, and a time series I (t) of the daily number of existing infections in each country during the epidemic needs to be generated on the basis of the original data.

First, perform an inner connection on the data in 2020 and 2021. The primary key is country, and obtain a breadth table of the historical number of infected people in 51 countries in the past two years. The numerical matrix is recorded as Hi.j{i = 20200824, j = ALB),and the row index i represents time (such as 20200824, 20200825,…20210923), the column index j represents the country, and the historical infection number sequence H. j in a single country is recorded as H(t). Then the data H(t) is preprocessed to obtain the time series I(t) of the daily number of infected people in each country. The conversion formula for preprocessing the historical cumulative confirmed number H (t) is as follows:

First, given the historical cumulative number of the confirmed cases per day H (t), R (t) represents the current historical number of recoveries, and because the SIR model does not consider birth and mortality rates during disease transmission, we derived the following equation: (4) Second, although we used R (t) to represent the current historical number of recoveries, this part of the data for R (t) is missing, and therefore, we assumed that infected patients can recover within 14 days after the infection. Therefore, Eq (5) is obtained: (5) Finally, the historical cumulative number of the confirmed cases H (t) 14 days earlier is subtracted from the historical cumulative number of the confirmed cases H (t-14), as in Eq (6): (6) Therefore, the resulting sequence for the number of people living with infection I (t) can be expressed as Eq 7: (7)

Data screening and missing value handling.

The Data screening and missing data handling steps are as follows;

Step1: Countries with are excluded, i.e., countries with a sequence I (t) length less than or equal to 365 for non-zero existing infections were excluded because the sequences of these countries are too short for a valid analysis.

Step2: culled . That is, the sequence of the number of people with existing infections whose minimum value is too large or the maximum value is too small is excluded. Through this step, the national sequence suitable for following analyses can then be basically selected.

Step3: Head data for all 0 in sequence I (t) of existing infections by country is removed.

Step4: Tail data for all 0 in sequence I (t) of existing infections by country is removed.

Deleted data are those which cannot be analyzed or are of minor significance. Data with too short non-zero sequences were removed in the first step because they were difficult to fit in to the SIR model. In the second step, the removal of the sequences whose minimum values were too large or whose maximum value were too small because such data were difficult to account for a very small proportion and thus, they were little meaningful for subsequent analyses. The third and fourth steps eliminated the data with the head and tail of 0, because these data could not fit the SIR model, and their removals would not affect the subsequent analysis.

After the above mentioned steps, the cumulative number of the confirmed cases H (t) in the past three years were collected on the WHO website, and the available sequence I (t) of the existing number of the infected cases were obtained after the cumulative number of the cases H (t-14) confirmed 14 days before was subtracted.

Most of the infection processes of epidemics are complex, and sometimes multiple infection processes may take place in the same time period, or errors may occur in the data collection process. These phenomena can be manifested in complex situations such as abnormal fluctuations in the number of the infected people, as shown in the Fig 2 t2-t3 interval. In our study, after the existing infection sequence I(t) was presented in the form of a curve plot, the existing infection sequence I(t) in each country was screened and waveform-divided. The final selection could be simulated through the SIR model propagation process, and the curve part with a satisfactory fitting effect was simulated and then calculated.

thumbnail
Fig 2. One of the schematic diagrams of the I(t) peak interval of MMR.

https://doi.org/10.1371/journal.pone.0301420.g002

As shown in Fig 2, (t1, t2) and (t3, t4) are the well-fitted portions of the curve for the selected available SIR propagation process. The dates corresponding to t1 and t3 are the starting dates of the intervals, and the dates corresponding to t2 and t4 are the termination dates, respectively. The ordinate shows the number of existing infections. The (t2, t3) interval is an abnormal fluctuation that needs to be eliminated.

Peak pre-segmentation during infection.

During SIR fitting, the peak of the number of the infected people can convey the information on the beginning or the end of an infection process. Therefore, it is an appropriate way to divide the infection process by detecting the peak value. The specific steps are as follows:

Step1: Search the local maximum point against the infection curve I (t) to obtain the point set P of all peaks;

Step2: Set a maximum threshold and a minimum threshold for the peak value, screen the point set P and obtain the point set p0 for all peak values;

Step3; For all points in a certain peak point set p0, set the maximum threshold for a forward detection as well as a backward detection and the threshold for the beginning and end of the propagation, perform a forward detection as well as a backward detection to find the corresponding segmentation points;

Step4: Segment the curve I(t) for the corresponding segmentation points, and the sequence segments too short to be fitted are summarized and filtered.

To seek local maximum points, the following approach is taken:

Step1: Calculate the first order difference of the curve I (t) to obtain diff (I);

Step2: Search for diff (I) to increase the descending mode, and find the mode fragment piece that increases followed by a decrease;

Step3: Take the maximum of the piece as the local maximum.

The searching method for local maximum points can refer to the findPeaks function method in R language.

SIR model building

In our study, the SIR model was constructed with reference to Florence D´ebarre [24], using the (deSolve) package in R language, and the SIR model presented a good interpretability for the current spread of the epidemic. The construction of the model required four parameters, including S (susceptible population), β (infection rate), γ (recovery rate), and I (initial number of infections). For the susceptible population S, we adopted a complete infection waveform, and the range of 10-fold amplification of the maximum number of the infected represented the susceptible number. The initial number of infections I was represented by the number of the existing infections I (t), obtained from data processing. Parameters β and γ were automatically generated by the R language.

The SIR model ’s parameters were obtained by following the stochastic coefficient method [25], and the parameters present in the SIR epidemic model were regarded as random variables with specific distributions. A system of differential equations was obtained based on the stochastic spectral representations of the parameters, and their numerical integration was performed to obtain the corresponding parameters.

In order to efficiently search for the optimization vector and improve the fitting effect of the SIR model, this study uses different parameter combinations as the initial value input to perform traversal search for the optimal fitting effect. N groups are randomly selected from the determined parameter initial value vector as the initial value, and the optimization fitting process is carried out.

The optimization process involved the use of the optim function in R language, and the objective function was the residual sum between the infected number sequence of the fitted SIR model and the actual infected number sequence (calculated based on the dist function in R language). The parameter combination constructed by the SIR model was used as the initial value input, with L-BFGS-B as the optimization method, and the parameters S, β, γ were optimized for output the optimal parameters. For each data, there are N initial values of the optimization process, and the least residuals of the optimization parameters were taken as the outputs. The entire optimization process can be described by the following equation: (8) Where ISIR (S, β, γ, I) is the infection sequence of the SIR model determined by the parameters, Ireal is the real infection sequence, and the optimized objective function is the residual of both (i.e., the L2 norm of the difference). As debugging 25 groups can increase the stability of optimization results, according to the empirical value, we decided on N = 25 to find the optimization results with the smallest residual error.

A schematic diagram of the parameter optimization process for the entire SIR model is shown in Fig 3:

thumbnail
Fig 3. Schematic diagram of optimizing the SIR model parameters.

https://doi.org/10.1371/journal.pone.0301420.g003

During SIR model fitting, it is necessary to analyze the relevant calculation accuracy and expand the constant term if the fitting is not effective. In the process of parameter fitting, parameter b (β) needs to be multiplied by of 10000, and other parameters need to be extended less. After optimization, the optimal S, β, γ, I of each interval and the coordinates of the peaks can finally be output, as shown in Fig 4:

thumbnail
Fig 4. Schematic diagram of the parameter-optimized fitting results.

https://doi.org/10.1371/journal.pone.0301420.g004

The solid red line in Fig 4 represents the fitted curve, the dots represent the intercepted original interval, the horizontal axis is time, and the vertical axis is the number of the infected persons. In the process of fitting prediction, it is necessary to record the inaccurate fitting curve in the current SIR fitting process during the application process and to add the emerging curve with prediction errors to the configuration table timely. In the meantime, the SIR curve that can be predicted but is not accurately predicted should be refitted or labeled for next step of analysis.

Multiple linear regression model for predicting the number of imported cases after entry

After SIR model fitting, the epidemic data of each country were obtained. The information contained in the dataset included the number of the existing infections in the exporting country, the number of the inbound persons in the exporting country, the number of the imported cases overseas, the period of the epidemic infection in the exporting country, the country’s name and the corresponding time series. Data of the vaccination coverage rate, strain infection efficiency, and epidemic prevention policies were summarized for the corresponding country according to the time series. This information was brought into the multiple linear regression model to construct a prediction model.

The training process of the multiple linear regression model involved the random sampling method, which divided the characteristic data sets of each country obtained from the SIR model fitting according to a training set: test set ratio of 7:3. Multiple linear regression model training and testing were carried out, and appropriate result outputs were selected.

Combined model prediction process

To describe the import risk more directly, the overall prediction process is described as follows:

Predicted target: The number of the cases imported from overseas within a time period T, case Num;

Predictor variables: Descriptive variable inject Feature List for the infection status of a country within a time period T;

Descriptive variable stage List for the epidemic stage of the country within T;

Descriptive variable total Feature List for the number of people imported outside the country during T.

From a viewpoint of practical application, two prediction processes could be generated in our study, due to the attributes of the SIR model itself, which determined the overall parameter process by some parameters.

Predicted process 1.

Variables similar to the multiple linear regression prediction model were constructed to predict the linear regression model directly, and the specific process is as follows:

Step1: Identify prediction formula as follows (9)

Step2: Construct the following variables:

Injectt: Summary of the number of the domestic infections in the exporting country during the same time period;

Inputt: Summary of the number of people imported from abroad by flight during the same time period;

Staget (Predicted epidemic Stage): If there are enough observed data (which can be judged according to the original data), the epidemic stage can be defined, it is brought directly into the multiple linear regression prediction model for a prediction; if there are insufficient observed data and the waveform of infection is incomplete, the SIR model can be fitted first to determine the import epidemic stage (as was mentioned in the prediction process 2), which is then brought into multiple linear regression for a prediction.

Step3: predict the number of the infected people casest, i.e., the import risk, on the basis of the parts of the multiple linear regression model.

Predictive process 2.

The feature table (the initial value table of SIR fitting) was combined to make prediction, and the specific prediction process is as follows:

Step1: Identify the initial infection segment Inject [t, T] imported overseas, where [t, T] is the observed interval (i.e., the adjacent time period during which the risk of imported infection will be predicted);

Step2: Import the data corresponding to the observation interval into the existing program of R language-based SIR model modeling, and use the relevant SIR model parameters already in the feature table (Table 2) to simulate and calculate the curve I(t);

Step3: In the well-fitted SIR curves, search the part I (t, T) that is close to the curve I (t) obtained from the fitting in Step2, calculate its distance to the initial infection fragment Inject [t, T], calculate the residual error and output the curve with the smallest distance;

Step4: With the curve with the smallest distance from the initial infection fragment Inject[t, T], summarize the features to obtain a dataset similar to the feature table (Table 2). Perform the prediction process 1 and output the predicted value of the number of the imported infections overseas.

A schematic diagram of the prediction process is shown in Fig 5:

Results

SIR fitting results for 51 countries is shown in Figs 69

thumbnail
Fig 6. Comparison of the actual infected persons among different Asian countries based on SIR fitting results.

https://doi.org/10.1371/journal.pone.0301420.g006

thumbnail
Fig 7. Comparison of the actual infected persons among different European countries based on SIR fitting results.

https://doi.org/10.1371/journal.pone.0301420.g007

thumbnail
Fig 8. Comparison of the actual infected persons among different African countries based on SIR fitting results.

https://doi.org/10.1371/journal.pone.0301420.g008

thumbnail
Fig 9. Comparison of the actual infected persons among different Americas and Oceania countries based on SIR fitting results.

https://doi.org/10.1371/journal.pone.0301420.g009

SIR model fitting effect assessment

With the parameter solution method, the optimal parameters for the SIR model were obtained, and these parameters were then used to predict the changes of the epidemic development in the country. The fitting effect of the curve was evaluated by R2, which is formulated as follows: (10) where represents the mean value of each true value, the predicted value, and the sequence, respectively. The closer a R2 value is to 1, the better the curve is considered to be fitted.

To test the accuracy of the model, R2 was calculated for each fitted epidemic and a density plot was drawn as shown in Fig 10:

thumbnail
Fig 10. R2 density plot of SIR model goodness of fit for all infection curves in 51 countries.

https://doi.org/10.1371/journal.pone.0301420.g010

As shown in Fig 10, most of the infection curves in 51 countries have R2 above 0.75, while only a few have R2 < 0.4. Overall, the mean value of all R2 is about 0.86, and this result can indicate that the infection status of COVID-19 cases in different exporting countries before entering China can be well described by the SIR model.

Multiple regression analysis of the SIR model outputs

After preliminary data processing and SIR fitting, the datasets that were finally entered into the multiple linear regression model were summarized. An explanation table of the variables in this dataset and some examples of the sample are shown in Tables 1 and 2:

thumbnail
Table 1. Interpretation table of the sample data for regression analysis.

https://doi.org/10.1371/journal.pone.0301420.t001

thumbnail
Table 2. Examples of the regression analysis sample part for SIR model fitting.

https://doi.org/10.1371/journal.pone.0301420.t002

Construction results of the multiple linear regression model

After the feature data set (Table 2) was trained and output from the SIR model, multiple linear regression model results were obtained, as summarized in Table 3:

thumbnail
Table 3. Significance test results of the optimized model.

https://doi.org/10.1371/journal.pone.0301420.t003

From Table 3, it can be seen that the number of the infections in the exporting country (injectFeatureList) and the total number of the exporting countries (totalFeatureList) exerted statistically significant effects on the number of the imported cases overseas (P < 0.05). In the whole model, the order of the influencing factors on overseas imported cases was "total number of the imported cases from exporting countries", "number of the infected cases from exporting countries" and "epidemic stage of the exporting country", from strong to weak.

Finally, the resulting model structure was trained as follows: (11) From the above formula, it can be obtained that the parameters were positively correlated except for StageList. StageList can be simply understood as follows: The greater the number of the infected people in a country, the greater the total number of the people imported overseas, and the more advanced the stage of epidemic, the higher risk of importation. A slight negative correlation between the numerical codes (i.e. stageList) assigned to the epidemic stage and the target value may be on account of a further reduction in infectivity later in the epidemic stage.

Multiple linear regression model residual analysis

Residual error is the difference between the actual value and the predicted value. Residual analysis can be used to detect the rationality of the model assumption and the reliability of the data. It is an effective tool to check the compliance of the data with the model. The distributions of the residuals of the above models for the training set and the test set are shown in Figs 11 and 12, respectively.

thumbnail
Fig 11. qq-plot plot of the residuals for the training set.

https://doi.org/10.1371/journal.pone.0301420.g011

As shown in the figures above, the residuals of the test set and the middle part of the residuals of the training set basically met a normal distribution, indicating that the fitting effect of the model is good within these parts. The partial outliers in the training set versus the test set caused the graphs to present partial skewness.

From the Table 4, the test set appeared consistently with the training set at the position of the predicted median, exhibiting relative stability. Within the 75% quantile, the predicted values of both the test and training sets were slightly larger than the actual values (with negative residuals). After the 75% quantile, the predicted values turned gradually smaller than the actual values, and the maximum deviation value of the test set was noticeably greater than the that of the training set.

thumbnail
Table 4. Comparison of the residuals between the test set and the training set (real-predict).

https://doi.org/10.1371/journal.pone.0301420.t004

Multiple linear regression model validation

The 2022 test set data was used to predict overseas imported cases, and the results were as follows:

As shown in Fig 13, The model we established predicts that the number of overseas imported cases is overall higher than the actual number of overseas imported cases. The reason may be that international flights were still under control when the data were collected and there were not flights arriving every day, which is different from the data collected. It is related to the fact that the total number of immigrants entering the country is 0 at some times, which results in the predicted value being higher than the actual value. At present, flight control has been fully relaxed, and the prediction model established by this research will have better applications in actual predictions in the future.

thumbnail
Fig 13. SIR-multiple linear regression model 2022 forecast set data forecast results chart.

https://doi.org/10.1371/journal.pone.0301420.g013

Predicted result evaluation

As shown in Fig 14, the residuals of the prediction set met a normal distribution, indicating that the prediction effect of the model on the prediction set in 2022 was excellent.

Discussion

Since the first outbreak of COVID-19, predicting its development trend has never been stopped, and a number of prediction models have been proposed, such as the combined prediction models [26] based on complete ensemble empirical modal decomposition (CEEMDAN), extreme gradient elevation tree (XGBoost), and network search data (WSD). Also, methods [27] for predicting the development trend of COVID-19 have been attempted, such as the long- and short-term memory (LSTM) neural network of Dropout technology and the SEIR optimization model [28]. However, CEEMDAN, XGBoost and WSD are complex and require high timeliness of the data [26]. For the LSTM neural network approach to predict the development trend of COVID-19 [27], the error interval is too large when it is used to predict the cumulative confirmed numbers based on a large base number within a long statistical time. Although the SEIR model considers the impact of novel corona virus infection latency on the development of the epidemic, the latency individuals in the actual transmission process, the difference in the onset time is difficult to count [29], and objective factors such as isolation measures have a greater impact on it, resulting in that the prediction model cannot accurately obtain the real epidemic parameters [28]. In the actual spread of the epidemic, it is reasonable to assume that the infection development and change of the infected persons and latent persons in a country can be finally reflected in the total number of the confirmed persons in that country, based on which a better development-predicting outcome can be obtained, this assumption gets support from Kremer [30]. Therefore, the most classical SIR model of infectious diseases was finally selected in our study for epidemic evolution prediction. Also, according to the existing studies, the direct use of the SIR model [31] has been proved effective in performing such prediction tasks with good interpretability for the current spread of the epidemic.

For the final overseas imported cases prediction part, the multiple linear regression model used in our study seemed slightly weaker compared with other prediction models. However, Smita Rath used multiple linear regression model to predict the epidemic situation of COVID-19 in India. To do that, he assessed the epidemic data in India by reviewing the historical applications of the model and found that this model could achieve good prediction results [32]. His findings indicates that multiple linear regression could be used as a satisfactory tool for predicting COVID-19. Similarly, Hari predicted the number of deaths caused by COVID-19 pneumonia using a regression model based on the data collected at the Hopkins Data website [33]. Multiple linear regression prediction models have high predictive accuracy, and Bakhtiarvand’s regression model used to analyze and predict the severity of COVID-19 patients evidenced for this point [34]. Considering that the data used in our study are linear, the multiple linear regression model should be more suitable for our prediction. The results of the comprehensive significance test showed that the order of the influencing factors on the number of the imported cases abroad, from strong to weak, was as follows: the number of infections in the exporting country, the number of the imports from the exporting country, and the period of the epidemic in the exporting country. Although the multiple linear regression model established in our study led to slightly higher prediction values, which might be related to the fact that during the data collection period, China ’s international flights were under the management and control according to the "Five Ones" policy (one airline company retains one flight route in one country and up to one flight in one week) [35, 36]. In this study, the multiple linear regression model was utilized to predict the number of inbound persons within a period of time. The restriction upon the number of flights during the management and control period might partially contribute to the higher prediction results in this study. Furthermore, the data collected from the WHO website on the COVID-19 epidemic in the exporting countries may suffer a time lag between epidemiological discovery and data uploading, which might also affect the accuracy of the prediction results to some extent. Neverthelss, with the gradual removal of the national epidemic prevention and control policies of China, the cross-border transmission prediction model established in our study can be further tested and verified with larger data same sizes.

Conclusions

In our study, the epidemic infection status of the imported cases before entering the country and the cross-border transmission process of COVID-19 after entering the country were modeled and analyzed. The SIR model could well predict the infection status of the overseas epidemic situation by constructing the overseas import combined prediction model based on the overseas import case data. The SIR-multiple linear regression combination model has a good overall prediction effect when used to predict the risk of imported cases from abroad.

Limitations and future recommendation

Limitations of our study included the following points. First, the dimension for data collection in this study was relatively simple, and data with regard to other variables that affected the effect of the model fitting (such as the rate of mask wearing in the country, the number of open public places, government regulatory measures, etc.) are difficult to collect. Excluding these factors from modeling might affect the accuracy of the model. Second, the multiple linear regression model in itself suffers from limited fitting capacity. Consequently, the actual fitting effect might be deficient, and as such, the final prediction results of this study might be affected. Third, the influence of different influencing factors on the prediction results considerably varies, with some factors exerting too weak influence to be reflected in the model in the correlation analysis.

The SIR model was used to fit the epidemic transmission status of a country, and then the domestic epidemic parameters were brought into the multiple linear regression model to predict the number of the imported cases. The SIR-multiple linear regression combined prediction model transformed the magnitude of the cross-border import likelihood into more intuitive and visible values and thus provide a reference for port judgment on the risk of overseas import.The model is used to predict the risk of overseas importation after the liberalization of the epidemic situation in China, and will have a better prediction effect.

Supporting information

S1 Dataset. These data are available at WHO coronavirus disease (COVID-19).

This website contains daily epidemic information of overseas imported countries. https://www.who.int/emergencies/diseases/novel-coronavirus-2019.

https://doi.org/10.1371/journal.pone.0301420.s001

(CSV)

S2 Dataset. Summary of datasets used in multiple linear regression models.

This data is used to predict cases imported from abroad.

https://doi.org/10.1371/journal.pone.0301420.s002

(XLSX)

S3 Dataset. Summary of datasets fitted by the SIR models.

This data is used to predict the prevalence of COVID-19 abroad.

https://doi.org/10.1371/journal.pone.0301420.s003

(CSV)

S4 Dataset. Summary of datasets used in multiple linear regression models.

This data is used to train the Multiple Linear Regression Models.

https://doi.org/10.1371/journal.pone.0301420.s004

(CSV)

References

  1. 1. Wang Z, Fu Y, Guo Z, Li J, Li J, Cheng H, et al. Transmission and prevention of SARS-CoV-2. Biochemical Society Transactions. 2020;48(5):2307–16. pmid:33084885
  2. 2. Organization WH. WHO Coronavirus (COVID-19) Dashboard | WHO Coronavirus (COVID-19) Dashboard With Vaccination Data https://covid19.who.int/
  3. 3. Zhan C, Tse CK, Fu Y, Lai Z, Zhang H. Modeling and prediction of the 2019 coronavirus disease spreading in China incorporating human migration data. PLoS One. 2020;15(10):e0241171. Published 2020 Oct 27. pmid:33108386
  4. 4. Zhan C, Zheng Y, Shao L, Chen G, Zhang H. Modeling the spread dynamics of multiple-variant coronavirus disease under public health interventions: A general framework. Inf Sci (N Y). 2023;628:469–487. pmid:36777698
  5. 5. Nyasulu JCY, Munthali RJ, Nyondo-Mipando AL, Pandya H, Nyirenda L, Nyasulu PS, et al. COVID-19 pandemic in Malawi: Did public sociopolitical events gatherings contribute to its first-wave local transmission? Int J Infect Dis. 2021;106:269–75. pmid:33771674
  6. 6. Williams GH, Llewelyn A, Brandao R, Chowdhary K, Hardisty KM, Loddo M. SARS-CoV-2 testing and sequencing for international arrivals reveals significant cross border transmission of high risk variants into the United Kingdom. EClinicalMedicine. 2021;38:101021. pmid:34278277
  7. 7. Hodcroft EB, Zuber M, Nadeau S, Vaughan TG, Crawford KHD, Althaus CL, et al. Spread of a SARS-CoV-2 variant through Europe in the summer of 2020. Nature. 2021;595(7869):707–12. pmid:34098568
  8. 8. Reichmuth ML, Hodcroft EB, Riou J, Neher RA, Hens N, Althaus CL. Impact of cross-border-associated cases on the SARS-CoV-2 epidemic in Switzerland during summer 2020 and 2021. Epidemics. 2022;41:100654. pmid:36444785
  9. 9. Kumar S, Sharma R, Tsunoda T, Kumarevel T, Sharma A. Forecasting the spread of COVID-19 using LSTM network. BMC Bioinformatics. 2021;22(S6). pmid:34112086
  10. 10. Liu M, Thomadsen R, Yao S. Forecasting the spread of COVID-19 under different reopening strategies. Scientific Reports. 2020;10(1). pmid:33230234
  11. 11. Giuliani D, Dickson MM, Espa G, Santi F. Modelling and predicting the spatio-temporal spread of cOVID-19 in Italy. BMC Infect Dis. 2020;20(1):700. pmid:32967639
  12. 12. Majeed B, Li A, Peng J, Lin Y. A Multi-Period Curve Fitting Model for Short-Term Prediction of the COVID-19 Spread in the U.S. Metropolitans. Front Public Health. 2021;9:809877. pmid:35118046
  13. 13. Hassan F, Albahli S, Javed A, Irtaza A. A Robust Framework for Epidemic Analysis, Prediction and Detection of COVID-19. Front Public Health. 2022;10:805086. pmid:35602122
  14. 14. Kong L, Duan M, Shi J, Hong J, Chang Z, Zhang Z. Compartmental structures used in modeling COVID-19: a scoping review. Infect Dis Poverty. 2022;11(1):72. pmid:35729655
  15. 15. Shen SP, Wei YY, Zhao Y, Jiang Y, Guan JX, Chen F. Zhonghua Liu Xing Bing Xue Za Zhi. 2020;41(10):1582–1587.
  16. 16. Wu JT, Leung K, Leung GM. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: a modelling study [published correction appears in Lancet. 2020 Feb 4;:]. Lancet. 2020;395(10225):689–697. pmid:32014114
  17. 17. Wu CH, Chou YC, Lin FH, et al. Epidemiological features of domestic and imported cases with COVID-19 between January 2020 and March 2021 in Taiwan. Medicine (Baltimore). 2021;100(39):e27360. pmid:34596146
  18. 18. Sheng H WU L, XIAO C. Modeling analysis and prediction of the spread of new coronary pneumonia epidemic. Journal of System Simulation 2020;32(5):759–66.
  19. 19. Xu L, Zhang H, Xu H, Yang H, Zhang L, Zhang W, et al. The coSIR model predicts effective strategies to limit the spread of SARS-CoV-2 variants with low severity and high transmissibility. Nonlinear Dyn. 2021;105(3):2757–73. pmid:34334951
  20. 20. Kermack W, McKendrick A. Contribution to the mathematical theory of epidemics—I. 1927. Bull Math Biol. 1991;53(1–2):33–55. pmid:2059741
  21. 21. Huarachi Olivera RE, Lazarte RAM. SIR model of the pandemic trend of COVID-19 in Peru. Rev Fac Cien Med Univ Nac Cordoba. 2021;78(3):236–42. pmid:34617709
  22. 22. Chen T, Chen Z, Jin X. A multiple information model incorporating limited attention and information environment. PLoS One. 2021;16(10):e0257844. pmid:34618813
  23. 23. Zimeras S, Diomidous M. Computer Virus Models—The Susceptible Infected Removed (SIR) Model. Studies in health technology and informatics. 2018;251:75–7. pmid:29968605
  24. 24. D´ebarre F. SIR models of epidemics. Level 1 module in “Modelling course in population and evolutionary biology
  25. 25. Chen-Charpentier BM, Stanescu D. Epidemic models with random coefficients. Mathematical and Computer Modelling. 2010;52(7–8):1004–10.
  26. 26. LI S, Wang X. Research application of XGBoost model in novel crown epidemic prediction. Small microcomputer system 2021;42(12):2465–72.
  27. 27. Wang R, Yan F,., Lu J, Yang W. Prediction of new coronavirus trends using the Dropout-LSTM model. Journal of University of Electronic Science and Technology of China 2021;50(3):414–21.
  28. 28. Qiu Z, Sun Y, He X, Wei J, Zhou R, Bai J, et al. Application of genetic algorithm combined with improved SEIR model in predicting the epidemic trend of COVID-19, China. Sci Rep. 2022;12(1):8910. pmid:35618751
  29. 29. Salzberger B, Buder F, Lampl BT, Ehrenstein B, Hitzenbichler F, Holzmann T, et al. SARS-CoV-2/COVID-19-epidemiology and prevention. Der nephrologe. 2021;16(1):3–9. pmid:33343742
  30. 30. Kremer C, Ganyani T, Chen D, Torneri A, Faes C, Wallinga J, et al. Authors’ response: Estimating the generation interval for COVID-19 based on symptom onset data. Euro Surveill. 2020;25(29). pmid:32720639
  31. 31. Saxena R, Jadeja M, Bhateja V. Propagation Analysis of COVID-19: An SIR Model-Based Investigation of the Pandemic. Arabian Journal for Science and Engineering. 2021. pmid:34395158
  32. 32. Rath S, Tripathy A, Tripathy AR. Prediction of new active cases of coronavirus disease (COVID-19) pandemic using multiple linear regression model. Diabetes Metab Syndr. 2020;14(5):1467–74. pmid:32771920
  33. 33. Singh H, Bawa S. Predicting COVID-19 statistics using machine learning regression model: Li-MuLi-Poly. Multimed Syst. 2022;28(1):113–20. pmid:33976474
  34. 34. Bakhtiarvand N, Khashei M, Mahnam M, Hajiahmadi S. A novel reliability-based regression model to analyze and forecast the severity of COVID-19 patients. BMC Med Inform Decis Mak. 2022;22(1):123. pmid:35513811
  35. 35. China CAAo. The Civil Aviation Administration of China issued a notice on "Several Measures for the Resumption of International Passenger Flights". 2022. http://www.caac.gov.cn/
  36. 36. China CAAo. Civil Aviation Authority Holds First Press Conference in April 2020. 2020. http://www.caac.gov.cn/