^{1}

^{2}

^{1}

^{2}

^{1}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: XZ. Performed the experiments: XZ TZ. Analyzed the data: XZ TZ XL. Contributed reagents/materials/analysis tools: XZ XL. Wrote the paper: XZ TZ AY XL.

Public health surveillance systems provide valuable data for reliable predication of future epidemic events. This paper describes a study that used nine types of infectious disease data collected through a national public health surveillance system in mainland China to evaluate and compare the performances of four time series methods, namely, two decomposition methods (regression and exponential smoothing), autoregressive integrated moving average (ARIMA) and support vector machine (SVM). The data obtained from 2005 to 2011 and in 2012 were used as modeling and forecasting samples, respectively. The performances were evaluated based on three metrics: mean absolute error (MAE), mean absolute percentage error (MAPE), and mean square error (MSE). The accuracy of the statistical models in forecasting future epidemic disease proved their effectiveness in epidemiological surveillance. Although the comparisons found that no single method is completely superior to the others, the present study indeed highlighted that the SVMs outperforms the ARIMA model and decomposition methods in most cases.

Public health surveillance is an important way to continuously collect, analyze, interpret and disseminate health data essential to prevention and control

The decomposition methods are generally the most traditional methods in time series analysis

The ARIMA models are almost the most widely used methods

In recent years, machine learning based time series models such as artificial neural networks have been successfully applied for modeling infectious disease incidence time series

The objectives of the present paper are to compare four typical time series methods, namely, two decomposition methods (regression and exponential smoothing), ARIMA model and SVMs in theory and practice as well as their real forecasting efficacy in epidemic time series. This comparison may be helpful for the epidemiologist to choose the most suitable methodology in a given situation.

We gathered available monthly incidence of nine typical infectious diseases time series data which were reported by the Chinese Center for Disease Prevention and Control (CDC). The data were collected from the Chinese National Surveillance System established in 2004. The incidence time series of brucellosis, gonorrhea, hemorrhagic fever renal syndrome (HFRS), hepatitis A (HA), hepatitis B (HB), scarlet fever, schistosomiasis, syphilis, typhoid fever from 2005 to 2012 were collected.

The decomposition methods try to extract the underlying pattern in the data series from randomness. The underlying pattern then can be employed to predict future trends and make forecasts. The underlying pattern can also be broken down into sub patterns to identify the component factors that influence each of the values in a series. Two separate components of the basic underlying pattern that tend to characterize the infectious disease time series are usually identified in decomposition methods. They are the trend cycle and seasonal factors. The trend cycle represents long term changes, and the seasonal factor is the periodic fluctuations with constant length that is usually caused by known factors such as rainfall, month of the year, temperature, timing of the holidays, etc. The decomposition model assumes that the data has the following form:

Time series = Pattern + Error = Trend cycle+ Seasonality+ error

The seasonality part of the time series is usually expressed with the seasonal indices

Once the Seasonal indices are calculated, one can deseasonalize data by dividing by the corresponding index.

Deseasonalized data = Raw data/Seasonal Index

The long-term trend is estimated from the deseasonalized data. There are many ways to estimate the long-term trend, such as moving average, exponential smoothing, and linear regression. In simple moving average methods, the current value is calculated as the mean of its previous

The linear regression method is another simple way to express the long term trend in which a common linear regression model is established between the incidence and time

The ARIMA model originated from AR model, MA model, and the combination of AR and MA, the ARMA models

MA models express the current value of the time series

ARMA models are a combination of AR and MA models, in which the current value of the time series is expressed linearly in terms of its previous values as well as current and previous residual series. It can be expressed as:

The ARIMA model deals with non-stationary time series with differencing process based on the ARMA model. The differenced stationary time series can be modeled as ARMA model to yield ARIMA model.

The ARIMA model is usually termed as ARIMA (_{S}

The ARIMA modeling procedure consists of three iterative steps: identification, estimation, and diagnostic checking. Prior to fitting the ARIMA model, an appropriate difference of the series is usually performed to make the series stationary. Identification is the process of determining seasonal and non-seasonal orders using the autocorrelation functions (ACF) and partial autocorrelation functions (PACF) of the transformed data

SVMs estimate the regression using a set of linear functions that are defined in a high dimensional space. SVMs carry out the regression estimation by using Vapnik's

Assume that

In

To estimate

Subjected to

Finally, by introducing Lagrange multipliers and exploiting the optimality constraints, the decision function given by

In

The contrasts between the observed value of the raw series and the predicted values obtained through the four methods were compared to determine the efficacy of the four forecasting methods used in the present study. The mean absolute error (MAE), mean absolute percentage error (MAPE), and the root mean square error (RMSE) were selected as the measures of evaluation because as empirical methods they are widely used in combining and selecting forecasts for measuring bias and accuracy of models

These measures were calculated using

To take into account the variability of MAE, MAPE and RMSE, the block bootstrap technique

where

Seasonal indices of different types of infectious diseases were extracted from the original time series, which are listed in

Jan | Feb | Mar | Apr | May | Jun | Jul | Aug | Sep | Oct | Nov | Dec | |

Brucellosis | 0.34 | 0.40 | 1.01 | 1.41 | 1.57 | 1.78 | 1.65 | 1.34 | 0.83 | 0.54 | 0.58 | 0.56 |

Gonorrhea | 0.95 | 0.77 | 0.99 | 1.00 | 1.03 | 1.07 | 1.07 | 1.09 | 1.02 | 0.97 | 1.01 | 1.01 |

Hemorrhagic Fever | 1.03 | 0.70 | 0.86 | 0.93 | 1.06 | 1.04 | 0.79 | 0.48 | 0.43 | 0.92 | 2.01 | 1.76 |

Hepatitis A | 0.83 | 0.73 | 1.06 | 1.03 | 1.03 | 1.04 | 1.08 | 1.16 | 1.09 | 1.01 | 0.99 | 0.97 |

Hepatitis B | 0.92 | 0.84 | 1.12 | 1.05 | 1.00 | 1.00 | 1.07 | 1.09 | 0.97 | 0.95 | 1.00 | 0.98 |

Scarlet Fever | 0.80 | 0.33 | 0.69 | 1.15 | 1.65 | 1.71 | 0.90 | 0.45 | 0.56 | 0.84 | 1.34 | 1.59 |

Schistosomiasis | 0.48 | 0.46 | 0.78 | 0.88 | 0.95 | 1.17 | 1.63 | 1.61 | 1.19 | 1.16 | 0.88 | 0.82 |

Syphilis | 0.76 | 0.68 | 1.01 | 0.99 | 1.03 | 1.09 | 1.13 | 1.13 | 1.09 | 1.01 | 1.03 | 1.05 |

Typhoid fever | 0.56 | 0.49 | 0.71 | 0.82 | 1.05 | 1.20 | 1.37 | 1.52 | 1.33 | 1.13 | 0.96 | 0.88 |

After the extraction of seasonal indices, linear regressions were modeled for the rest of the incidence time series. The form of the regression model is:

Deseasonalized value at time

The parameters of the established models are listed in ^{2}^{2}^{2}

Constant | Coefficient | R^{2} |
||

Brucellosis | 0.0931 | 0.0022 | 0.7404 | 0.0011 |

Gonorrhea | 1.1952 | −0.0077 | 0.9220 | 0.0030 |

Hemorrhagic Fever | 0.1218 | −0.0010 | 0.5295 | 0.0005 |

Hepatitis A | 0.5496 | −0.0044 | 0.7991 | 0.0030 |

Hepatitis B | 7.9058 | 0.0012 | 0.0019 | 0.4313 |

Scarlet Fever | 0.1457 | 0.0013 | 0.1340 | 0.0067 |

Schistosomiasis | 0.0193 | 0.0001 | 0.1975 | 0.0001 |

Syphilis | 0.6958 | 0.0242 | 0.9464 | 0.0201 |

Typhoid fever | 0.2121 | −0.0019 | 0.8049 | 0.0005 |

We also used exponential smoothing to extract the long term trend after the extraction of seasonal indices. Different smoothing factors were tested from 0.1 to 0.9 with 0.1 step. Smoothing factors were selected by the criterion of minimum MSE in the modeling process.

ARIMA models were fitted to the nine types of infectious diseases from 2005 to 2011 and tested by predicting the incidence for the year 2012. Different ARIMA models were tested to determine the best fitting models.

Disease | Identification | AIC | SBC |

Brucellosis | ARIMA(0,0,0)×(0,1,1) | −282.39 | −280.13 |

Gonorrhea | ARIMA(0,0,1)×(0,1,0) | −152.05 | −149.79 |

ARIMA(0,0,1)×(0,1,0) | −165.46 | −163.08 | |

ARIMA(1,0,0)×(0,1,1) | −160.70 | −156.18 | |

Hemorrhagic Fever | ARIMA(1,0,0)×(0,1,0) | −354.63 | −352.32 |

ARIMA(0,0,1)×(0,1,0) | −358.07 | −355.80 | |

Hepatitis A | ARIMA(1,0,1)×(0,1,0) | −227.38 | −222.85 |

ARIMA(0,0,0)×(1,1,0) | −237.12 | −234.85 | |

ARIMA(1,0,1)×(0,1,1) | −241.85 | −235.06 | |

Hepatitis B | ARIMA(1,0,0)×(0,1,0) | 168.09 | 170.35 |

ARIMA(0,0,1)×(0,1,0) | 160.53 | 162.79 | |

ARIMA(1,0,1)×(0,1,0) | 157.30 | 161.82 | |

ARIMA(2,0,0)×(0,1,0) | 161.89 | 166.41 | |

ARIMA(3,0,0)×(0,1,0) | 157.38 | 164.17 | |

ARIMA(0,0,2)×(0,1,0) | 155.18 | 159.70 | |

ARIMA(1,0,0)×(1,1,0) | 151.74 | 156.27 | |

Scarlet Fever | ARIMA(1,0,0)×(0,1,0) | −169.68 | −167.41 |

ARIMA(0,0,1)×(0,1,0) | −172.22 | −169.96 | |

ARIMA(0,0,1)×(0,1,1) | −190.10 | −185.57 | |

ARIMA(1,0,0)×(1,1,0) | −173.30 | −168.77 | |

ARIMA(0,0,1)×(1,1,0) | −176.39 | −171.87 | |

ARIMA(2,0,0)×(0,1,0) | −179.34 | −174.81 | |

Schistosomiasis | ARIMA(1,0,0)×(0,1,0) | −517.69 | −515.43 |

ARIMA(1,0,1)×(0,1,0) | −524.15 | −519.62 | |

ARIMA(1,0,0)×(0,1,1) | −520.67 | −516.15 | |

Syphilis | ARIMA(1,0,0)×(0,1,0) | −55.74 | −53.48 |

ARIMA(0,0,1)×(0,1,0) | −67.99 | −65.74 | |

ARIMA(1,0,1)×(0,1,0) | −72.93 | −68.41 | |

ARIMA(1,0,0)×(0,1,1) | −74.20 | −69.67 | |

ARIMA(0,0,1)×(1,1,0) | −76.85 | −72.33 | |

ARIMA(1,0,1)×(1,1,0) | −81.47 | −74.69 | |

ARIMA(2,0,0)×(0,1,0) | −60.71 | −56.19 | |

ARIMA(3,0,0)×(0,1,0) | −68.56 | −61.77 | |

ARIMA(0,0,2)×(0,1,0) | −72.07 | −67.54 | |

ARIMA(2,0,0)×(1,1,0) | −74.93 | −68.14 | |

ARIMA(2,0,0)×(0,1,1) | −79.18 | −72.39 | |

Typhoid fever | ARIMA(0,0,1)×(0,1,0) | −369.44 | −367.17 |

The training number of the SVM based time series model needed to be determined. In previous studies _{t}

The input matrix is sent into SVM for training, and its corresponding output matrix is its training goal. Once the parameters are determined, they are used to forecast the incidence in 2012 iteratively.

Several parameters needed to be determined. They are ^{−10} to 2^{10} in 2 increments. There is no structural way to determine the optimal parameters of SVMs. In the present study, cross validation methods were applied to determine the proper SVMs. The training samples were randomly divided into

Disease | Methods | Modeling | Predication | ||||||||||

MAE | SE | MAPE | SE | RMSE | SE | MAE | SE | MAPE | SE | RMSE | SE | ||

(MAE) | (MAPE) | (RMSE) | (MAE) | (MAPE) | (RMSE) | ||||||||

Brucellosis | Regression | 0.0240 | 0.0075 | 0.1345 | 0.0873 | 0.0313 | 0.0016 | 0.0520 | 0.0206 | 0.1970 | 0.3565 | 0.0642 | 0.0057 |

Exponential Smoothing | 0.0183 | 0.0080 | 0.1084 | 0.0994 | 0.0261 | 0.0018 | 0.0414 | 0.0219 | 0.1546 | 0.3367 | 0.0540 | 0.0060 | |

ARIMA | 0.0247 | 0.0135 | 0.1505 | 0.1210 | 0.0327 | 0.0147 | 0.0285 | 0.0369 | 0.1464 | 0.2396 | 0.0341 | 0.0403 | |

SVM | 0.0045 | 0.0161 | 0.0402 | 0.1269 | 0.0077 | 0.0033 | 0.0355 | 0.0161 | 0.1667 | 0.3271 | 0.0428 | 0.0051 | |

Gonorrhea | Regression | 0.0356 | 0.0139 | 0.0481 | 0.0182 | 0.0460 | 0.0056 | 0.0898 | 0.0380 | 0.1510 | 0.0578 | 0.1026 | 0.0190 |

Exponential Smoothing | 0.0321 | 0.0139 | 0.0431 | 0.0193 | 0.0457 | 0.0061 | 0.0700 | 0.0401 | 0.1273 | 0.0593 | 0.0835 | 0.0199 | |

ARIMA | 0.0446 | 0.0098 | 0.0570 | 0.0136 | 0.0718 | 0.0128 | 0.0345 | 0.0193 | 0.0659 | 0.0402 | 0.0515 | 0.0238 | |

SVM | 0.0334 | 0.0397 | 0.0485 | 0.0468 | 0.0570 | 0.0098 | 0.0281 | 0.0547 | 0.0542 | 0.0562 | 0.0436 | 0.0250 | |

Hemorrhagic Fever | Regression | 0.0180 | 0.0041 | 0.2580 | 0.0602 | 0.0245 | 0.0004 | 0.0524 | 0.0104 | 0.5528 | 0.2075 | 0.0700 | 0.0014 |

Exponential Smoothing | 0.0081 | 0.0043 | 0.1145 | 0.0683 | 0.0110 | 0.0005 | 0.0170 | 0.0105 | 0.1822 | 0.2084 | 0.0240 | 0.0015 | |

ARIMA | 0.0119 | 0.0039 | 0.1628 | 0.0682 | 0.0184 | 0.0050 | 0.0129 | 0.0135 | 0.1246 | 0.2605 | 0.0200 | 0.0188 | |

SVM | 0.0052 | 0.0049 | 0.0689 | 0.0257 | 0.0105 | 0.0007 | 0.0189 | 0.0148 | 0.1758 | 0.1100 | 0.0285 | 0.0024 | |

Hepatitis A | Regression | 0.0382 | 0.0332 | 0.1111 | 0.0233 | 0.0539 | 0.0019 | 0.0141 | 0.0071 | 0.0898 | 0.1284 | 0.0176 | 0.0079 |

Exponential Smoothing | 0.0209 | 0.0083 | 0.0637 | 0.0378 | 0.0286 | 0.0021 | 0.0482 | 0.0236 | 0.3110 | 0.1283 | 0.0501 | 0.0074 | |

ARIMA | 0.0296 | 0.0101 | 0.0910 | 0.0336 | 0.0435 | 0.0115 | 0.0294 | 0.0090 | 0.1854 | 0.0631 | 0.0319 | 0.0096 | |

SVM | 0.0313 | 0.0390 | 0.0941 | 0.1278 | 0.0432 | 0.0074 | 0.0132 | 0.0218 | 0.0887 | 0.1352 | 0.0158 | 0.0055 | |

Hepatitis B | Regression | 0.4468 | 0.0089 | 0.0553 | 0.1670 | 0.5660 | 0.1179 | 0.7544 | 0.3892 | 0.0966 | 0.0261 | 0.9321 | 0.0584 |

Exponential Smoothing | 0.3033 | 0.0622 | 0.0384 | 0.0093 | 0.4548 | 0.1309 | 0.6530 | 0.1655 | 0.0817 | 0.0257 | 0.8938 | 0.3839 | |

ARIMA | 0.3922 | 0.0758 | 0.0498 | 0.0099 | 0.6070 | 0.0925 | 0.6425 | 0.1559 | 0.0813 | 0.0186 | 0.8714 | 0.1943 | |

SVM | 0.4529 | 0.2727 | 0.0583 | 0.0363 | 0.6238 | 0.3054 | 0.7206 | 0.1986 | 0.0942 | 0.0312 | 0.8379 | 0.3997 | |

Scarlet Fever | Regression | 0.0718 | 0.0125 | 0.4192 | 0.0785 | 0.1066 | 0.0045 | 0.0514 | 0.0292 | 0.1832 | 0.2896 | 0.0623 | 0.0142 |

Exponential Smoothing | 0.0239 | 0.0131 | 0.1321 | 0.0912 | 0.0365 | 0.0049 | 0.1650 | 0.0323 | 0.5909 | 0.2896 | 0.1924 | 0.0151 | |

ARIMA | 0.0416 | 0.0090 | 0.2614 | 0.0738 | 0.0628 | 0.0112 | 0.3888 | 0.0821 | 1.7556 | 0.5018 | 0.3933 | 0.0861 | |

SVM | 0.0206 | 0.0228 | 0.1214 | 0.1452 | 0.0352 | 0.0061 | 0.0712 | 0.0219 | 0.3278 | 0.2373 | 0.0847 | 0.0110 | |

Schistosomiasis | Regression | 0.0032 | 0.0065 | 0.1521 | 0.3112 | 0.0045 | 0.0001 | 0.0095 | 0.0094 | 0.2997 | 0.2923 | 0.0114 | 0.0001 |

Exponential Smoothing | 0.0031 | 0.0007 | 0.1440 | 0.0574 | 0.0039 | 0.0000 | 0.0092 | 0.0019 | 0.2882 | 0.1981 | 0.0112 | 0.0000 | |

ARIMA | 0.0048 | 0.0008 | 0.2312 | 0.0499 | 0.0063 | 0.0010 | 0.0088 | 0.0025 | 0.2707 | 0.0692 | 0.0118 | 0.0031 | |

SVM | 0.0027 | 0.0007 | 0.1242 | 0.0571 | 0.0045 | 0.0000 | 0.0083 | 0.0019 | 0.2490 | 0.1978 | 0.0114 | 0.0000 | |

Syphilis | Regression | 0.0987 | 0.0392 | 0.0538 | 0.0398 | 0.1319 | 0.0472 | 0.3557 | 0.1178 | 0.1355 | 0.1446 | 0.4120 | 0.1819 |

Exponential Smoothing | 0.0888 | 0.0417 | 0.0501 | 0.0426 | 0.1356 | 0.0544 | 0.2021 | 0.1158 | 0.0741 | 0.1415 | 0.3165 | 0.1762 | |

ARIMA | 0.0999 | 0.0200 | 0.0593 | 0.0126 | 0.1286 | 0.0279 | 0.2722 | 0.0886 | 0.1090 | 0.0370 | 0.3110 | 0.1006 | |

SVM | 0.0740 | 0.0098 | 0.0477 | 0.0053 | 0.1378 | 0.0029 | 0.2038 | 0.1864 | 0.0828 | 0.1922 | 0.2414 | 0.2675 | |

Typhoid Fever | Regression | 0.0145 | 0.0053 | 0.1466 | 0.0508 | 0.0179 | 0.0007 | 0.0389 | 0.0133 | 0.5074 | 0.1687 | 0.0397 | 0.0024 |

Exponential Smoothing | 0.0081 | 0.0051 | 0.0813 | 0.0567 | 0.0105 | 0.0007 | 0.0080 | 0.0130 | 0.1086 | 0.1721 | 0.0096 | 0.0024 | |

ARIMA | 0.0133 | 0.0038 | 0.1319 | 0.0429 | 0.0176 | 0.0057 | 0.0121 | 0.0077 | 0.1766 | 0.1070 | 0.0152 | 0.0089 | |

SVM | 0.0087 | 0.0040 | 0.0797 | 0.0143 | 0.0122 | 0.0010 | 0.0111 | 0.0130 | 0.1435 | 0.0878 | 0.0130 | 0.0032 |

MAPE is a relative index among the three evaluation indices. We used MAPE to evaluate the general performance for the models to forecast each disease. The MAPEs for each model obtained for each disease in both modeling process and predicating process are shown in

To compare the performance the different models for different diseases, different evaluation indices were emphasized. MAPE is emphasized for lower level incidence disease (annual mean incidence <0.1/100,000) such as Schistosomiasis (0.0245/100,000) and Hemorrhagic Fever (0.0814/100,000). RMSE is emphasized for higher level incidence disease (mean incidence >1/100,000), such as Hepatitis B (7.9335/100,000) and syphilis (1.8461/100,000). MAE was emphasized for medium level incidence disease (0.1/100,000<mean incidence <1/100,000) including Hepatitis A (0.3356/100,000), gonorrhea (0.8329/100,000), scarlet fever (0.2131/100,000), typhoid fever (0.1242/100,000) and brucellosis (0.1975/100,000). The performances of the three methods for gonorrhea, hepatitis B, Schistosomiasis and Syphilis ranked in descending order were: SVM, ARIMA, exponential smoothing and regression. The performances of the three methods for Hepatitis A ranked in descending order were: SVM, regression, exponential smoothing and ARIMA. The performances of the three methods for Brucellosis and Hemorrhagic fever ranked in descending order were: ARIMA, SVM, exponential smoothing and regression. The performances of the four models for Scarlet Fever ranked in descending order were: regression, SVM, exponential smoothing and ARIMA. The performances of the four models for typhoid fever ranked in descending order were: exponential smoothing, ARIMA, SVM and regression. SVMs performed best in forecasting gonorrhea, hepatitis A, hepatitis B, Schistosomiasis and Syphilis. ARIMA performed best in forecasting Brucellosis and Hemorrhagic Fever and performed the worst in forecasting Scarlet Fever. Exponential smoothing performed best in forecasting typhoid fever, but worst in hepatitis A. Regression method performed best in forecasting scarlet fever, however the worst in Brucellosis, Gonorrhea, Hemorrhagic Fever, Schistosomiasis, Syphilis and typhoid fever. The exponential smoothing method performs better than regression decomposition method except in the case of hepatitis A and scarlet fever.

The early recognition of epidemic behavior is significantly important for epidemic disease control and prevention. The effectiveness of statistical models in forecasting future epidemic disease incidence has been proved useful

In principle, the decomposition method can break down the original into different parts. The seasonal factor can be expressed in the form of seasonal indices. The series after seasonal pattern removal can be modeled with regression methods or exponential smoothing, etc. Time series decomposition models do not involve a lot of mathematics or statistics; they are relatively easy to explain to the end user. The ARIMA model can grasp the historical information by (1) AR to consider the past values, and (2) MA to consider the current and previous residual series. The ARIMA model is popular because of its known statistical properties and the well-known Box–Jenkins methodology in the modeling process. It is one of the most effective linear models for seasonal time series forecasting. In contrast, the SVMs time series models capture the historical information by nonlinear functions. With flexible nonlinear function mapping capability, support vector machine can approximate any continuous measurable function with arbitrarily desired accuracy.

In practical matters, the building of the decomposition methods generally involves two parts: (1) extraction of the seasonal indices to express the seasonal pattern hidden in the infectious disease time series, and (2) regression methods to model the long trend pattern. The building of the ARIMA model requires the determination of differencing orders (

Based on the three forecasting measured errors (MAE, MAPE, MSE), and the visualization of the forecasted values, the empirical evidence is that no one method completely dominated the others. However, the present study shows that support vector machine generally outperforms the conventional ARIMA model and decomposition methods. The ARIMA model has been proved an effective linear model to effectively capture a linear trend of the infectious disease series. The decomposition methods generally perform better when the series conform to the decomposition hypothesis. The linear regression hypothesis seems to be more rigid on the season moved series than exponential smoothing.

The advantage of decomposition is that decomposition models do not involve a lot of mathematics or statistics; they are relatively easy to explain to the end user. This is a major advantage because if the end user has an appreciation of how the forecast was developed, he or she may have more confidence in its use for decision making. The disadvantage of decomposition methods is that the hypothesis may be too strong for the epidemic behavior, so that the model may not perform well sometimes. The ARIMA model has advantages in its well-known statistical properties and effective modeling process. It can be easily realized through mainstream statistical software. The model can be used when the seasonal time series are stationary and have no missing data. The disadvantage of the ARIMA model is that it can only extract linear relationships within the time series data. it may not work well for the occurrence of an infectious disease which can be affected by various factors, including many meteorological and various social factors, namely, the occurrence of the disease does not necessarily associate with the historical data in linear relationship. Our study suggested that nonlinear relationships may exist among the monthly incidences of many diseases such as scarlet fever, so that the ARIMA model did not efficiently extract the full relationship hidden in the historical data. Support vector machines are potentially useful endemic time series forecasting methods because of their strong nonlinear mapping ability and tolerance to complexity in forecasting data. SVMs have very good learning ability in time series modeling. SVMs have unique advantages compared with other machine learning methods, such as neural networks. For example, the SVMs implement the structural risk minimization principle, which leads to better generalization than neural networks that implement the empirical risk minimization principle. SVMs also have fewer free parameters than neural networks

What is more, the scarlet fever incidence shown in

The limitations of the study should also be acknowledged. First, only eight-years of incidence data were obtained because the Chinese National Surveillance System for Infectious Disease was established only in 2004. The relatively short length of the series may influence the forecasting efficacy of the different methods. Second, we only predicted the infectious disease incidence with the four typical forecasting methods. The findings based on a specific disease may not be repeatable when used on other cases. What is more, there are some other hypotheses on the long term trend in decomposition methods, such as generalized models which assume a nonlinear function among the time series. Many other models were developed to make up deficiencies of ARIMA, such as GARCH, etc. SVM is only one of the typical machine learning techniques. In this paper, we only choose four very typically used time series methods to make a comparison.

Infectious diseases pose a significant threat to human health. The establishment of epidemiological surveillance system greatly facilitates the implement of strategic health planning, such as vaccination costs and stocks. More research on the accurate prediction of the epidemiological events based on surveillance data should be conducted, and more sophisticated forecasting techniques should be applied and compared in practice.