Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

Applications and Comparisons of Four Time Series Models in Epidemiological Surveillance Data

  • Xingyu Zhang ,

    Contributed equally to this work with: Xingyu Zhang, Tao Zhang

    Affiliations: Department of Medical Statistis, West China School of Public Health, Sichuan University, Chengdu, Sichuan, P.R. China, Department of Anatomy with Radiology, University of Auckland, Auckland, New Zealand

  • Tao Zhang ,

    Contributed equally to this work with: Xingyu Zhang, Tao Zhang

    Affiliation: Department of Medical Statistis, West China School of Public Health, Sichuan University, Chengdu, Sichuan, P.R. China

  • Alistair A. Young,

    Affiliation: Department of Anatomy with Radiology, University of Auckland, Auckland, New Zealand

  • Xiaosong Li

    lixiaosong1101@126.com

    Affiliation: Department of Medical Statistis, West China School of Public Health, Sichuan University, Chengdu, Sichuan, P.R. China

Applications and Comparisons of Four Time Series Models in Epidemiological Surveillance Data

  • Xingyu Zhang, 
  • Tao Zhang, 
  • Alistair A. Young, 
  • Xiaosong Li
PLOS
x
  • Published: February 5, 2014
  • DOI: 10.1371/journal.pone.0088075

Correction

28 Feb 2014: The PLOS ONE Staff (2014) Correction: Applications and Comparisons of Four Time Series Models in Epidemiological Surveillance Data. PLoS ONE 9(2): e91629. doi: 10.1371/journal.pone.0091629 View correction

Abstract

Public health surveillance systems provide valuable data for reliable predication of future epidemic events. This paper describes a study that used nine types of infectious disease data collected through a national public health surveillance system in mainland China to evaluate and compare the performances of four time series methods, namely, two decomposition methods (regression and exponential smoothing), autoregressive integrated moving average (ARIMA) and support vector machine (SVM). The data obtained from 2005 to 2011 and in 2012 were used as modeling and forecasting samples, respectively. The performances were evaluated based on three metrics: mean absolute error (MAE), mean absolute percentage error (MAPE), and mean square error (MSE). The accuracy of the statistical models in forecasting future epidemic disease proved their effectiveness in epidemiological surveillance. Although the comparisons found that no single method is completely superior to the others, the present study indeed highlighted that the SVMs outperforms the ARIMA model and decomposition methods in most cases.

Introduction

Public health surveillance is an important way to continuously collect, analyze, interpret and disseminate health data essential to prevention and control [1]. Public health surveillance systems are designed to facilitate the detection of abnormal behavior of infectious diseases and other adverse health events. To achieve this goal, different statistical methods have been used to forecast infectious disease incidence. Time series models have long been of interest in the literature. The time series models try to predict epidemiological behaviors by modeling historical surveillance data. Many researchers have applied different time series models to forecasting epidemic incidence in previous studies. Exponential smoothing [2] and generalized regression [3] methods were used to forecast in-hospital infection and incidence of cryptosporidiosis respectively. Decomposition methods [4] and multilevel time series models [5] were used to forecast respiratory syncytial virus. Autoregressive integrated moving average (ARIMA) models have been widely used for epidemic time series forecasting including the hemorrhagic fever with renal syndrome [6], [7], dengue fever [8], [9], and tuberculosis [10]. Models based on artificial neural networks were also used to predict the incidence of hepatitis A [11], [12] and typhoid fever [13].

The decomposition methods are generally the most traditional methods in time series analysis [14], [15]. These methods try to break down the original series into a long trend pattern, a seasonal pattern and residuals. Seasonal indices are extracted to express the seasonal pattern; a regression model is established to express the long trend pattern and the residuals are ignored in the methods. Because the decomposition time series methods do not involve a lot of mathematics or statistics, they are relatively easy to explain to the end user. This is a major advantage because if the end user has recognition of how the forecast was developed, he or she may have more confidence in its use for decision making.

The ARIMA models are almost the most widely used methods [16], [17]. The ARIMA models are generally derived from three basic time series models (1) autoregressive (AR), (2) moving average (MA), and (3) autoregressive moving average (ARMA). The current value of the time series is a linear function of its previous values and random noise in the AR model; whereas the current value of the time series is a linear function of its current and previous values of residuals in the MA model. The ARMA model is the combination of AR and MA, which considers both the historical values and residuals. The time series required in AR, MA, and ARMA models are stationary processes. This means that the mean and the covariance of the series do not change with time. Transformation of the series into a stationary one has to be performed first for non-stationary time series. The ARIMA model fits the time series data generally based on the ARMA model and a differencing process which effectively transforms the non-stationary data into a stationary one.

In recent years, machine learning based time series models such as artificial neural networks have been successfully applied for modeling infectious disease incidence time series [18]. Support vector machines (SVMs) are a new type of machine learning methods based on statistical learning theory [19]. They could lead to greater potential and better performance in practical applications. This is due to the structural risk minimization principle employed in SVMs, which has greater generalization ability and is superior to the empirical risk minimization principle that is adopted by traditional neural networks. SVMs have been successfully applied in different problems of time series prediction such as forecasting production value in machinery industry [20], predicating engine reliability [21] and economic time series predication [22], [23]. The successful utilization of support vector machines in time series predication motivates our research work by using support vector machines for epidemic time series forecasting.

The objectives of the present paper are to compare four typical time series methods, namely, two decomposition methods (regression and exponential smoothing), ARIMA model and SVMs in theory and practice as well as their real forecasting efficacy in epidemic time series. This comparison may be helpful for the epidemiologist to choose the most suitable methodology in a given situation.

Materials and Methods

Materials

We gathered available monthly incidence of nine typical infectious diseases time series data which were reported by the Chinese Center for Disease Prevention and Control (CDC). The data were collected from the Chinese National Surveillance System established in 2004. The incidence time series of brucellosis, gonorrhea, hemorrhagic fever renal syndrome (HFRS), hepatitis A (HA), hepatitis B (HB), scarlet fever, schistosomiasis, syphilis, typhoid fever from 2005 to 2012 were collected.

Methods

Decomposition methods.

The decomposition methods try to extract the underlying pattern in the data series from randomness. The underlying pattern then can be employed to predict future trends and make forecasts. The underlying pattern can also be broken down into sub patterns to identify the component factors that influence each of the values in a series. Two separate components of the basic underlying pattern that tend to characterize the infectious disease time series are usually identified in decomposition methods. They are the trend cycle and seasonal factors. The trend cycle represents long term changes, and the seasonal factor is the periodic fluctuations with constant length that is usually caused by known factors such as rainfall, month of the year, temperature, timing of the holidays, etc. The decomposition model assumes that the data has the following form:

Time series = Pattern + Error = Trend cycle+ Seasonality+ error

The seasonality part of the time series is usually expressed with the seasonal indices [24]. To arrive at seasonal factors, the entire incidences for the training sample are averaged first, and then the averaged incidence is divided by the mean incidence for each month. If the seasonal index is bigger than 1, it means that the incidence is usually higher than the average level. Otherwise, it means that the incidence is usually lower than the average level.

Once the Seasonal indices are calculated, one can deseasonalize data by dividing by the corresponding index.

Deseasonalized data = Raw data/Seasonal Index

The long-term trend is estimated from the deseasonalized data. There are many ways to estimate the long-term trend, such as moving average, exponential smoothing, and linear regression. In simple moving average methods, the current value is calculated as the mean of its previous k values, whereas exponential smoothing assigns exponentially decreasing weights over time. When the time series x(t) begins at time t = 0, the simplest form of exponential smoothing is given by the formulae: where is the smoothing factor and is the output of the exponential smoothing algorithm .

The linear regression method is another simple way to express the long term trend in which a common linear regression model is established between the incidence and time t.

ARIMA model.

The ARIMA model originated from AR model, MA model, and the combination of AR and MA, the ARMA models [25]. AR models express the current value of the time series X(t) linearly in terms of its previous values (X(t−1), X(t−2)…) and the current residuals , which can be expressed as: (1)

MA models express the current value of the time series X(t) linearly in terms of its current and previous residual series . the model can be expressed as: (2)

ARMA models are a combination of AR and MA models, in which the current value of the time series is expressed linearly in terms of its previous values as well as current and previous residual series. It can be expressed as: (3)

The ARIMA model deals with non-stationary time series with differencing process based on the ARMA model. The differenced stationary time series can be modeled as ARMA model to yield ARIMA model.

The ARIMA model is usually termed as ARIMA (p, d, q)×(P, D, Q)S. In the expression, P is the seasonal order of autoregressive, p the non-seasonal order of autoregressive, Q the seasonal order moving average, q the non-seasonal order of moving average, d the order of regular differencing and D the order of seasonal differencing. The subscripted letter “s” indicates the length of seasonal period. For example, the incidence of infectious disease varies in the annual cycle, so s = 12 in the present study.

The ARIMA modeling procedure consists of three iterative steps: identification, estimation, and diagnostic checking. Prior to fitting the ARIMA model, an appropriate difference of the series is usually performed to make the series stationary. Identification is the process of determining seasonal and non-seasonal orders using the autocorrelation functions (ACF) and partial autocorrelation functions (PACF) of the transformed data [26]. The ACF is a statistical tool that measures whether earlier values in the series have some relation to later values. PACF captures the amount of correlation between a variable and a lag of the said variable that is not explained by correlation at all low-order lags. Parameters in the ARIMA model(s) are estimated with the conditional least squares (CLS) method [27] after the identification step. Finally, the adequacy of the established model for the series is verified by employing white noise tests [28] to check whether the residuals are independent and normally distributed. It is possible that several ARIMA models may be identified, and the selection of an optimum model is necessary. Such selection of models is usually based on the Akaike Information Criterion (AIC) and Schwartz Bayesian Criterion (SBC) [29].

Support Vector Machine.

SVMs estimate the regression using a set of linear functions that are defined in a high dimensional space. SVMs carry out the regression estimation by using Vapnik's -insensitive loss function. SVMs use a risk function consisting of the empirical error and a regularization principle [30].

Assume that is a set of data points, where is the input sample, is the desired value and n is the total number of data. The SVMs calculate the function using the following: (4)where is the high dimensional feature space which is non-linearly mapped from the input space . The coefficients and b are calculated by minimizing (5)(6)

In eq. (5), the first term represents the empirical error risk, which is calculated by the -insensitive loss function in eq. (6). The second term is the regularization term. C is the regularized constant, which determines the trade-off between the empirical risk and the regularization term. If the value of C is changed, the relative importance of the empirical risk and the regularization term will also be changed. Increasing the value of C will lead to the growth of the weight of the regularization. is named as the tube size, which is equivalent to the approximation accuracy placed on the training data sample. Both C and are user-prescribed parameters [31].

To estimate and b, eq. (5) is transformed to the primal function given by eq. (7) by introducing the positive slack variables and as follows:(7)

Subjected to

Finally, by introducing Lagrange multipliers and exploiting the optimality constraints, the decision function given by Eq. (4) has the following explicit form:(8)

In Eq. 5, and are the so-called Lagrange multipliers. They satisfy the equalities , and where i = 1,…,n, and are obtained by maximizing the dual function of eq.4 which has the following form: (9)with the constraints

is called the kernel function. The value of the kernel is equal to the inner product of two vectors and in the feature space and , that is . The elegance of using the kernel function is that one can deal with feature spaces of arbitrary dimensionality without having to compute the map explicitly. A. Typical examples of kernel function are the Gaussian kernel where is the bandwidth of the Gaussian kernel [32]. The kernel parameter should be carefully chosen as it implicitly defines the structure of the high dimensional feature space and thus controls the complexity of the final solution. From the implementation point of view, training SVMs is equivalent to solving a linearly constrained quadratic programming (QP) with the number of variables twice as that of the training data points. The sequential minima optimization algorithm propounded by Scholkopf and Smola [33], [34] is reported to be very effective in training SVMs for solving regression problems.

Model selection criterion and evaluation indices.

The contrasts between the observed value of the raw series and the predicted values obtained through the four methods were compared to determine the efficacy of the four forecasting methods used in the present study. The mean absolute error (MAE), mean absolute percentage error (MAPE), and the root mean square error (RMSE) were selected as the measures of evaluation because as empirical methods they are widely used in combining and selecting forecasts for measuring bias and accuracy of models [35].

These measures were calculated using Equations (10), (11), and (12). Pt is the predicted value at time t, Zt is the observed value at time t and T is the number of predictions. (10)(11)(12)

To take into account the variability of MAE, MAPE and RMSE, the block bootstrap technique [36] was adopted to calculate their standard errors. All of the incidence time series in the current research have a one-year period of seasonality (D = 1). Therefore, in our block bootstrap simulations, the block length was set to be 12 months so that the autocorrelation structure within seasonal blocks was reserved. We firstly simulated 10000 replications by block bootstrap sampling, and then calculated the MAE, MAPE and RMSE for each replication. At last, the standard errors could be obtained by the following formula:

where n is number of replications (10000), index could be MAE, MAPE or RMSE. Take MAE as an example, here Index means MAE, represents the specific value of MAE in the i-th replication and is the mean value of MAE for the whole replications. It is the same with MAPE and RMSE.

Time Series Modeling Results

Decomposition Methods

Seasonal indices of different types of infectious diseases were extracted from the original time series, which are listed in Table 1 (Seasonal index of each type of infectious disease), Figure 12(Seasonal index of each type of infectious disease (1)). The seasonality of the incidence behavior of each infectious disease can be seen according to the seasonal indices. All the infectious diseases selected show a seasonal trend as the occurrence of infectious disease can be more or less influenced by the temperature, rainfall and sunshine, etc. However, the extent of the seasonality is not quite similar among them. Figure 1 shows the five types of diseases whose seasonal index varies obviously through 12 months. Figure 2 shows the four types of disease whose seasonality indices do not vary seriously. Brucellosis, hemorrhagic fever, scarlet fever, schistosomiasis and typhoid fever show stronger seasonality than the others, as their variances of their seasonal indices are bigger than others. The incidence of brucellosis is higher in summer and lower in winter, with the crest in June. Hemorrhagic fever has the highest seasonal index in November and lowest in September. Scarlet fever has the higher seasonal index in May, June and December and lower index in August. The incidence of schistosomiasis is higher in summer and lower in winter, with the crest in July. The incidences of typhoid fever are higher in summer and lower in winter, with the crest in August. The other diseases, such as Hepatitis A, Hepatitis B, Gonorrhea and syphilis have relatively smooth seasonal index curves.

thumbnail
Figure 1. Seasonal index of each type of infectious disease (1).

doi:10.1371/journal.pone.0088075.g001

thumbnail
Figure 2. Seasonal index of each type of infectious disease (2).

doi:10.1371/journal.pone.0088075.g002

thumbnail
Table 1. Seasonal index of each type of infectious disease.

doi:10.1371/journal.pone.0088075.t001

After the extraction of seasonal indices, linear regressions were modeled for the rest of the incidence time series. The form of the regression model is:

Deseasonalized value at time t = Constant + Coefficient * t

The parameters of the established models are listed in Table 2 (Regression results of each series removed seasonality). R2 is the coefficient of determination. It ranged between 0 and 1, which is used to describe how well a regression line fits a set of data. An R2 near 1 indicates that a regression line fits the data well, while an R2 closer to 0 indicates a regression line does not fit the data very well. It can be seen from Table 2 that the regression model on the seasonality-removed incidence data of brucellosis, gonorrhea, hepatitis A, Syphilis and typhoid fever generally fit well. The regression model on the seasonality removed incidence data of hepatitis B fit badly, and P value is over 0.05, however, the model is still used to forecast the incidence in the study as the model has good fitting and forecasting efficacy.

thumbnail
Table 2. Regression results of each series removed seasonality.

doi:10.1371/journal.pone.0088075.t002

We also used exponential smoothing to extract the long term trend after the extraction of seasonal indices. Different smoothing factors were tested from 0.1 to 0.9 with 0.1 step. Smoothing factors were selected by the criterion of minimum MSE in the modeling process.

ARIMA model

ARIMA models were fitted to the nine types of infectious diseases from 2005 to 2011 and tested by predicting the incidence for the year 2012. Different ARIMA models were tested to determine the best fitting models. Table 3(Estimation of available ARIMA models for each disease) presents the results of the estimations using various ARIMA processes for the nine diseases incidence time series. The selections of the best models were performed according to the principle of AIC and SBC. The final selected ARIMA model was marked into yellow in Table 3. The parameter significance test and the white noise diagnostic check for residuals obtained by the selected model were made to ensure that the data was fully modeled.

thumbnail
Table 3. Estimation of available ARIMA models for each disease.

doi:10.1371/journal.pone.0088075.t003

Support Vector Machine

The training number of the SVM based time series model needed to be determined. In previous studies [13], the training number for the training of periodic series is usually the period of the series. In the present study, the period of the entire infectious disease incidence selected is twelve. Therefore, twelve was selected as the training number for SVM based models, in which the last 12 months of data were reserved as the input for forecasting the present data. Proper transition of the data series is always necessary to determine the input and the output data before the training process. Supposing that Xt represents the value at time t, the input matrix and the corresponding output matrix of the training and validation sample used in our study are written as follows:

The input matrix is sent into SVM for training, and its corresponding output matrix is its training goal. Once the parameters are determined, they are used to forecast the incidence in 2012 iteratively.

Several parameters needed to be determined. They are C, and the kernel parameter . The value of is reportedly not sensitive to the accuracy of SVMs. In the present study, the value of was prescribed as 0.01. Different C and were examined from 2−10 to 210 in 2 increments. There is no structural way to determine the optimal parameters of SVMs. In the present study, cross validation methods were applied to determine the proper SVMs. The training samples were randomly divided into k parts in the training process, each part was used for testing and the others used for training. The obtained MSE each test was recorded and the mean of the MSE acted as the selection criterion for the optimal parameters.

Comparisons of the forecasting performance

Table 4 (Comparison of the performance of the three different methods), Figure 36(MAPE for ARIMA model) and Figure 79 (Comparison of the performances of the three different methods) show the modeling and predication performances of the three methods. Residual plots were made of the four different methods for each disease. The residual plots of Brucellosis and Typhoid fever are presented in this paper as examples (Figure 1011). The fitting and the forecasting incidences of the four methods for over seven years are graphed in Figure 1220. Generally, the fitting values and predicated values obtained by all the three methods reasonably matched the real incidence of the infectious diseases. It can also be seen that the performance of the four methods are not quite the same among the different diseases. The standard errors of the MAE, MAPE and MSE are quite small, indicating that these MAE, MAPE and MSE index values are quite stable.

thumbnail
Figure 3. MAPE for Decomposition method (Regression).

doi:10.1371/journal.pone.0088075.g003

thumbnail
Figure 4. MAPE for Decomposition method (Exponential Smoothing).

doi:10.1371/journal.pone.0088075.g004

thumbnail
Figure 5. MAPE for ARIMA model.

doi:10.1371/journal.pone.0088075.g005

thumbnail
Figure 6. MAPE for SVM model.

doi:10.1371/journal.pone.0088075.g006

thumbnail
Figure 7. Comparison of the performances of the four different methods (1).

doi:10.1371/journal.pone.0088075.g007

thumbnail
Figure 8. Comparison of the performances of the four different methods (2).

doi:10.1371/journal.pone.0088075.g008

thumbnail
Figure 9. Comparison of the performances of the four different methods (3).

doi:10.1371/journal.pone.0088075.g009

thumbnail
Figure 10. Residual plot of the four methods modeling Brucellosis.

doi:10.1371/journal.pone.0088075.g010

thumbnail
Figure 11. Residual plot of the four methods modeling Typhoid fever.

doi:10.1371/journal.pone.0088075.g011

thumbnail
Figure 12. Brucellosis incidence and fitting values predicted by the four methods.

doi:10.1371/journal.pone.0088075.g012

thumbnail
Figure 13. Gonorrhea incidence and fitting values predicted by the four methods.

doi:10.1371/journal.pone.0088075.g013

thumbnail
Figure 14. Hemorrhagic fever incidence and fitting values predicted by the four methods.

doi:10.1371/journal.pone.0088075.g014

thumbnail
Figure 15. Hepatitis A incidence and fitting values predicted by the four methods.

doi:10.1371/journal.pone.0088075.g015

thumbnail
Figure 16. Hepatitis B incidence and fitting values predicted by the four methods.

doi:10.1371/journal.pone.0088075.g016

thumbnail
Figure 17. Scarlet fever incidence and fitting values predicted by the four methods.

doi:10.1371/journal.pone.0088075.g017

thumbnail
Figure 18. Schistosomiasis incidence and fitting values predicted by the four methods.

doi:10.1371/journal.pone.0088075.g018

thumbnail
Figure 19. Syphilis incidence and fitting values predicted by the four methods.

doi:10.1371/journal.pone.0088075.g019

thumbnail
Figure 20. Typhoid fever incidence and fitting values predicted by the four methods.

doi:10.1371/journal.pone.0088075.g020

thumbnail
Table 4. Comparison of the performance of the four different methods.

doi:10.1371/journal.pone.0088075.t004

MAPE is a relative index among the three evaluation indices. We used MAPE to evaluate the general performance for the models to forecast each disease. The MAPEs for each model obtained for each disease in both modeling process and predicating process are shown in Figure 36. It was shown that most of the MAPEs obtained by the decomposition (Regression) method in the modeling process are controlled within 30% except scarlet fever (42%). In the predication process, the MAPEs for all infectious disease are controlled within 30% except hemorrhagic fever (55%), and typhoid fever (51%). The decomposition (Regression) methods had bad performance in fitting scarlet fever incidence and predicating those of hemorrhagic fever and typhoid fever. All of the MAPEs obtained by decomposition (Exponential Smoothing) method in the modeling process were controlled within 15%. The method generally had a good fit in the modeling process. In the predication process, the MAPEs for all infectious diseases were controlled within 30% except scarlet fever (59%) and Hepatitis A(31%). The decomposition (Exponential Smoothing) methods had bad performance in predicating scarlet fever incidence. The MAPEs obtained by ARIMA model in the modeling process method were controlled within 30%. In the predication process, the MAPEs for the 9 kinds of infectious diseases were controlled within 30% except scarlet fever (175%). The ARIMA model had good performance in the fitting process of all the infectious diseases selected. But it had bad performance in forecasting scarlet fever. The MAPEs obtained by SVM model in the modeling process are controlled within 15%. In the predication process, the MAPEs for the 9 kinds of infectious diseases were controlled within 20% except scarlet fever (33%) and Schistosomiasis (25%). The SVM based model had good performance in the fitting process and predicting process of all the infectious diseases selected.

To compare the performance the different models for different diseases, different evaluation indices were emphasized. MAPE is emphasized for lower level incidence disease (annual mean incidence <0.1/100,000) such as Schistosomiasis (0.0245/100,000) and Hemorrhagic Fever (0.0814/100,000). RMSE is emphasized for higher level incidence disease (mean incidence >1/100,000), such as Hepatitis B (7.9335/100,000) and syphilis (1.8461/100,000). MAE was emphasized for medium level incidence disease (0.1/100,000<mean incidence <1/100,000) including Hepatitis A (0.3356/100,000), gonorrhea (0.8329/100,000), scarlet fever (0.2131/100,000), typhoid fever (0.1242/100,000) and brucellosis (0.1975/100,000). The performances of the three methods for gonorrhea, hepatitis B, Schistosomiasis and Syphilis ranked in descending order were: SVM, ARIMA, exponential smoothing and regression. The performances of the three methods for Hepatitis A ranked in descending order were: SVM, regression, exponential smoothing and ARIMA. The performances of the three methods for Brucellosis and Hemorrhagic fever ranked in descending order were: ARIMA, SVM, exponential smoothing and regression. The performances of the four models for Scarlet Fever ranked in descending order were: regression, SVM, exponential smoothing and ARIMA. The performances of the four models for typhoid fever ranked in descending order were: exponential smoothing, ARIMA, SVM and regression. SVMs performed best in forecasting gonorrhea, hepatitis A, hepatitis B, Schistosomiasis and Syphilis. ARIMA performed best in forecasting Brucellosis and Hemorrhagic Fever and performed the worst in forecasting Scarlet Fever. Exponential smoothing performed best in forecasting typhoid fever, but worst in hepatitis A. Regression method performed best in forecasting scarlet fever, however the worst in Brucellosis, Gonorrhea, Hemorrhagic Fever, Schistosomiasis, Syphilis and typhoid fever. The exponential smoothing method performs better than regression decomposition method except in the case of hepatitis A and scarlet fever.

Discussion

The early recognition of epidemic behavior is significantly important for epidemic disease control and prevention. The effectiveness of statistical models in forecasting future epidemic disease incidence has been proved useful [37]. The surveillance system is a good way to collect and analyze infectious disease data. With high quality surveillance data, the epidemic behavior may be accurately detected and forecasted. Discussion of the forecasting techniques is very important. In the present study, we conducted a comparative study of four typical time series investigations in the forecasting of the epidemic pattern of nine types of infectious diseases, namely two decomposition methods (regression and exponential smoothing), ARIMA model, and SVMs based model. We have also compared the differences among these methods in both principle and practical aspects.

In principle, the decomposition method can break down the original into different parts. The seasonal factor can be expressed in the form of seasonal indices. The series after seasonal pattern removal can be modeled with regression methods or exponential smoothing, etc. Time series decomposition models do not involve a lot of mathematics or statistics; they are relatively easy to explain to the end user. The ARIMA model can grasp the historical information by (1) AR to consider the past values, and (2) MA to consider the current and previous residual series. The ARIMA model is popular because of its known statistical properties and the well-known Box–Jenkins methodology in the modeling process. It is one of the most effective linear models for seasonal time series forecasting. In contrast, the SVMs time series models capture the historical information by nonlinear functions. With flexible nonlinear function mapping capability, support vector machine can approximate any continuous measurable function with arbitrarily desired accuracy.

In practical matters, the building of the decomposition methods generally involves two parts: (1) extraction of the seasonal indices to express the seasonal pattern hidden in the infectious disease time series, and (2) regression methods to model the long trend pattern. The building of the ARIMA model requires the determination of differencing orders (d, D), and operators (p, q, P, Q), as well as the estimation of model parameters in the autoregressive and moving average polynomials. The construction of SVMs requires the determination of three parameters, namely, , C, . The time series data should be transformed into the input matrix and the output matrix, and then be put into the support vector machine. Certain training accuracy goals should be assigned before training.

Based on the three forecasting measured errors (MAE, MAPE, MSE), and the visualization of the forecasted values, the empirical evidence is that no one method completely dominated the others. However, the present study shows that support vector machine generally outperforms the conventional ARIMA model and decomposition methods. The ARIMA model has been proved an effective linear model to effectively capture a linear trend of the infectious disease series. The decomposition methods generally perform better when the series conform to the decomposition hypothesis. The linear regression hypothesis seems to be more rigid on the season moved series than exponential smoothing.

The advantage of decomposition is that decomposition models do not involve a lot of mathematics or statistics; they are relatively easy to explain to the end user. This is a major advantage because if the end user has an appreciation of how the forecast was developed, he or she may have more confidence in its use for decision making. The disadvantage of decomposition methods is that the hypothesis may be too strong for the epidemic behavior, so that the model may not perform well sometimes. The ARIMA model has advantages in its well-known statistical properties and effective modeling process. It can be easily realized through mainstream statistical software. The model can be used when the seasonal time series are stationary and have no missing data. The disadvantage of the ARIMA model is that it can only extract linear relationships within the time series data. it may not work well for the occurrence of an infectious disease which can be affected by various factors, including many meteorological and various social factors, namely, the occurrence of the disease does not necessarily associate with the historical data in linear relationship. Our study suggested that nonlinear relationships may exist among the monthly incidences of many diseases such as scarlet fever, so that the ARIMA model did not efficiently extract the full relationship hidden in the historical data. Support vector machines are potentially useful endemic time series forecasting methods because of their strong nonlinear mapping ability and tolerance to complexity in forecasting data. SVMs have very good learning ability in time series modeling. SVMs have unique advantages compared with other machine learning methods, such as neural networks. For example, the SVMs implement the structural risk minimization principle, which leads to better generalization than neural networks that implement the empirical risk minimization principle. SVMs also have fewer free parameters than neural networks [38].

What is more, the scarlet fever incidence shown in figure 17 (Scarlet fever incidence and fitting values predicted by the four methods) indicated that the average incidence from 2011 to 2012 was higher than that in the previous six years (2005–2010). The phenomenon that the incidence level changed greatly through time was called level shift by Tsay, R. S. in 1988.[39] Since the ARIMA model is in fact a regression of the present incidence value on the past values and residuals, it is of high risk that level shift would likely affect the forecasting performance of ARIMA model. Therefore, statisticians and time series analysts have tried to overcome the effect of level shift for many years. In our paper, it is interesting that, as presented in Table 4, the MAE, MAPE, RMSE and their standard errors of ARIMA model are larger than those of decomposition model, SVM and exponential smoothing method. This result in our paper suggests that the other three methods may serve as a better way than SARIMA model in analyzing time series in the presence of level shift.

The limitations of the study should also be acknowledged. First, only eight-years of incidence data were obtained because the Chinese National Surveillance System for Infectious Disease was established only in 2004. The relatively short length of the series may influence the forecasting efficacy of the different methods. Second, we only predicted the infectious disease incidence with the four typical forecasting methods. The findings based on a specific disease may not be repeatable when used on other cases. What is more, there are some other hypotheses on the long term trend in decomposition methods, such as generalized models which assume a nonlinear function among the time series. Many other models were developed to make up deficiencies of ARIMA, such as GARCH, etc. SVM is only one of the typical machine learning techniques. In this paper, we only choose four very typically used time series methods to make a comparison.

Infectious diseases pose a significant threat to human health. The establishment of epidemiological surveillance system greatly facilitates the implement of strategic health planning, such as vaccination costs and stocks. More research on the accurate prediction of the epidemiological events based on surveillance data should be conducted, and more sophisticated forecasting techniques should be applied and compared in practice.

Author Contributions

Conceived and designed the experiments: XZ. Performed the experiments: XZ TZ. Analyzed the data: XZ TZ XL. Contributed reagents/materials/analysis tools: XZ XL. Wrote the paper: XZ TZ AY XL.

References

  1. 1. Nobre FF, Monteiro ABS, Telles PR, Williamson GD (2001) Dynamic linear model and SARIMA: a comparison of their forecasting performance in epidemiology. Statistics in medicine 20: 3051–3069. doi: 10.1002/sim.963
  2. 2. Farrington C, Andrews N (2003) Outbreak detection: application to infectious disease surveillance. Monitoring the Health of Populations: Statistical Principles and Methods for Public Health Surveillance 2003: 203–231. doi: 10.1093/acprof:oso/9780195146493.003.0008
  3. 3. Chadwick D, Arch B, Wilder-Smith A, Paton N (2006) Distinguishing dengue fever from other infections on the basis of simple clinical and laboratory features: application of logistic regression analysis. Journal of Clinical Virology 35: 147–153. doi: 10.1016/j.jcv.2005.06.002
  4. 4. Gonzalez-Parra G, Arenas AJ, Jodar L (2009) Piecewise finite series solutions of seasonal diseases models using multistage Adomian method. Communications in Nonlinear Science and Numerical Simulation 14: 3967–3977. doi: 10.1016/j.cnsns.2009.02.023
  5. 5. Spaeder MC, Fackler JC (2012) A multi-tiered time-series modelling approach to forecasting respiratory syncytial virus incidence at the local level. Epidemiology and Infection 140: 602–607. doi: 10.1017/s0950268811001026
  6. 6. Li Q, Guo N-N, Han Z-Y, Zhang Y-B, Qi S-X, et al. (2012) Application of an autoregressive integrated moving average model for predicting the incidence of hemorrhagic Fever with renal syndrome. The American journal of tropical medicine and hygiene 87: 364–370. doi: 10.4269/ajtmh.2012.11-0472
  7. 7. Liu Q, Liu X, Jiang B, Yang W (2011) Forecasting incidence of hemorrhagic fever with renal syndrome in China using ARIMA model. Bmc Infectious Diseases 11..
  8. 8. Wongkoon S, Jaroensutasinee M, Jaroensutasinee K (2012) Development of temporal modeling for prediction of dengue infection in Northeastern Thailand. Asian Pacific Journal of Tropical Medicine 5: 249–252. doi: 10.1016/s1995-7645(12)60034-0
  9. 9. Luz PM, Mendes BVM, Codeco CT, Struchiner CJ, Galvani AP (2008) Time Series Analysis of Dengue Incidence in Rio de Janeiro, Brazil. American Journal of Tropical Medicine and Hygiene 79: 933–939.
  10. 10. Rios M, Garcia JM, Sanchez JA, Perez D (2000) A statistical analysis of the seasonality in pulmonary tuberculosis. European Journal of Epidemiology 16: 483–488. doi: 10.1023/a:1007653329972
  11. 11. Gonzalez-Parra G, Arenas AJ, Jodar L (2009) Piecewise finite series solutions of seasonal diseases models using multistage Adomian method. Communications in Nonlinear Science and Numerical Simulation 14: 3967–3977. doi: 10.1016/j.cnsns.2009.02.023
  12. 12. Ture M, Kurt I (2006) Comparison of four different time series methods to forecast hepatitis A virus infection. Expert Systems with Applications 31: 41–46. doi: 10.1016/j.eswa.2005.09.002
  13. 13. Zhang X, Liu Y, Yang M, Zhang T, Young AA, et al. (2013) Comparative Study of Four Time Series Methods in Forecasting Typhoid Fever Incidence in China. PloS one 8: e63116. doi: 10.1371/journal.pone.0063116
  14. 14. Bowerman BL, O'Connell RT, Richard T (1993) Forecasting and time series: An applied approach: Belmont CA Wadsworth.
  15. 15. Hamilton JD (1994) Time series analysis: Cambridge Univ Press.
  16. 16. Zhang GP (2003) Time series forecasting using a hybrid ARIMA and neural network model. Neurocomputing 50: 159–175. doi: 10.1016/s0925-2312(01)00702-0
  17. 17. Pai P-F, Lin C-S (2005) A hybrid ARIMA and support vector machines model in stock price forecasting. Omega 33: 497–505. doi: 10.1016/j.omega.2004.07.024
  18. 18. Chang C-C, Lin C-J (2011) LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST) 2: 27. doi: 10.1145/1961189.1961199
  19. 19. Thissen U, Van Brakel R, De Weijer A, Melssen W, Buydens L (2003) Using support vector machines for time series prediction. Chemometrics and intelligent laboratory systems 69: 35–49. doi: 10.1016/s0169-7439(03)00111-4
  20. 20. Pai P-F, Lin C-S (2005) Using support vector machines to forecast the production values of the machinery industry in Taiwan. The International Journal of Advanced Manufacturing Technology 27: 205–210. doi: 10.1007/s00170-004-2139-y
  21. 21. Hong W-C, Pai P-F (2006) Predicting engine reliability by support vector machines. The International Journal of Advanced Manufacturing Technology 28: 154–161. doi: 10.1007/s00170-004-2340-z
  22. 22. Müller K-R, Smola AJ, Rätsch G, Schölkopf B, Kohlmorgen J, et al.. (1997) Predicting time series with support vector machines. Artificial Neural Networks–ICANN'97: Springer. pp. 999–1004.
  23. 23. Tay FE, Cao L (2002) Modified support vector machines in financial time series forecasting. Neurocomputing 48: 847–861. doi: 10.1016/s0925-2312(01)00676-2
  24. 24. Wei WW-S (1994) Time series analysis: Addison-Wesley Redwood City, California.
  25. 25. Moghram I, Rahman S (1989) Analysis and evaluation of five short-term load forecasting techniques. Power Systems, IEEE Transactions on 4: 1484–1491. doi: 10.1109/59.41700
  26. 26. Grahn T (1995) A CONDITIONAL LEAST SQUARES APPROACH TO BILINEAR TIME SERIES ESTIMATION. Journal of Time Series Analysis 16: 509–529. doi: 10.1111/j.1467-9892.1995.tb00251.x
  27. 27. Ho SL, Xie M, Goh TN (2002) A comparative study of neural network and Box-Jenkins ARIMA modeling in time series prediction. Computers & Industrial Engineering 42: 371–375. doi: 10.1016/s0360-8352(02)00036-0
  28. 28. Galbraith J, Zinde-Walsh V (1999) On the distributions of Augmented Dickey–Fuller statistics in processes with moving average components. Journal of Econometrics 93: 25–47. doi: 10.1016/s0304-4076(98)00097-9
  29. 29. Koehler AB, Murphree ES (1988) A Comparison of the Akaike and Schwarz Criteria for Selecting Model Order. Journal of the Royal Statistical Society Series C (Applied Statistics) 37: 187–195. doi: 10.2307/2347338
  30. 30. Wu C-H, Ho J-M, Lee D-T (2004) Travel-time prediction with support vector regression. Intelligent Transportation Systems, IEEE Transactions on 5: 276–281. doi: 10.1109/tits.2004.837813
  31. 31. Xuegong Z (2000) Introduction to statistical learning theory and support vector machines. Acta Automatica Sinica 26: 32–42.
  32. 32. Thissen U, Van Brakel R, De Weijer A, Melssen W, Buydens L (2003) Using support vector machines for time series prediction. Chemometrics and intelligent laboratory systems 69: 35–49. doi: 10.1016/s0169-7439(03)00111-4
  33. 33. Smola AJ, Schölkopf B (2004) A tutorial on support vector regression. Statistics and computing 14: 199–222. doi: 10.1023/b:stco.0000035301.49549.88
  34. 34. Thomason M (1999) The practitioner methods and toolJ. Journal of Computational Intelligence in Finance 7(3): 36–45.
  35. 35. Christodoulos C, Michalakelis C, Varoutas D (2011) On the combination of exponential smoothing and diffusion forecasts: An application to broadband diffusion in the OECD area. Technological Forecasting and Social Change 78: 163–170. doi: 10.1016/j.techfore.2010.08.007
  36. 36. Davision AC, Hinkley DV. Bootstrap methods and their application. Cambridge University Press, Cambridge, 1997, pp.23.
  37. 37. Yan W, Xu Y, Yang X, Zhou Y (2010) A Hybrid Model for Short-Term Bacillary Dysentery Prediction in Yichang City, China. Japanese Journal of Infectious Diseases 63: 264–270.
  38. 38. Tay FE, Cao L (2001) Application of support vector machines in financial time series forecasting. Omega 29: 309–317. doi: 10.1016/s0305-0483(01)00026-3
  39. 39. Tsay RS (1988) Outliers, level shifts, and variance changes in time series. Journal of forecasting 7(1): 1–20. doi: 10.1002/for.3980070102