A new Seasonal Difference Space-Time Autoregressive Integrated Moving Average (SD-STARIMA) model and spatiotemporal trend prediction analysis for Hemorrhagic Fever with Renal Syndrome (HFRS)

Hemorrhagic fever with renal syndrome (HFRS) is a naturally-occurring, fecally transmitted disease caused by a Hantavirus (HV). It is extremely damaging to human health and results in many deaths annually, especially in Hubei Province, China. One of the primary characteristics of HFRS is the spatiotemporal heterogeneity of its occurrence, with notable seasonal differences. In view of this heterogeneity, the present study suggests that there is a need to focus on trend simulation and the spatiotemporal prediction of HFRS outbreaks. To facilitate this, we constructed a new Seasonal Difference Space-Time Autoregressive Integrated Moving Average (SD-STARIMA) model. The SD-STARIMA model is based on the spatial and temporal characteristics of the Space-Time Autoregressive Integrated Moving Average (STARMA) model first developed by Cliff and Ord in 1974, which has proven useful in modelling the temporal aspects of spatially located data. This model can simulate the trends in HFRS epidemics, taking into consideration both spatial and temporal variations. The SD-STARIMA model is also able to make seasonal difference calculations to eliminate temporally non-stationary problems that are present in the HFRS data. Experiments have demonstrated that the proposed SD-STARIMA model offers notably better prediction accuracy, especially for spatiotemporal series data with seasonal distribution characteristics.


Introduction
Hemorrhagic Fever with Renal Syndrome (HFRS) is a serious infectious disease that is mainly caused by a Hantavirus (HTNV) and the Seoul virus (SEOV) [1][2][3][4]. The clinical symptoms for HFRS are fever, hemorrhaging and renal dysfunction and it can result in long-term kidney damage, hypotension and even death. HFRS has s distribution across a number of countries.
China is the most seriously affected, accounting for more than 90% of the world's cases of HFRS [5][6][7][8][9]. Within China, however, one province in particular, Hubei, has become the most seriously affected area of all in recent years. Since the first case of HFRS was reported in Hubei in 1957, HFRS epidemics have expanded and reached a high point in 1983 with 23,943 cases. From 1980 to 2009, the number of HFRS cases in Hubei Province totaled 104,467. The spread of HFRS has had a significant impact on social stability and human health [10][11][12].
Spatial and temporal statistical methods have been used to discover the spatial and temporal distribution and clustering characteristics of HFRS across a number of different locations [13], including Buenos Aires in Argentina [14], Germany [15] and Brussels in Belgium [16]. In China, a Kulldorff spatial scan statistic has been used to try and identify the clustering of HFRS, drawing upon data spanning the period 1980 to 2009 [17]. A Gaussian GWR model has also been used to try and identify the factors influencing HFRS transmission (such as meteorological factors, rodent density, surface mean elevation, water area and human population density) drawing upon data from Hubei that was collected between 2011 and 2015 [18]. Moran's I index was adopted for a global spatial autocorrelation analysis that sought to identify the overall spatiotemporal pattern of HFRS outbreaks in Hubei between 2005 and 2014, and Spearman's rank correlation analysis was used at the same time to explore the possible factors influencing the epidemics, such as the weather and the area's geography [19]. Cross-correlation analysis has also been used to assess a possible association with meteorological variables and a time-series Poisson regression model was adopted to examine the independent contribution of meteorological variables to HFRS transmission in both Elunchun and Molidawahaner counties in Northeastern China between 1997 and 2007 [20]. Alongside of this, a generalized additive model with penalized smoothing splines has been used to examine the effect of meteorological factors on the occurrence of HFRS in Jiaonan between 2006 and 2011 [21].
Identifying the spatial and temporal distribution of HFRS can help with analyzing and evaluating the trends in HFRS outbreaks, thus leading to the adoption of more effective measures for the prevention and control of the disease. HFRS, however, has a frustrating degree of spatiotemporal heterogeneity and seasonal variation [19]. So, in order to conduct a better analysis of HFRS distribution and to acquire a more accurate means of prediction, the construction of a space-time model seems to be called for. Space-time modeling refers to the process of finding an analytical method to model and predict the value of an unrecorded space-time position based on given spatiotemporal data [22]. Space-time modeling is a spatial expansion of time series modelling and the factors influencing the attribute values of unobserved space-time positions bring together the spatial and temporal factors associated with single time series modeling, single spatial modeling and spatiotemporal modeling.
The most representative single time series model is Autoregressive Integrated Moving Averages (ARIMA). This analyzes the time series of historical data and obtains the model with the optimal fit for predicting events that will occur in the short term [23] [24]. An ARIMA model shows time series data that is related to both sequentially lagged variables and their errors. ARIMA models have been used several times for the prediction of HFRS outbreaks [25,23,[26][27][28], which indicates that this model is a good fit here as well for the forecasting of outbreaks.
For the single spatial modeling, there are space autoregressive models and space moving average models. Based on a spatial weight matrix, these models study the quantization measure of neighboring spatial units [29].
Drawing upon time and space series modeling, the geographer A.D. Cliff and the statistician J.K. Ord, originally proposed in 1974 a space-time series modeling framework [22] that is essentially a spatial expansion of the time series model. It combines Spatial Autocorrelation (SAR), a Spatial Moving Average (SMA) and Spatial Regression (SR). A large number of studies had shown that, whilst the ARIMA model provided better fitting results for data with a relatively stable temporal distribution and no strong spatial autocorrelation, its effectiveness for prediction relating to spatiotemporally heterogenous sample data was much weaker [30][31][32]. Cliff and Ord's Spatiotemporal Autoregressive Integrated Moving Average (STARIMA) extended beyond the ARIMA model [33]. The STARIMA model provides a space-time autocorrelation function (ST-ACF) and a space-time partial correlation function (ST-PACF) to address the problem of measuring spatiotemporal correlations. It also introduced a spatiotemporal lag operator that makes it capable of simultaneously extrapolating and predicting multiple spatial units [34]. The STARIMA model was subsequently proved to offer high estimation performance when applied to a case study of the regional deposits of commercial banks operating in Turkey using non-linear estimators [35]. The STARIMA model has also been applied to rainfall and waterlogging process simulation and to short-term forecasting. Here, it offers improved prediction accuracy and reliability when compared to traditional hydro model simulation and prediction [36]. Outside of this, STARIMA models have been applied to traffic prediction, environment variable prediction and in social and economic analyses [37][38][39][40][41].
Research has indicated that HFRS has a characteristic seasonal or cyclic time series-based occurrence [42,11]. In our previous work, a Seasonal Difference-Geographically and Temporally Weighted Regression (SD-GTWR) model was developed as an extension of the GTWR model that sought to use seasonal difference to get stabilized data [43]. Seasonal difference was used to deal with a non-stationary time series with seasonal distribution characteristics. Following on from this research, we constructed a Seasonal Difference-Space-Time Auto Regressive Integrated Moving Average (SD-STARIMA) model that is based on STARIMA. Time serials analysis and autocorrelation analysis were conducted to ensure the feasibility of using a seasonal difference approach. The STARIMA model is a prerequisite for advanced seasonal difference modeling and analysis. In our previous research, we found that from 1980 to 2000 [17] and from 2005 to 2014 [19] the HFRS cases in Hubei Province displayed a bimodal seasonal distribution pattern rather than a linear distribution. Seasonal difference calculations for HFRS incidence in Hubei using SD-STARIMA offer the prospect of improving the accuracy of previous space-time series models. The main contribution of this paper is the development of a new SD-STARIMA model that is able to bring seasonal difference calculations to bear in a way that will eliminate the non-stationary temporality problem found in HFRS data. Estimation results from the SD-STARIMA model show it to be more accurate than other models such as ARIMA and STARIMA. This confirms its potential to contribute to the prevention and control of HFRS.

Study data
The area focused on in this study is Hubei Province in central-southern China. The data covers the period from 2005 to 2014. In the past 30 years, the data during this decade is the most representative and 2014 is the most recent year for which detailed data is available. Basic geographic data about Hubei Province was collected from the Chinese National Administrator of Surveying, Mapping and Geo-Information. HFRS case data was provided by the Hubei Province Center for Disease Control and Prevention and the Chinese Center for Disease Control and Prevention. The HFRS case data contains the monthly case values for each county. Meteorological data was obtained from the National Center for Environmental Prediction and the Hubei Meteorological Bureau. Human population density data was extracted from the Hubei Statistical Yearbook, which includes the annual population for each county.

Seasonal characteristic analysis
The monthly distribution pattern of HFRS in Hubei Province from 2005 to 2014 is shown in Fig 1. It can be seen that HFRS epidemics appear to have a bimodal distribution for each year (12 months), occurring around March and September. As a result, the time frame for the range of seasonal differences for each year has been narrowed down to 6 months for this study [44].

Stationarity analysis of the HFRS incidence data
To arrive at a more effective time series analysis, it is necessary to identify the spatial and temporal series of the HFRS case data. Figs 2 and 3 show that the HFRS outbreak incidence in Hubei is clustered and does not meet the requirements of a normal distribution. In order to look for significant correlations in the HFRS outbreak distribution across the time series, an autocorrelation of the HFRS incidence time series data was undertaken using an autocorrelation graph. The autocorrelation graph and partial autocorrelation graph are plotted according to the autocorrelation and partial autocorrelation coefficients. In Fig 4(A), the abscissa is the number of lags and the ordinate is the ACF (autocorrelation function) value. The two lines in this figure represent the autocorrelation coefficient confidence interval of 95%. If there is no autocorrelation, the distribution pattern should be randomly distributed within the 95% confidence interval, without any fixed pattern and with the ACF values gradually tending to zero as the lag k increases. However, it can be seen from    there is periodicity in the time series. That being so, the time and space series for the HFRS case data in Hubei does not have a smooth time series.
Thus, according to the seasonal characteristics and stationarity analysis of the HFRS outbreaks presented above, the series for HFRS incidence distribution in Hubei Province is temporally unstable. As previously mentioned, a large number of studies have shown that ARIMA models are better able to fit data with a relatively stable time distribution and no strong spatial autocorrelation, but they are not so effective when there is spatiotemporal heterogeneity in the sample data [30][31][32]. This was the original reason for the development of Cliff and Ord's, Spatiotemporal Autoregressive Integrated Moving Average (STARIMA) model [33]. However, the accuracy of this model is still limited for non-stationary series. In that case, there is a need for a new spatiotemporal series model that is capable of analyzing the seasonal characteristics and stationary distribution of the HFRS outbreaks in Hubei to improve the precision of the predictions.

Construction of a seasonal difference Spatio-temporal autoregressive integrated moving average (SD-STARIMA) Model
By building upon both the ARIMA model and the STARIMA model, the SD-STARIMA model not only inherits the functions of STARIMA, but also has its own particular advantages. In this paper, the ARIMA analysis was conducted using SPSS 22 and the STARIMA analysis was conducted using R package. Construction and analysis of the SD-STARIMA model was conducted using MATLAB.
Principles of the ARIMA model ARIMA models are able to take into account changing trends, periodic changes, and random disturbances in a time series, so they are very useful for modeling a time series' time dependence structure. In epidemiology, ARIMA models have been successfully applied to predict the incidence of a number of infectious diseases, such as influenza [45] and malaria [46], to mention but a few [47,48]. ARIMA (p,d,q) modeling of time series originated with the work of Box-Jenkins [24]. The model-building process was designed to take advantage of associations in the sequentially-lagged relationships that usually exist in periodically collected data [49]. The following were the parameters selected when fitting the ARIMA model: p, the order of autoregression; d, the integration parameter; and q, the order of the moving average. Autocorrelation function (ACF) and Partial autocorrelation function (PACF) graphs were used to identify the order of the moving average (MA) and the autoregressive (AR) terms included in the ARIMA model.  Table 1 indicate the spatial autocorrelation results for Moran's Index I. From this it can be concluded that the distribution of HFRS incidence in Hubei has spatial autocorrelation characteristics, so the trends for HFRS cannot be simulated using just time. Specifically, if you let z(t) be the N×1 vector of observations at time t, the STARIMA model class can be expressed as follows [50]:

Construction of the STARIMA model
where p is the autoregressive order; q is the moving average order;λ k is the spatial order of the k th autoregressive term; m k is the spatial order of the k th moving average term;F kl is the autoregressive parameter at temporal lag kand spatial lag l; θ kl is the moving average parameter at temporal lag k and spatial lag l; W (t) is the N×N matrix of weights for spatial order l; and Ɛ(t) is  the random normally distributed error vector at time t [51].
This specific model is referred to as the STARIMA (p l 1 ;l 2 ;...;l p ; q m 1 ;m 2 ;...m q ) model. Two special subclasses of the STARIMA model are of note. When q = 0, only autoregressive terms remain, in which case the model is called a space-time autoregressive or STAR model. Models that contain no autoregressive terms (p = 0) are referred to as STMA models.

Construction of the SD-STARIMA model
By building upon the ARIMA model, the STARIMA model is able to evaluate the space functions pertaining to ARIMA. In essence, STARIMA is an extended linear regression model, so it can only describe linear autocorrelation results. That being so, STARIMA models are not well-suited to the prediction of the incidence of diseases with a seasonal epidemic pattern. Our above analysis of the time series results for HFRS incidence in Hubei suggests that HFRS incidence does not have a stationary temporal distribution. ARIMA or ARIMA-based models need a stationary distribution of time series data as a prerequisite. In view of this, a seasonal difference method was used to eliminate the disruptive tendencies and get a stationary time series. The seasonal difference method amounts to being a way of getting a new time series by calculating the difference between various circles labeled L: A new series model can be obtained after the d-order difference calculation has finished. This can be formally defined as: As previously mentioned, HFRS has specific spatially-distributed epidemics, with the seasonal epidemic pattern in Hubei being characteristically bimodal. The time frame for the seasonal difference calculation was set to 6 months. So, for the purposes of data stabilization by differential, the interval for each order of difference in the time series should be set to 6 months. The stationary series data used to establish the STARIMA model has three steps: identification; estimation; and diagnostic checking [52]. The novel SD-STARMA model proposed in this paper can be formally expressed as follows (9):

Selection of the order of difference
The Augmented Dickey-Fuller test for unit root in level is conducted, the results are demonstrated in Table 2. It can be conducted in Table 2 that p value for ADF test is 0.07 which indicate that HFRS cases series is non-stationary distributed which p<0.07.
The results of the time series for the HFRS outbreaks data in Hubei Province from 2005 to 2014 using a first-order difference are shown in Fig 6  Fig 7 presents the stationarity analysis results relating to HFRS incidence after using a firstorder difference. The time series fluctuates around the value 0, indicating an overall uniform distribution. The ACF and PACF appear to be tailing off. It can be inferred from Fig 7. that, after taking the first-order difference into account, the time series shown in Fig 6. is a stationary time series. Therefore, for this paper we have chosen to use the first order difference to preprocess the data.

Construction of the SD-STARIMA model and comparison with the ARIMA and STARIMA models
In this section we construct ARIMA, STARIMA and SD-STARIMA models using first-order difference for the time series relating to the HFRS incidence data.
ARIMA model. On the basis of first-order difference, the ARIMA(p,q) model can be defined as: On the basis of the ACF and PACF across different time lag values, p = 4 and q = 2 were selected as the values for this model. The autoregressive coefficient, moving average coefficient and test parameters are shown in Table 3.
STARIMA model. For the STARIMA model, a spatial weight matrix had to be established first of all. First-order spatial neighborhood matrices and second-order spatial domain matrices of 73 � 73 were obtained according to the spatial neighborhood relationship of 73 counties in Hubei Province (there are actually 76 counties, but 73 were used as samples and the other 3 for validation). The core diagonal elements of the first-order adjacency matrix are 0. There are no adjacent spatial units if the non-diagonal elements are 0. 1 indicates that there are adjacent spatial units. The first-and second-order spatial neighborhood matrix can be obtained on the basis of the specific adjacency unit according to the row and column identifying the elements and the line standardization.
The space-time autocorrelation coefficients and space-time partial autocorrelation coefficients are then calculated for the HFRS outbreaks data incidence series (before seasonal difference). The calculated results are shown in Tables 4 and 5.
The ACF values are truncated after time lag 4 and for all of the spatial lags. The PACF values are truncated after time lag 3 and for all of the spatial lags. In that case, a STARIMA (4,3) model can be constructed using the results in Table 6. SD-STARIMA model. Tables 7 and 8 present the calculated values for the space-time autocorrelation coefficient and space-time partial autocorrelation coefficient after applying the first-order difference series to the HFRS outbreaks data.
Looking at the results in Tables 7 and 8, it can be seen that both the AFC and PACF are tailing off. This confirms that this is a STARIMA model. A candidate time autocorrelation average moving model as in STARIMA (1,1) can now be got using the transformation status of the AFC and PACF. The STARIMA (1,1) model can be expressed formally as: zðtÞ ¼ φ 10 zðt À 1Þ þ φ 11 W 1 zðt À 1Þ þ φ 12 W 2 zðt À 1Þ þ εðtÞ À y 10 εðt À 1Þ À y 11 W 1 εðt À 1Þ À y 12 W 2 εðt À 1Þ A maximum likelihood estimate is made for the STARIMA (1,1) model to obtain its parameter estimation values and hypothesis test values. The results are shown in Table 9.

HFRS incidence prediction
Highly representative areas or areas with a high incidence of the disease were used to validate the model. Luotian, Zhongxiang and Yicheng counties were used to undertake a comparison.
The observed values and predicted values for these three counties are presented in Fig 8. It can be seen that the two values are very close, indicating that the prediction results are reliable.  We also evaluated the prediction results and general performance of the ARIMA, STAR-IMA and SD-STARIMA models to assess their relative effectiveness. Table 10 shows the correlation coefficient (R), Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), Average Absolute Error (MAE) and Classic Akaike Information Criterion(AIC) for each of the models. It can be seen from the table that the SD-STARIMA model is more reliable and that the error between its predicted values and actual observed values is smaller. Overall, then, we can conclude as follows: 1. The data relating to HFRS incidence in Hubei Province has a fluctuating distribution curve and is quite different from other statistically sampled data in terms of its space and time distribution features, which are characterized by an obvious seasonal distribution. An SD-STARIMA model was therefore introduced that is able to adjust for seasonal difference and thus fit the data incorporating seasonal distribution trends more effectively.
2. The data for HFRS incidence in Hubei Province has both spatial and temporal characteristics. The SD-STARIMA model has both spatial and temporal features that are thus able to explain and simulate the HFRS tendencies in Hubei, with a spatial-temporal weight matrix being used to quantify the influence from the neighboring counties. We found that the SD-STARMA model has a higher degree of fit as a result of its implementation of timespace autocorrelation than would be the case with time autocorrelation alone.
3. Although the overall trend for HFRS incidence is consistent across every county in Hubei Province, the time series for the different counties is still different because of various impacting factors such as the local environment and human demography. The SD-STAR-IMA model is able to combine not only historical influences, but also the spatial and temporal impact from neighboring counties to evaluate the tendencies for HFRS incidence for any one specific county.
Having arrived at our results, we also compared them, to previous studies relating to HFRS analysis and prediction. Zhang at al., for instance, used a basic Poisson regression method to examine the potential impact of climate variability on the transmission of HFRS [20]. They incorporated climatic variables across a range of lags into a basic Poisson regression model that effectively eliminated the lagged effect of the climatic variables on the number of HFRS cases. However, spatial influences and spatial lag for the HFRS data were not considered, potentially overlooking a significant set of influencing factors.
Li et al. have used a GWR (geographically weighted regression) model to identify the impact of environmental factors and social-economic factors on the spatiotemporal heterogeneity of HFRS in China [42]. In this model, spatial characteristics are taken into account when undertaking the GWR-based analysis. However, this model suffers from the opposite flaw to the one above: temporal correlation is also a key influencing factor for HFRS cases in Hubei Province. Thus, by overlooking the temporal factors, this may similarly undermine the accuracy of the estimated results.

Conclusion
Time series-based approaches have commonly been used in the past to predict the trends in HFRS epidemics, with ARIMA models standing as prime example. As a result of their capacity to capture both spatial and temporal variation, simulation results based on STARIMA models have been found to be more accurate than the results provided by non-spatial models like ARIMA. However, because there are also seasonal characteristics relating to the HFRS epidemics in Hubei Province, we developed a new model named SD-STARIMA that is able to incorporate adjustments for seasonal differences into space-time series analysis of HFRS outbreaks. We compared the estimates produced by ARIMA, STARIMA and SD-STARIMA for HFRS incidence data for Hubei Province and found that the SD-STARIMA model more closely predicted observed trends.
In conclusion, our examination of various possible models in this paper demonstrated the importance of analyzing seasonal differences in relation to HFRS epidemics because of the disease's seasonal characteristics. On top of this, we found that first-order differences most closely reflect the stability data and bimodal distribution characteristics of the disease. We then constructed a first-order difference based SD-STARIMA model that is able to make accurate predictions using both space-time autocorrelation coefficients and space-time partial autocorrelation coefficients. To validate the proposed approach, we used data relating to three counties that have a higher incidence of HFRS in Hubei Province (Luotian, Zhongxiang and Yicheng). According to the results, the SD-STARIMA model is more accurate than the ARIMA and STARIMA models and is generally much better for counties that are consistent with overall distribution trends. In that case, the SD-STARIMA model proposed in this paper has been proven to be more reliable for predicting HFRS epidemics in Hubei Province and has the potential to be more widely used for the prediction of epidemics.