Forecasting seasonal influenza-like illness in South Korea after 2 and 30 weeks using Google Trends and influenza data from Argentina

We aimed to identify variables for forecasting seasonal and short-term targets for influenza-like illness (ILI) in South Korea, and other input variables through weekly time-series of the variables. We also aimed to suggest prediction models for ILI activity using a seasonal autoregressive integrated moving average, including exogenous variables (SARIMAX) models. We collected ILI, FluNet surveillance data, Google Trends (GT), weather, and air-pollution data from 2010 to 2019, applying cross-correlation analysis to identify the time lag between the two respective time-series. The relationship between ILI in South Korea and the input variables were evaluated with Linear regression models. To validate selected input variables, the autoregressive moving average, including exogenous variables (ARMAX) models were used to forecast seasonal ILI after 2 and 30 weeks with a three-year window for the training set used in the fixed rolling window analysis. Moreover, a final SARIMAX model was constructed. Influenza A virus activity peaks in South Korea were roughly divided between the 51st and the 7th week, while those of influenza B were divided between the 3rd and 14th week. GT showed the highest correlation coefficient with forecasts from a week ahead, and seasonal influenza outbreak patterns in Argentina showed a high correlation with those 30 weeks ahead in South Korea. The prediction models after 2 and 30 weeks using ARMAX models had R2 values of 0.789 and 0.621, respectively, indicating that reference models using only the previous seasonal ILI could be improved. The currently eligible input variables selected by the cross-correlation analysis helped propose short-term and long-term predictions for ILI in Korea. Our findings indicate that influenza surveillance in Argentina can help predict seasonal ILI patterns after 30 weeks in South Korea, and these can help the Korea Centers for Disease Control and Prevention determine vaccine strategies for the next ILI season.

Introduction Seasonal influenza forecasts can provide data-driven information that supports influenza prevention and mitigation strategies [1]. Various statistical and machine learning methods have been used to predict influenza patterns using variables related to influenza surveillance, such as internet search query data [2]. The burden of 2013-2014 seasonal influenza in Korea was estimated at 125 million USD, higher than the burden observed in the past [3]. However, there have been few studies on influenza prediction in South Korea; current literature on the subject focuses on short-term forecasts using Google Trends (GT) or data from social network services [4,5]. Moreover, forecasting seasonal influenza in South Korea has proven difficult, due to the two peaks of influenza activity observed in a season.
Forecast targets for influenza should be chosen with quantitative and meaningful definitions that reflect public health needs [1]. The FluSight series of influenza forecasting challenges in the United States has forecast targets that include both seasonal (onset, peak week, and peak intensity) and short-term targets (forecasts up to 4 weeks), which are selected by the Centers for Disease Control and Prevention in the United States to understand the characteristics of seasonal influenza [6]. However, there are no such official influenza forecasting challenges and systems for forecasting seasonal targets in South Korea. Previously, we forecasted seasonal influenza patterns after 26 weeks in the United States using influenza activity in Australia and Chile, where the seasonal patterns and influenza outbreaks were similar to but preceded those observed in the United States [7].
Time series data have internal structures such as autocorrelation and seasonal variation, which can be used to forecast future values of the data using analyzed patterns in it [8]. The autoregressive moving average (ARMA) and autoregressive integrated moving average (ARIMA) models have been selected to analyze time series data in the clinical domain. Zhang et al. presented the prediction models for seasonal influenza using Google Trends and temperature [9]. They used cross-correlation analysis for the selection of input variables and seasonal ARIMA (SARIMA) models for prediction [9]. However, Zhang's model could predict relatively short-term lags than our previous model for the United States using surveillance data in other countries. Basile et al. presented ARMA models for short-term prediction of influenzalike illness (ILI) [10], and the ARIMA analysis was employed to forecast malaria incidence in Afghanistan with time-series patterns [11].
In this study, we aimed to identify variables that aid in forecasting seasonal and short-term targets for ILI in South Korea. We were able to achieve this using FluNet surveillance data, Google Trends, and weather and air pollution data. We also successfully suggested prediction models for ILI activity for seasonal targets using SARIMA, including exogenous variables (SARIMAX) models.

Data collection
ILI data were collected from the Korea Centers for Disease Control and Prevention (KCDC) [12]. ILI is defined by the KCDC as the quantitative number of people per week with a fever of 38˚C, and a cough or a sore throat per 1,000 outpatients. Influenza surveillance data were collected from the FluNet database of the WHO Global Influenza Surveillance Network [13][14][15]. These data are uploaded to the FluNet database every week by the countries in the network [13]. The FluNet database contains the following variables reported by 160 countries: influenza transmission zone, number of specimens, number of influenza A and B viruses detected by subtype, and number of influenza-positive viruses [15]. We collected surveillance data from the 160 countries from the 40 th week of 2010 until the 52 nd week of 2019. The starting point for the study period was 2010, due to a novel pandemic strain (H1N1 pdm09) in 2009 [16]. Missing data were replaced with a zero. Total influenza (INF) was defined as the sum of the number of influenza A and B viruses detected among processed specimens [7].
GT demonstrates the people's interest in near real-time using Google search engine queries [17]. Further, GT provides information on the volume of searches by country. We included the search keywords "A hyeong dokgam" and "B hyeong dokgam", which are Korean words for influenza A virus and influenza B virus, respectively, in South Korea from October 2010 to December 2019, with reference to Woo et al [5]. We defined the total GT as the sum of query data for "A hyeong dokgam" and "B hyeong dokgam". We obtained weekly search query data from GT. To reduce noise, the GT values that did not exceed zero for more than two consecutive weeks were replaced with zeros.
Weather data in South Korea were obtained from the National Weather Data Release Portal [18]. We included weekly temperature data and the average values for Seoul in South Korea during each week [19]. The air pollution data for South Korea were obtained from the Seoul Information Communication Plaza [20]. We included weekly air pollution data and the average values for Seoul in South Korea during each week [21].

Statistical analysis
For the variables included in the two time-series, cross-correlations were analyzed using Pearson's correlation, with a time lag range of ±30 weeks from the 40 th week of 2010 until the 52 nd week of 2019, with Bonferroni's correction. Cross-correlation allows for the time lag between two time-series to be identified [22]. If the higher cross-correlation value was found to have a negative lag, the values of the first series (ILI in South Korea) were correlated with the values of the second series (other variables), and the second series was made to precede the first in lag weeks [7]. The ILI in South Korea was compared to the input variables, and we selected variables with a time lag of -20 weeks or less and a correlation coefficient of 0.6 or more for seasonal forecast targets.
Linear regression analyses (LR) were used to evaluate the relationship between the ILI in South Korea and ILI in selected variables by cross-correlation analysis with a time lag from the 40 th week of 2010 to the 52 nd week of 2019. LR 1 used the ILI in South Korea after the time lag as the dependent variable and previous seasonal data from South Korea as the independent variable. Two univariate LR 2 models used the ILI in South Korea as the dependent variable; the total GT in South Korea for LR 2 of the forecast after 2 weeks and influenza activity in Argentina for LR2 of the forecast after 30 weeks were designated as the independent variables, respectively. Univariate LR 3 used the ILI in South Korea as the dependent variable and the average vapor pressure for the forecast after 30 weeks as the independent variable. The input variables in LR 4 were selected by cross-correlation analysis.
For time-series modeling, the ILI data were categorized into three terms: the trend, seasonal, and resid attributes [23]. Dickey-Fuller Test and seasonal Mann-Kendall test were performed to verify the stationary and seasonal trend term of the ILI time series, respectively [24]. The periodical terms were investigated using an autocorrelation function (ACF) and partial ACF (PACF) diagrams of ILI data. All statistical analyses were performed using Python 3.6.2 (Python Software Foundation), and p-values < 0.05 were considered statistically significant.

Prediction model
Our prediction models included forecasting the ILI in South Korea after 2 weeks and 30 weeks, separately. ILI forecasts after 2 weeks were defined as hindcasts and nowcasts; the former is the forecast of past conditions due to delays in reporting and data accrual, while the latter is the forecast of the current point [1]. The input variable that functioned as a reference in the ARMAX models after 2 and 30 weeks was the previous seasonal ILI. The input variables of the prediction model for ILI after 2 weeks in South Korea were previous seasonal ILI and total GT. The input variables of a prediction model for ILI after 30 weeks in South Korea were previous seasonal ILI, average vapor pressure, and total INF in Argentina.
The autocorrelation function and partial autocorrelation function were used to determine the autoregressive (AR) and moving average (MA) order. Reference 25 contains a complete description of the ARMA analysis [25]. An ARMA model includes parameters such as p of the AR order and q of the MA order [26]. After validation of the input variables, we found the final model for forecasting ILI after 30 weeks with the best parameters using the "arma_or-der_select_ic" from the "statsmodels" package in Python 3.6.2. ARMAX, ARIMAX, and SARI-MAX were adopted to select the final model. A SARIMA model includes parameters such as p of the AR order, d of the differencing, q of the MA order, and 52 weeks for seasonality. The coefficient of determination, R 2 , which corresponds to the percentage of the variance of the observed time-series that is explained by the model, was calculated. Root-mean-square error (RMSE) was calculated using real and predicted values for ILI activity in the validation set from the 41 st week of 2014 to the 52 nd week of 2019.

Validation for the prediction model
The final model for forecasting ILI was validated using the Akaike Information Criterion (AIC) index. The smaller AIC values corresponded to a better fitting [28]. Ljung-Box test was used to examine the independence of residuals, and Jarque-Bera test was utilized to examine whether the residual of the model followed a normal distribution [29]. Moreover, standardized residual, histogram plus estimated density, normal Q-Q, and ACF of the residual were drawn using the "plot_diagnostics" from the "statsmodels" package in Python 3.6.2. The requirement for ethical approval of this study was waived as we used open-source data that can be downloaded online without login.  Table 2 shows the adjusted R-squared values, calculated by LR analyses using ILI in South Korea as dependent variables, and the input variables as independent variables, which were previous seasonal ILI in South Korea, total GT, total INF in Argentina, and average vapor pressure. In LR 1, for the forecasts after 2 weeks and 30 weeks using previous seasonal ILI, the adjusted R-squared was 0.331, which was used as a reference point. The total GT for the forecast after 2 weeks was reported to be 0.744 in LR 2 for forecast after 2 weeks. In LR 2 for the forecast after 30 weeks, the total INF in Argentina shows 0.513, which is higher than the reference point of 0.331. LR 4 reported adjusted R-squared values of 0.794 and 0.712 for the input variables for the forecasts after 2 and 30 weeks, respectively.

Prediction models
The parameters (p,q) of the ARMAX models for prediction after 2 and 30 weeks were selected 1 of p and 0 of q because we want to validate the input variables without bias of difference in the parameters. Moreover, the fixed rolling window analysis performed 575 models of ARMAX because of two models for the forecast values after 2 and 30 weeks per each day. We selected a simple parameter for ARMAX rather than searching the optimal parameters. The   exogenous variables for ARMAX were the selected input variables by cross-correlation and linear regression analyses. Table 3 shows the performance of the prediction models for seasonal ILI in South Korea after 2 and 30 weeks using ARMAX. The AIC values in Table 3 are the mean values of the ARMA models for the fixed rolling window analysis. The AIC values of the ARMA models for forecasting after 2 and 30 weeks with selected exogenous variables were 849.1 and 696.0, respectively, which were lower than those of the reference models. The R 2 score of the prediction models after 2 weeks using the total GT was 0.789, higher than the reference value of 0.623. The R 2 score of the prediction models after 30 weeks, using average vapor pressure and  To construct the final model for forecasting ILI after 2 and 30 weeks at the 50 th week of 2019, the training set was the ILI data from the 50 th week of 2016 to the 50 th week of 2019. Fig  5 shows the seasonal components and coefficients of correlation of the training set for the final model, which are the trend, seasonal, resid attributes, ACF, and PACF. The results of the Dickey-Fuller test of the training set were -3.6 for test statistic and 0.006 for P-value; so, the ILI data was stationary time series, and we did not use the Box-Cox transformation for normalization. The test statistic value and P-value of the seasonal Mann-Kendall test were -0.871 and 0.383; there was no trend in the ILI data during the last three years. The automatically selected parameter (p, q) by AIC was 4 of p and 1 of q. Among ARMAX (p, q), ARIMAX (p, d, q), and SARIMAX (p, d, q) (p, d, q) seasonal , we selected a final model with the smallest AIC value, which was SARIMAX (4,1,1) (1,0,0) 52 model in Table 4.

Discussion
The current study aimed to identify variables for forecasting seasonal and short-term targets for ILI in South Korea and suggest prediction models for forecasting after 2 and 30 weeks using the variables. Among the input variables from various domains, the total GT showed the highest correlation coefficient with a week ahead. Moreover, the high correlation of seasonal influenza outbreak patterns in Argentina with those of South Korea when considering the 30-week forecast indicated that Argentina's seasonal influenza patterns 30 weeks prior were highly correlated with the current seasonal ILI patterns in South Korea. These variables were more useful for forecasting seasonal and short-term targets than previous seasonal ILI in South Korea. The R 2 values reported in the prediction models after 2 and 30 weeks using ARMAX indicated that the reference models could be improved using only the previous seasonal ILI.
In this study, total INF in Argentina showed a high correlation with ILI after 30 weeks in South Korea. In our previous study, total INF in Australia and Chile showed a high correlation with ILI after 22 and 28 weeks, respectively, in the United States [7]. Countries in the southern hemisphere showed different influenza patterns, and we identified those with seasonal patterns and influenza outbreaks similar to but preceding those of the reference country [7]. In Table 1, the absolute values of the time lag for Argentina and Chile in the Southern Hemisphere are higher than those for Ireland and China in the Northern Hemisphere. Therefore, the time lag for total INF in Argentina using the cross-correlation analysis was related to the latitude difference between Argentina and South Korea. Moreover, the correlation coefficient of the crosscorrelation analysis for Argentina (0.717) in Table 1 was higher than those for China (0.699) and Japan (0.517), which are neighboring countries of South Korea. Although this study did not prove the causality of the correlations between countries for seasonal influenza, the seasonal influenza patterns between Argentina and South Korea were similar, indicating that influenza surveillance in Argentina can be used to predict seasonal ILI patterns after 30 weeks in South Korea. The total GT had the highest correlation coefficient for ILI in South Korea and proved to be a powerful variable for estimating peak timing for the short-term targets, but not peak amplitude. For example, Google Flu Trends overestimated the prevalence of influenza in the 2011-2012 and 2012-2013 seasons by more than 50% in the United States [30]. During the peak flu season in 2014-2015, Google Flu Trends reported that 11% of the United States had influenza, nearly double the actual 6% reported by the CDC [30]. When flu-related events occur and are reported in the news or media, people without flu symptoms may search for flu-related keywords, which can lead to bias due to the increased volume of searches [31]. However, among variables that can be obtained in real-time, it is difficult to find a variable with a high correlation with seasonal flu patterns. Therefore, it is necessary to supplement the GT data with keyword combinations and machine learning. The variables for air pollution were not selected to forecast ILI in South Korea. Liu et al. demonstrated that SO 2 was positively associated with laboratory-confirmed influenza, which indicated that the number of confirmed influenza cases increased when the air concentration of SO 2 was high [32]. The average vapor pressure has a higher correlation coefficient with ILI than those of temperature and relative humidity in South Korea. Bai et al. reported that vapor  pressure was significantly associated with ILI in China; however, its correlation coefficient was lower than the correlation coefficient for temperature [33]. The real ILI values after the 5 th week of 2020 were lower than the forecast result of the final model in Fig 7. When patients with Coronavirus Disease-19 (COVID-19) were reported in South Korea, the Korean government recommended wearing a mask, maintaining social distance, and closing schools. These policies could have lowered influenza outbreaks as well as COVID-19.
The current study had several limitations. There were insufficient data on other potential covariates, such as the standard of the medical facilities, economic levels in South Korea and Argentina, and medical records related to the influenza virus. Further research to explain the underlying mechanisms of the relationship of influenza activity between these countries is warranted.

Conclusions
This study identified input variables suitable for short-term and long-term predictions of ILI in South Korea by cross-correlation analysis. Although total GT had a negative time lag, it is eligible as an input variable for the two-week forecast, since the KCDC releases the ILI value of the week before the present time. Short-term predictions performed better than long-term predictions but were not suitable for predicting peak timing and amplitude. Therefore, we suggest the prediction model be used after 30 weeks in South Korea, using influenza surveillance and average vapor pressure data from Argentina. Improved predictions for seasonal ILI after 2 and 30 weeks could help the KCDC determine vaccine strategies for the next season of ILI.