Forecasting type-specific seasonal influenza after 26 weeks in the United States using influenza activities in other countries

To identify countries that have seasonal patterns similar to the time series of influenza surveillance data in the United States and other countries, and to forecast the 2018–2019 seasonal influenza outbreak in the U.S., we collected the surveillance data of 164 countries using the FluNet database, search queries from Google Trends, and temperature from 2010 to 2018. Data for influenza-like illness (ILI) in the U.S. were collected from the Fluview database. We identified the time lag between two time-series which were weekly surveillances for ILI, total influenza (Total INF), influenza A (INF A), and influenza B (INF B) viruses between two countries using cross-correlation analysis. In order to forecast ILI, Total INF, INF A, and INF B of next season (after 26 weeks) in the U.S., we developed prediction models using linear regression, auto regressive integrated moving average, and an artificial neural network (ANN). As a result of cross-correlation analysis between the countries located in northern and southern hemisphere, the seasonal influenza patterns in Australia and Chile showed a high correlation with those of the U.S. 22 weeks and 28 weeks earlier, respectively. The R2 score of ANN models for ILI for validation set in 2015–2019 was 0.758 despite how hard it is to forecast 26 weeks ahead. Our prediction models forecast that the ILI for the U.S. in 2018–2019 may be later and less severe than those in 2017–2018, judging from the influenza activity for Australia and Chile in 2018. It allows to estimate peak timing, peak intensity, and type-specific influenza activities for next season at 40th week. The correlation between seasonal influenza patterns in the U.S., Australia, and Chile could be used to forecast the next seasonal influenza pattern, which can help to determine influenza vaccine strategy approximately six months ahead in the U.S.


Introduction
Seasonal influenza viruses are a significant public-health problem that causes a great many deaths worldwide every year [1]. The annual recurrence of seasonal epidemics is attributed to the continued evolution of seasonal influenza viruses, which enables them to escape the PLOS

Data collection
Data for ILI in the U.S. were collected from the Fluview database of the Centers for Disease Control and Prevention in the U.S. The agency's website (https://gis.cdc.gov/grasp/fluview/ fluportaldashboard.html) provides both new and historical data. The CDC's ILI is freely distributed and available through ILInet [17]. From this website, we can obtain the CDC's dataset on weighted ILI (%). Influenza surveillance data were collected from the FluNet database of the WHO Global Influenza Surveillance Network (URL: http://apps.who.int/flumart/Default?ReportNo=12) [3,18]. These data are uploaded to the FluNet database every week by the countries in the network [3]. The FluNet database contains the following variables, reported by 164 countries: influenza transmission zone, number of specimens, number of INF A and INF B detected by subtype, and number of influenza-positive viruses. We collected the surveillance data of the 164 countries from the 40 th week in 2010 to the 40 th week in 2018. We set the starting point for our research data in 2010, because influenza season during 2009 was an atypical season, with the introduction of a novel pandemic strain (INF A H1N1 pdm09) [19]. Missing data were replaced by a zero because some countries conducted surveillance only during weeks with high influenza activity. Total INF, INF A, and INF B were defined as the number of influenza subtypes detected among processed specimens.
Google Trends (GT) (https://trends.google.com/trends/) is the principal tool used to study the trends and patterns of Google search engine queries [20]. Google Trends provides search activity for each country and specific time periods. We included the search topics, "influenza", "influenza A virus", and "influenza B virus" in the U.S., Australia, and Chile from October 2010 to October 2018. The search queries data from Google Trends were linearly interpolated from monthly data to weekly data points.
The daily temperature data in the U.S. was obtained from the National Oceanic and Atmospheric Administration (ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/by_year/). We included weekly temperature data and the average values for all stations in the U.S. during a week. The overall flow chart for this study is presented in Fig 1.

Statistical analysis
In two time periods, cross-correlations were analyzed using Pearson's correlation, with a time lag range of ± 30 weeks, with Bonferroni's correction. Cross-correlation allows the time lag between two time series to be identified [21]. If the blue waveform of the reference country correlates with the green waveform of country A with a time lag of -2 weeks, the peak or onset of the reference country can be identified as occurring 2 weeks later than that of country A (Fig 2). Time-series for ILI, Total INF, INF A, and INF B in the U.S. as reference were analyzed  with the Total INF, Total INF, INF A, and INF B, respectively, in the comparison country using cross-correlation. For example, ILI in the U.S. was compared to Total INF in each comparison country. Moreover, cross-correlation analysis was also performed on the GT and temperature data. We selected countries with a time lag between -20 and -30 weeks and a correlation coefficient of 0.7 or more in Table 1.
Linear regression analysis (LR) was used to evaluate the relationship between influenza surveillances in the U.S. after 26 weeks and those in selected countries, and GT and temperature from the 40 th week in 2010 to the 40 th week in 2018. LR 1 used influenza data from the U.S. after 26 weeks as dependent variable and previous seasonal data from the U.S. as independent variable. LR 2 used influenza activities from selected countries, and LR 3 used GT with the keyword "influenza A virus" in selected countries. LR 4 used temperature in the U.S. The input variables in LR 5 were those of LR models with the adjusted R 2 values higher than the LR 1, and these input variables were used to forecast seasonal influenza after 26 weeks in the U.S.   All statistical analyses were performed using Python 3.6.2 (Python Software Foundation), and p < 0.05 was considered statistically significant.

Prediction model
Forecasting seasonal influenza after 26 weeks was defined as forecasting influenza pattern after six months (26 weeks) with data available only at the current point (Fig 3). Four seasons for 2015-2019 were selected to validate the prediction models for seasonal influenza patterns. This is done because the training set accounted for 50% of eight seasons in the whole dataset for 2010-2018. For example, the prediction model for the seasonal influenza pattern during 2015-2016 used the influenza surveillance data from the 40 th week in 2010 to the 40 th week in 2015 as the training set (Fig 4).
The prediction model is a single-output model for seasonal influenza patterns after 26 weeks using the four historical values that indicate a month (4 weeks): we predict Y t+26 based on X t , X t-1 , X t-2 , and X t-3 : Y, X, and t are the predicted value, input variable, and week, respectively [22].
In order to forecast ILI, Total INF, INF A, and INF B of next season (after 26 weeks) in the U.S., we developed prediction models using an artificial neural network (ANN), LR, and ARIMA including exogenous variables (ARIMAX). ANN is an artificial intelligence technology inspired by the architecture of biological neurons, such as that in the human brain [23]. The input layer receives an input signal, which moves to the next layer as a modified version of the input signal. It passes through several layers composed of multiple transformations, and last passes through the output layer as an output signal [24,25]. We implemented ANN using the python library Keras (version 2.2.0) with TensorFlow (version 1.8.0) backend. The scikitlearn library (https://scikit-learn.org/stable/) was used for data management and preprocessing. In this study, we used a three-layer ANN network with a 10% dropout rate for the first Forecasting type-specific influenza after 26 weeks in the U.S. layer for the overfitting problem. The models were optimized using the Adam optimizer with a loss function of mean squared error. Neuron activation functions were rectified linear units for the second layer. We selected 100 epochs and one batch size for the ANN model.
Prediction models of LR were calculated to forecast influenza surveillance after 26 weeks in the U.S. per each year. The influenza time series is characterized by an autocorrelation, so we also employed the ARIMAX. The autocorrelation function and partial autocorrelation function is used to determine the autoregressive (AR) and moving average (MR) order. An ARIMA model is notated as ARIMA (p, d, q), where p indicates the AR order, d the differencing order and q the MA order [26]. For a complete description of the ARIMA analysis, see [27]. Prediction models of LR and ARIMAX were used the same input variables for ANN in S1 and S2 Tables, respectively.

Validation for the prediction model
The coefficient of determination, R 2 , which corresponds to the percentage of the variance of the observed time-series that is explained by the model, was calculated. Root-mean-square error (RMSE) was calculated using real values and predicted values for influenza activities in the validation set from the 41 th week in 2015 to 14 th week in 2019. The onset of influenza weeks for ILI is defined as the weighted percentage that exceeds the national baseline [28]. The peak amplitude and the peak timing are defined as the maximum value and that week in seasonal influenza week.  Table 1 shows the maximum correlation coefficient and time lag between the seasonal influenza outbreaks in the U.S. and the input variables. The correlation coefficients were calculated by cross-correlation between ILI in the U.S. and the number of positive influenza viruses worldwide. In Table 1 Although the correlation coefficients for GT with the keywords "influenza", "influenza A virus", and "influenza B virus in the U.S. showed high values, time lags for those ranged from -4 to 2 weeks, which were not eligible variables to forecast influenza after 26 weeks. However, Forecasting type-specific influenza after 26 weeks in the U.S.

Cross-correlation analysis
GT with the keyword "influenza A virus" in Australia and Chile showed high correlation coefficients with -23 and -30 week time lags, respectively. The correlation coefficient for the ILI of temperature in the U.S. was more than 0.7 with a -27 week time lag.  Forecasting type-specific influenza after 26 weeks in the U.S. Fig 5A and 5B, showing the INF A and INF B in the U.S. and Australia, shows that the values for Australia were shifted forward 22 weeks from the 40 th week in 2010 to the 40 th week in 2018. Interestingly, the waveforms for the U.S. and Australia are similar in peak timing and amplitude. Fig 5C, showing the INF A in the U.S. and Chile, shows that the values for Chile was shifted forward 28 weeks, the waveforms also similar. Therefore, influenza surveillance data in Australia and Chile could be valuable for predicting an influenza outbreak after 26 weeks in the U.S.  Table 3 shows the performance of the prediction models for seasonal influenza outbreaks in the U.S. after 26 weeks using previous season data, LR, ARIMAX, and ANN. The R 2 scores of LR, ARIMAX, and ANN for ILI, Total INF, INF A, and INF B showed better performance than those of the previous season. The R 2 score for the prediction model of ANN for ILI was 0.758. The R 2 score of ARIMAX for Total INF was 0.806, which was the highest.

Discussion
The aim of this study was to identify countries with seasonal patterns and influenza outbreaks that were similar to but preceded those of the United States. We used influenza activities in Australia and Chile, GT with the keyword "influenza A virus" in Australia and Chile, and temperature in the U.S. to forecast for seasonal influenza after 26 weeks in the U.S. The seasonal influenza patterns in Australia before 22 weeks and Chile before 28 weeks showed a high correlation with those of the U.S. In Table 2, influenza surveillance and GT in Australia and Chile at present were more useful than previous seasonal influenza in the U.S. for forecasting next seasonal influenza in the U.S. ANN models showed better performance for forecasting influenza activities after 26 weeks in the U.S than previous season data. The LR and ARIMAX models also showed high performance which can be interpretable. Our prediction models forecast that the ILI for the U.S. in 2018-2019 may be later and less severe than that in 2017-2018.
The seasonal influenza patterns in the U.S. were highly correlated with those in Canada, Australia, Chile, and the United Kingdom. The correlation coefficients for these countries were higher than those for their neighboring countries or for other countries in the Northern Hemisphere, for example, Mexico (0.412 for Total INF), Russia (0.329), Cuba (0.121), Spain (0.775), and Japan (0.600), which are not shown in Table 1. Moreover, the period of the primary influenza peak in the Southern Hemisphere countries was July and August [13], so the absolute values of the time lag for the U.S. in the Southern Hemisphere countries were more than 20 weeks in Table 1. However, the correlation coefficients for Argentina, Australia, Chile, and Uruguay with similar latitudes were different, at 0.516, 0.892, 0.778, and 0.303 for Total INF in the U.S., respectively, which are not shown in Table 1. For these reasons, the correlations for seasonal influenza between countries could be caused by characteristics of influenza, Economic Status, Educational Status, or Access to Health Care as well as by Season environmental factors in the Northern Hemisphere [3,29]. However, this study did not prove the  H1N1 and B viruses coincided with slower rates of antigenic evolution, lower ages of infection, and smaller, less-frequent epidemics than for A/H3N2 viruses [30]. Their study analyzed the correlation of peak timing for seasonal influenza between countries, and gave a time-series graph for influenza surveillance from 2000 to 2012 in the U.S. and Australia [30]. The timeseries of virological characterizations for A/H3N2 in the U.S. and Australia were consistent with the time-series graph in our Fig 5A. Bedford et al. suggested that differences in ages of infection could explain patterns of global circulation across a variety of human viruses [30].
Viboud et al. analyzed correlations for influenza epidemics from 1972 to 1997 in the U.S., France, and Australia [31]. In their study, France and the U.S. had a high correlation for influenza epidemics, but there was no significant correlation between the U.S. and Australia. In the scenario in which the influenza season in Australia was systematically six months in advance of that in the Northern Hemisphere, the median time lag between the peaks in Australia and in the United States was 27 weeks (range 14-39) [31]. Our study analyzed correlations for INF A and INF B, but the study of Viboud et al. did not [31].
The previous studies for prediction models for seasonal influenza have focused on social networking service data, search engine query data, and environmental factors [32][33][34]. These predictors are correlated with present influenza cases with a relatively short-term gap, of about one to four weeks. However, the influenza surveillance data in Australia had a time gap of 22 weeks from those in the U.S., which can help to establish a data-driven influenza vaccine strategy about six months ahead. Forecasting type-specific influenza after 26 weeks in the U.S.
Kandula et al. analyzed whether forecasts targeted to predict influenza by type and subtype during 2003-2015 in the U.S. were more or less accurate than forecasts targeted to predict Total INF using four compartmental models [35]. They found that forecasts separated by type/ subtype generally produced more accurate predictions and, when summed, produced more accurate predictions of Total INF [35]. Our prediction models for type-specific influenza is valuable as well as those for ILI, which could provide an important, richer picture of influenza activity.
To our best knowledge, this is the first study to use influenza surveillance in Australia and Chile for predicting influenza cases after 26 weeks in the U.S. However, our study has several limitations. For the ILI study, we could not use ILI data in other countries, instead had to use data for Total INF, because collecting ILI data separately for the 164 countries was difficult. There were not enough data on other potential covariates to show relationships between influenza outbreaks in Australia, Chile, and the U.S., such as the standard of the medical facilities, economic level, and medical records of the influenza virus. Furthermore, we included only data from laboratory-confirmed cases, which may underestimate the true incidence of influenza in the population [36]. Further research to explain underlying mechanisms for the relationship of influenza activities between these countries is warranted.

Conclusions
Our study forecasts the 2018-2019 seasonal influenza after 26 weeks in the U.S. using the 2018 seasonal influenza in Australia and Chile. The correlation between the seasonal influenza patterns in the U.S., Australia, and Chile could be used to forecast the next seasonal influenza pattern, which can help to determine influenza vaccine strategy approximately six months ahead in the U.S. Our prediction model allows to estimate peak timing, peak intensity, and type-specific influenza activities for next season at 40 th week.
Supporting information S1