Influenza surveillance with Baidu index and attention-based long short-term memory model

doi:10.1371/journal.pone.0280834

Fig 1.

The whole working flow of our study.

As shown in this figure, firstly, we give two datasets, Baidu Index data and ILI data where the former is the independent variable of the regression model and the latter is the dependent variable, for model training and testing. Secondly, through data preprocessing, we divide the two data sets into training set and test set, and retain their corresponding relationship in the time series, where Baidu index data are also standardized for doing better training. Thirdly, all models for comparison including our proposed attention-based LSTM model are trained on the same training set and tested on the same test set, so that we can get the test results of all models in different evaluation index on the test set. Fourthly, we compare these test results of all models according to the different evaluation index, so as to analyze whether the model we propose really have better fitting and prediction capabilities. Finally, if our attention-based LSTM model is proved to be better through evaluation, we will let it output and visualize the prediction results on the test set, including the correlation between the predicted values and the real values, then we can further discuss and get conclusion.

More »

Expand

Table 1.

Search terms primarily selected in this study.

More »

Expand

Fig 2.

The structure diagram of our proposed attention-based LSTM model.

The main structure of our proposed attention-based LSTM model is composed of Input, NN Cell, Attention Cell, Softmax, LSTM Cell, and Output modules. Corresponding to the mathematical expressions above, firstly the Input sequences X(t) will be weighted and add together with the weighted LSTM output Hidden and State data, then go through the NN Cell; secondly the data will flow to Attention Cell to generate the attention weighted data and then go into Softmax layer; thirdly the LSTM Cell will operate on the input data and output Hidden and State data from two directions, one is to output to the previous NN layer to operate with the data of the next sequence, and the other is output to the later NN layer for operating the model output Y(t). In order to gain a better training result for this model, we can also add a step that concatenating the previous output sequence data Y(t-1) with the LSTM Cell output to get the latest output sequence Y(t) in the last NN layer.

More »

Expand

Table 2.

The training models and their abbreviations.

More »

Expand

Table 3.

Test results of the training models.

More »

Expand

Fig 3.

Fitting results of the attention-based LSTM model.

The abscissa of this line chart is the time span (in week unit, containing the last 315 weeks of the training set, from the 49^th week of 2014 (December 1^st to December 7^th), to the 50^th week of 2020 (December 7^th to December 13^th)), and the ordinate of this chart is the numerical value of ILI. In this chat the real numerical value of ILI mean data through the time span is shown as the blue line, and the algorithm return ILI value by using ATLSTM model in fitting the training dataset in the same time span is shown as the brown line. In this line chart, there is a high degree of fitness between the blue line and the brown line, which reveals the characteristics of the real ILI data are well captured by the ATLSTM by training based on the Baidu index data.

More »

Expand

Fig 4.

Prediction results of the attention-based LSTM model.

The abscissa of this line chart is the time span (in week unit, containing the last 350 weeks of the whole dataset, from the 52^th week of 2015 (December 21^th to December 27^th), to the 52^th week of 2021 (December 27^th 2021 to January 2^nd 2022)), and the ordinate of this chart is the numerical value of ILI. Since we use the last 57 week of our whole data as the test set, in this chat we draw a dotted line named prediction line to distinguish the time span between training data and test set, which is, the 49^th week of 2020 (November 30^th to December 6^th) to the 52^th week of 2021. Same as the Fig 2, in this chart, the real numerical value of ILI mean data through the time span is shown as the blue line, and the algorithm return ILI value by using ATLSTM model in fitting and prediction on the training set and test set respectively in the same time span is shown as the brown line. In this line chart, we can also find a high degree of fitness between the blue line and the brown line, which reveals the characteristics of the real ILI data are well remembered by the ATLSTM by training based on the Baidu index data so it can return a good prediction.

More »

Expand

Fig 5.

Correlation of the predicted values of the attention-based LSTM model and the actual values.

The data presented in this figure are all time series data in the same time interval, which is, the time interval we used for model testing from 49^th week of 2020 (November 30^th to December 6^th) to the 52th week of 2021. The abscissa of this scatter plot in the main figure is the actual ILI values and the ordinate of this chart is the predicted values of our proposed ATLSTM model. The regression line of those scatter points is also given in the main figure, where the shadow around the line represents the confidence interval. The two auxiliary figures on the top and on the right are the distribution plots of actual values and predicted values respectively. As a supplement to the known results that R-square of our proposed ATLSTM model is 0.752, in this line figure, we can also find a high degree of correlation between predicted values and actual values, which show the ATLSTM model can return a good prediction.

More »

Expand