A machine learning method to monitor China’s AIDS epidemics with data from Baidu trends

Background AIDS is a worrying public health issue in China and lacks timely and effective surveillance. With the diffusion and adoption of the Internet, the ‘big data’ aggregated from Internet search engines, which contain users’ information on the concern or reality of their health status, provide a new opportunity for AIDS surveillance. This paper uses search engine data to monitor and forecast AIDS in China. Methods A machine learning method, artificial neural networks (ANNs), is used to forecast AIDS incidences and deaths. Search trend data related to AIDS from the largest Chinese search engine, Baidu.com, are collected and selected as the input variables of ANNs, and officially reported actual AIDS incidences and deaths are used as the output variable. Three criteria, the mean absolute percentage error, the root mean squared percentage error, and the index of agreement, are used to test the forecasting performance of the ANN method. Results Based on the monthly time series data from January 2011 to June 2017, this article finds that, under the three criteria, the ANN method can lead to satisfactory forecasting of AIDS incidences and deaths, regardless of the change in the number of search queries. Conclusions Despite the inability to self-detect HIV/AIDS through online searching, Internet-based data should be adopted as a timely, cost-effective complement to a traditional AIDS surveillance system.


Introduction
inaccessible in some rural and western regions and noisy irrelevant information, web search trend data have been explored as a low-cost, nearly real-time approach to monitoring disease activity. Recently, Internet search query data have been shown to be a promising surveillance method for specific and sensitive monitoring of infectious diseases, harnessing data that are easily processed, aggregated, and visualized in near real time [12]. This paper aims to forecast and monitor AIDS deaths and incidences with a machine learning method, artificial neural networks (ANNs), in which monthly official reports on AIDS incidences and deaths are used as the output variable and online search query data are used as inputs. Indeed, the volume of Baidu searches for HIV-and AIDS-related terms may not reflect the HIV incidence or prevalence rates at the population level. There are many reasons why people may search for HIVand AIDS-related terms other than personal health-seeking behaviours, including: student assignments; health information seeking by medical staff, researchers, friends, and relatives of PLHIV (people living with HIV); and searches by the general public in response to media reporting about HIV/AIDS. However, it is also the case that, at the population level, in the case of HIV and other communicable diseases, researchers have consistently identified positive spatial and temporal correlations between specific disease-related searches and laboratory-confirmed disease incidence and prevalence, as shown by extant studies [8,13,14]. Our findings show that search trend surveillance can be a novel, cost-effective complement to traditional disease surveillance systems and enable DCDs or the public to achieve accurate and timely AIDS monitoring.

Data sources
AIDS data: We use two indicators to measure the actual AIDS outcome: monthly AIDS incidences and AIDS deaths. The data are collected from the monthly report of the National Health and Family Planning Commission of China (http://www.moh.gov.cn/), ranging from January 2011 to June 2017. The data are aggregate data, open to the public, and thus display no personal information.
Search trend data: The Internet and online search engines have been broadly used in China. A report issued by the China Internet Network Information Centre indicated that there are 566 million Chinese search engine users, and the Internet usage rate reached 82.3% at the end of 2015 [15]. We select the Baidu search trend, the weighted sum of the search frequency of each keyword, as the main data source, since Baidu.com (https://www.baidu.com/) is the largest Chinese search engine, holding the largest market share in China with a market penetration rate of 93.1%, followed by Google with a market penetration rate of only 18.0%. We obtain all the relevant search query volumes from the Baidu Index (http://index.baidu.com/), which is publicly released by Baidu.com daily. The volumes from both PC users and mobile users are integrated to make full use of the information available. Since the official AIDS data are reported monthly, we aggregate the daily search volumes into monthly ones.

Search queries
Since thousands of keywords are potentially related to AIDS, we perform the following three procedures to choose the search queries and mine the Baidu search trend data: determine the query base, extract and rectify the data, and filter the queries.
Query base: Previous studies have generally selected the names or clinical symptoms of target diseases as their core keywords but have not mined relevant queries further [16,17]. Since there are notable differences in users' search habits, preferences, and concerns, their practice of choosing relevant search queries may be arbitrary and omit some crucial search queries. Therefore, we first choose AIDS as the basic search query (in Chinese) and then use keywordmining tools, Webmaster tools (http://tool.chinaz.com/), Qucha (http://www.7c.com/ keyword/), and Aizhan (http://ci.aizhan.com/) to obtain the related queries. As a result, we obtain a query base converging to 201 search queries. After entering these keywords into the Baidu Index and dropping some queries with insufficient search volumes, we finally acquire a search query library (SQL) with 195 search queries regarding AIDS/HIV. These related search queries recommended by mining tools cover almost all the aspects of AIDS causes, symptoms, treatment, prevention, and so on and thus contain more comprehensive information on AIDS and AIDS-related activities.
Data extraction and rectification: We extract candidate data series from the Baidu Index (https://index.baidu.com/) based on the above-defined search keywords. Query filtering: More search queries do not necessarily assure better-fitting results, and the marginal contribution of adding terms to a 'saturated' model is limited but costly [13]. Here we filter search queries with the Pearson correlation coefficient (PCC) between the AIDS incidences (or AIDS deaths) and the monthly search volumes of each search query. Meanwhile, if too large a PCC threshold is chosen, some valuable queries will be removed, which may reduce the forecasting performance of the machine learning method that we use. In our case, if the PCC threshold is set at 0.65, only two search queries are left, too few to measure searching behaviours regarding HIV/AIDS. When forecasting influenza using search queries, 0.4 has been used as the PCC threshold to filter those search queries [18].
Therefore, we first adopt 0.5 as the PCC threshold value to select search queries. If one search query volume's PCC with AIDS incidences or deaths is larger than 0.5, this query is included in the list of predictors. As a result, 14 queries are filtered out to forecast AIDS incidences and 31 queries to forecast AIDs deaths, which are listed in S1 and S2 Tables, respectively. As an example, S1 Fig also shows that the aggregate monthly search volume of 'Regulation on the Prevention and Treatment of AIDS' and the official AIDS statistics display similar contemporaneous trends over time. To gauge the sensitivity of the forecasting performance, the PCC threshold value is increased by 0.05 to 0.65, leaving fewer search queries but a higher PCC of these queries with AIDS incidences and deaths.

Method
A machine learning method, artificial neural networks (ANNs), is used to forecast AIDS incidences and deaths in the next one to six months, for which the input variables are the search queries from the Baidu trend data and the output variables are the officially reported actual AIDS incidences and deaths. ANNs are branches of artificial intelligence developed in the 1950s with the aim of imitating the biological brain architecture, which consists of many simple neural cells [19]. ANNs are parallel-distributed systems made of many interconnected nonlinear processing elements (PEs) called neurons [20]. The model of an artificial neuron is shown in Fig 2. Neuron j is the sum of input signals x i , weighted by connection weights w ij from neighbouring neurons. These weighted signals constitute the neuron's net input net j . Then, the activation threshold ϑ j is added to the net input. After the activation function f (.) is further applied to the net input, the output value y j can finally be computed and sent to other neurons. A multilayer perceptron (MLP) model is one of the most widely used nonlinear, nonparametric ANNs. MLP networks consist of an input layer, one or more hidden layers, and an output layer. The architecture of an MLP model is shown in Fig 3. For more details about the ANNs and MLP, readers can refer to the book written by Tadeusiewicz [21].
The relationship between Internet query searching and actual HIV/AIDS epidemics may be influenced by many factors, such as the regional Internet penetration rate, the health careseeking behaviour of Internet users, or the media's influence power. In other words, the relationship may be nonlinear, and the linear regression model may not fit it well. Since no specific assumption needs to be made, an MLP model of ANNs is also used in this paper. In terms of hidden layers and nodes, a one-hidden-layer BP network type is used, and the number of hidden nodes is determined by trial and error [22]. Moreover, the Levenberg-Marquardt backpropagation method is adopted to update the weight and bias states in each iteration, and the MATLAB @ Neural Networks Toolbox is used to run the MLP model. To conduct the learning process of neural networks, the MLP model randomly divides the data pairs into three subsets-the training set, the validation set, and the test set-which are sets accounting for 80%, 10%, and 10% of the observations, respectively. To accelerate the process of convergence, all the inputs and outputs are normalized by X new = 2[(x-x min )/(x max -x min )]-1, where x denotes either any input or the output.

Forecasting performance
To evaluate the forecasting performance of ANNs with AIDS search trend data, we employ three criteria: the mean absolute percentage error (MAPE), the root mean squared percentage error (RMSPE), and the index of agreement (IA). They are given as follows: where n is the total period, y i is the actual value, andŷ i denotes the predicted value from the MLP model of ANNs. The MAPE and RMSPE measure the difference between the real AIDS indicators and their predicted value estimated from the MLP model, and in an ideal case they take the value of zero. Thus, the smaller the value is, the better the forecasting performance is. Meanwhile, the IA is a dimensionless index within the range of 0-1, and IA = 1 indicates perfect agreement between predictions and observations, that is, perfect forecasting. In practice, the extant studies have broadly used 0.05 as the threshold value of the MAPE and RMSPE and 0.6 as the threshold value of the IA for a qualified forecasting performance [23][24][25][26].

Results
The forecasting performances of the MLP model of ANNs for AIDS are calculated on the basis of the test set, which accounts for 10% of all the observations, and the full set, that is, all the observations. In this section, the 80% training set is used to train the networks, and the 10% test set and the full set are used to evaluate the forecasting performance of the networks trained with the training set.  with the topology structure is (4:4:1), that is, four nodes in the input layer, four neurons in the hidden layer, and one node in the output layer. This produces the best forecasting of AIDS incidences in the test set and leads to the best forecasting of AIDS deaths in the test set with the topology structure of (18:18:1). It should also be noted that the forecasting performance is robust to the change in PCC threshold values and is very good, since both the MAPE and the RMSPE are smaller than 0.05 and the IA is larger than 0.68, all representing a highly accurate prediction [23][24][25].
After training the ANNs with the training set to realize the best forecasting of AIDS in the test set, we further use the optimal networks to predict AIDS in the full set to determine whether the forecasting performance remains unchanged. Table 2 reports the full-set forecasting performance. The results are reassuring. When the PCC threshold value is set at 0.6 and 0.5, the well-trained MLP model leads to the best forecasting of AIDS incidences and deaths, respectively, according to the MAPE and the RMSPE for AIDS incidence forecasting and the MAPE and the IA for AIDS death forecasting.  Table 1 as accurate forecasting. This best-fitting performance can also be shown by graphing two-way scatters of the AIDS statistics and their predicted values from the MLP model, in which the predicted values can attain the best forecasting to the actual values of AIDS incidences when the PCC threshold value is 0.6 and to those of AIDS death when that threshold value is 0.5 (see S2 Fig).
Meanwhile, we also report the optimal prediction horizons of AIDS incidences and deaths resulting from the MLP model of the ANN method. First, the optimal net is obtained based on  Table. We find that, under the MAPE/IA criteria, the MLP model can best forecast AIDS incidences in the next second month, but, under the RMSPE criterion, it can best forecast AIDS incidences in the next fifth month. In terms of AIDS deaths, both the MAPE and the RMSPE also indicate that the model produces the best forecasting in the next month, and the IA indicates the best forecasting in the next third month. However, despite the changes in the three criteria, the MLP model performs quite well in forecasting both AIDS incidences and AIDS deaths in the next first to fourth months, since all three evaluation criteria have acceptable values: smaller than 0.05 for the MAPE and RMSPE and larger than 0.7 for the IA. Better forecasting performance of shorter horizons can also be observed in S3 Fig.

Discussion
The Internet provides almost limitless real-time, cheap data on consumers' searching behaviours, while ANNs provide solid novel methods for accurate forecasting. However, there might be several limitations to our study. First, the Baidu search trend data regarding AIDS might lack representation, since they are restricted to users of the Baidu search engine. Internet access is uneven in China and between rural and urban areas. Search query data represent only part of AIDS victims' searching behaviours and thus are just a complementary but important method for monitoring real-time AIDS incidences and deaths. On the other hand, due to social discrimination against AIDS victims, HIV-infected people are usually unwilling to report their disease or they resort to informal treatments. Besides, social ignorance regarding HIV and AIDS makes HIV carriers fail to realize their disease. As a result, the incidences and deaths may be under-reported in the official statistics. Second, we did not compare the forecasting performance between the ANN method and the traditional time series models, such as the autoregressive and moving average, vector autoregressive, and vector error correction models, since the data contain only 78 observations and the variables are not necessarily stationary, which may reduce the credibility of the estimations produced by these time series models. However, a recent study has suggested that the ANN method can be a promising alternative to traditional linear methods [27]. The ordinary least square method also shows that the predicted values from the optimal MLP model can significantly explain actual AIDS incidences and deaths with a goodness of fit of 0.681 and 0.754, respectively (see S2(i)-(C) and S2(ii)-(A) Fig).
Third, some search behaviours may be generated by news reports, special events, or a 'celebrity effect' rather than actual disease activity. For example, the search volumes of some search queries, such as 'AIDS workers are advised to rest' and 'NBA famous player Johnson is infected with AIDS', are due to news reports that an AIDS victim had been fired by his employer and a famous basketballer named Johnson had been infected by HIV, respectively. However, these queries generally have a low correlation coefficient with actual AIDS incidences and deaths and were removed from the list of inputs of the MLP model. Another concern is that the sharp increase in the search trend of AIDS queries on particular days, say, 1 December 1, that is, World AIDS Day, also does not reflect the actual AIDS incidences or deaths but rather indicates people's attention to AIDS. However, it is apparent from Fig 4 that actual AIDS incidences and deaths also increase on this day. The synchronous increases may occur because AIDS awareness activities generate more related searching behaviours; less discrimination against and greater caring for AIDS victims happen around the day as a result of the information campaign; and more knowledge about AIDS makes AIDS victims more willing to disclose their disease.
Finally, because of the lack of provincial or city-level data, we only conducted a study at the national level. China is a large country with diversified geographies, cultures, ethnicities, and population distribution. Thus, our results may have limited policy implications for how local governments can use search trends data to monitor local AIDS incidences and deaths. Moreover, we only use monthly data rather than weekly data or daily data, which are also unavailable. We admit that a study with weekly or daily data will lead to a more accurate forecasting performance on AIDS because of their merits of real-time and high-frequency observations. As an exploratory study, our framework using Baidu search trend data to forecast AIDS incidences and deaths can easily be extended to the provincial or city level.
However, despite the limitations of online search data, the ANN method, as an information-processing system inspired by biological neural networks [28], does not require any preassumptions regarding the relationship between the input variables and an outcome variable. It can process various fuzzy, nonlinear, noisy data through neuron simulation, memory, and association, and it processes calculation analysis using the method of self-adapting pattern recognition [29]. Moreover, three fundamental features of ANNs, that is, parallel processing, distributed memory, and adaptability, enable them to tolerate errors and noises and thereby provide robust forecasting. Thus, the ANN method can make full use of the information behind the search trends to realize satisfactory forecasting of AIDS incidences and deaths. Of course, to extend our study, it is necessary to collect data in different regions or with a longer time horizon and then compare them with traditional time series models.

Conclusion
ANNs perform well in forecasting China's monthly AIDS incidences and deaths with Baidu trend data. Three criteria of forecasting performance, the MAPE, the RMSPE, and the IA, all indicate that the MLP model of ANNs can result in accurate forecasting of concurrent AIDS incidences and deaths with Baidu search trend data. Moreover, the forecasting performance is relatively robust to a change in the number of search queries and forecasting horizons.
Our results have strong policy implications for both AIDS surveillance and the forecasting of other diseases. As people increasingly engage in their work and lives online, their traces on the Internet can be used to monitor their behaviours. Although HIV/AIDS can only be detected by professional laboratory instruments, potentially infected victims' searches through search engines with keywords relating to AIDS reveal the future risk of an AIDS/HIV outbreak. Thus, AIDS/HIV surveillance based on online search queries is becoming increasingly urgent. It costs less and is more immediate than traditional surveillance based on hospital data, which are lagged by days, weeks, or even longer [30]. In our case, forecasting AIDS one to four months ahead of time could enable DCDs or the public to prioritize the detection and prevention of the spread of AIDS/HIV. As people increasingly access the Internet, search queries will become more representative and can be used to monitor other emerging and re-emerging diseases in other countries and regions with high AIDS prevalence, given that data on actual AIDS are available.