Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Incorporating connectivity among Internet search data for enhanced influenza-like illness tracking

  • Shaoyang Ning ,

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Supervision, Validation, Visualization, Writing – original draft, Writing – review & editing

    sn9@williams.edu

    Affiliation Department of Mathematics and Statistics, Williams College, Williamstown, MA, United States of America

  • Ahmed Hussain,

    Roles Data curation, Investigation, Methodology, Software, Validation, Visualization, Writing – review & editing

    Affiliation Department of Mathematics and Statistics, Williams College, Williamstown, MA, United States of America

  • Qing Wang

    Roles Conceptualization, Formal analysis, Investigation, Methodology, Writing – original draft, Writing – review & editing

    Affiliation Department of Mathematics, Wellesley College, Wellesley, MA, United States of America

Abstract

Big data collected from the Internet possess great potential to reveal the ever-changing trends in society. In particular, accurate infectious disease tracking with Internet data has grown in popularity, providing invaluable information for public health decision makers and the general public. However, much of the complex connectivity among the Internet search data is not effectively addressed among existing disease tracking frameworks. To this end, we propose ARGO-C (Augmented Regression with Clustered GOogle data), an integrative, statistically principled approach that incorporates the clustering structure of Internet search data to enhance the accuracy and interpretability of disease tracking. Focusing on multi-resolution %ILI (influenza-like illness) tracking, we demonstrate the improved performance and robustness of ARGO-C over benchmark methods at various geographical resolutions. We also highlight the adaptability of ARGO-C to track various diseases in addition to influenza, and to track other social or economic trends.

Introduction

Big data collected from the Internet, recording billions of people’s digital footprints, possess great potential to reveal the ever-changing trends in society. A growing number of attempts have been made to harness the potential of Internet data to address issues in a wide range of fields, including public health [111], economics [1223], business [2426], finance [17, 27, 28], social policy [29], popular culture trends [30], among others. In particular, digital disease detection, which utilizes big data from online source to provide accurate and up-to-date tracking of infectious diseases, has grown in popularity, especially since the 2009 H1N1 influenza pandemic and the 2020 global pandemic of COVID-19 [3136]. However, despite the existing efforts in enhancing infectious disease tracking with Internet search data, many challenges and limitations still remain. In particular, there lacks a statistically rooted and integrative approach in existing digital disease tracking frameworks that effectively accounts for the connectivity within the Internet search data. The purpose of our paper is to pioneer a statistical learning method, ARGO-C (Augmented Regression with Clustered GOogle data). ARGO-C takes advantage of the interconnections among the Internet search data and aims to improve the accuracy and interpretability of the disease tracking framework. Our method focuses on influenza (flu) tracking, but has the generality and flexibility to be adapted to tracking other infectious diseases or other social/economic trends.

Influenza (flu) epidemics occur every year with varying timing and intensity. It may claim up to 650,000 deaths per year worldwide [37], and, on average, results in 610,660 life-years loss in the US [38]. Our ability to prepare for and respond to epidemics or pandemics depends on the timely tracking and forecasting of the infectious disease activities [3941]. Traditionally, the tracking and surveillance of flu activities in the US rely mainly on the US Outpatient Influenza-like Illness Surveillance Network (ILINet) by the Centers for Disease Control (CDC). Each week, ILINet collects outpatients information from thousands of healthcare providers across the nation and reports the percentage of Influenza-like Illness patients (%ILI). However, due to the time incurred by data collection, aggregation, and administrative processing, the CDC’s weekly flu report usually lags behind real time by 1–2 weeks, which is far from optimal for tracking a fast-spreading, ever-changing disease epidemic such as flu.

In order to eliminate the time lag between CDC’s flu report and the real time event, digital disease detection [42], a new disease tracking framework based on Internet data, was proposed and has since revolutionized the landscape of flu tracking. In particular, methods for digital flu detection employ statistical or mechanistic models to harness Internet-derived data from various sources [1, 2, 6, 4355] to provide current estimation of future prediction of flu activity (usually in terms of %ILI). This is also refred to as “nowcasting”, in contrast to forecasting (i.e., predicting future). Among these approaches, Google Flu Trends (GFT) [2], which uses the volume of selected Google search terms to estimate current influenza-like illnesses (ILI) activity, has attracted the most attention. However, the significant prediction errors by GFT in the following flu season, as well as its lack of transparency and reproducibility, has incurred many criticisms. This has also inspired a growing literature [31, 5660] in digital disease detection, with the aim to identify what Google had done wrong and improve from there.

In particular, the ARGO framework (AutoRegression with GOogle search data) [61] provides robust and highly accurate ILI estimates at the national level by directly addressing the limitations of GFT. Through a linear model design that is justified by a hidden Markov model, the ARGO framework effectively integrates multi-source information from the CDC’s flu reports and Google’s search volume data while accounting for dynamics in flu epidemics and people’s search patterns. Due to its flexibility and generality, ARGO has been well-adapted to multi-resolution, multi-disease tracking based on multi-source data [36, 52, 6267].

Nevertheless, among the existing methods for digital disease tracking, few have directly addressed the complex connectivity observed within the Internet data. Particularly, many of the Internet search terms included in ARGO share semantic similarities, such as phrases like “treat flu”, “how to treat the flu”, and “treat the flu”, and consequently, their search volumes may be closely related. Such connection among Internet data has never been explicitly investigated in the existing ARGO-derived frameworks. Our goal here is to propose ARGO-C, a general framework that incorporates underlying interconnections among the Internet data and improves flu tracking’s accuracy and interpretability. Our contribution is significant in that (i) ARGO-C provides an innovative statistical learning framework that explicitly models the connectivity among Internet search data and utilizes the information to improve the accuracy of disease tracking; (ii) it enhances the interpretability of the predictive model by revealing the clustering structure of search terms and each cluster’s contribution to the model; (iii) it provides a general framework that is readily adaptable to tracking other diseases and/or social/economic trends with Internet search data.

Materials and methods

The ARGO model

The model of ARGO targets the time series of logit-transformed ILI activity level Yt (i.e., the logit-transformed %ILI) at the national-level from CDC’s flu report at week t. It assumes a Markovian structure in a period M of flu activities Y(tM+1):t (note that {(tM + 1):t} = {tM + 1, tM + 2, …, t − 1, t}, the set of all integer indices in between), which takes the form of an autoregressive model. Consequently, Xt = (X1t, X2t, …, Xpt), a vector of p log-transformed search volumes of flu-related queries from Google, depends solely on the the flu activities at time t. Jointly, the structure of the search volumes Xt and recent flu activities Y(tM+1):t can be summarized by a Hidden Markov model, as presented in Eq (1). (1)

With further assumptions on stationarity, normality of the observations, and linear dependence of search volumes Xt on flu activities Y(tM+1):t, the prediction distribution Yt|Y(tM+1):(t−1), Xt is Normal with a mean linear in Y(tM+1):(t−1) and Xt and a stationary variance. This leads to a linear predictive model for ARGO: (2) where the random errors ϵt are i.i.d., with mean 0 and constant variance . Given the large number of predictors in model (2), ARGO adopts an L1-regularized regression [68] to achieve an adaptive selection of predictors. To further account for the dynamic changes in search patterns and flu epidemics, ARGO employs a rolling window prediction scheme, with a sliding training set of N = 104 weeks. Therefore, at a given week T, the coefficients μy, β = (β1, β2, …, βp), and α = (α1, α2, …, αM) are estimated as follows: (3) where ‖ ⋅ ‖1 represents the L1 norm, and λ (λ ≥ 0) is the tuning parameter for penalization.

The ARGO2 and ARGOX models for localized flu tracking

ARGO2 [63] generalizes the national ARGO model to localized, regional flu tracking (the US Health and Human Services (HHS) regions). It is operated in two steps:

  • Step One is to extract Internet search information. It employs the framework of ARGO model and applies to each region individually (with autoregressive terms left out) to obtain a preliminary raw estimate for each region’s %ILI of the week.
  • Step Two integrates multi-source, multi-resolution information to boost the regional %ILI prediction. Specifically, the best linear predictor based on a structured covariance matrix is used to provide joint %ILI prediction for all 10 regions, which incorporates Google search information from Step One’s raw estimates, the national flu baseline level estimated by the ARGO model, and the recent flu time series trends from the latest CDC’s flu reports.

ARGOX [65] further extends the ARGO framework to state-level, thereby establishing a coherent multi-resolution framework for digital flu tracking. Specifically, ARGOX first dichotomizes all the states into two groups, i.e., the epidemically connected and disconnected, and then customizes different prediction models in the Step Two algorithm accordingly.

Penalization methods with group-wise sparsity

As discussed in the previous sections, due to the high dimension of the Google Search terms, penalization technique is integrated into the ARGO, ARGO2, and ARGOX algorithms. In this paper, we are interested in adapting penalization methods with a group-wise regularization to these frameworks.

In linear and generalized linear regressions, penalization methods are among popular practical tools that aim at reducing the variability of parameter estimation by conventional methods, such as the ordinary least squares (OLS) method or the maximum likelihood (ML) algorithm, or decreasing the dimensionality of a given feature space. Over the past several decades, a number of penalization techniques have been developed. For instance, Horel and and Kennard [69] proposed the ridge estimator that shrinks the OLS estimator towards zero so as to alleviate its large sampling variation. Later, Lasso and elastic net algorithms were proposed [68, 7073], both of which can realize simultaneous variable selection and parameter estimation. These methods each is defined on a different penalty function. However, they all impose regularization on individual model parameters and intend to drive each estimated parameter towards zero.

Motivated by multi-factor analysis of variance (ANOVA), Yuan and Lin [74] proposed a group lasso method that is designed to select groups of indicator variables associated with the same factor. They combine the dummy indicator variables defined for a given factor together, and then select a subset of important factors through a penalty term imposed on the corresponding grouped model parameters. More specifically, suppose there are p predictor variables which can be partitioned into K groups with group size ). In the context of linear regression, the group lasso solution for the regression coefficients β can be expressed as where Yi is the response variable for the i-th observation, is the i-th observation’s corresponding predictors in group k, β(k) is the vector of coefficients in group k, and λ (λ ≥ 0) is the tuning parameter for penalization. Later, Meier et al. [75] applied the group lasso method to logistic regression and other generalized linear regression models.

More recently, Simon et al. [76] studied a sparse group lasso (SGL) algorithm, imposing a convex combination of the L1- and L2-norm penalties on the grouped and individual parameters respectively. In linear regression, the SGL solution for the regression coefficients β is where α ∈ [0, 1] is a weight tuning parameter, λ ≥ 0, and ‖ ⋅ ‖1 is the L1 norm. When α = 0, the solution is reduced to the group lasso solution; when α = 1, it becomes the lasso solution. Simon et al. [76] also discussed the application of SGL in other model settings, such as generalized linear regression.

ARGO-C: The proposed method

In this subsection, we illustrate in detail the methodology of our proposed approach, ARGO-C.

We first recognize that some of the flu-related Google search terms often share a common theme. For example, search terms concerning flu treatments may include phrases such as “treat flu”, “how to treat the flu”, and “treat the flu”; or a specific sub-type of flu may contain terms like “influenza a”, “influenza type a”, etc. It is then natural to consider clustering similar search terms together, and then fit a penalized linear regression model with a group-wise penalty. This direction motivated our project.

Our proposed approach aims at enhancing the infrastructure of ARGO. Here, we focus on the ILI tracking at the national level as an illustration. Additionally, to showcase the generality of our approach, we provide an exemplary integration of our approach to the ARGO2 and ARGOX frameworks [63] for localized (regional and state-level) flu tracking. Our methodology can be readily integrated to existing methods to track other infectious diseases [36, 62, 66, 67], or other social and/or economic trends [23].

National level.

For national level %ILI prediction, our proposed ARGO-C is realized in two steps: in Step 1, we identify the connectivity structure among the candidate flu-related Google search terms by unsupervised statistical learning; in Step 2, we integrate the identified cluster structure into the predictive model of weekly flu activities (in terms of %ILI) using a penalized regression with group-wise regularization.

We start with defining some notations that we will refer to for the rest of the paper. Let Yt be the logit-transformed percentage of influenza-like illness (%ILI) at the national level at week t from CDC’s weekly flu reports; Xt = (X1t, X2t, …, Xpt) be the log-transformed Google search volumes of p flu-related terms at week t. Due to the delay of CDC’s reports, at the current week T, we are only able to observe Y1:(T−1), up to the previous week; the Google search data are instead up-to-date, with X1:T all available. Our proposed method can be summarized as follows:

  • Step 1: clustering Internet search terms.
    Partition the p search terms into K groups, denoted by G1, …, GK (K ≥ 2). This can be done using some standard clustering method, such as hierarchical clustering [77], k-means [78], or model-based clustering [79]. And, standard statistical programming languages, such as R [80] or Python [81], have existing packages that can realize these methods easily. The number of clusters K can be determined through investigating measures such as the within-group variance, silhouette [82], or the gap statistic [83]. For ILI tracking tasks reported in this study, we recommend hierarchical clustering realized based on correlation as the distance metric between search terms’ time series, i.e., the correlation distance between two search terms is given by , where is the correlation coefficient. The number of clusters K is selected based on the “Elbow Method”: examining the trend between the within-group variation (i.e., calculated Within-cluster Sum of Squares, or WSS) against varying the numbers of groups and then selecting candidate values for K at change points. The final number of clusters K is determined with consideration of interpretability (avoiding too many or too few clusters) while cross-referencing with other criteria such as gap statistic and silhouette.
  • Step 2: nowcasting using group-structured, penalized regression.
    Predict the current week’s (logit-transformed) ILI%, YT, by (4) where m is the length of the lagged time series terms, μy is the intercept, γ = γm:1 is the autoregressive coefficients, and β = β1:p are the exogenous coefficients of Google search terms. The coefficients β are partitioned into K groups as identified in the previous step. Then, the model parameters in Eq (4), (μy, β, γ), can be estimated by minimizing the following penalized sum of squares quantity, where N is the training window length, β(k) = {βj, jGk} is the coefficients of search terms in cluster k, and pk is the size of cluster k. Specifically, the cluster structure of search terms is incorporated through the sparse group lasso (SGL) regularization which intends to impose sparsity on all the coefficients of terms in each cluster simultaneously. Note that α and λ are the tuning parameters, with α deter S1 Fig mining the weights between the individual and group-wise regularization and λ controlling the strength of regularization. In practice, we use the default setting α = 0.95 in R package SGL [84] and use cross-validation to select λ. We also follow the ARGO’s default setting and set the training windows to be two years, i.e., N = 104 weeks. Notably, the model is trained dynamically in a rolling-window scheme to address the evolution in people’s search patterns [61]; it is also trained based only on data available to the time of prediction, thus enabling real-time flu tracking.

Regional level.

For ARGO-C’s regional flu tracking, we break down ARGO2’s Internet data extraction step (i.e., Step 1) into two sub-steps.

  • Step 1.1: clustering Internet search terms.
    For each region r (r = 1, …, 10), follow the similar procedure in Step 1 of the national %ILI to partition the p search terms into K groups, denoted by . Here we keep the cluster number K the same as the national level for consistency but allows clustering variations across different regions to account for distinct connectivity of search information for each region.
  • Step 1.2: extracting regional Google information based on clustered search terms.
    Obtain preliminary estimates for regional level %ILI based completely on the region-wise Google search data and the clustering structure learned in Step 1.1. Specifically, the raw estimate for region r’s (log-transformed) %ILI at the current week T, , is given by (5) where the superscript (r) indicates the region-specific parameters and data. And, similar to Step 2 of ARGO-C at the national level, the parameters are estimated through a sparse group lasso regularization: where N is the training window length, is the region-specific coefficients of search terms in cluster k, and pk is the size of cluster k. Similar to the national level, we use the default setting α = 0.95 in R package SGL [84] and employ 10-fold cross-validation to select λ for each individual region.
  • Step 2: cross-regional boosting.
    This follows exactly the same as the original Step 2 in ARGO2, to prediction 10 regions’ %ILI jointly based on multi-source, multi-resolution information. More details can be found in [63].

State level.

At the state level, ARGO-C is readily adaptable to fit into the ARGOX framework. Step 1.1 and 1.2 follow the regional level procedure above but applied to each state; Step 2 directly inherits the original Step 2 of ARGOX [65], with the same dichotomic treatment of the 51 states/district/city.

Data

CDC’s %ILI data.

The CDC’s weekly flu report is released every Friday, listing the percent of outpatient visits with influenza-like illness (%ILI) in the previous week [85] (https://www.cdc.gov/flu/weekly/overview.htm). Therefore, the CDC’s %ILI always lags behind real time by at least one week. The CDC’s report includes %ILI at the national level, of the 10 the US Health and Human Services (HHS) regions, and the 51 states/district/city (50 states plus Washington DC, excluding Florida but including New York City). The CDC’s %ILI data for this study were collected on January 29, 2023.

Google search data.

The Internet search data from Google are publicly available through Google Trends (trends.google.com). Once a user specifies a desired query term (or a topic), a geographical indicator, and a time range on Google Trends, the website will return a time series of the term’s weekly search volumes. With Google Trends API, we are able to obtain the un-normalized search frequencies for the specified term, which includes all the searches that contain the entire term.

The search query terms in this study follow the same ones established by previous works [61, 63, 65]. Specifically, 161 flu-related search terms/topics were identified, with the first batch returned by Google Correlate on March 29, 2009 and the remaining on May 22, 2010 to account for the 2009 H1N1 outbreak. We also included additional topics/queries obtained from “Related queries” and “Related topics” when searching flu-related terms on Google Trends after Google Correlate got discontinued in December, 2019. S8 Table lists these search terms. As verified by previous works, this thoroughly screened set of search queries provides a relatively comprehensive characterization of people’s search patterns relating to ILI. We welcome future endeavors to further enrich this set, potentially with Large Language Model and text mining techniques.

The state-level Google search volumes are enriched by the corresponding regional-level data, following the ARGOX framework [65], to alleviate the missing data issue (zero frequencies).

We admit that the Google search data may only be representative of the search interests among Google users rather than the entire population. The ARGO (including ARGO2 and ARGOX) framework [65] attempts to correct for such potential bias in the modeling.

As one benchmark method for comparison, we downloaded the discontinued Google Flu Trends (GFT) data (https://www.google.org/flutrends/about/data/flu/us/data.txt). GFT has the weekly %ILI prediction from January 1, 2004 to August 9, 2015.

Evaluation metrics

We use three metrics to evaluate the accuracy of an estimate against the actual %ILI released by the CDC: the root mean squared error (RMSE), the mean absolute error (MAE), and the Pearson correlation (Correlation). RMSE between an estimate and the true value pt over period t = 1, …, N is given by The MAE between an estimate and the true value pt over period t = 1, …, N is defined as And, the correlation we considered is the Pearson correlation coefficient between and p = (p1, …, pN).

Results

We first applied the ARGO-C model to retrospectively estimate the weekly %ILI at the US national level from March 29, 2009 to January 28, 2023. S1 and S2 Figs illustrate the clustering structures identified among the flu-related Google search terms by ARGO-C’s Step 1. The results were obtained via hierarchical clustering with an average linkage function using correlation as the distance metric, based on the Google data available prior to the earliest prediction date (more detailed discussion on the choice of clustering method in Discussion section and in S1 Table). In particular, we realized the clustering analysis twice, using two sets of search terms of 71 and 161 terms/topics, respectively (see S8 Table). Then, the identified clusters of search terms were incorporated into ARGO-C’s Step 2 to predict %ILI after the collection date of the corresponding set of search terms (March 29, 2009—May 21, 2010 for the first 71 terms, and May 22, 2010 onward for all 161 terms, respectively). More specifically, 53 clusters were identified among the 71 search terms originally collected by March 29, 2009 (see S1 Fig, where the number of clusters was determined by minimizing the sum of within-group variation while preserving interpretability). In the same fashion, based on the second set of 161 flu-related search terms/topics, 45 clusters were identified, as illustrated in S2 Fig. As can be seen, the clustering method successfully grouped search terms with close semantics into the same group. For example, one cluster consists of search terms related to flu treatments, containing phrases such as “treat flu”, “how to treat the flu”, and “treat the flu”. In addition, another cluster includes search terms about respiratory illness related to flu, such as “sinus”, “bronchitis”, “pneumonia”, and “walking pneumonia” (53 clusters among 71 terms, in S1 Fig). We note that some search terms with close semantic meanings are not necessarily highly correlated in their search volumes, and thus may not be clustered together, such as “flu treatment” and “treat flu” (S1 Fig). On the other hand, semantically less relevant search terms could be grouped together potentially due to people’s shared search interests, such as “flu Texas” and “flu report” (S2 Fig). This exemplifies the data-driven nature of our approach. Besides empirical evaluation, the effectiveness of the clustering results is also supported by quantitative evidence, where our method gives leading performance in terms of silhouette and within-group variance (WSS, S1 Table).

Fig 1 and Table 1 summarize the national-level prediction performance of our proposed model, ARGO-C, in comparison with benchmark methods, including Google’s original GFT (discontinued on July 11, 2015), vector autoregression model with lag 1 (VAR1), the original ARGO model [61], and the naive method which simply carries over the previous week’s %ILI to predict the current week. We first focus on the period prior to the influence of COVID-19. That is, we exclude the period when the %ILI reported by CDC was highly confounded and contaminated by COVID-19 symptoms and cases, thus not accurately reflecting the flu activities anymore [67, 85]. During this whole period from 2009 to 2020 (Fig 1), our method ARGO-C shows the leading prediction accuracy compared to all other benchmarks across all three performance metrics. In particular, the improvement of ARGO-C from ARGO confirms the potential of integrating the interconnection among Internet search data into the modeling process while showcasing the effectiveness of ARGO-C in utilizing such information to enhance disease prediction.

thumbnail
Fig 1. Comparison of %ILI estimation between ARGO-C and other benchmarks.

The evaluation is based on the national level %ILI in three accuracy metrics: RMSE, MAE, and correlation; the evaluation period is the overall period from March 29, 2009 to March 21, 2020 (excluding the influence of COVID-19). Detailed numbers can be found in Table 1.

https://doi.org/10.1371/journal.pone.0305579.g001

thumbnail
Table 1. Comparison of national %ILI estimation between ARGO-C and other benchmarks.

https://doi.org/10.1371/journal.pone.0305579.t001

To verify the significance of ARGO-C’s advantage, we further conducted ablation analysis to compare ARGO-C to several other benchmarks. S3 Table compares the performance of ARGO-C under varying α, the weight parameter between group penalty and the individual penalty for the search terms; S5 Table compares ARGO-C to the benchmarks under random clustering and single clustering of the search terms, as well as the benchmark with group-aggregated search frequencies (i.e., simply adding up search frequencies in a group as a new predictor) instead of group-structured penalization. The ablation studies confirm the effectiveness of incorporating the connectivity learned from Internet data to improve the prediction accuracy and highlight the advantages in adopting a structured groupwise penalization for extracting connectivity information from the Internet data. Additionally, we illustrated the significance of ARGO-C’s improvement over ARGO with bootstrapped relative efficiency in S4 Table (relative RMSE of ARGO-C/ARGO is 0.929 (0.050), relative MAE: is 0.972 (0.036), relative correlation is 1.003 (0.001). Values in parentheses are simulated standard errors of the measures).

The effective use of connectivity among Internet data is further confirmed by a closer look at the evolving patterns of search terms included and excluded in the ARGO-C model over time. Among three exemplary clusters highlighted in S3 Fig, we observe that thanks to the introduced group-penalty structure, ARGO-C frequently selects/filters out an entire cluster of search terms, and thus fully takes advantage of the interconnection among Internet data. On the other hand, each search term may also be selected/filtered out individually within a cluster, indicating a good balance between individual and group-wise penalization. Breaking down the results into each flu season, ARGO-C’s performance is consistent, giving the most accurate predictions in majority of the flu seasons. Notably, ARGO-C is the single leading method in every flu season since 2010 by the measure of correlation (Table 1). This highlights ARGO-C’s strength in predicting the flu epidemic trends. Moreover, ARGO-C’s advantage over the vanilla ARGO is significantly evident in certain difficult flu seasons, yielding up to 30% of error reduction (e.g., there are error reductions of 30.7% for ’10-’11, 18.4% for ’14-’15, and 18.9% for ’16-’17 in terms of RSME). Additionally, the 95% prediction interval given by ARGO-C (based on the stationary bootstrap [61]) has an empirical coverage of 95.08%, given a nominal level of 95%.

We also applied the proposed ARGO-C model to localized flu tracking. At the regional level, ARGO-C again shows the strongest performance across all 10 regions in all three accuracy metrics. More specifically, compared to the naive method, ARGO-C reduces the RMSE by 12% to 29%, and improves the MAE measure by 11% to 25% (Table 2). In addition, the correlation measure based on the ARGO-C method is uniformly higher than other benchmarks in all 10 regions. Breaking down into individual flu seasons (S9 Table), ARGO-C still leads in vast majority of all evaluated periods. The strengths of ARGO-C at the regional level also reaffirms our projection that interconnection among Internet search data would contribute effectively in improving flu tracking performance. Additionally, Table 3 reports the empirical interval coverage for the regional %ILI prediction (ranging from 93% to 96%, given a nominal level of 95%), further confirming the reliability of ARGO-C.

thumbnail
Table 2. Comparison of regional %ILI estimation between ARGO-C and other benchmarks.

https://doi.org/10.1371/journal.pone.0305579.t002

thumbnail
Table 3. Actual coverage of prediction intervals by ARGO-C for regional %ILI prediction.

https://doi.org/10.1371/journal.pone.0305579.t003

Table 4 and S6 Table summarize ARGO-C’s state-level flu tracking performance in comparison with the benchmarks. Averaging over 51 states, ARGO-C again is the best performing method compared to all benchmarks, showcasing its adaptability and robustness in high-resolution disease tracking. The strength of ARGO-C attributes to our effective modeling of interconnection among search terms, which is even more relevant for efficient extracting of low-quality, high-noise Internet data at high resolution. More detailed reports on each individual state and each flu season are given in S10 Table. During the period from 2014 to 2020 (pre-COVID, Table 4), ARGO-C leads the chart for the majority of the states (ARGO-C outperforms other models for 42 states in terms of MSE). Notably, after including the irregular flu seasons since COVID (2014–2023, Table), ARGO-C shows even more remarkable strength over benchmarks (ARGO-C outperforms other models for 47 states in terms of MSE). To further confirm the reliability of ARGO-C, we report the actual coverage rate of the 95% prediction interval given by ARGO-C for each state in S7 Table: overall, the average coverage rate over the 51 states is 92.6%, fairly close to the nominal level.

thumbnail
Table 4. Comparison of state-level %ILI estimation between ARGO-C and other benchmarks.

https://doi.org/10.1371/journal.pone.0305579.t004

Discussion

In summary, to account for the interconnection among the Internet search data, we proposed an innovative statistical learning framework, ARGO-C. By applying ARGO-C to both national and localized flu tracking, we observe that ARGO-C enhances the original ARGO/ARGO2/ARGOX framework by effectively and efficiently extracting and utilizing the inherent grouping structure of Google search terms.

The first step model of our proposed ARGO-C identifies the interconnection structure among Google search terms through clustering. In general, any clustering technique may be employed at this stage, and for each given method, several fitting criteria may also be considered, such as different choices of the distance metric and/or other tuning parameters (e.g., linkage function for hierarchical clustering). Specifically for the task of flu prediction, we recommend using hierarchical clustering based on the correlation distance metric and the average linkage function. This is through our empirical explorations of multiple classic clustering methods, including hierarchical clustering, k-means, and PAM [86] with various tuning configurations (linkage, distance metric, etc., detailed in Table S1 Table). In practice, one may also explore a few different clustering methods and choose the one that gives the most interpretable results in the given context. Other preprocessing techniques on time series data, such as transformation, could be explored as well. It is also possible to realize clustering in a more dynamic fashion, updating the search term clusters periodically over time, which may possibly improve the accuracy of the %ILI predictions in the following modeling step. In addition, we also explored various criteria to determine the number of clusters K. Our final choice of K was a joint decision based on the within-cluster distance with consideration of model interpretability. In practice, we recommend choosing a relatively large K for effective incorporation of the clustering information.

Our ARCO-C framework mainly employs statistical principled learning approach to extract Google search information and to predict %ILI. This design of ARGO-C aims to enjoy the interpretability and robustness of the model while reaching efficient information extraction from complex Internet data and satisfactory performance in prediction accuracy, as similarly discussed in [61, 63, 67]. ARGO-C preserves the effectiveness of the ARGO framework in handling noise and missingness within the Internet data. Specifically, ARGO-C enriches the state-level Google search data with regional level information for each search term in the data preprocessing step to alleviate the missing data issue (zero frequencies) (as detailed in [65]). ARGO-C also employs regularized regression to dynamically and automatically filter out noisy search terms [63, 65], which further accounts for evolution in people’s search pattern and search engine algorithms. The multi-step design of ARGO-C ensures flexibility in variable selection and model fitting at each geographical level and unit (state/region/national) while effectively integrating Google search information between and across various geographical resolutions and units. There is growing literature in natural language processing (NLP) that involves more sophisticated, black-box-type algorithms for text mining, such as deep neural networks (DNN), word embedding or Large Language model (LLM). Due to the moderate scale and delicate structure of our data at hand, we made the choice to focus on statistically principled learning approaches for their efficiency, interpretability and robustness. But we also note that our ARGO-C framework is readily adaptable to integrate these NLP/DNN-based text mining techniques in the Step 1 model. Additionally, there are other penalization techniques in the literature that could be readily adapted to ARGO-C, as well as other non-linear, black-box-type techniques such as the neural network. These could be of great interests in future expansion of our framework.

We acknowledge that there are limitations of utilizing Google search data into %ILI predictions during the COVID-19 pandemic or the post-COVID period in the near future. First of all, %ILI is only a proxy for the actual flu incidence in the population. Since the seasonal flu and COVID-19 share many symptoms in common, the reported %ILI may as well include visits due to COVID-19. Consequently, the %ILI predictions may be largely influenced by simultaneous COVID-19 seasonal surges or underlying COVID-19 cases. In the main focus of our data analysis, we excluded the period that was possibly contaminated by COVID-19, as we suspect that the %ILI target and the existing set of Google search terms may not well represent flu activities during that period. Nevertheless, the %ILI surveillance data can still provide valuable insights on the general trend of influenza activity [85] (so, we also presented the results during the post-COVID period in the Supplementary Information, which also showcases the robustness of ARGO-C). It could be among our future endeavors to update the Google search terms after accounting for the effect of COVID-19, and/or to target alternative flu indicators [41], such as laboratory-confirmed influenza hospital admissions [85]. In addition, it will also be an interesting future project to explore the possibility of predicting seasonal flu cases and COVID-19 cases simultaneously by accounting for their interactive effects on each other [67].

Although not presented in this paper, ARGO-C is highly robust and can be easily adapted to digital tracking of other diseases or social/economic trends (as we have done to ILI prediction at various geographical resolution). We hope that our proposed framework can improve the real-time tracking of various infectious diseases and potentially contribute to the area of public health by saving more people’s lives.

Supporting information

S1 Fig. Clustering of the 71 Google search terms, collected by March 29, 2009.

53 clusters were identified. Hierarchical clustering with average linkage and correlation distance metric were used. The clustering was conducted based on the time series of Google search data from January 10, 2004 (the earliest available Google Trends data) to March 29, 2009 (the earliest prediction date by these 71 terms).

https://doi.org/10.1371/journal.pone.0305579.s001

(TIF)

S2 Fig. Clustering of the 161 Google search terms, collected by May 22, 2010.

45 clusters were identified. Hierarchical clustering with average linkage and correlation distance metric were adopted. The clustering was conducted based on the time series of Google search data from January 10, 2004 (the earliest available Google Trends data) to May 22, 2010 (the earliest prediction date by these 161 terms).

https://doi.org/10.1371/journal.pone.0305579.s002

(TIF)

S3 Fig. Traceplots of clustered predictors included in the ARGO-C model at the national level over time.

The heatmaps indicate whether each predictor was included in the predictive ARGO-C model at each week. Three exemplary clusters were highlighted. Each of the top two clusters contains three search terms (identified among 71 search terms by March 29, 2009, used for predictions before 2010), while the last one includes four search terms (identified among 161 topics/terms by May 22, 2010 and used for predictions since 2010). The entire cluster was penalized and excluded from the model when an entire column in the traceplot is colored grey.

https://doi.org/10.1371/journal.pone.0305579.s003

(TIF)

S1 Table. Evaluation of various clustering methods for grouping search terms.

Clustering is based on the 161 Google search terms collected by May 22, 2010, with 45 clusters. Methods in comparison include hierarchical clustering (HC) with average linkage (ave), with complete linkage (comp), and single linkage (single), based on correlation/Pearson distance metric, as well as Kmeans and PAM. Metrics for comparison include within-cluster sum of squares (WSS), silhouette, and gap statistics. Note that for correlation-based distance, WSS and gap statistics (defined based on Euclidean distance) are less relevant. All clusterng methods can be readily applied for ARGO-C. Based on the superior performance in this evaluation, we use hierarchical clustering with average linkage as default in this paper.

https://doi.org/10.1371/journal.pone.0305579.s004

(PDF)

S2 Table. Comparison of % ILI estimation between ARGO-C and other benchmarks at the national level, for flu seasons since COVID-19.

The evaluation is based at the national level %ILI in multiple periods and multiple metrics. RMSE, MAE, and correlation are reported. The method with the best performance is highlighted in boldface for each metric in each period. Methods considered here include ARGO-C, VAR1, GFT, the original ARGO, and the naive method. All comparisons are conducted on the original scale of the CDC’s %ILI. The overall period ’09-’23 is March 29, 2009 to January 28, 2023, including the period since COVID. Each regular flu season is from week 40 to week 20 next year, as defined by CDC’s Morbidity and Mortality Weekly Report. (The ’22-’23 season is up to January 28, 2023).

https://doi.org/10.1371/journal.pone.0305579.s005

(PDF)

S3 Table. Comparison of % ILI estimation by ARGO-C and other benchmarks at the national level, with varying tuning parameter α.

The evaluation period is March 29, 2009 to February 29, 2020, before COVID. RMSE is reported for varying α, the weight between Lasso penalty and group Lasso penalty in the ARGO-C model, with α = 1 corresponding to vanilla ARGO without group penalty.

https://doi.org/10.1371/journal.pone.0305579.s006

(PDF)

S4 Table. Relative accuracy between ARGO-C and ARGO on the national level.

The evaluation period is March 29, 2009 to February 29, 2020, before COVID. The relative accuracy, characterized by the ratio predication accuracy of ARGO-C over ARGO is reported in relative RMSE, MAE, and Correlation. Bootstrap is conducted to estimate the relative accuracy, with the bootstrapped SE and quantiles reported (based on 50 bootstrapped samples). For RMSE and MAE, a relative accuracy <1 indicates advantage; for correlation, a relative accuracy >1 indicates advantage.

https://doi.org/10.1371/journal.pone.0305579.s007

(PDF)

S5 Table. Comparison of % ILI estimation by ARGO-C and additional benchmarks at the national level.

The evaluation period is March 29, 2009 to February 29, 2020, before COVID. RMSE is reported. ARGO-C (random) is based on randomly assigned clusters in the Step 1 of ARGO-C with the same number of clusters as identified by unsupervised learning in ARGO-C; the RMSE is an averaged of 10 random assignments. ARGO (single) is based on one single cluster including all search terms. ARGO (group) is based on group-aggregated search frequencies, where each predictor is simply the sum of frequencies of search terms in each cluster.

https://doi.org/10.1371/journal.pone.0305579.s008

(PDF)

S6 Table. Comparison of % ILI estimation between ARGO-C and other benchmarks at the state level, for flu seasons since COVID-19.

The evaluation is based on the average of 51 US state/district in multiple periods and multiple metrics. RMSE, MAE, and correlation are reported. The method with the best performance is highlighted in boldface for each metric in each period. Methods considered here include ARGO-C, VAR1, GFT, the original ARGOX, and the naive method. All comparisons are conducted on the original scale of the CDC’s %ILI. The overall period ’14-’23 is January 10, 2014 (first available estimate by ARGO framework) to January 28, 2023, including the period since COVID. The post-COVID period is the period since COVID, March 21, 2020 to January 28, 2023. Each regular flu season is from week 40 to week 20 next year, as defined by CDC’s Morbidity and Mortality Weekly Report. (The ’22-’23 season is up to January 28, 2023).

https://doi.org/10.1371/journal.pone.0305579.s009

(PDF)

S7 Table. Actual coverage of prediction intervals by ARGO-C for state-level %ILI prediction.

The coverage is for 95% nominal level. The average coverage over 51 states/city/district is 92.6%. The evaluation period is January 10, 2014 to March 21, 2020, excluding the period with COVID-19 influence.

https://doi.org/10.1371/journal.pone.0305579.s010

(PDF)

S8 Table. All search query terms used in this study.

The first 71 terms were collected on March 29, 2009, the remaining terms/topics were identified on May 22, 2010. The last 21 terms separated by a horizontal line from the first 140 terms were “Related topics” from Google Trends.

https://doi.org/10.1371/journal.pone.0305579.s011

(PDF)

S9 Table. Comparison of different methods for regional level %ILI estimation in Region 1-10.

The RMSE, MAE, and correlation measures are reported. The method with the best performance is highlighted in boldface for each metric in each period.

https://doi.org/10.1371/journal.pone.0305579.s012

(PDF)

S10 Table. Comparison of different methods for state-level %ILI estimation in all 51 states/city/district.

The MSE, MAE, and correlation are reported. The method with the best performance is highlighted in boldface for each metric in each period.

https://doi.org/10.1371/journal.pone.0305579.s013

(PDF)

Acknowledgments

Ahmed Hussain contributed to the project as an undergraduate research assistant at Williams College.

References

  1. 1. Polgreen PM, Chen Y, Pennock DM, Nelson FD, Weinstein RA. Using internet searches for influenza surveillance. Clinical infectious diseases. 2008;47(11):1443–1448. pmid:18954267
  2. 2. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L. Detecting influenza epidemics using search engine query data. Nature. 2009;457(7232):1012–1014. pmid:19020500
  3. 3. Althouse BM, Ng YY, Cummings DA. Prediction of dengue incidence using search query surveillance. PLoS neglected tropical diseases. 2011;5(8):e1258. pmid:21829744
  4. 4. Chan EH, Sahai V, Conrad C, Brownstein JS. Using web search query data to monitor dengue epidemics: a new model for neglected tropical disease surveillance. PLoS Neglected Tropical Diseases. 2011;5(5):e1206. pmid:21647308
  5. 5. Murdoch TB, Detsky AS. The inevitable application of big data to health care. Jama. 2013;309(13):1351–1352. pmid:23549579
  6. 6. Lee K, Agrawal A, Choudhary A. Real-time disease surveillance using twitter data: demonstration on flu and cancer. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining; 2013. p. 1474–1477.
  7. 7. Khoury MJ, Ioannidis JP. Big data meets public health. Science. 2014;346(6213):1054–1055. pmid:25430753
  8. 8. Rufai SR, Bunce C. World leaders’ usage of Twitter in response to the COVID-19 pandemic: a content analysis. Journal of public health. 2020;42(3):510–516. pmid:32309854
  9. 9. Effenberger M, Kronbichler A, Shin JI, Mayer G, Tilg H, Perco P. Association of the COVID-19 pandemic with internet search volumes: a Google TrendsTM analysis. International Journal of Infectious Diseases. 2020;95:192–197. pmid:32305520
  10. 10. Aiello AE, Renson A, Zivich P. Social media-and internet-based disease surveillance for public health. Annual review of public health. 2020;41:101. pmid:31905322
  11. 11. Lampos V, Majumder MS, Yom-Tov E, Edelstein M, Moura S, Hamada Y, et al. Tracking COVID-19 using online search. NPJ digital medicine. 2021;4(1):17. pmid:33558607
  12. 12. Ettredge M, Gerdes J, Karuga G. Using web-based search data to predict macroeconomic statistics. Communications of the ACM. 2005;48(11):87–92.
  13. 13. Goel S, Hofman JM, Lahaie S, Pennock DM, Watts DJ. Predicting consumer behavior with Web search. Proceedings of the National Academy of Sciences. 2010;107(41):17486–17490. pmid:20876140
  14. 14. McLaren N, Shanbhogue R. Using internet search data as economic indicators. Bank of England Quarterly Bulletin. 2011;(2011):Q2.
  15. 15. Bollen J, Mao H, Zeng X. Twitter mood predicts the stock market. Journal of computational science. 2011;2(1):1–8.
  16. 16. Choi H, Varian H. Predicting the present with Google Trends. Economic Record. 2012;88(s1):2–9.
  17. 17. Preis T, Moat HS, Stanley HE. Quantifying trading behavior in financial markets using Google Trends. Scientific reports. 2013;3(1):1–6. pmid:23619126
  18. 18. Scott SL, Varian HR. Predicting the present with bayesian structural time series. International Journal of Mathematical Modelling and Numerical Optimisation. 2014;5(1-2):4–23.
  19. 19. Einav L, Levin J. Economics in the age of big data. Science. 2014;346(6210):1243089. pmid:25378629
  20. 20. Wu L, Brynjolfsson E. The future of prediction: How Google searches foreshadow housing prices and sales. In: Economic analysis of the digital economy. University of Chicago Press; 2015. p. 89–118.
  21. 21. Vicente MR, López-Menéndez AJ, Pérez R. Forecasting unemployment with internet search data: Does it help to improve predictions when job destruction is skyrocketing? Technological Forecasting and Social Change. 2015;92:132–139.
  22. 22. Scott SL, Varian HR. Bayesian variable selection for nowcasting economic time series. In: Economic analysis of the digital economy. University of Chicago Press; 2015. p. 119–135.
  23. 23. Yi D, Ning S, Chang CJ, Kou S. Forecasting unemployment using Internet search data via PRISM. Journal of the American Statistical Association. 2021;116(536):1662–1673.
  24. 24. Manyika J, Chui M, Brown B, Bughin J, Dobbs R, Roxburgh C, et al. Big data: The next frontier for innovation, competition, and productivity. McKinsey & Company; 2011.
  25. 25. McAfee A, Brynjolfsson E. Big data: The management revolution. Harvard Business Review. 2012;90(10):60–68. pmid:23074865
  26. 26. Chen H, Chiang RH, Storey VC. Business intelligence and analytics: From big data to big impact. MIS Quarterly. 2012;36(4):1165–1188.
  27. 27. Risteski D, Davcev D. Can we use daily Internet search query data to improve predicting power of EGARCH models for financial time series volatility. In: Proceedings of the International Conference on Computer Science and Information Systems (ICSIS’2014), October 17–18, 2014, Dubai (United Arab Emirates); 2014.
  28. 28. Zhu C. Big data as a governance mechanism. The Review of Financial Studies. 2019;32(5):2021–2061.
  29. 29. Kim GH, Trimi S, Chung JH. Big-data applications in the government sector. Communications of the ACM. 2014;57(3):78–85.
  30. 30. Bennett J, Lanning S. The netflix prize. In: Proceedings of KDD Cup and Workshop 2007; 2007.
  31. 31. Santillana M, Zhang DW, Althouse BM, Ayers JW. What can digital disease detection learn from (an external revision to) Google Flu Trends? American journal of preventive medicine. 2014;47(3):341–347. pmid:24997572
  32. 32. Wójcik OP, Brownstein JS, Chunara R, Johansson MA. Public health for the people: participatory infectious disease surveillance in the digital age. Emerging themes in epidemiology. 2014;11(1):1–7. pmid:24991229
  33. 33. Bates M. Tracking disease: digital epidemiology offers new promise in predicting outbreaks. IEEE pulse. 2017;8(1):18–22. pmid:28129137
  34. 34. Li C, Chen LJ, Chen X, Zhang M, Pang CP, Chen H. Retrospective analysis of the possibility of predicting the COVID-19 outbreak from Internet searches and social media data, China, 2020. Eurosurveillance. 2020;25(10):2000199. pmid:32183935
  35. 35. Ma S, Sun Y, Yang S. Using Internet Search Data to Forecast COVID-19 Trends: A Systematic Review. Analytics. 2022;1(2):210–227.
  36. 36. Ma S, Yang S. Covid-19 forecasts using internet search information in the united states. Scientific Reports. 2022;12(1):11539. pmid:35798774
  37. 37. Iuliano AD, Roguski KM, Chang HH, Muscatello DJ, Palekar R, Tempia S, et al. Estimates of global seasonal influenza-associated respiratory mortality: a modelling study. The Lancet. 2018;391(10127):1285–1300. pmid:29248255
  38. 38. Molinari NAM, Ortega-Sanchez IR, Messonnier ML, Thompson WW, Wortley PM, Weintraub E, et al. The annual impact of seasonal influenza in the US: measuring disease burden and costs. Vaccine. 2007;25(27):5086–5096. pmid:17544181
  39. 39. Lipsitch M, Finelli L, Heffernan RT, Leung GM, Redd; for the 2009 H1N1 Surveillance Group SC. Improving the evidence base for decision making during a pandemic: the example of 2009 influenza A/H1N1. Biosecurity and bioterrorism: biodefense strategy, practice, and science. 2011;9(2):89–115. pmid:21612363
  40. 40. Nsoesie EO, Brownstein JS, Ramakrishnan N, Marathe MV. A systematic review of studies on forecasting the dynamics of influenza outbreaks. Influenza and other respiratory viruses. 2014;8(3):309–316. pmid:24373466
  41. 41. Chretien JP, George D, Shaman J, Chitale RA, McKenzie FE. Influenza Forecasting in Human Populations: a Scoping Review. PloS One. 2014;9(4):e94130. pmid:24714027
  42. 42. Brownstein JS, Freifeld CC, Madoff LC. Digital disease detection—harnessing the Web for public health surveillance. The New England journal of medicine. 2009;360(21):2153. pmid:19423867
  43. 43. Dalton C, Durrheim D, Fejsa J, Francis L, Carlson S, d’Espaignet ET, et al. Flutracking: a weekly Australian community online survey of influenza-like illness in 2006, 2007 and 2008. Communicable diseases intelligence quarterly report. 2009;33(3):316–322. pmid:20043602
  44. 44. Achrekar H, Gandhe A, Lazarus R, Yu SH, Liu B. Predicting flu trends using twitter data. In: 2011 IEEE conference on computer communications workshops (INFOCOM WKSHPS). IEEE; 2011. p. 702–707.
  45. 45. Yuan Q, Nsoesie EO, Lv B, Peng G, Chunara R, Brownstein JS. Monitoring influenza epidemics in china with search query from baidu. PloS one. 2013;8(5):e64323. pmid:23750192
  46. 46. Paul MJ, Dredze M, Broniatowski D. Twitter improves influenza forecasting. PLoS currents. 2014;6. pmid:25642377
  47. 47. McIver DJ, Brownstein JS. Wikipedia usage estimates prevalence of influenza-like illness in the United States in near real-time. PLoS computational biology. 2014;10(4):e1003581. pmid:24743682
  48. 48. Santillana M, Nsoesie EO, Mekaru SR, Scales D, Brownstein JS. Using clinicians’ search query data to monitor influenza epidemics. Clinical Infectious Diseases. 2014;59(10):1446–1450. pmid:25115873
  49. 49. Paolotti D, Carnahan A, Colizza V, Eames K, Edmunds J, Gomes G, et al. Web-based participatory surveillance of infectious diseases: the Influenzanet participatory surveillance experience. Clinical Microbiology and Infection. 2014;20(1):17–21. pmid:24350723
  50. 50. Smolinski MS, Crawley AW, Baltrusaitis K, Chunara R, Olsen JM, Wójcik O, et al. Flu near you: crowdsourced symptom reporting spanning 2 influenza seasons. American journal of public health. 2015;105(10):2124–2130. pmid:26270299
  51. 51. Santillana M, Nguyen AT, Dredze M, Paul MJ, Nsoesie EO, Brownstein JS. Combining search, social media, and traditional data sources to improve influenza surveillance. PLoS Comput Biol. 2015;11(10):e1004513. pmid:26513245
  52. 52. Yang S, Santillana M, Brownstein JS, Gray J, Richardson S, Kou S. Using electronic health records and Internet search information for accurate influenza forecasting. BMC infectious diseases. 2017;17(1):1–9. pmid:28482810
  53. 53. Bradshaw B, Konty KJ, Ramirez E, Lee WN, Signorini A, Foschini L. Influenza surveillance using wearable mobile health devices. Online Journal of Public Health Informatics. 2019;11(1).
  54. 54. Hassan Zadeh A, Zolbanin HM, Sharda R, Delen D. Social media for nowcasting flu activity: Spatio-temporal big data analysis. Information Systems Frontiers. 2019;21:743–760.
  55. 55. Viboud C, Santillana M. Fitbit-informed influenza forecasts. The Lancet Digital Health. 2020;2(2):e54–e55. pmid:33334559
  56. 56. Cook S, Conrad C, Fowlkes AL, Mohebbi MH. Assessing Google flu trends performance in the United States during the 2009 influenza virus A (H1N1) pandemic. PloS one. 2011;6(8):e23610. pmid:21886802
  57. 57. Pervaiz F, Pervaiz M, A Rehman N, Saif U, et al. FluBreaks: early epidemic detection from Google flu trends. Journal of medical Internet research. 2012;14(5):e2102. pmid:23037553
  58. 58. Butler D. When Google got flu wrong: US outbreak foxes a leading web-based method for tracking seasonal flu. Nature. 2013;494(7436):155–157. pmid:23407515
  59. 59. Olson DR, Konty KJ, Paladini M, Viboud C, Simonsen L. Reassessing Google Flu Trends data for detection of seasonal and pandemic influenza: a comparative epidemiological study at three geographic scales. PLoS computational biology. 2013;9(10):e1003256. pmid:24146603
  60. 60. Lazer D, Kennedy R, King G, Vespignani A. The parable of Google Flu: traps in big data analysis. science. 2014;343(6176):1203–1205. pmid:24626916
  61. 61. Yang S, Santillana M, Kou SC. Accurate estimation of influenza epidemics using Google search data via ARGO. Proceedings of the National Academy of Sciences. 2015;112(47):14473–14478. pmid:26553980
  62. 62. Yang S, Kou SC, Lu F, Brownstein JS, Brooke N, Santillana M. Advances in using Internet searches to track dengue. PLoS computational biology. 2017;13(7):e1005607. pmid:28727821
  63. 63. Ning S, Yang S, Kou S. Accurate regional influenza epidemics tracking using Internet search data. Scientific reports. 2019;9(1):5238. pmid:30918276
  64. 64. Lu FS, Hattab MW, Clemente CL, Biggerstaff M, Santillana M. Improved state-level influenza nowcasting in the United States leveraging Internet-based data and network approaches. Nature communications. 2019;10(1):147. pmid:30635558
  65. 65. Yang S, Ning S, Kou S. Use Internet search data to accurately track state level influenza epidemics. Scientific reports. 2021;11(1):1–10. pmid:33597556
  66. 66. Wang T, Ma S, Baek S, Yang S. COVID-19 hospitalizations forecasts using internet search data. Scientific Reports. 2022;12(1):9661. pmid:35690619
  67. 67. Ma S, Ning S, Yang S. Joint COVID-19 and influenza-like illness forecasts in the United States using internet search information. Communications Medicine. 2023;3(1):39. pmid:36964311
  68. 68. Tibshirani R. Regression shrinkage and selection via the LASSO. Journal of the Royal Statistical Society-Series B. 1996;58(1):267–288.
  69. 69. Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems. Technometrics. 1970;12(1):55–67.
  70. 70. Tibshirani R. The LASSO method for variable selection in the Cox model. Statistics in Medicine. 1997;16:385–395. pmid:9044528
  71. 71. Lokhorst J. The lasso and generalized linear models. University of Adelaide; 1999.
  72. 72. Roth V. The generalized LASSO. IEEE Transactions on Neural Networks. 2004;15(1):16–28. pmid:15387244
  73. 73. Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society-Series B. 2005;67(2):301–320.
  74. 74. Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society-Series B. 2006;68(1):49–67.
  75. 75. Meier L, van de Geer S, Bhlmann P. The grouped lasso for logistic regression. Journal of the Royal Statistical Society-Series B. 2008;70(1):53–71.
  76. 76. Simon N, Friedman J, Hastie T, Tibshirani R. A sparse-group lasso. Journal of Computational and Graphical Statistics. 2013;22(2):231–245.
  77. 77. Ward JH. Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association. 1963;58(301):236–244.
  78. 78. MacQueen J. Classification and analysis of multivariate observations. In: 5th Berkeley Symp. Math. Statist. Probability. University of California Los Angeles LA USA; 1967. p. 281–297.
  79. 79. Banerjee A, Shan H. Model-Based Clustering. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning. Springer; 2011.
  80. 80. R Core Team. R: A Language and Environment for Statistical Computing; 2022. Available from: https://www.R-project.org/.
  81. 81. Van Rossum G, Drake FL, et al. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam; 1995.
  82. 82. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of computational and applied mathematics. 1987;20:53–65.
  83. 83. Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology). 2001;63(2):411–423.
  84. 84. Simon N, Friedman J, Hastie T, Tibshirani R. SGL: Fit a GLM (or Cox Model) with a Combination of Lasso and Group Lasso Regularization; 2019. Available from: https://CRAN.R-project.org/package=SGL.
  85. 85. Center for Disease Control and Preventions. Flu Activity & Surveillance; 2023.
  86. 86. Kaufman L. Partitioning around medoids (program pam). Finding groups in data. 1990;344:68–125.