Using Google Health Trends to investigate COVID-19 incidence in Africa

The COVID-19 pandemic has caused over 500 million cases and over six million deaths globally. From these numbers, over 12 million cases and over 250 thousand deaths have occurred on the African continent as of May 2022. Prevention and surveillance remains the cornerstone of interventions to halt the further spread of COVID-19. Google Health Trends (GHT), a free Internet tool, may be valuable to help anticipate outbreaks, identify disease hotspots, or understand the patterns of disease surveillance. We collected COVID-19 case and death incidence for 54 African countries and obtained averages for four, five-month study periods in 2020–2021. Average case and death incidences were calculated during these four time periods to measure disease severity. We used GHT to characterize COVID-19 incidence across Africa, collecting numbers of searches from GHT related to COVID-19 using four terms: ‘coronavirus’, ‘coronavirus symptoms’, ‘COVID19’, and ‘pandemic’. The terms were related to weekly COVID-19 case incidences for the entire study period via multiple linear and weighted linear regression analyses. We also assembled 72 variables assessing Internet accessibility, demographics, economics, health, and others, for each country, to summarize potential mechanisms linking GHT searches and COVID-19 incidence. COVID-19 burden in Africa increased steadily during the study period. Important increases for COVID-19 death incidence were observed for Seychelles and Tunisia. Our study demonstrated a weak correlation between GHT and COVID-19 incidence for most African countries. Several variables seemed useful in explaining the pattern of GHT statistics and their relationship to COVID-19 including: log of average weekly cases, log of cumulative total deaths, and log of fixed total number of broadband subscriptions in a country. Apparently, GHT may best be used for surveillance of diseases that are diagnosed more consistently. Overall, GHT-based surveillance showed little applicability in the studied countries. GHT for an ongoing epidemic might be useful in specific situations, such as when countries have significant levels of infection with low variability. Future studies might assess the algorithm in different epidemic contexts.

Introduction keywords had a strong correlation with COVID-19 cases, and concluded that GT may be a useful tool for predicting COVID-19 outbreaks. Brodeur et al. (2021) used GT to see how lockdowns affected the well-being in the U.S. [31]. Once lockdowns were implemented, well-being likely decreased, as searches for certain terms such as 'stress,' 'suicide,' and 'worry' increased over the lockdown period. Ahmad et al. (2020) used gastrointestinal-related symptom search terms to determine whether GT could predict COVID-19 incidence, and found correlations between the search terms and increases of COVID-19 cases in multiple regions across the U.S. with a four-week lag [32].
Here, we explored whether GHT search query data correlate with COVID-19 incidence at the country level in Africa, as a potential complementary source for more customary forms of COVID-19 surveillance. We decided to use GHT instead of GT given the semi-quantitative nature of the information recovered by GHT. We collected case and death data for 54 African countries, and used four COVID-19-related search terms (see below) for each country. We then assessed whether Internet access, demography, economic information, or health variables, were associated with GHT searches. Lastly, we calculated a standardized volatility index to illuminate whether variability in the signal of case incidence led to less accurate predictions by GHT.

COVID-19 incidence data
Daily COVID-19 new cases and death counts were obtained for all 54 African countries from 2 February 2020 to 25 September 2021. Country-level case data were obtained via the Johns Hopkins COVID-19 global time series on the pandemic [33]; data were constrained to laboratory-confirmed cases only. We explored the progression of average daily COVID-19 case and death incidence per 100,000 people in Africa in four time periods, each roughly five months (~150 days) long: (a) 2 February 2020 to 30 June, (b) 1 July to 30 November, (c) 1 December to 30 April 2021, and (d) 1 May to 25 September 2021. We then converted daily new cases into weekly new cases for each of the countries to match the weekly GHT data up to 25 September 2021, for a total of 86 observations. We calculated weekly incidence rates by dividing the number of cases per week by the total population per country in millions [34]. Country-level population data were collected from the forecasted midyear 2020 estimates from the U.S. Census Bureau [35].

Ethics
Human data included in this study was collected through publicly available repositories of anonym COVID-19 case counts from 54 African countries [1,2]. Thus, the present research does not need a revision by a bioethics committee.

Google Health Trends data
We downloaded data corresponding to four English terms from the GHT application programming interface (API): 'coronavirus,' 'coronavirus symptoms,' 'COVID19,' and 'pandemic'. Although the four terms are related conceptually, they have the potential to capture a broad spectrum of information specifically related to the studied disease, avoiding non-informative data from less specific words, as has been previously demonstrated [25,36]. We addressed potential language barriers by collecting data for the latter two terms in French and Portuguese. The former two search terms were spelled the same in French and Portuguese, aside from accents, so the English versions of those terms captured a majority of individuals searching those terms in those other languages. We matched the relative search proportions of these words-which is the raw output provided by GHT [25]-with the weekly COVID-19 case incidence for the selected time period.

Statistical analysis
We used a multiple linear regression model fitted with the four GHT English search terms as independent variables of COVID-19 incidence at the country level for each of the 54 African countries being evaluated. We then performed the same analysis, substituting the latter two terms for their equivalents in French or Portuguese, based on the official or spoken language as determined by Nations Online [37]. The primary outcome measure was the adjusted R 2 statistic as a measure of the best-fitting models. We recorded the largest adjusted R 2 value (in absolute value) from the models with all English or English and French/Portuguese terms. We assumed that an R 2 value >0.5 was the minimum threshold to show relevant associations. If one or more of the four terms chosen did not retrieve search counts from GHT, it was removed from the analysis for that country. At least two terms were included for each region. While multicollinearity may exist in our time series, it will minimally affect the goodness-of-fit as we rely on R 2 instead of p-values for interpreting the association between dependent and independent variables [38]. Finally, to address possible autocorrelation and heteroskedasticity issues in our time series, we performed first-order differencing and ran another analyses with a weighted least squares regression model, giving larger weight to those observations with lower variance. We also recorded the results from this weighted regression as a more conservative measure.
Next, we used the adjusted R 2 statistics collected from the 54 African countries as a dependent variable to explore whether different socio-economic variables might explain the patterns. This analysis was conducted separately for the adjusted R 2 statistics collected from the basic fitted regression and the weighted regression models, respectively. Socio-economic variables for the African countries included Internet access, demographic, economic, and health indicators ( Table 1); data were gathered from World Bank [39]. We explored logarithmic transformations of each of these variables to determine whether normalization of the indicators led to stronger correlations. We also included a standardized volatility score calculated using the standardized normalized case incidence data of each country as follows: in which n is the total number of observations and Y is the normalized case incidence per country. The average of the absolute difference (i.e., volatility) summarizes the COVID-19 case incidence signal, reflecting if it is relatively constant or fluctuates broadly from week to week [25]. Overall, we explored a total of 72 potential explanatory variables (Table 1 and S1 Table).
Variables were analyzed individually using a pair-wise univariate linear regression and collectively in a multivariate stepwise regression, in which variables were added and removed iteratively to obtain a subset of variables providing the best model outcome according to the Akaike Information Criterion (AIC). In addition, variables were analyzed using a least absolute shrinkage and selection operator (i.e., LASSO) regression for both untransformed and log-adjusted data to avoid overfitting and produce simpler models. Countries with missing variable information were removed from the univariate regression with that particular variable (38/72; 53% of variables had at least one country removed, S1 Table), and only variables with information for every country were used in the stepwise and LASSO regressions. All analyses were done for both adjusted R 2 values collected from the basic regression and the weighted regression. All analyses were performed in R [40]. Data and scripts to replicate the results of this study are available in a GitHub repository accompanying this publication (https://github. com/alxjfulk/GHT-and-COVID19-code).

Results
Examining the distribution of first cases among the 54 African countries, we observed that dates of first reported COVID-19 cases were centered around March 2020. Egypt (EGY) Table 1. Independent variables explored in the present study. Different categories were selected based on their perceived potential to explain patterns of Google Health Trends and COVID-19 regression models. We also evaluated the log of each variable, for a total of 72 variables.

Category Indicator
Internet access 1. Percentage of population with access to electricity.
2. Fixed total number of broadband subscriptions in a country. 11. Percentage of people using at least basic drinking water services.
12. Percentage of people using at least basic sanitation services.
13. Percentage of people using safely managed drinking water services.
14. Percentage of people using safely managed sanitation services. reported the first case of COVID-19 on the continent on 14 February 2020, 15 days after the World Health Organization (WHO) declared the COVID-19 epidemic an emergency of international concern [41]. Comoros (COM) and Lesotho (LSO) were the last countries to report COVID-19 introductions, with first cases on 30 April and 13 May 2020, respectively (Fig 1). Countries with highest COVID-19 case incidences for the first time period include Djibouti (3.39 cases per 100,000 people), São Tome and Principe (2.25), and South Africa (1.79) (Fig 2). During the second period, Cameroon (10.7), Libya (7.78), and South Africa (7.39) were most affected (Fig 2). For the third and fourth periods, countries across the continent reported increased COVID-19 incidences, with Seychelles (third period = 39.1; fourth period = 109), Tunisia (third period = 12.0; fourth period = 22.8), Botswana (third period = 10.3; fourth  (Fig 2). Tanzania had an incidence of 0 for the second and third time periods, which will be discussed below.
COVID-19 death incidence was recorded for all the African countries in the second time period except for Eritrea, Seychelles, Comoros, Mauritius, Tanzania, and Burundi, although the latter four reported 5.51x10 -3 , 4.83x10 -3 , 2.39x10 -4 , and 5.62x10 -5 death incidences per 100,000 people during the first period, respectively. Further, South Africa (0.219 deaths per 100,000 people) and Tunisia (0.179) reported the highest death incidence in the second period. For the third period, highest death incidences were again reported in South Africa (0.385) and Tunisia (0.422); for the fourth period, highest incidences were recorded in Tunisia (0.808), Namibia (0.732), and Seychelles (0.584).
Few countries lacked information for one or two of the chosen English terms (6/54; 11.1%); only 'coronavirus' and 'COVID19' always recovered search query counts. Several countries that had French or Portuguese listed as an official language returned no information for either one or both language-specific terms (8/32; 25%, S2 Table). Overall, the adjusted R 2 values collected to depict the relationship between GHT search queries and COVID-19 weekly incidence were low, never above 0.4 for any of the countries in either the basic regression or the weighted regression (Fig 3). The largest adjusted R 2 results from the basic regression were for Algeria (0.33), Ethiopia (0.20), and Kenya (0.19; Fig 4). The countries with the lowest adjusted R 2 results from the basic regression included Burkina Faso (-0.028), Sierra Leone (-0.030), and Sudan (-0.031; Figs 3 and 4, S2 Table). For the weighted regression analysis on the first-order differenced case incidence and GHT data, the countries that returned the largest adjusted R 2 results were Guinea-Bissau (0.24), Lesotho (0.08), and Niger (0.07, Fig 3), respectively. The lowest adjusted R 2 results came from Zimbabwe, Egypt, and Mauritania each with an adjusted R 2 value of -0.05 (rounded; see S1 Fig). Several of the 72 variables were correlated, at least in part, with the pattern of adjusted R 2 statistics obtained for the 54 African countries. Almost all univariate, linear analyses from the basic regression yielded adjusted R 2 values of 0.25 or less, except for the log of average weekly cases (0.37), log of cumulative total deaths (0.30), and log of fixed total number of broadband subscriptions in a country (0.26, S3 Table). The only adjusted R 2 value greater than 0.25 from the weighted regression analyses came from the number of community health workers per 1,000 people, albeit that variable was only available for 26 countries. The log of average weekly cases, the log of GDP, and the log of the volatility scores yielded low adjusted R 2 values (R 2 = 0.120, 0.057, and 0.080, respectively). The stepwise regression analysis on the untransformed data showed that a model including percentage of GDP for current health expenditure, life expectancy (years) at birth, mobile cellular subscriptions per 100 people, total population, GDP per capita, percentage of people using the Internet, total urban population, total number of mobile cellular subscriptions in a country, average weekly cases over the studied period, and (notably) volatility score for a country calculated using weekly incidence yielded an adjusted  Examining the adjusted R 2 values collected from the weighted regression model, the volatility score for a country calculated using weekly incidence was deemed most useful for both stepwise and LASSO regressions, though the latter method also returned percentage of the population with access to electricity as an important variable (R 2 = 0.051, and 0.063, respectively). Conversely, using logarithmically transformed variables, a stepwise regression model including average weekly cases over the studied period, percentage of GDP for current health expenditure, life expectancy (years) at birth, yielded an adjusted R 2 value of 0.47. Using the adjusted R 2 values collected from the weighted regression analyses, a stepwise regression model selected the average weekly cases over the time period studied as an important variable. LASSO regression analysis of logarithmic transformed variables indicated that a model including percentage of individuals with access to electricity, life expectancy (years), average weekly cases over the period studied, cumulative deaths, percentage of GDP for current healthcare expenditure, and total population gave the highest adjusted R 2 of 0.45. Using the adjusted R 2 collected from the weighted regression analyses, volatility score for a country calculated using weekly incidence and average weekly cases were returned an R 2 = 0.13. The results of these models yielded adjusted R 2 values larger than most of the univariate analyses (S3 Table); thus, although we are mentioning the variables depicting the largest association with GHT and COVID-19 incidence in Africa outputs, we are cautious in interpreting any of the variables as explanatory considering lower than R 2 = 0.5 across all models [25,42,43].

Discussion
Despite successful demonstrations of the GHT algorithm to aid infectious disease surveillance for influenza, dengue, and other diseases [26,43,44], our study demonstrates that, in the context of the COVID-19 epidemic, GHT appeared to be difficult to implement as a surveillance tool for COVID-19 incidence and impact. Average weekly cases over the period studied was an important variable when analyzing possible patterns in the adjusted R 2 values collected from both the basic regression and weighted regression analyses. The volatility score for a country was also an important variable for the applicability of GHT, as demonstrated in our univariate, stepwise, and LASSO models (S3 Table). Finally, indicators related to Internet access (mobile cellular subscriptions per 100 people, total number of mobile cellular subscriptions in a country, percentage of individuals using the Internet, percentage of individuals with access to electricity), health (life expectancy (years) at birth), demographics (total population, total urban population), and economics (percentage of GDP for current health expenditure, GDP, GDP per capita) can be interpreted as important factors in the patterns of GHT and COVID-19 incidence although heterogeneously with different modeling approaches and therefore difficult to interpret (S3 Table).
The top three ranking countries based on adjusted R 2 values in the basic regression (Algeria, Ethiopia, and Kenya) all seemed to have similar COVID-19 incidence signal type (Fig 4,  upper panels). Cases begin at zero, spike, and subsequently drop to a lower, but still higher level of incidence, followed by additional waves, potentially reflecting an exhaustion of susceptible individuals or dynamics of new variants [45,46]. Algeria, Ethiopia, and Kenya all had strong responses to initial outbreaks of COVID-19 and invested significantly in preventative measures against COVID-19 such as testing, vaccination, and healthcare [47][48][49]. These three countries also ranked within the top 10 when looking at the total number of mobile cellular subscriptions in a country and GDP (S1 Table). Conversely, Burkina Faso, Sierra Leone, and Sudan are lower-income countries, and have struggled to combat COVID-19 [50][51][52]; according to World Bank data, they ranked lower than the top-ranking countries in terms of total number of mobile cellular subscriptions in a country (S1 Table). Furthermore, while four out of these six countries had an extremely low percentage of individuals using the Internet (< 20% as of 2017), we found no clear association between GHT patterns and Internet accessibility variables, which may indicate that the way Internet access is currently measured reflects GHT behavior poorly. Interestingly, the three countries with the lowest GHT R 2 values (Fig 4, lower panels) showed fewer cases and greater variability in their incidence signal compared to the best-performing countries. The combination of these results may indicate that some consistent level of infection is required for keeping the interest of communities searching information through Google search engines. This gives GHT a chance to match cases, and it may perform better when a rapid growth of infection coincides with interest in the topic and Internet search volume for disease-specific terms is likely to be high, regardless of the level of Internet access.
As in the rest of the world, incidence of both COVID-19 cases and COVID-19 related deaths increased across Africa steadily during the study period. However, in the second and third periods of our study, Tanzania showed zero COVID-19 cases (Fig 2). Upon closer examination, the country stopped reporting coronavirus cases and deaths in April of 2020, so any patterns that might be observed for Tanzania (0.012) are actually reflecting a lack of data [53].  Although infodemiology approaches represent the next frontier of infectious disease surveillance [19,23], the present modeling effort demonstrates that search queries from GHT are difficult to correlate with incidence of disease in the context of an emerging epidemic. In contrast with diseases such as influenza or dengue that are studied consistently in a seasonal pattern or are endemic to multiple regions [25,43,59], COVID-19 represented an unprecedented case study that might render Google-based information mining ineffective for several reasons: (a) partial or incomplete COVID-19 case detection and reporting [8,60], (b) media-induced search behavior [61], or even (c) information fatigue [36]. Thus, we encourage caution regarding interpretation of COVID-19 modeling experiments based on Google search engines. For example, Ahmad et al. (2020) found a correlation between gastrointestinal search terms obtained through GT and COVID-19 cases and suggested that Internet searches may be useful in predicting COVID-19 cases using a four-week lag in the U.S. [32]. This correlation, however, might be an artifact since none of the gastrointestinal terms is specific to COVID-19, and the only COVID-19 specific term-'ageusia'-increased during the time that the pandemic was declared (i.e., 11 March) and decreased while cases started to increase (Fig 1 in [32]). The U.S. showed an increase in case numbers driven by increasing test capacity, thus, these case numbers were reflecting disease incidence inaccurately [62]. Thus, although our findings are based on the GHT algorithm, we are cautious about interpreting our results and those of others in characterizing COVID-19 via Google search engines. Similar to our findings, Asseo et al. (2020) found correlations between GT search queries related to smell and taste at the beginning of the pandemic in Italy and the U.S., which faded in succeeding epidemiological weeks [36]. More importantly, Asseo et al. (2020) also showed how correlation patterns break down when analyzing Google search queries and COVID-19 incidence in nonconsecutive weeks (e.g. 11-17 March vs. 1-7 April 2020 in [36]).
We acknowledge some limitations of the present research. Because of the timeframe of the study and the availability of GHT data as weekly counts, we had to convert daily cases to weekly cases, limiting our analysis to only 86 observations, decreasing the statistical power of our approach. Moreover, the four terms related with COVID-19 that were selected might not be as popular in the region as expected. Language might be an important although permeable barrier [25,27]. Still, in the present study the addition of French and Portuguese translations of search terms did not yield significantly higher adjusted R 2 values (S2 Table). We did not explore the role of media coverage of COVID-19 in web search behavior in Africa, which might be an important cofounder for infodemiology studies [63], however the lack of GHT and COVID-19 associations found in the present research actually demonstrate that even in the context of well-covered epidemics, GHT should be used with caution. Finally, we lacked complete data for some of the variables explored (e.g., prevalence of severe food insecurity in the population; S3 Table) which halts interpretation of several of the indicators used; however, those that were available for all the countries showed certain explanatory power as in other research studies (e.g., total population, signal volatility, disease incidence, etc) [25,43].

Conclusions
Surveillance for an ongoing epidemic via GHT might be useful in specific situations in which accurate case counts can be retrieved and there is sustained level of disease incidence as in the case of dengue or influenza; surveillance via GHT for COVID-19 in Africa seems difficult to implement. Google instruments to recover population search counts-GT and GHT-are potentially powerful digital epidemiology tools that can lead to greater insight into disease dynamics, and should be studied and implemented depending on the particular context of an outbreak [25,30,[64][65][66][67]. Future directions to examine GHT on COVID-19 research include expansion of the analysis to a larger dataset both in time and space. Other refinements can be implemented, for example combining other forms of digital data (e.g., Twitter, Wikipedia) to determine if addition of more information improves the predictive power of the model.