Predicting seasonal influenza using supermarket retail records

Increased availability of epidemiological data, novel digital data streams, and the rise of powerful machine learning approaches have generated a surge of research activity on real-time epidemic forecast systems. In this paper, we propose the use of a novel data source, namely retail market data to improve seasonal influenza forecasting. Specifically, we consider supermarket retail data as a proxy signal for influenza, through the identification of sentinel baskets, i.e., products bought together by a population of selected customers. We develop a nowcasting and forecasting framework that provides estimates for influenza incidence in Italy up to 4 weeks ahead. We make use of the Support Vector Regression (SVR) model to produce the predictions of seasonal flu incidence. Our predictions outperform both a baseline autoregressive model and a second baseline based on product purchases. The results show quantitatively the value of incorporating retail market data in forecasting models, acting as a proxy that can be used for the real-time analysis of epidemics.


Introduction
Recent years have seen a growing interest in generating real-time epidemic forecasts through novel digital data streams and machine learning approaches. Seasonal influenza forecasting approaches are leading the way in this rapidly advancing research landscape. Seasonal influenza is still a major burden to the health care systems of countries with 3 to 5 million infected, and 290,000 -650,000 deaths caused by influenza worldwide every year 1 . For this reason, the US Centers for Disease Control and Prevention (CDC) formally pioneered infectious disease forecasting by starting the Flusight consortium focused on prediction of seasonal flu incidence. The CDC seasonal influenza challenge has been remarkably successful in maintaining momentum for a coordinated focus on the operational implementation of disease forecasting. Simultaneously, it fuels the research on developing forecasting models based both on traditional surveillance systems such as influenza-like illness (ILI) incidence captured by the network of outpatient clinics, and novel digital data streams such as search engine queries and social media [1][2][3][4][5][6]. In this context the use of machine learning techniques has received considerable attention [7], and although the use of novel digital data streams as proxy data for disease forecasting did show evident limitations in early approaches, the use of multiple data sources and ensemble of models is now defining the second generation of forecasting tools defining the state of the art in the field.
The pioneer in the use of machine learning and proxy data for flu forecasting has been the famous Google Flu Trends (GFT) platform. The platform was providing forecasts of the current level of influenza-like illness (ILI) incidence in the USA by using search engine queries associated our framework is the data proxy -sentinel baskets -and that any other forecasting method can be applied in this framework.
Our results show that exploiting the information hidden in the retail market data can contribute to predicting the future incidence of influenza. Our findings indicate that the seasonal influenza forecast accuracy improves with the use of retail records and our predictive framework outperforms the baseline autoregressive model with historical ILI reports. More specifically, with two-week and three-week forecasts ahead, forecast performance indicators improve consistently with error estimates decreasing of about 50%. In order to support the rationale behind our choice of sentinel baskets as a proxy for predicting seasonal influenza, we introduce a second baseline using single products' time series of retail market data. Forecasts obtained by using sentinel baskets are significantly more accurate than those obtained using single products' time series. It's not the predictive power of our framework that is important, but rather the increase of the predictive power when we add the sentinel baskets that capture hidden human behaviors adapted to ongoing influenza epidemics. The presented work shows quantitatively the value of incorporating retail market data in forecasting approaches, adding one more dataset to the armory of proxy signals that can be used for the real-time analysis of epidemics. The framework developed in this paper has shed lights on the great potential of combining other predictive approaches (e.g., mechanistic models and/or deep learning models) and assimilating algorithms based on different proxy data [57], thus defining ensemble forecasting methodologies that have proven to achieve the reliability required in the policy-making process.

Results
Our main goal is to study whether retail market data can act as a proxy for predicting influenza. Specifically, our aim is the development of influenza incidence forecasts 4 weeks in advance of the latest ground truth data released from the regular surveillance system. Generally, the release date of the ground truth is delayed by one week, according to the value of k, where k be the k week ahead (k = 1, 2, 3 or 4). Therefore a distinction can be made between hindcast targets (k = 1), i.e. inferring the present influenza incidence value of a week that has already passed by, nowcasting (k = 2), i.e. predicting the influenza incidence value during the week in which the forecast is prepared, and forecasting (k > 2), i.e. predicting the flu activity in the future weeks from the moment in time the analysis is performed.
The novelty of our work lies in the framework we design to tackle this task. We built a data-driven approach, exploiting information extracted from the retail market data, using data mining and machine learning techniques, leveraging customers' behavioral changes during the influenza peak as observed from the items they purchased in their shopping carts. We develop a nowcasting and forecasting framework that makes use of sentinel baskets, i.e., products bought together, to provide estimates for seasonal flu incidence in Italy up to 4 weeks ahead of the latest ground truth data.
We base our analysis on real-world data describing the purchases of the customers of COOP, one of the largest supermarket chains in Italy. This source of data has been used for different purposes, such as identifying successful innovations, meant to be a success later on [58], introducing an alternative metric to GDP by quantification of the average sophistication of satisfied needs of a population [59], creating a personal cart assistant that suggests to the customer the items to put in her shopping list based on a innovative clustering method [60] and finally, describing the buying behavior of different classes of customers, as highly ranked customers that have more sophisticated needs tend to buy niche products, i.e., low-ranked products, and on the other hand, low-ranked, low purchase volume customers tend to buy only high-ranked products, very popular products that everyone buys [61].
We generate influenza activity forecasts for the 2011/12, 2012/13, 2013/14, and 2014/15 influenza seasons. Influenza activity in Italy is officially monitored by the Italian National Institute of Health, "Istituto Superiore di Sanitá" (ISS) and the Interuniversity Research Centre on Influenza (Ciri), through a system called Influnet. As ground truth data and forecast targets, we consider the ILI incidence defined as the number of patients presenting ILI symptoms over all the persons seeking medical attention during a specific week in the network of about 900 sentinel General Practitioners (GPs) and pediatricians of the Influnet system. Fig. 1 displays the predictions against the reported influenza activity level for the four time horizons, 1, 2, 3, and 4 weeks ahead. Overall predictions track the influenza activity level very accurately, as shown in the top panel of the figure. Close inspection shows that the 1 week ahead predictions from the regression model with the sentinel baskets and the reported influenza activity level is very similar, with small errors. For 2, 3, and 4 weeks ahead, our sentinel baskets continue to track rather closely the influenza activity level with some overshooting in some cases. In order to evaluate the forecast performance of our approach we consider standard indicators such as the Pearson correlation, the mean absolute percent error (MAPE) and the root mean square error (RMSE) of the 1-4 weeks ahead forecast time series with respect to the ground truth provided by the Influnet system. In Table 1, we report these indicators calculated over all the influenza seasons considered here. Specifically, we report the performance of forecasts obtained by considering the top 1 and top 5 most correlated sentinel baskets called Basket-1 and Basket-5, respectively. We test numerically that the performance remains stable, increasing the number of baskets considered in the predictive framework. Along with the results from our forecast framework augmented with the sentinel basket data, we report the performance of two baseline forecast approaches: i) autoreg: this approach only uses historical Influnet data via the autoregressive model; ii) Product-5 : this baseline forecast method integrates as proxy data the time series of the most correlated products put together in a basket in the same prediction model of our main approach.
From Table 1, it is evident the added value of using the sentinel baskets over a simple historical autoregressive approach and simple product purchases. Forecasts obtained with the sentinel basket approach are significantly more accurate compared to the baseline approaches, especially in the 3 and 4 weeks ahead, time horizon. It is worth remarking that the autoreg baseline has a better performance in comparison to the Product-5 baseline. As expected, we also remark that the performance of forecasts deteriorates as the time horizon increases. We report the results of individual influenza seasons 2011/12 to 2014/15 in the Appendix.
To assess the statistical significance of the improved prediction power of the sentinel basket approach, we report its relative efficiency with respect to the baseline approaches in Table 2. The relative efficiency between two approaches is defined here as the ratio of the mean-squared error of Approach 2 to that of Approach 1 [62]: We also report the 95% confidence interval for the relative efficiency. The relative efficiency can be estimated by the time series stationary bootstrap method [63], where the replicated time series of the error residual is generated using random blocks with mean length 14 (which corresponds to the on-season weeks with an ILI rate value greater than the threshold of 0.02). Table 2 shows that our approach is estimated to be almost twice as efficient as the autoregressive baseline, and the improvement in accuracy is highly statistically significant. We do not include the second baseline of the single products' time series, as its predictive power proved to be rather low. Comparing our results with [57], the only influenza nowcasting and forecasting approach in Italy, where the authors used data extracted from a Web-based participatory surveillance system, we succeed in predicting influenza incidence with higher accuracy reducing the error significantly.

Discussion
In this study we propose the use of a novel data source, namely retail market data, as a proxy for predicting seasonal influenza. The rationale behind our choice is that customers' behavioral changes are reflected in the items purchased in a shopping basket, thus providing a valuable proxy for the spread of seasonal influenza. We make use of a regression model (SVR) to produce our forecasts for 1 to 4 weeks ahead. We need to emphasize that the most important component in our framework is the data proxy -sentinel baskets -and that any other forecasting method can be used instead of the SVR model. We compare the results obtained with the sentinel basket approach with a baseline autoregressive model (autoreg) that considers only historical influenza data from the traditional surveillance Influnet.
The analysis of the results obtained for the Italian flu season from 2011 to 2015 shows the superiority of the sentinel basket approach. The forecasts consistently outperform the baseline autoregressive model, thus proving the added value of incorporating retail market data quantitatively. The retail market data we use for our approach are in the form of sentinel baskets (Basket-1 and Basket-5) and not just a basket of simple time series of single products, such as in the second baseline (Product-5 ), where we use the most correlated products with the influenza adoption trend. We demonstrate that the use of single products' time series does not produce the same results as using our sentinel baskets. The predictions of our approach are significantly more accurate than the predictions with the use of a basket of single products in all four-week forecasts. We need to stress the fact that we obtain a noticeable increase in the predictive power of our approach when we add the sentinel baskets.
The results we present here are for influenza-like illnesses at the national level within Italy. Nevertheless, our approach shows promise to be easily extended to accurately track not only influenza in other countries where similar data sources are available but also other infectious diseases. Although the predictive framework is outperforming the baseline approaches, it is possible to envision the use of retail market data in the context of multi-data and ensemble approaches, thus contributing to state of the art performing forecasting schemes. Furthermore, retail data are available at the very fine geographical resolution, thus opening to the definition of proxy data for forecasting at a regional and urban level where ground truth for Influenza Incidence data are available.

Materials and methods
In this section, we describe the data used in our study, highlighting their main characteristics. Additionally, we describe our predictive framework and its main components.

Data Description
First, we describe the influenza activity data in Italy as captured by Influnet. In addition, we describe the retail market data describing the purchases of the customers of COOP supermarkets all over Italy.

Influenza Data
In developed and developing countries, there are national syndromic (i.e., based on observed symptoms) surveillance systems for influenza-like illness (ILI). These systems monitor levels of ILI cases among the general population by gathering information from physicians, known as sentinel doctors, who record the number of people seeking medical attention and presenting ILI symptoms. Influenza activity in Italy is officially monitored by the Italian National Institute of Health, "Istituto Superiore di Sanitá" (ISS) and the Interuniversity Research Centre on Influenza (Ciri), through a system called Influnet. The Influnet system collects data from a network of about 900 sentinel General Practitioners (GPs) and pediatricians. It compiles a weekly report in which the national and regional incidence rates by age group are published during the winter season, generally from week 42 to the last week of April of the following year (around week 17). The data cover about 2% of the Italian population. Doctors who participate in the monitoring are required to identify and write down daily, on their register, each new case of influenza. Each week, they transmit the aggregate number of cases seen by any physician (divided by age groups and by risk category) to the relevant Reference Center. The ISS processes the data at the national level and produces a weekly report. Data are published with at least one-week lag, and typically new reports provide a first estimate of the weekly ILI incidence, which is then updated in the following weeks as more data from sentinel GPs are recorded. We collected the Influnet reports for five influenza seasons, from 2011/12 to 2014/15, from week 42 to week 17. The reports are publicly available at the website of Influnet 2 .
We have to mention that our analysis is performed on national influenza data because regional influenza data are not reliable enough. This is equivalent to consider that influenza spreads in a relatively homogeneous way all over the country, which for a small country as Italy is a reasonable assumption.

Retail Market Data
We base our analysis on real-world data about customer behavior. We use a retail market dataset describing the purchases of the customers of COOP, one of the largest supermarket chains in Italy. An important dimension of the data regards the company's classification of products: there is a tree organization, and the hierarchy is built on the product typologies. The top-level of this hierarchy is called "Area" that splits the products into three fundamental categories: "Food", "No Food", and "Other" that refers to medical products. The leaves of the tree are at the bottom level of the hierarchy, called "Item". The marketing hierarchy goes like that: i) Area (3 values There are several conceptual issues in using the lower level of the hierarchies of the product typologies. For instance, the distinction between different packages of the same product as specified at the "Item" level, e.g., different sizes of bottles containing the same liquid, is not of interest in our study. Equally, the distinction between products of different brands, e.g., milk from company A or B, is not of interest in our study ("Segment" level). A way to solve this issue is to use the marketing hierarchy, substituting the item with its marketing "Subcategory" value. As a result, we reduce the cardinality of the dimension of the products (from 571,092 to 2,665), aggregating logically equivalent products. Throughout our study, we will refer to those subcategories as products.
We analyzed a dataset of 30M shopping sessions that occurred in Livorno province, one of the best-represented areas of Italy, with regards to the number of shops in the area, as well as the number of loyal users, over 2010-2015, corresponding to about 150,000 active and recognizable customers. A customer is active if there is at least one purchase during the data time window, and she is recognizable if the purchase has been made using a loyalty card. Customers are provided with a loyalty card that allows linking different shopping sessions, and therefore reconstruct their personal shopping history. The 138 stores of the company cover the whole west coast of Italy, selling 571,092 different items. For each customer, we have N∼150 baskets, D∼100 different items, and an average basket length T of ∼8 items.

Predictive Framework
Two main algorithmic components compose the forecasting approach proposed here: i) the sentinel baskets discovery from the previous influenza season s − 1; and ii) the use of the sentinel baskets for prediction in the current influenza season s. In Fig. 2, we provide a diagrammatic illustration of the predictive framework. Proposed approach workflow. We consider the time series of all the products purchases (volume) and rank them according to their correlation with the flu incidence time series as obtained from the Influnet surveillance system in Italy. Then for each product we identify the customers that bought it during the influenza peak, and construct all their basket purchases during the same period. We obtain the composite time series for the most frequent baskets, and use as our sentinels those with the highest correlation with the previous year's flu season. Finally, we feed these signals into the prediction model and we produce the forecasts for the flu incidence. The following steps summarize the definition of the sentinel baskets (see Algorithm 1): • We construct the time series S p of the volume of purchases at a weekly level for each product p ∈ P . We select the sentinel products P that are more correlated with the influenza adoption trend I calculating the Pearson correlation measure between {S p , I}∀p ∈ P (see Algorithm 1, lines 2-4).
• For each of the sentinel products we identify the sentinel customers, customers C P that bought December 18, 2020 8/17 them during the influenza peak [T − 2, T + 2] (see Algorithm 1, line 5-9).
• For all sentinel customers c ∈ C P , we obtain all their purchases during the same period, and we create a pool of all their baskets B. We apply the Apriori algorithm to identify the most frequent baskets B f . We select the baskets that are more correlated with the influenza adoption trend I to be sentinel baskets, B (see Algorithm 1, lines [10][11][12][13][14].

Algorithm 1: Sentinel Baskets Discovery
Data: Sp-products' time series, I-influenza time series, R-receipts Result: B-sentinel baskets 1 P ← ∅; C P ← ∅; B ← ∅; B f ← ∅; B ← ∅ ; // initialize the sentinel products, sentinel customers, pool of baskets, frequent baskets, and sentinel baskets 2 for p ∈ P do // for each product Once the sentinel baskets have been identified, during each week of the current influenza season, t s , we use their corresponding volume time series along with the past influenza incidence data in order to train a regression model of the future incidence values of influenza for 1 to 4 weeks ahead. More precisely, we proceed according to the following steps: • We construct the composite time series, S, for each of the sentinel baskets B, where we add the volume of purchases at a weekly level for each product p ∈ B up to week t.
• We introduce the regression model whose coefficients are solved by Support Vector Regression (SVR). For each forecasting week t s and forecasting target of k week ahead, the SVR model is trained by the data starting from the first week in the previous season s − 1 to the last week t s − 1.
• For the prediction, the regression model makes use of the historical ILI reports available till week t − 1 and sentinel baskets data available till week t.
Details on the various components of the proposed forecast framework are reported in the following sections.

Sentinel Products
The first necessary step to learn the sentinel baskets from the previous influenza season s − 1 is the discovery of the sentinel products. We need to define the time granularity of our observation period for the retail market data. We choose to use a weekly aggregation mainly because influenza reports are on a weekly base. We prepare the retail market data in order to correspond to the weekly reports of influenza, and we work on a "Subcategory" level. We report the weekly sales for each of the products p ∈ P for all the weeks of interest (42nd week of the year until the last week of April of the following year), producing the final retail time series S p . It is crucial to notice that even working at an aggregated level in the retail hierarchy, our time series are still 2,665. So it is imperative to filter out the products that are not correlated with the influenza adoption trend I, so we can work mainly with products that have a similar adoption trend. We choose to use the Pearson Correlation, as it is one of the most commonly used correlation measures. In statistics, the Pearson correlation coefficient [64], also referred to as Pearson's r, is a measure of the linear correlation between two variables x and y. It has a value between +1 and -1, where 1 is total positive linear correlation, 0 is no linear correlation, and -1 is total negative linear correlation. Using Pearson correlation coefficient we calculate the correlation r between each product's time series S p with the influenza time series I and we filter out the time series with a low correlation in order to identify the products that have adoption trend similar to the influenza trend, the most correlated sentinel products P = {p|p ∈ P, r(S p , I) > δ}, where δ > 0.2 to exclude products with weak or no correlation.

Sentinel Customers
We are interested in studying human behavior mainly during the influenza peak of the previous influenza season s − 1. We identify the influenza peak week at time T and we define the period of interest [T − δ, T + δ] where [δ] is the width of the time window. We used [δ] = 2 in our experiments so we have a period of interest of 4 weeks (∼1 month) which is the typical length of the period that the influenza is at its peak. Using the sentinel products in P, we trace their sales during the period of interest, and we identify the customers that bought them through the receipts matching each customer with her corresponding purchases. These customers become our sentinel customers denoted with C P . We are interested in the purchases of these specific customers since those individuals would have a higher possibility to be either infected or close to an infected individual. We have to notice that customers are using loyalty cards, linking them with their purchases throughout the whole period of interest and that a loyalty card normally represents the whole household, with the probability of more than a person per household.

Sentinel Baskets
In the final step of discovering the sentinel baskets from the previous influenza season s − 1 and preparing their time series for the current influenza season s, we are working backwards. Using the sentinel customers C P , we track all their purchases during the period of interest, through their receipts, and we obtain their corresponding baskets, where each basket b contains products bought together under the same receipt b = {p 1 , p 2 , ..., p n |p i ∈ P }. We obtain the baskets for each customer c ∈ C P , and we create a pool of baskets B, discarding the information of who bought what. It is worth stressing that since we are interested in the information contained in the products bought together and in the patterns we can extract through customers behaving similarly, a key component of our approach is the Apriori algorithm [65].
The Apriori algorithm is an algorithm for frequent itemset mining and association rule learning over transactional databases. The algorithm uses a bottom-up approach where it identifies the most frequent individual items in the database and extends them to larger and larger itemsets as long as those itemsets satisfy a minimum threshold frequency. The algorithm terminates at the moment that no further successful extensions are found. It uses a breadth-first search and a Hash tree structure to count candidate itemsets efficiently. It generates candidate itemsets of length k from itemsets of length k-1. Then it prunes the candidates who do not have a frequent sub pattern. According to the downward closure lemma, the candidate set contains all frequent k -length itemsets. After that, it scans the database to determine frequent itemsets among the candidates.
Using the Apriori algorithm, we extract the most frequent baskets B f in our pool. For every product in each of the most frequent baskets, we obtain the corresponding time series. Then for each basket, we create a cumulative value of all the products that belong to it, and we create the corresponding composite time series S B f . We use the measure presented in Step 1, and we calculate the Pearson correlation between the ILI time series I and the time series for each of the baskets S B f , and we keep the 5-most correlated baskets, the sentinel baskets B ∈ B f .
In order to construct the sentinel basket time series for the current influenza season s, we extract the time series for each of the products that belong to the sentinel baskets, p ∈ B that we had obtained from the previous season. We repeat the procedure mentioned before, as for each sentinel basket, we create a cumulative value of all the products in it, thus creating its corresponding composite time series S B , S for simplicity. We will incorporate these time series with the historical ILI reports in the prediction model described below.
We should also note that we learn the sentinel baskets only from one previous season and not more to avoid introducing biases from changes that may occur in the retail market database, as new products may appear and older disappear.

Forecast Models
The baseline model is inspired by Autoregression (AR) which suggests a linear relation between the current and previous values of a time series. Let I t s be the logit transformed ILI at week t in season s, k be the k week ahead (k = 1, 2, 3 or 4). The baseline model could be written as where h is the window size, α and a i are the regression coefficients. We include the sentinel baskets' time series S as an exogenous signal, where S t s be the value of S at week t in season s, yielding: Note that j starts from −1 because the retail market data is up-to-date while ILI has one week lag such that S t has an extra week of data than I t .
Further more, to test the sensitivity of the model on the number of sentinels, we expand the model as where N S is the total number of sentinels, and S n is the nth sentinel. It is essential to notice that by extending the model to incorporate more sentinel baskets, we can capture more shopping behaviors and with greater variance. In the forecast model (5), I t s +k is the dependent variable and {I t s −i }, {S n t s −j } are the the explanatory variables. The model makes use of the historical ILI reports available till week t − 1 and sentinel baskets data available till week t. Table 3  We make use of the Support Vector Regression (SVR) model with radial basis function (rbf) kernel in order to solve the coefficients of the above autoregression and regression models. Since the sentinel baskets B S for t s are generated from season s − 1, the data earlier than that is not included in the training data. For each forecasting week t s and forecasting target k, the SVR model is trained by the data starting from the first week in the previous season s − 1 to the last week t s − 1. For SVR model with rbf kernel, there are two hyperparameters which are regularization parameter C and kernel width γ that need to be defined [66]. We set the range of parameters as C ∈ [1, 1e4], γ ∈ [0.01, 2.0] and window size h ∈ [2,6]. We select the window size h and SVR-related hyperparameters by Grid Search and 5-Fold Cross-validation.

Performance Indicators
We consider the following indicators to assess the performance of the forecast approaches with respect to the ground truth influenza incidence. Our notation is as follows: y t denotes the observed value of the influenza at time t, x t denotes the predicted value by the model at time t,ȳ denotes the mean or average of the values y t and similarlyx denotes the mean or average of the values x t .
Pearson Correlation, a measure of the linear dependence between two variables during a time period [t 1 , t n ], is defined as: | y t − x t y t |) × 100 (7) Root Mean Square Error (RMSE), a measure of prediction accuracy that represents the square root of the second sample moment of the differences between predicted values and true values, is defined as: In order to be similar to MAPE we multiply RMSE with 100 to make it percentage error as well.

Data and Code Availability
We provide the data and code of our study for reproducibility in https://github.com/jeannetm/ predict_influenza_with_retail_records. However, we do not include the COOP files regarding the retail records as they are not publicly available data.