^{1}

^{*}

^{2}

^{1}

^{¤}

^{1}

^{1}

^{1}

^{3}

^{3}

^{3}

^{3}

SE, KH, JD, and JK are paid employees of IBM Almaden Research Center. All other authors have declared that no competing interests exist.

Conceived and designed the experiments: SE KH JK. Performed the experiments: KH MF CT SE AH. Analyzed the data: BA CT KH AH AK JL. Contributed reagents/materials/analysis tools: SE AH JK MF CT JL. Wrote the paper: SE KH JD JK. Preprocessed/organized raw food sales data: CT BA AK. Designed/wrote algorithm: SE. Planned/clustered data: AH CT. Proposed use of retail data with public health data: JK.

Current address: Department of Statistics, Purdue University, West Lafayette, Indiana, United States of America

Foodborne disease outbreaks of recent years demonstrate that due to increasingly interconnected supply chains these type of crisis situations have the potential to affect thousands of people, leading to significant healthcare costs, loss of revenue for food companies, and—in the worst cases—death. When a disease outbreak is detected, identifying the contaminated food quickly is vital to minimize suffering and limit economic losses. Here we present a likelihood-based approach that has the potential to accelerate the time needed to identify possibly contaminated food products, which is based on exploitation of food products sales data and the distribution of foodborne illness case reports. Using a real world food sales data set and artificially generated outbreak scenarios, we show that this method performs very well for contamination scenarios originating from a single “guilty” food product. As it is neither always possible nor necessary to identify the

Response to foodborne disease outbreaks is complicated by globalization of our food supply chains. Rapid identification of contaminated products is essential to limit the damage caused by foodborne disease. Worldwide, foodborne disease outbreaks are responsible for $9B a year in medical costs and over $75B in economic losses. Yet relevant data required to accelerate the identification of suspicious food already exists as part of the inventory control systems used by retailers and distributors today. Combining this retail data with public health case reports has the potential to hasten outbreak investigations and provide public health investigators with better information on suspected products to test. This paper demonstrates the feasibility of the principle and efficiency of this approach. Based on these findings it can be concluded that in foodborne disease outbreaks retail data could be used to speed and target public health investigations and consequently reduce numbers of sick/dead people as well as reduce economic losses to the industry.

In recent years global trade has significantly altered the topology of food supply chains

In a previous study, as a possible strategy to achieve this goal, we proposed a likelihood-based method that could be applied as an early response system to help determine the product most likely to be associated with a foodborne disease outbreak

In the work reported here, we test our likelihood-based method using raw food sales data. As a simplifying assumption, we model food consumption at the point of sale region. In future work, we will test this assumption by applying Huff's “gravity model” for retail shopping to smooth the sales distribution over other regions

In applying the likelihood-based method to real world sales data, we use a ROC (receiver operating characteristics) analysis to quantify the performance of the method, comparing two different classifiers. This analysis also identifies the optimal discrimination threshold to maximize performance as a function of both the selectivity and specificity for the likelihood-based analysis. Additionally we explore how the method's performance may depend on “structural” properties of the sales data distribution, as this understanding is essential for efforts to proactively predict which contaminated foods/food groups might be hard to pinpoint in the event of an outbreak.

We apply product specific retail sales data from stores of a German food retail company covering 3,513 of Germany's 8,235 postal zones. The dataset lists the weekly sales of 580 anonymous food products (N = 580). For application in this analysis, sales data were aggregated per postal zone and product over the three-year period 01/2008 to 12/2010. Let _{s}(n, r)

The underlying assumption of outbreak pattern generation is that for each product the distribution of sales across the postal codes reflects the true consumption pattern for that food _{c}(n, r)

Notice that for a given product _{c} (n, r)

We take advantage of this when generating synthetic outbreak case reports for a selected “contaminated” product _{c} (x, r)_{c} (x, r)

For each product the results are averaged over 50 trials. For each trial, the x axis is sorted from most to least frequently occurring location to show the outbreak pattern.

An outbreak can be described by the set of locations {_{i}^{th}

Let _{i}_{k}(m)

Let

In this method, normalization of _{th}

We run the analysis varying the contaminated product, _{th}_{x,m}

Statistic

Here we assume the ≥ test returns 1 when satisfied, 0 otherwise. Essentially we sum the total number of outcomes where the ratio of “guilty” product

To define the false positive rate for a contaminated product

Next we compute the number of false positives:

The average false positive rate is now defined as:

In the analysis, we use the thresholds

In order to analyze how different food distribution patterns can influence the performance of the likelihood-based method, the similarity of the distribution patterns of the food products was measured by calculating the pair-wise Spearman's rank correlation coefficient,

Since Spearman's

In order to evaluate its performance the method has been applied to a real world dataset of 580 food products with known distribution patterns across Germany

To assess the performance of the likelihood-based method statistic

Taking advantage of the likelihood-based approach we can also assess the relative probability for

To visualize the performance statistic

Using the Spearman's rank correlation coefficient

For large correlation, the contaminated product cannot always be uniquely determined.

The data in

Consider ‘Y’ products with

As noted, a high degree of similarity between the distribution patterns of the food products under investigation and the spatial pattern of the contaminated “guilty” product implies that it is (will be) difficult to correctly identify the causative food item. To describe and visualize this property of the food data set, we calculate the correlation matrix and apply hierarchical clustering algorithms.

This figure depicts the correlation matrix map sorted by clusters.

Different colors indicate different clusters, defined by a cut-off value of 0.25. (Note that colors were used multiple times, i.e., non-adjacent clusters of the same color are not related in any special way.)

The

To characterize the clusters observed in the dendrogram in

For illustration purposes, all product clusters containing exactly three products are displayed. Clusters are arranged in two columns of seven clusters each. Other cluster sizes exhibit similar correlations between product distribution patterns. This image is published with permission from Esri and its data providers, and from Michael Bauer Research GmbH, Nürnberg, Germany; Data Source: Microm 2013.

This analysis shows how, when information on the food distribution channels is available, likelihood-based methods can quickly identify those products likely to be causing an outbreak using the geographic locations for even relatively few cases. However, these methods assume that food distribution channels are well characterized, which may rarely be the case. Nevertheless, our methods could be extremely useful for retail companies that want to assess which of their own products could potentially be involved in an ongoing disease outbreak, or identifying chains or individual stores that should be prioritized for investigation in an ongoing outbreak. In practice, multiple products may be contaminated by a single food ingredient. Here we use a very simple model of the probability of individuals consuming food for particular shops, which may be quite different from real consumption patterns.

In this paper we also make the simplifying assumption that food is consumed where it is sold. In fact, people travel. In the future, it is possible to extend the current work by adding Huff's “gravity model” for retail shopping behavior

This analysis also provided some fundamental insights into the relationship of method's performance and inherited properties of the analyzed food sales data. We could confirm that the degree in similarity of the spatial food distribution pattern determines how quickly the likelihood method will converge on a finite suspect product set size. Generally, the maximum pair-wise correlation with the actual contaminated product is negatively related to success rate, and positively related to the number of cases required for a perfect prediction. This suggests that it may be beneficial to consider identifying groups of products as likely to contain the tainted food, rather than focusing on finding one product.

Additionally it has been shown that relevant intrinsic properties of the food sales data can be visualized by performing hierarchical clustering algorithms. This method provides a helpful graphical summary of the spatial similarity of food distributions. Further, on the basis of clusters generated by this algorithm, it is shown that log cluster size has a negative, linear relationship with success rate. This suggests that, as the number of products similarly distributed as the contaminated product increases, our ability to consistently identify the contaminated food in a small number of cases decreases. Highly correlated food product distributions are associated with products that are (and will be) harder to identify than uncorrelated product distributions. Since correlated product clusters can be identified proactively, suspect products can also be grouped for analysis accelerating an outbreak investigation.

(CSV)

(RAR)

(TIF)

(TIF)

(DOC)

Retail sales data were provided by SymphonyIRI Group GmbH, Germany.