A Likelihood-Based Approach to Identifying Contaminated Food Products Using Sales Data: Performance and Challenges

James Kaufman; Justin Lessler; April Harry; Stefan Edlund; Kun Hu; Judith Douglas; Christian Thoens; Bernd Appel; Annemarie Käsbohrer; Matthias Filter

doi:10.1371/journal.pcbi.1003692

Reader Comments

Supplementing epidemiological investigations in foodborne outbreaks

Posted by hoehle on 24 Jul 2014 at 09:23 GMT

I congratulate the authors with a nice paper showing alternative ways
for detecting the source in a foodborne outbreak. Recent outbreaks
have shown the importance of understanding the flow of food from
production to consumer and several suggestions have been made on how
to supplement the epidemiological investigation with such
information. However, most of these approaches have focused on the
cases only and thus try to find alternative ways for establishing a
base-line pattern to compare with. The present paper uses the regional
variation of food sales data for trying to detect the food source of
an outbreak.

Altogether, it is a welcome paper which by simulation studies
investigates the potential of such an approach and analyses the
available actual food sales data in more detail. Reading the paper
with interest for its potential in detecting the source of foodborne
outbreak there are, however, a number of points I'd like to make.

* Outbreaks consists of individuals and individuals are
heterogeneous. One way to be heterogeneous is to live in different
places, but - wrt. to food sales - other variables describe
heterogeneity even better than where you live (age, sex, income,
vegetarian, ...). Furthermore, just because you buy something does
not mean you actually consume it. The later the authors are good at
pointing out, the former limitation I, however, miss a discussion
of. Had the food sales data been available at, e.g., age/sex
resolution this would probably be an even better marker than
region. One trend in epidemiological methods is to try to use food
surveys to quickly determine a base-line to compare food consumption
of cases with (Keene et al., 1997). Supermarket sales data would be
one good addition in this respect and I hope that it in the future
it is possible to anonymize such data sets appropriately such that
more detailed consumer information becomes available (consumer cards
are aimed at gathering such information for marketing purposes
already).

* Since outbreaks are about individuals, but the sales data are only
available as quantities not linked to individuals, the authors have
to make the assumption that the probability than an individual
buys/consumes a food item j is proportional to the number of units
sold in the region of the individual. However, this might fail to be
a good proxy for whether an individual really consumes the food item
(thinking of different pack sizes, shelf-life, seasonal variation,
etc. but also due to the above described individual heterogeneity, e.g.,
kids usually don't buy their own food). Furthermore, people
might acquire/consume foot items in different ways than buying it at
the supermarket (e.g. restaurant/cantine/catering), which usually
have very different distribution channels. Altogether, one strength
of epidemiological case-control studies is that they focus on the
individual by contrasting diseased individuals with healthy
individuals. This eliminates the need for (coarse) assumptions about
the probability of being exposed to a food item.

* To simplify calculations the authors assume that the probability
$\xi$ than an individual becomes sick of food item j and is
reported/detectable as case is homogeneous over food items and
regions. However, once the particular type of disease/pathogen in an
outbreak is known (say STEC, Salmonella, Listeriosis), $\xi$ is
definitely going to vary by food item (for listeriosis gummi bears
are less likely to be the source than raw milk cheese). Furthermore,
when using a routine monitoring system cases might be more likely to
be reported in one region than in another. Since the method is all
about trying to relate case-patterns to sales-patterns such regional
reporting differences might ruin any hope of finding the best food
item pattern among the case pattern.

* It would be interesting (even imperative!) to see how the suggested
approach works with actual outbreak data from a supermarket oriented
foodborne outbreak. The authors should, as a next step, try to
substantiate their analysis with, in co-operation with relevant
public health authorities, retrospectively apply the suggested
approach to actual outbreak data. I think many potential caveats
like the ones discussed above are only revealed when looking at real
life data.

Altogether, the above should not distract from the fact that I think
such approaches, if used correctly and with care for their limitations,
could be a good supplement to classical shoe and leather
epidemiology. In case the authors are interested, I'd be happy to
discuss my points in more detail with them.

Technical comments:
===================

Likelihood vs. Bayes
----------------------------
Altogether, a Bayesian approach appears more natural to apply than the
author's suggested likelihood approach. The (at places notationally
inconsistent) likelihood derivations in the supplementary text S1 can
be replaced by deriving the posterior probability (where i denotes an
individual and j a food item):

P(source is j| i is a case) = P(i is a case | source is j) * P(source is j) / P(i is a case)

under the assumption of a uniform prior on the food items,
i.e. P(source is j) = 1/N, this posterior probability becomes

P(source is j| i is a case) = f_c(j, region(i)) / \sum_{j=1}^N f_c(j, region(i)).

When having data D consisting of m cases and their region this then becomes

P(source is j|D) = \prod_{i=1}^m f_c(j, region(i)) / \sum_{j=1}^N f_c(j, region(i)),

which is very similar to the \overline{P}(m) expression used in the
paper's Method 1. The suspected item would then be the item with the
highest posterior. Similarly, the posterior allows one to look at all
food items with a posterior probability higher than a specific cut-off
(the authors did this by thresholding the likelihood ratio, which is
equivalent, but here the threshold is less intuitive). Since the above
are probabilities it is also possible to norm them s.t. they reflect a
multinomial experiment. This would allow for a intuitive weighting
when doing computations with the "suspect product set".

Especially, once the specific disease/pathogen of an outbreak is know
(say STEC, Salmonella, etc) it appears natural to assign different
priors to different food-items. However, an even more prudent approach
would be to vary the the attack probability (denoted $\xi$ in text S1)
by food item. That is, higher probability for eggs than for gummi bears
when looking at, e.g., salmonella cases. Different priors could also
originate from other sources, e.g. concurrent case-control studies,
previous studies/outbreaks, etc. The Bayesian framework is very
flexible here.

Cases vs. controls
-------------------------

In your analysis you decide to look at cases only, but since there is
a sequence of Binomial random variables determining whether you will
end up being a case, I wonder if the non-cases are not worth taking
into account. Assuming the source is item j, the number of cases Y_r
in region 1<=r<=N could be assumed to be

Y_r \sim Po( \epsilon population(r) f_c(j,r)),

which makes the likelihood contribution for item j become

L_j \propto \prod_{r=1}^N f_c(j,r)^{y_r} \exp(- \epsilon population(r) f_c(j,r)).

One observes that the first term equals the authors' equation (4), but
note the additional \exp term which is going to be close to one for a
rare disease, but not equal to 1. Would the above expression give a
more efficient inference approach? At least would such a model be more
in line with classical spatial epidemiology approaches and would
provide a more flexible framework for additional statistical
modelling.

Literature:
===========

Keene WE, Hedberg K, Herriott DE, et al. A prolonged outbreak of
Escherichia coli O157:H7 infections caused by commercially distributed
raw milk. J Infect Dis. 1997;176(3): 815–818.

No competing interests declared.

RE: Supplementing epidemiological investigations in foodborne outbreaks

jhkauf replied to hoehle on 30 Jul 2014 at 17:06 GMT

Professor Höhle,
Thank you very much for your comments on our paper. Your suggestions for validation and extension as well as your identification of limitations are all excellent. This is an initial study exploring a novel technique applied to an underutilized data source. We hope to do further work exploring extensions to the technique. In particular we think taking a Bayesian approach that accounts for the differential risk of foodborne infection would greatly improve the utility of this approach. Also, for wider adoption you are correct that this method must be integrated with existing case control methodologies which have been the standard for outbreak investigations for decades.

J. Kaufman and J. Lessler

No competing interests declared.

RE: Supplementing epidemiological investigations in foodborne outbreaks

mfilter replied to hoehle on 28 Aug 2014 at 11:59 GMT

Dear Professor Höhle,

thank you very much for your comments and ideas which we really appreciate.
On the highest level this research attempts to identify and evaluate a new way to support outbreak investigations. When establishing the cause of disease outbreaks attributable to food, a number of different methods are currently used. Apart from testing for pathogens in food products directly, these include, for example, epidemiological methods such as patient interviews and subsequent tracing back food products along the food supply chains. The probability-based method for identifying foods that may be contaminated with pathogens is to be seen as an additional tool to help establish the cause of an outbreak. In our understanding the classical approaches are and will remain the key pillars in most outbreak investigations. As you pointed out correctly several methods have been proposed to complement these methods and we believe the approach presented here might be useful in certain specific outbreak situations. Specifically our method currently can only help if the assumption is that an outbreak is caused by a single food item for which sales data are available. As outlined in the paper it is our plan to continue this research e.g. to investigate on how to extend this approach also to situations where an outbreak is caused by a contaminated ingredient or products without a unique product ID. We also have several ideas for follow on work including introduction of shopping behavior and studying the effects of noise on the method which could then also evaluate the effect of regional reporting differences raised. We further agree with the comment on the need for evaluating the method on real-world outbreak data. However to acquire suitable data is challenging given the current scenario limitations of the algorithm. Here we expect that it becomes easier with future versions of the algorithm that can be applied in a wider range of outbreak scenarios. Since BfR for Germany is in charge for risk assessments in food-borne outbreak situations we continuously work on matching results from scientific research with real-world outbreaks.

To comment on some of your remarks more specifically:

• heterogeneity of individuals
We agree with you that it would be very relevant to consider specific attitudes of the cases if this information is available. In most situations, only very little information (e.g. age, sex) is available which can be used to streamline the hypothesis as regards the consumption pattern.

• Buying foods vs. consumption of foods
From the data currently available we are deducing in which region most probably the food was consumed. All the dataset which might be available will not reflect who in a household has consumed which of the products bought somewhere. Thus, although we agree that it would be better to have detailed knowledge on the consuming persons; this information can only be collected in the context of interviews of the cases and matching controls. Some hypotheses on probable food items generated by our approach could support case-control studies. At this point some specific questions could be added. And for sure, better knowledge on consumption patters, e.g. based on supermarket sales data and usage of supermarket sales data for our approach would improve the capability of the method.

• Number of units sold in a region
As regards the volumes sold, these can be ‘translated’ in number of portions (based on knowledge on an average portion) and thus, in our understanding, can still be considered an appropriate estimate for exposure. As regards shelf-life and season, this is more or less a matter of detailed data on when the products have been sold, when the cases got sick, and knowledge on the period the product is on the market (and can be consumed) and the incubation period of the respective causative agent.

• Homemade foods vs. foods consumed at restaurants etc.
We fully agree that the distribution channels differ. In case of food-borne outbreaks, where several people are linked by the place of food consumption, the traditional case-control-study (linked with the food items offered during the respective time interval) will be a very useful tool to identify commonly consumed food items. In our approach, where we focus on a ‘single food products‘ we might mainly focus on products consumed at home. This results usually in scattered sporadic cases, where it depends very much on the type of disease and the surveillance system in place (including microbiological testing) whether the linkage to a common food-borne outbreak will be identified. And typically, case-control-studies are not regularly performed in these situations.

• Specific risks related to the food items taking into account the causative agent involved.
We fully agree that besides considering the specific agent related incubation period, knowledge on more risky foods might be considered. This could be a second step, when identified matching products are assessed against current knowledge on the food items, the cases …. But we also have to be open to new exposure pathways (e.g. nobody expected fenugreek seed as source for EHEC cases in the beginning). And we also agree that we have to consider differences in the surveillance systems, but this might be true for any type of approach used to investigate food-borne infections.

M. Filter, A. Käsbohrer and B. Appel

No competing interests declared.

A Likelihood-Based Approach to Identifying Contaminated Food Products Using Sales Data: Performance and Challenges

A Likelihood-Based Approach to Identifying Contaminated Food Products Using Sales Data: Performance and Challenges

Reader Comments

Post Your Discussion Comment

Why should this posting be reviewed?

Thank You!

Supplementing epidemiological investigations in foodborne outbreaks

Posted by hoehle on 24 Jul 2014 at 09:23 GMT

RE: Supplementing epidemiological investigations in foodborne outbreaks

jhkauf replied to hoehle on 30 Jul 2014 at 17:06 GMT

RE: Supplementing epidemiological investigations in foodborne outbreaks

mfilter replied to hoehle on 28 Aug 2014 at 11:59 GMT

Reader Comments

Post Your Discussion Comment

Why should this posting be reviewed?

Thank You!

Supplementing epidemiological investigations in foodborne outbreaks

Posted by hoehle on 24 Jul 2014 at 09:23 GMT

RE: Supplementing epidemiological investigations in foodborne outbreaks

jhkauf replied to hoehle on 30 Jul 2014 at 17:06 GMT

RE: Supplementing epidemiological investigations in foodborne outbreaks

mfilter replied to hoehle on 28 Aug 2014 at 11:59 GMT

Cookie Preference Center

Customize Your Cookie Preference