Estimating animal abundance in industrial scale batches of ground meat is important for mapping meat products through the manufacturing process and for effectively tracing the finished product during a food safety recall. The processing of ground beef involves a potentially large number of animals from diverse sources in a single product batch, which produces a high heterogeneity in capture probability. In order to estimate animal abundance through DNA profiling of ground beef constituents, two parameter-based statistical models were developed for incidence data. Simulations were applied to evaluate the maximum likelihood estimate (MLE) of a joint likelihood function from multiple surveys, showing superiority in the presence of high capture heterogeneity with small sample sizes, or comparable estimation in the presence of low capture heterogeneity with a large sample size when compared to other existing models. Our model employs the full information on the pattern of the capture-recapture frequencies from multiple samples. We applied the proposed models to estimate animal abundance in six manufacturing beef batches, genotyped using 30 single nucleotide polymorphism (SNP) markers, from a large scale beef grinding facility. Results show that between 411∼1367 animals were present in six manufacturing beef batches. These estimates are informative as a reference for improving recall processes and tracing finished meat products back to source.
Citation: Hu X-S, Simila J, Platz SS, Moore SS, Plastow G, Meghen CN (2012) Estimating Animal Abundance in Ground Beef Batches Assayed with Molecular Markers. PLoS ONE 7(3): e34191. https://doi.org/10.1371/journal.pone.0034191
Editor: Rongling Wu, Pennsylvania State University, United States of America
Received: November 16, 2011; Accepted: February 24, 2012; Published: March 30, 2012
Copyright: © 2012 Hu et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This project is funded by the Genome Alberta and Alberta Livestock and Meat Agency (ALMA) Applied Livestock Genomics Program (http://www.genomealberta.ca/livestock/genetic-identification-techniques-to-improve-food-safety-recall-of-ground-meat-products.aspx). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: Dr. Simila, Dr. Platz and Dr. Meghen are employed by IdentiGEN North America Inc. This study was funded by the Alberta Livestock and Meat Agency Ltd. There are no patents, products in development or marketed products to declare. This does not alter the authors' adherence to all the PLoS ONE policies on sharing data and materials.
Estimating animal abundance in manufactured batches of fresh ground meat is an important phase in traceability and certification in meat supply chains .This is of a particular value in the event of a microbial contamination incident given that fresh ground beef accounts for more than 40% of all beef consumed in Canada  and 42% in the United States . To identify whole muscle meat products, such as steaks and joints, throughout the supply chain, DNA profiling is currently applied through the use of reference animal or carcass databases, analogous to the DNA databases widely used in human forensics. In a large scale industrial manufacturer, a single ground beef batch may consist of many hundreds of animals from diverse sources, which may include more than one country of origin. Characterizing the distribution of these individuals in large grind batches informs the possibility of developing a recall management tool based on DNA profiling. Estimating animal abundance has been widely applied in ecology and wild life conservation , , , . However, the mixture in ground beef batches complicates the application of this technique, including isolation of individual DNA profiles and the selection of an appropriate statistical model. The objective of this study is to focus on the statistical model for a preliminary estimate of animal abundances in grind meat batches, given the heterogeneity arising from different manufacturing systems and the absence of a reference DNA profile database.
We employ the conventional mark-recapture methodology to estimate animal abundance, with multiple surveys in individual manufacturing batches for estimating capture and recapture frequencies. Samples are taken from the finished ground beef batch and individual animal contributors identified by subdividing the sample into constituent discrete muscle fibres for DNA extraction and single nucleotide polymorphisms (SNP) genotyping . Matching DNA profiles among samples, analogous to the case of sampling with replacement, are used to estimate recapture frequency. Two specific features are crucial for statistical modelling in ground meat batches. One is the presence of a highly heterogeneous capture probability among individuals in a single batch. This can arise where an unequal amount of useable carcass from distinct animals is blended into individual batches for ground beef. This forms the biological basis for generating unequal capture probability among distinct animals. The other is that the number of animals in a single beef batch could be very large in industrial scale manufacturing. This can result in a large number of animals not being captured or captured at a low frequency, in addition only a few animals may be captured at a relatively high frequency. These two features limit the suitability of most existing models for estimating population size in ground meat batches.
Methodologically, many statistical models have been developed using the mark-recapture framework for population size estimation, including the non-parameter and parameter estimators, the models for equal and unequal capture probability, and the models for discrete- and continuous-time surveys (for comprehensive reviews, see , , ). The well-known non-parameter estimators include Lincoln-Petersen's estimator, the jackknife estimator , , the bootstrap estimator , the moment estimator , , and the sample-coverage (SC) estimator . Most non-parameter estimators underestimate population size when a small proportion of animals are captured. The jackknife estimator can produce appropriate estimates when many individuals are captured multiple times . Chao's estimator (Chao-1) performs well for a lower level of heterogeneity in capture probability or when a majority of individuals are captured . Xu et al.  recently proposed an alternative non-parameter estimator that slightly modifies Chao-1 estimator using a different moment approach. The commonality is that these estimators (except the high-order jackknife and SC estimator) mainly employ partial information on the observed capture and recapture frequencies in multiple surveys.
With a reference to the parameter-based estimators, a few methods have been developed to derive maximum likelihood estimate (MLE) of population size since Fisher's logarithm series model , , , . These methods are mainly based on the abundance data (frequency count) although connections are available for a few abundance and incidence models . Crucial to the parameter-based methodology is to select an appropriate function to describe the pattern of capture-recapture frequencies. Chao and Bunge  used a Gamma-mixed Poisson or negative binomial distribution to derive MLE. Shen and He  used a modified beta function to derive MLE for species richness. The commonality is that these methods employ the full information on the pattern of capture-recapture frequencies. These methods have limited performance when the heterogeneity in capture probability is large or when most individuals are not captured in multiple surveys. This motivated us to develop alternative estimators that are suitable for the population with a high heterogeneity in capture probability.
We developed two parametric models for incidence data to estimate population size: Model I is based on a function similar to a modified continuous version of Fisher's logarithm series model, which can deal with the population with a high heterogeneity in capture probability; Model II is a modified beta function, with an alternative zero-truncated function to the modified function of Shen and He . Model II can deal with the population with a relatively low heterogeneity in capture probability. In the following sections, the proposed models are described, including the detailed procedure of deriving MLE. The proposed estimators are then compared with other existing non-parameter estimators through simulations with different survey schemes and the use of previously published empirical datasets. Finally, we apply the proposed models to estimating the number of animals in six manufacturing beef batches, each of approximately 1 metric tonne in weight, genotyped with 30 SNP markers, selected for identification . Inferences on population sizes in each batch of fresh ground beef are drawn from comprehensive analyses with multiple estimators.
The Model and Estimator
We begin by briefly summarizing Burnham and Overton's model and then proceed to propose an alternative method to estimate population size. Consider a closed population with constant N unique individuals that are indexed by 1, …, and N. There are t surveys through non-invasive genetic samples (analogous to the sampling with replacement). Let (i = 1, 2, …, N) be the capture probability of the ith individual at each survey (constant capture probability assumption). Here, we assume that the capture probability for each individual is nonzero at each survey () and that unequal capture probabilities exist among different individuals, i.e. . The capture probabilities, , are a random sample from a probability density distribution . Note that is equivalent to the notation of Burnham and Overton . Like previous studies , the multiple samples can be arranged in a matrix () where is the observed frequency of the ith individual in the jth survey. Let n be the total number of observed distinct individuals caught in the t samples, which can be expressed as where I(A), the indicator function, is equal to 1 when event A occurs and 0 otherwise. Let be the number of individuals captured exactly k times (k = 0, 1,…, t) in the t samples, which can be expressed as (, p 784). Thus, we get the captured individuals in total, , and the population size N is where f0 is the number of individuals that are not captured in the t samples.
According to Burnham and Overton , the joint likelihood function for the whole t samples can be expressed as(1)where . is the probability for the t samples with i unique individuals in the multinomial distribution. The integration in removes the impacts of a random sample of . Based on the above general framework, Burnham and Overton  developed a kth-order jackknife estimator for population size N. Using the same framework, Chao  developed an alternative non-parameter estimator (moment estimator) of N. Here, we proceed with the same framework to develop an unconditional MLE of N by hypothesizing two different types of capture probability density distributions .
Since a non-zero capture probability for each individual (, i = 1,…, N) is assumed at each survey, the zero point as the lower bound must be eliminated in calculating probability . In the absence of prior information about individual capture probabilities, it is difficult to determine the exact capture probability and probability density function . Biologically, different sources of uncontrollable and unobservable variations can generate heterogeneity in capture probability among individuals or the relative occurrences of different individuals at each survey. This variation may arise from behavioural difference among individuals or different foraging areas or different exposures to traps . How to determine such impacts on the capture probability density distributions remains to be explored. In this study, we consider two capture probability density distributions that are suitable for a large population.
In Model I, we assume that the probability density function (pdf), for a capture probability p has the following expression:(2)where , the total number of captures given the minimum non-zero capture probability , and is the lower bound of capture probability. The biological meaning of parameter θ (>1) is termed as the average capture change per individual per unit time. This setting is based on the biological phenomenon that the observed abundance distribution, , frequently exhibits an “opposite J-shape” pattern (Figure 1A). Many individuals are captured once and a few individuals are captured more than once. can be used to describe the phenomena in mark-recapture experiments where the number of captured individuals decreases with the capture probability p (without a long tail of frequency distribution).
A: Capture probability density function (pdf) for Model I, , given a population size N = 500: line for θ = 1.5, dashed line for θ = 2.5, and dot dashed line for θ = 3.5. The skew of the capture probability distribution increases as the parameter θ increases. B: Capture probability density function (pdf) for Model II, , given a population N = 500: line for α = 1, β = 3.0; dotted line for α = 2, β = 5 (skewed bell-shape); thick dashed line (bell-shape) for α = 2, β = 2; dashed line (U-shape) for α = 0.5, β = 0.5; and dot dashed line for α = 5, β = 1.0. An array of capture probability distributions can be generated by changing parameters α and β.
Several considerations are needed in setting in Eq. (2). First, the proper pdf is derived by normalizing the function by considering p as the capture probability for incidence data rather than the relative frequency for abundance data (e.g., allele frequency in a population or the relative species abundance in a community; ). Here, we borrow the function from the neutral theory (the infinite number of allele model) in molecular population genetics , . is the expected number of unique individuals whose capture probabilities (p) fall within the range (p, p+dp) and . In population genetics, the function is the well known function for describing the abundance distribution of neutral alleles in a closed population, where θ is the average number of alleles generated by mutation per generation. Again, the conceptual difference is that p is not the gene frequency (abundance data) but the capture probability (incidence data) in this study. The capture distribution for an array of capture probabilities is analogous in distribution pattern to but different in biological meaning from the abundance distribution of an array of gene frequencies ,  (pp. 205–206). Second, for the abundance model, (not pdf) is the same as the well-known Fisher's logarithmic series (discrete) distribution  except that is the version for a continuous distribution (, p 250). Leigh (, Appendix 8.2) transformed Fisher's logarithm series into . Fisher's α parameter in the logarithm series function is analogous to θ here, which is also analogous to Hubbell's θ in describing the pattern of species richness and relative abundances in a neutral metacommunity (the fundamental biodiversity parameter; , ). Chao and Bunge  also employed this kind of function (gamma-mixed Poisson) to derive the probability for the t samples with i unique individuals for the abundance data, analogous in concept to but different in expression to here. In this study, in Eq. (2) for the incidence data can be seen as the model similar to the zero-truncated continuous version of Fisher's logarithmic series model. Third, the lower bound for an individual capture probability must be nonzero in biology except for the case of extinction, although a zero bound is allowed from the statistics point of view. One feature of the function is that its integration value becomes substantially large as becomes smaller, given a constant population size (, p 210). How to determine the lower bound remains to be explored in biology. In practice, it is difficult to even catch the individual with the capture probability of 1%. One way is to directly estimate by considering as one additional parameter. However, extensive simulations indicate that this consideration leads to the difficulty of obtaining convergent estimates (results not shown here). In the following parts, we set = 1/N, and this lower bound becomes sufficiently small () as the population size increases. Thus, Model I with 1/N as the lower bound is suitable for a large population. It is noteworthy that, for the abundance data, a setting similar to the above but with different biological meanings exists in population genetics (, p 210; , p 398) or in community ecology , , where represents the total number of existent alleles in a population or existent individuals in a metacommunity, respectively.
Since a non-zero capture probability is considered for each of N individuals in the population, the lower bound in is correspondingly changed, i.e.(3)Here, in is set as 1/N, and the model is suitable for a large population. The sum of remains 1, i.e. .
The general likelihood function can be decomposed into two sub-likelihood functions , , , , i.e. where and . The difference from previous models is that each sub-likelihood function (L1 or L2) is the function of two parameters (N and θ). Calculation of conditional MLE remains to be explored. To derive the MLE of N and , we simply use the global likelihood function instead of decomposing it into two different components. Like Stollenwerk and Jansen (, pp 185–191), we approximate the population size N as a continuous variable in derivation.
Let , , and . can be expressed as . Let , the digamma function , and , the trigamma function ()) . Note that the first term in is Euler's constant γ = 0.5772156649. The first- and second-order partial differentials of the log likelihood function with respect to N and are derived in Appendix S1. Population size N and the parameter can be solved using Newton and Raphson's iterative method (with a fast convergent speed):(4)
The initial values for N and in iteration can be set as n and 0, respectively. Iterative calculations are continued till convergence for each estimate is achieved. Note that no failure convergence existed in all simulations described in the next section. The variances for estimates N and can be calculated from the diagonal elements of the inverse variance-covariance matrix (inverse of Fisher's information matrix) at convergence: and .
In Model II, , is set as a zero-truncated beta distribution function:(5)where (). The biological meanings of parameters α and β are termed as the average capture changes per individual per unit time for individuals with capture probabilities p and 1−p, respectively. This type of capture probability density function, similar to Pearson's Type I model (, p 248), can be used to represent a variety of patterns of distributions under different parameter settings, including the opposite J-shape pattern (Figure 1B). The difference from Model I is that the pattern for the capture-recapture frequencies generated by Model II is not as highly skewed as that generated by Model I, i.e. a relative lower heterogeneity in capture probability. When , Model II reduces to Model I. When , Model II reduces to the model of a beta-binomial distribution mixture . Shen and He  recently also employed the beta function to describe species richness distribution, but used a different zero- truncated transformation by changing . One constraint in Shen and He's model is that the setting of can lead their constant to an infinite value, violating the condition in setting their (equivalent to here). Again, in and is set as 1/N. Thus, Model II is suitable for a large population.
Like in Model I, Eq. (3) remains unaltered after changing the lower bound in by 1/N. To derive MLE, let , , , , , and . can be expressed as . The first- and second-order partial differentials of the log likelihood function with respect to N, , and are derived in Appendix S2. Similarly, these three parameters can be estimated using Newton and Raphson's iterative method:(6)
The initial values during the iterative calculation can be set as n, 0, and 0 for N, , and , respectively. Iterative calculations are continued till convergence for each parameter. Note that non convergence can occur under some parameter settings, such as the case of α = 1 and β = 3.0 in simulations described in the next section. The variances for estimates N, , and can be calculated from the diagonal elements of the inverse variance-covariance matrix at convergence.
Monte Carlo Simulations and Comparisons
Simulation Data Generation.
To examine the properties of the proposed models, we analyzed several sampling schemes based on the distribution pattern of , generated by different parameter settings in capture probability density function . The aims are (i) to look at the impacts of different sampling schemes (the number of surveys) under a known population size N and parameters, and (ii) to look if some non-parameter estimators perform well with the capture probability distribution assumed in Models I and II since estimates of population size are sensitive to the assumption of . Similar to Shen and He , three non-parameter estimators were selected: the first-order jackknife estimator , , the bootstrap estimator , , and Chao-1 estimator , . The jackknife and Chao-1 estimators only employ partial information of capture-recapture frequencies; while the bootstrap estimator employs the full capture-recapture frequencies in the t surveys in a way different from the proposed models. These three non-parameter estimators have been extensively assessed in previous studies from the literature.
Given the population size N, the setting for a sample size is constrained by the fixed sum ( = N) of the observed unique individuals in total and the unobserved individuals. An arbitrary setting of sample size n could result in the total population size exceeding N according to the distribution . Thus, the simulated samples for the proposed two- and three-parameter models are generated in the following steps. Given a population size N, t surveys, and parameter for Model I, or parameters α and β for Model II, calculate each probability (i = 0,1,…, t). Then, use these probabilities (multinomial distribution) to generate the numbers of individuals with different capture-recapture frequencies 's (i = 0, 1, …, t; ). Note that the samples of capture-recapture frequencies, generated by this way are equivalent to those generated by Otis et al.'s  method that is based on assigning each individual a certain capture probability based on . The routine of Press et al. (, pp 210–211) was used to generate random numbers with uniform distribution within (0, 1) for sampling purpose. The observed frequencies, 's (i = 1,…,t), were then used to estimate parameters according to Eq.(4) for Model I and Eq. (6) for Model II. We consider that the convergence is reached when the absolute difference between two consecutive iterative values is less than 10−5 for each parameter although an even smaller number can be set at the expense of long-time iterations. Three non-parameter estimators were also calculated from the observed 's (i = 1,…, t). One hundred independent data sets were created, and each was used to estimate all parameters. Means and standard deviations (Sd) of estimated parameters were calculated from these replicated datasets. The standard deviations for N, θ, α, and β were also calculated from averaged Fisher information index, in addition to empirical standard deviations.
Several sampling schemes were simulated in Model I, with the number of surveys increasing from 2 to 10 under three different patterns of capture probability distributions (θ = 1.5, 2.5, and 3.5; Figure 1A). The distribution becomes more skewed as parameter θ increases from 1.5 to 3.5. In Model II, five different patterns of capture probability distributions were simulated (Figure 1B): α = 1, β = 3 (opposite J-shape); α = 2, β = 5 (skewed bell-shape); α = 2, β = 2 (bell-shape); α = 0.5, β = 0.5 (U-shape); and α = 5, β = 1 (J-shape) for the known parameter settings. These distribution patterns may occur for the capture-recapture frequencies in different animal species in trapping experiments or for plant species in spatiotemporal quadrat surveys in ecology. Four sampling schemes were simulated in each of the five patterns, with the number of surveys increasing from 4 to 10. Programs in C are available upon request from Hu.
In Model I, the average estimates of population size and parameter in each of the three capture frequency distributions are generally in good agreement with their actual values (Table 1). The actual population size N and parameter θ are within the ranges of one standard deviation of estimates in each case. The standard deviations for and calculated from the inverse of the Fisher information matrix (not shown in Table 1) are consistent with the empirical values for a large sample size (n). Generally, the standard deviations for each parameter estimate decrease as the number of surveys increases. Based on the distribution of probability , the average number of sample size per survey () decreases as the number of surveys increases from t = 2 to 10. The observable sample size in total (n) generated from the probability distribution (, i = 1,…, t) decreases as the capture probability distribution becomes more skewed (θ changing from 1.5 to 3.5; Figure 1A). The results indicate that the combination of more surveys with a small sample size per survey can produce better estimates than the combination of a small number of surveys with a large sample size per survey (Table 1). The three non-parameter estimators substantially underestimate population size N although the average estimates of population size increase with an increased number of surveys (detailed data not shown here). When the capture probability distribution becomes more skewed, the non-parameter estimators produce severe underestimates of N. Standard deviations exhibit different patterns for different non-parameter estimators, but each is related to the extent of skewness of the capture probability distribution. Thus, these non-parameter estimators are not suitable for the population with the capture probability distribution assumed in Model I where a high heterogeneity of capture probability exists .
With Model II, the average estimates of N, α, and β become closer to the actual values as the sampling scheme changes from t = 4 to 10 in each of the five capture probability distributions (Table 2). The actual population size and parameters (, and ) are within the ranges of one standard deviation of estimates in each case. Again, the standard deviations of each parameter (, , and ) calculated from the inverse of the Fisher information matrix (not shown in Table 2) are very close to the empirical values. The standard deviations for each parameter estimate decrease as the number of surveys increases from t = 4 to 10. The average observed sample sizes (n) are closely related to the capture probability distribution and exhibit considerable variation among the five distributions. A trade-off relationship does not exist between the number of surveys and the average number of individuals captured per survey. In each case, the standard deviations for observed sample sizes decrease as the number of surveys increases from t = 4 to 10. Given a sampling scheme, the observable sample size in total (n) is the smallest in the case α = 1 and β = 3, but the largest in the case of α = 5 and β = 1 (Figure 1B).The observable sample size in total (n) reaches the maximum in the case α = 5 and β = 1 since almost all individuals can be captured in this distribution (Figure 1B).
Unlike the results in Model I, Model II has a comparable performance to the non-parameter estimators in four of the five types of distributions, the exception being α = 1 and β = 3, where underestimates are obtained (Table 2). The scheme with more surveys can produce better estimates in each case. The results indicate that the three non-parameter estimators generally perform well for the capture probability distribution assumed in Model II.
Comparisons Using Published Empirical Data Sets
Here, we use two published datasets to demonstrate the application of the proposed models. The first example is the well-known Fisher's butterfly data that was collected in Malaya . The paper provided the observed distribution of frequencies of butterflies for species abundance ranging from 1 to 24 (, p 43). This dataset has been examined for estimating species richness by several researchers with different models, including the Poisson-lognormal model (; = 815±43), the Poisson-inverse Gaussian model (; = 719), the Poisson-generalized inverse Gaussian model (; = 1000), and the mixed Gamma-Poisson model . Chao and Bunge  extensively analyzed this dataset by using the cut-off point from t = 10 to 24 and compared six different estimators. They concluded that a stable value of = 850 species was expected under the cut-off point below 24 (t24). Like Chao and Bunge , we estimated population size using the same array of cut-off points. As summarized in Table 3 the estimate obtained from Model II, = 825 (the average over all cut-off points) is close to Bulmer's (; = 815) and Chao and Bunge's (; = 850) results.
The second example is the experimental cottontail abundance determined from two sets of live trapping data with known population sizes. The first dataset was collected in the Olentangy Wildlife experimental Station, Delaware County, Ohio, in 1961. The second dataset was collected in 1963 at Robert Allerton Park, Monticello, Illinois. In the first dataset (Ohio), the observed capture-recapture frequencies from f1 to f7 were 43, 16, 8, 6, 0, 2, and 1. This dataset was also examined by several researchers using different models, including Schnabel's estimate , Schumacher and Eschmeyer's method , MLE and the regression method based on the geometric model , and Chao-1 estimator . The results obtained from both the regression method based on the geometric model and Chao's non-parameter estimator ( = 133.8±24.0 for Chao-1 estimator; ) are consistent with the actual population size. Analysis with Model II produces a negative α estimate, demonstrating a poor fit to the capture-recapture frequency pattern assumed in Model II. Analysis with Model I produces MLE = 211.3±31.7 and = 2.49±0.52. is overestimated (actual value N = 135) because of a low heterogeneity (the coefficient of variation (CV) for the low captured individuals = 0.619; ). This indicates that the actual capture probability distribution in this population (a low heterogeneity and a small population size) is biased from assumed in Model I (a large population and a high heterogeneity, say CV>0.8; ). In the second dataset (Illinois), the observed capture-recapture frequencies from f1 to f6 were 36, 15, 13, 3, 1, and 1. Chao-1 estimator gives = 112.2±19.4 with a low to moderate heterogeneity (CV = 0.382). Model II produces = 136.9±47.6, = 0.55±0.78, and = 3.57±2.13, which is fairly close to the actual population size (N = 130; ).
Applications to Ground Beef Batches
We now apply the proposed models to estimate the number of unique animals in ground beef batches (one batch is considered as one population). We had 57 time sequenced ground beef samples (each sample ∼250 g) taken from six 1 tonne batches from a single manufacturing line during a single production shift. There are 10 samples, analogous to the field surveys (sampling with replacement) in animal ecology , from Batches I to IV (manufacturing ID: 5.2, 5.3, 5.7, and 5.9), 9 samples from Batch V (ID: 5.11), and 8 samples from Batch VI (ID: 5.13). In each sample, we dissected 94 individual muscle fiber sub-samples, yielding 752∼940 sub-samples, extracted DNA, and genotyped over 30 SNP markers (∼160,000 genotypes in total). Missing genotypes were marked but excluded in analysis.
Several methods were applied to estimating the unique number of animals in individual batches and samples. One is the use of Genecap  where pairwise matching probabilities, in terms of the probability of identity (PI) were calculated assuming both Hardy-Weinberg equilibrium (HWE) for genotypic frequencies and linkage equilibrium. HWE was tested using Genepop software , showing that 6 out of 180 tests (∼3% in total) were not in HWE (see results below). Linkage disequilibria (LD) for all pairwise SNPs in each batch were tested using Genepop software as well, showing that all used SNPs were essentially in linkage equilibrium (see results below). The average multilocus PI in each batch is much smaller than 10−5 by using 25–30 SNP markers, which ensures the appropriate use of these markers for identifying individuals for estimating population size (mark-recapture method) , ,, . The modified Lincoln–Petersen method with the assumption of equal capture probability (homogeneous) was used to estimate population size N . Each batch was separated in half for estimating recapture frequencies between two pooled samples. Population size N and its variance are calculated by and , where M is the total number of animals captured and marked in one pooled sample, C is the total number of animals captured in the second pooled sample, and R is the number of animals recaptured in the second pooled sample.
In order to apply the proposed models to estimating N, we need to calculate the observed capture-recapture frequencies, fi's (i = 1,…,t). The following steps were conducted. First, we identified the number of unique animals based on the statistical test (Pearson's correlation with student's t-test) of multilocus genotype matches with 30 SNP genotypes, removing the HWE assumption for calculating PI. Note that all pairs of SNPs were essentially independent from each other in each batch (see LD tests below). In order to identify unique individuals in a given sample, the individual SNP genotypes were transformed into numerical values. For example genotypes AA, AT, and TT were assigned 2, 1, and 0, respectively. Missing genotypes were designated another number and removed from the calculation. Pearson's correlation for each pair of individuals was tested using the significant level by Bonferroni correction (the type I error for the entire test was controlled at 1%). Two individuals are considered to be identical when they matched exactly, and replicates were removed from the analysis. Second, using the above described method, we identified the number of unique animals in each batch, i.e. n ( = ) in the proposed model, by pooling all t samples that consisted of unique individuals. Third, using the same method as in the first step, we compared each of the t samples with the batch population (n individuals) and calculated the capture-recapture frequency of each of the n individuals in the batch, i.e. the estimates of fi (i = 1,…, t). In fact, our observations indicated that all exactly matched individuals (within or among samples) in our data sets of this study were identical in each of all genotypes (Pearson's correlation coefficients = 1.0). Once the observed frequencies ( fi's) are available, the proposed models are then applied for estimating N. Two programs were written in SAS codes for this purpose and are available upon request from Hu.
As references, additional non-parameter estimators for unequal capture probability models were also applied, including Chao-1 estimator , the abundance-based coverage estimator (ACE)  and the first- and second-order jackknife estimators . MLE based on the mixed Gamma-Poisson model was employed where fi (i = 1, 2, …, t) was assumed to follow Poisson distribution while p in was assumed to follow a gamma distribution . To measure the degree of heterogeneity among capture probabilities, the coefficient of variation (CV) for the low captured individuals was calculated (for formula, see ). Population size N with all these non-parameter estimators and Chao and Bunge's estimator can be estimated using SPADE software .
Population genetic analysis indicates that gene diversity ( = , is the frequency of the ith allele at a SNP site) was about 0.46 per SNP for all six batches (Table 4). Among the total of 180 tests of the selected 30 SNPs in all batches, only six tests were in Hardy-Weinberg disequilibrium (Table 4; P-value<0.0003), indicating that most batches were essentially in HWE. Batch-based LD tests indicate that only two pairs of SNPs in Batch 6 (SNPs 14 and 19, SNPs 21 and 24; P-value<2.210−5) were in LD. Thus, SNP-17 in Batch 3, SNP-19 in all six batches, and SNP-21 were removed for further analyses. All SNPs eventually used in this study were independent from each other and in HWE.
Table 5 summarises the observed capture-recapture frequencies, fi's(i = 1,…,t), in all six batches, showing that all batches except Batch 1 displayed a highly skewed distribution of capture-recapture frequencies. CV estimates were 0.586, 0.893, 1.255, 0.836, 1.003, and 0.732 for Batches 1, 2, 3, 4, 5, and 6, respectively, indicating a high heterogeneity in capture probability in Batches 2, 3, 4, and 5 (CV>0.8), but not in Batches 1 and 6 (CV<0.8; ). As expected, Lincoln–Petersen's estimator severely underestimated population size due to the presence of heterogeneous capture probability in each batch that violated the assumption of homogeneous capture probability in this method. As suggested by Chao and Shen , the Chao-1 estimator (for a low to moderate heterogeneity in capture probability) produced the lower bound estimates of population size, but its estimates were greater than those obtained with Lincoln-Petersen's estimator. The first- and second-order jackknife estimators provided comparable estimates to Chao-1 estimator. Chao and Shen  recommended the use of ACE-1 for the population with a high heterogeneity (CV>0.8) since this estimator uses the information on a highly heterogeneous capture probability in estimation. The ACE-1 estimator produced higher estimates of population size for Batches 2, 3, 4, and 5, = 576.8∼1011.3, but not for Batches 1 and 6. Batch 3 had the largest population size, followed by Batch 5, which was consistent with the rank of CV values.
The mixed Gamma-Poisson model  provided larger estimates of population size for Batches 2 ( = 821.8±287.3), 4( = 771.4±231.5), and 6 ( = 667.7±264.7). Iterations were not convergent for Batches 3 and 5 due to the high heterogeneity in capture probability (Table 5).
With application of the proposed models in this study, we first applied Model II to obtain MLE of N, α, and β because Model I is the specific case of Model II. If the estimate of α is negative, we then apply Model I. Results indicate that α estimates were negative in all batches except Batch 1. Thus, we used Model II to analyze Batch 1 data and Model I to analyze the other batches. The population size in Batch 1 was 411.4±56.3, but the 95%CI (confidence interval) overlapped with the 95%CI obtained from the second jackknife estimator. The average population sizes were greater than 1000 (1011∼1367) in the remaining batches. Since a very high heterogeneity in capture probability exists in Batches 2–6, all the examined non-parameter methods produce severe underestimates of population , as indicated from the simulation results in the preceding section. The capture probability distributions in these batches more likely follows the assumption of in Model I, and the estimates of population size are close to their actual sizes (see simulation results for N = 1000 and t>6 in Table 1). Estimates in Batches 2, 4, and 6 with Model I were mainly distributed within the 95%CI obtained from the mixed Gamma-Poisson estimator. Estimates, 's, were positively related to the CV values, reflecting the extent of heterogeneity in capture probability.
In this study, we proposed two related statistical models for estimating the number of animals in a population. One uses a model similar to the modified continuous version of Fisher's logarithmic series model to describe capture probability density function (Model I); while the other uses the modified beta function to describe (Model II). Model I is the specific case of Model II. In each model, the lower bound for capture probability is truncated by 1/N, and this lower bound approaches zero as the population size increases. This way of removing the non-captured probability is more meaningful since the capture probability for each individual must be nonzero in biology (each individual must be obtainable in theory) although the lower bound may be allowed to be zero in statistics. The idea is different in biological meaning from Wright's thinking in calculating the existent alleles in a population (, p 398) or a similar way in calculating existent individuals in community ecology ,  for the abundance data. Good (, pp 251–252) discussed the truncated distribution related to Model I for the abundance data, but did not discuss how to determine the lower bound. In general, Model I is suitable for the population with a very high heterogeneity in capture probability (say, CV>0.8) and a large population size; while Model II is suitable for the population with a relatively lower heterogeneity in capture probability (say, a moderate heterogeneity; ) and a relatively smaller population size. Both Models I and II provide new additions to the incidence-based methods of estimating population size.
Selection of appropriate model is important for analyzing empirical data since each model has its own strength and limitation. Estimates of population size for parametric models are sensitive to model assumptions about the capture probability density distribution . Bunge and Barger  reviewed several parametric models for the abundance data and discussed the connection between abundance and incidence models. Our proposed two models are based on incidence data samples. The strength of Model I is its suitability to the population of a very high heterogeneity in capture probability and its better performance over the existing non-parameter estimators. One caution is that a slightly positive bias for the mean estimate may occur although the actual parameters are not significantly different from estimates (the actual values are within the ranges of one standard deviation). The weakness of Model I is that a substantially biased estimate can be produced when the heterogeneity in capture probability is low or moderate, as indicated from the example of experimental cottontail abundance. Model II has comparable performances with other existing non-parameter models in the presence of a relatively lower heterogeneity in capture probability or in the case of capturing a large proportion of population. The setting of 1/N as the lower bound predicts a better performance of Model II for a large population, as indicated from the example of Fisher's butterfly datasets.
It is important to understand that many distinct processes may be involved in generating a highly heterogeneous capture probability in a single manufacturing batch. Most meat in a ground beef batch comes from off-cuts or trimmings. These raw materials are usually blended during processing as it would be entirely impractical and uneconomic to process, label or tag each component separately . Because different animals exhibit wide variation in meat and fat content, the quantity and quality of trimmings varies considerably among animals. Thus, different animals have quite variable contributions to a single beef batch. This forms the biological basis for generating heterogeneous capture probability although sampling process or animal behaviours could likely modify . Many thousands of animals are processed per day in large scale slaughterhouses, and this may subsequently result in a large number of animals in a single grind batch. In addition, the number of animals in a single batch is affected by several factors in the supply chain, including the specific grind manufacturing process, the number of diverse farms providing cattle to the processors, the scale of production and the use of lean finely textured beef (LFTB). These processes could explain the highly skewed pattern of capture-recapture frequencies in the five batches. Many animals can be captured with a low frequency (e.g., once) and a few animals can be captured multiple times.
The observed capture-recapture frequencies, fi's (i = 1,…, t), in six manufacturing batches indicate a high heterogeneity in capture probability in a single ground beef batch. A highly skewed opposite J-shape in five batches (Batches 2–6) implies that a potentially large number of individuals are present in them. An average of 411 to1367 animals was present in the six grind batches. These estimates indicate high variation in the number of animals among different batches from the same manufacturer on a single production line during a single production shift. From the manufacturing records, the batches examined here were compounded from raw materials consisting of 3 grades of fresh and frozen beef trim with unequal weights of components among batches. In addition up to 10% of each batch was comprised of LFTB and rework. Animal abundance in each raw material is unknown a priori. The estimates derived here are informative as a reference in decision-making in the case of food safety recalls.
It is of interest to compare the similarity and difference in mark-recapture experiments between the conventional field of animal ecology  and the laboratory or non-invasive DNA-profile detection in a ground beef batch. Both animal abundance and habitats/behaviours can affect the capture probability distribution in field animal surveys. With the ground beef batch, population composition can affect the heterogeneity in capture probability if the samples for DNA profile testing are randomly taken. Further, use of DNA profiles to identify individuals can result in false positive capture if the number of markers is small . One striking difference is that multiple copies of the same animal can occur in one survey in a grind batch, but infrequently take place in the field animal survey. The marked animals are not recorded twice in a single survey. In a single grind batch, the same DNA profiles from different parts of one animal could be sampled, similar to DNA samples from multiple shed hair samples of animals . Thus, to employ the standard mark-recapture method, the duplicated DNA profiles must be removed in a single survey. Lukas et al.  proposed an alternative likelihood function that can use the duplicated DNA profiles in a single survey, but the proposed algorithm is too complex for application.
So far, the mechanisms for generating the pattern of capture probability distribution have not been fully examined except for the application of Fisher's logarithmic series . Also, the conventional mark-recapture framework has not been linked to the relevant biological mechanisms for maintaining a closed population and the relationship between the capture probability distribution and population composition or animal activities. In most situations, the assumption of a closed population holds in multiple field surveys within short-time intervals (no change of population size through births, deaths, immigration and emigration). Fisher's logarithm series model or the more explicitly continuous version indeed refers to the case of neutral metacommunity or completely isolated community with a fixed size , . The capture probability distribution in Model I reflects the pattern in a closed population. Unlike Model I, Model II is probably more flexible for a closed population or an open population (e.g., the carry-over between batches in the same manufacturer) with a fixed population size N. Previous theories in population genetics demonstrate that the beta function can be used to describe the distribution of gene frequency (abundance data) in a local open population with a fixed size (, p 362), given the presence of a constant ratio of effective (Ne) versus real (N) population sizes. Bunge and Barger  have discussed the connection of the beta-distribution for incidence model to the log-beta distribution for the abundance models. Such a connection needs further exploration from the zero-truncated beta function for incidence model to the function for the abundance model. It cannot be excluded that exchanges of individuals may generate an array of patterns of capture probability distributions in an open population (Figure 1B). Different from the model of Jolly  and Seber and Manly , Model II can deal with the case of heterogeneous capture probability. Previous models for an open population assumed constant homogeneous capture probability , , but their comparisons with Model II need empirical evaluations.
To apply the proposed models for estimating animal abundance in a single batch, the following steps are needed. First, we need to select appropriate markers to identify individual profiles. For a single marker, a large gene diversity or heterozygosity should be selected. For multiple markers, linkage equilibria among them should be required so as to avoid redundant information. The number of markers can be decided by their joint PI (), or more conservatively by the joint PI of sibs as the reference. Waits et al.  suggested that the number of markers generating a joint PI<0.0001 can be used for mark-recapture analysis. The present study sufficiently meets these two criteria. Second, we need to decide an appropriate survey scheme. Our simulation results recommend that the scheme of multiple surveys, each with a relatively small sample size, is better than the scheme with limited surveys, each with a relatively large sample size. Multiple surveys with small sample sizes are better in reflecting the true pattern of capture-recapture frequency. However, this is not the case for the non-parameter estimators that rely on the frequencies of one- and two-time captures (e.g., f1 and f2 in Chao's estimator ). Third, the capture-recapture frequencies, fi's (i = 1,…, t), can be calculated by either Genecap (HWE and without LD; ) or the Pearson's coefficients (without LD) used in this study. Fourth, once all capture and recapture frequencies ( fi's) are available, MLE can be obtained by applying the proposed models. The advantage of the proposed model over some non-parameter models lies in that the full information on capture-recapture frequency is employed. Further, MLE becomes unbiased as the total number of captured individuals (n) increases in multiple surveys.
Finally, in the phase of meat processing, tracing the finished ground meat products inevitably involves decision-making on tracing within and between batches. Our results recognize the complexity of tracking and tracing ground meat batches based on the trimmings since more than 1000 animals could be included in a single grind batch. Grinding operations are the last phase before the market or end-users in the meat supply chain . The existing meat traceability systems are primarily documented in regards to the primal cuts  and have inadequate tracing of the mixed trimmings. Also, analysis with Genepop indicates that population (batch) differentiation was very small among these six batches, with the 95%CI for multilocus Fst being within [0.1%, 0.2%] (detailed results not shown here). Further extensive analysis is needed to investigate batch differentiation using measures differing in sensitivity to population differentiation. With the use of Models I and II, a large number of animals comprise each batch of ground meat. Based on this premise, a sampling scheme can be implemented which provides sufficient DNA information to effectively differentiate ground meat batches. Development of additional statistical models to establish a reliable framework for the genetic characterization of individual ground beef batches is undertaken. Establishing methods by which individual ground beef batches can be identified can significantly reduce the scope of a product recall in the event of a contamination incident. For instance, contamination with E. coli 0157:H7 accounts for 24% of FSIS recalls in the United States in 2009 .This would have a significant impact on the economics and efficiency of the recall process.
Partial differentials of the log likelihood function for Model I.
We sincerely appreciate Rongling Wu and one referee for very helpful comments that substantially improved the earlier version of this article.
Conceived and designed the experiments: SSM CNM GP. Performed the experiments: SSM CNM GP. Analyzed the data: XSH JS SSP. Contributed reagents/materials/analysis tools: SSM CNM GP SSP XSH JS. Wrote the paper: XSH CNM SSP GP. Developed method and conducted simulations: XSH.
- 1. Loftus RT, Meghen CN (2011) Tracing Meat Products through the Production and Distribution Chain Farm to Consumer. In: Budowle B, Schutzer SE, Breeze RG, Keim PS, Morse SA, editors. Microbial Forensics, 2nd Edition. pp. 59–73.
- 2. The Meat-Like protein market in Canada-2004. Available: www.soy2020.ca/pdfs/MeatLikeProteinMarketCanada.pdf. Accessed 2010 May 13.
- 3. Davis CG, Lin BH (2005) (2005) Factors Affecting U.S. Beef Consumption/LDP-M-135-02 Economic Research Service/USDA.
- 4. Seber GAF (1982) The estimation of animal abundance and related parameters. London: Griffin.
- 5. Seber GAF (1986) A review of estimating animal abundance. Biometrics 42: 267–292.
- 6. Chao A (2001) An overview of closed capture-recapture models. Journal of Agricultural, Biological, and Environmental Statistics 6: 158–175.
- 7. Chao A (2005) Species estimation and applications. In: Balakrishnan N, Read CB, Vidakovic B, editors. Encyclopaedia of Statistical Sciences. 12. : 7907–7916. Wiley, New York.
- 8. Burnham KP, Overton WS (1978) Estimation of the size of a closed population when capture probabilities vary among animals. Biometrics 65: 625–633.
- 9. Smith EP, van Belle GV (1984) Nonparametric estimation of species richness. Biometrics 40: 119–129.
- 10. Chao A (1987) Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43: 783–791.
- 11. Xu C, Xu L, Yu F, Tan W, Moroz LL, et al. (2006) Nonparametric estimation of the number of unique sequences in biological samples. IEEE Transactions on Signal Processing 54: 3759–3767.
- 12. Chao A, Lee SM (1992) Estimating the number of classes via sample coverage. Journal of the American statistical Association 87: 210–217.
- 13. Otis DL, Burnham KP, White GC, Anderson DR (1978) Statistical inference from capture data on closed animal populations. Wildlife Monographs 62: 1–135.
- 14. Fisher RA, Corbet AS, Williams CB (1943) The relation between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology 12: 42–58.
- 15. Sanathanan L (1972) Estimating the size of a multinormial population. Annals of Mathematical Statistics 42: 58–69.
- 16. Sanathanan L (1977) Estimating the size of a truncated sample. Journal of the American Statistical Association 72: 669–672.
- 17. Bunge J, Barger K (2008) Parametric models for estimating the number of classes. Biometrical Journal 50: 971–982.
- 18. Chao A, Bunge J (2002) Estimating the number of species in a stochastic abundance model. Biometrics 58: 531–539.
- 19. Shen TJ, He F (2008) An incidence-based richness estimator for quadrats sampled without replacement. Ecology 89: 2052–2060.
- 20. Heaton MP, Harhay GP, Bennett GL, Stone RT, Grosse WM, et al. (2002) Selection and use of SNP markers for animal identification and paternity analysis in U.S. beef cattle. Mammalian Genome 13: 272–281.
- 21. Dorazio RM, Royle JA (2003) Mixture models for estimating the size of a closed population when capture rates vary among individuals. Biometrics 59: 351–364.
- 22. He FL, Hu XS (2005) Hubbell's fundamental species diversity and Simpson's index. Ecology Letters 8: 386–390.
- 23. Wright S (1938) The distribution of gene frequencies under irreversible mutation. Proceedings of the National Academy of Sciences of the U.S.A. 24: 253–259.
- 24. Kimura M, Crow JF (1964) The number of alleles that can be maintained in a finite population. Genetics 49: 725–738.
- 25. Kimura M (1983) The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge.
- 26. Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40: 237–264.
- 27. Leigh EG (1999) Tropical Forest Ecology. Oxford University Press, Oxford.
- 28. Hubbell SP (2001) The Unified Neutral Theory of Biodiversity and Biogeography. Princeton University Press, Princeton, NJ.
- 29. Wright S (1969) Evolution and the Genetics of Populations. Vol. 2: The Theory of Gene Frequencies. Chicago, IL: The University of Chicago Press.
- 30. Hu XS, He FL, Hubbell SP (2007) Species diversity in neutral local communities. The American Naturalist 170: 844–853.
- 31. Stollenwerk N, Jansen V (2011) Population Biology and Criticality: From critical birth-death process to self-organized criticality in mutation pathogen system. Imperial College Press, London.
- 32. Abramowitz M, Stegun IA (1972) Psi (digamma) function. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, 9th printing. New York: Dover. pp. 258–259.
- 33. Heltshe JF, Forrester NE (1983) Estimating species richness using the jackknife procedure. Biometrics 39: 1–11.
- 34. Press WH, Flannery BP, Teukolsky SA, Vetterling WT (1991) Numerical Recipes in C: The Art of Scientific Computing. Cambridge University Press, Cambridge.
- 35. Chao A, Shen TJ (2003) User's Guide for Program SPADE (species prediction and diversity estimation). Updated May 2010. Available: http://chao.stat.nthu.edu.tw/. Accessed 2011 September 10.
- 36. Bulmer M (1974) On fitting the Poisson lognormal distribution of species-abundance data. Biometrics 30: 101–110.
- 37. Ord JK, Whitmore GA (1986) The Poisson-inverse Gaussian distribution as a model for species abundance. Communications in Statistics-theory and Methods 15: 853–871.
- 38. Sichel HS (1986) Parameter estimation for a word frequency distribution based on occupancy theory. Communications in Statistics-Theory and Methods 15: 935–949.
- 39. Edwards WR, Eberhardt L (1967) Estimating Cottontail abundance from livetrapping data. The Journal of Wildlife Management 31: 87–96.
- 40. Schnabel ZE (1938) Estimation of total fish population of a lake. The American Mathematical Monthly 45: 348–352.
- 41. Schumacher FX, Eschmeyer RW (1943) The estimate of fish population in lakes or ponds. Journal of Tennessee Academy of Science 18: 228–249.
- 42. Chao A, Hwang WH, Chen YC, Kuo CY (2000) Estimating the number of shared species in two communities. Statistica Sinica 10: 227–246.
- 43. Wilberg MJ, Dreher BP (2004) Genecap: a program for analysis of multilocus genotype data for non-invasive sampling and capture-recapture population estimation. Molecular Ecology Resources 4: 783–785.
- 44. Raymond M, Rousset F (1995) GENEPOP (version 1.2): population genetics software for exact tests and ecumenicism. Journal of Heredity 86: 248–249.
- 45. Meghen CN, Scott CS, Bradley DG, MacHugh DE, Loftus RT, et al. (1998) DNA based traceability techniques for the beef industry. Animal Genetics 29S1: 48–49.
- 46. Mills LS, Citta JJ, Lair KP, Schwartz MK, Tallmon DA (2000) Estimating animal abundance using onoinvasive DNA sampling: promise and pitfalls. Ecological Applications 10: 283–294.
- 47. Cunningham EP, Meghen CN (2001) Biological identification systems: genetic markers. Scientific and Technical Review of the Office International des Épizooties 20: 491–499.
- 48. Waits LP, Luikart G, Taberlet P (2001) Estimating the probability of identity among genotypes in natural populations: cautions and guidelines. Molecular Ecology 10: 249–256.
- 49. Shackell GH, Dodds KG (2008) DNA-based traceability of meat. In: Toldrá F, editor. Meat Biotechnology. Springer Science Business Media, LLC. pp. 61–88.
- 50. Lukacs PM, Eggert LS, Burnham KP (2007) Estimating population size from multiple detections with non-invasive genetic data. Wildlife Biology in Practice 3: 83–92.
- 51. Jolly GM (1981) Mark-recapture-what next? In: Ralph CJ, Scott JM, editors. Estimating the Number of Terrestrial Birds. Studies in Avian Biology. 6. Oxford: Pergamon. pp. 137–138.
- 52. Seber GAF, Manly BFJ (1985) Approximately unbiased variance estimation for the Jolly-Seber mark-recapture model: population size. In: Morgan BJT, North PM, editors. Statistics in Ornithology, Lecture Notes in Statistics, No. 29. New York: Springer-Verlag. pp. 363–371.
- 53. Jolly GM (1982) Mark-recapture models with parameters constant in time. Biometrics 38: 301–321.
- 54. Mousavi A, Sarhadi M, Lenk A, Fawcett S (2002) Tracking and traceability in the meat processing industry: a solution. British Food Journal 104: 7–19.
- 55. Food Safety Inspection Service, Recall Case Archive. Available: www.fsis.usda.gov/fsis_recalls/Recall_Case_Archive_2009/index.asp. Accessed 2010 May 13.