## Correction

22 Jan 2014: Hampton J, Lladser ME (2014) Correction: Estimation of Distribution Overlap of Urn Models. PLOS ONE 9(1): 10.1371/annotation/cd4610af-3595-4ea2-9409-71770712f2a9. doi: 10.1371/annotation/cd4610af-3595-4ea2-9409-71770712f2a9 View correction

## Figures

## Abstract

A classical problem in statistics is estimating the expected coverage of a sample, which has had applications in gene expression, microbial ecology, optimization, and even numismatics. Here we consider a related extension of this problem to random samples of two discrete distributions. Specifically, we estimate what we call the dissimilarity probability of a sample, i.e., the probability of a draw from one distribution not being observed in draws from another distribution. We show our estimator of dissimilarity to be a -statistic and a uniformly minimum variance unbiased estimator of dissimilarity over the largest appropriate range of . Furthermore, despite the non-Markovian nature of our estimator when applied sequentially over , we show it converges uniformly in probability to the dissimilarity parameter, and we present criteria when it is approximately normally distributed and admits a consistent jackknife estimator of its variance. As proof of concept, we analyze V35 16S rRNA data to discern between various microbial environments. Other potential applications concern any situation where dissimilarity of two discrete distributions may be of interest. For instance, in SELEX experiments, each urn could represent a random RNA pool and each draw a possible solution to a particular binding site problem over that pool. The dissimilarity of these pools is then related to the probability of finding binding site solutions in one pool that are absent in the other.

**Citation: **Hampton J, Lladser ME (2012) Estimation of Distribution Overlap of Urn Models. PLoS ONE 7(11):
e42368.
doi:10.1371/journal.pone.0042368

**Editor: **Yu Zhang,
Pennsylvania State University, United States of America

**Received: **March 19, 2012; **Accepted: **July 6, 2012; **Published: ** November 6, 2012

**Copyright: ** © 2012 Hampton, Lladser. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Funding: **This work was partially supported by the National Science Foundation grant DMS #0805950, the National Institutes of Health (HG4872), and the Crohns and Colitis Foundation of America. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

An inescapable problem in microbial ecology is that a sample from an environment typically does not observe all species present in that environment. In [14], this problem has 1been recently linked to the concepts of *coverage probability* (i.e. the probability that a member from the environment is represented in the sample) and the closely related *discovery* or *unobserved probability* (i.e. the probability that a previously unobserved species is seen with another random observation from that environment). The mathematical treatment of coverage is not limited, however, to microbial ecology and has found applications in varied contexts, including gene expression, microbial ecology, optimization, and even numismatics.

The point estimation of coverage and discovery probability seem to have been first addressed by Turing and Good [2] to help decipher the Enigma Code, and subsequent work has provided point predictors and prediction intervals for these quantities under various assumptions [1], [3]–[5].

Following Robbins [6] and in more generality Starr [7], an unbiased estimator of the expected discovery probability of a sample of size is(1)where is the number of species observed exactly -times in a sample with replacement of size . Using the theory of U-statistics developed by Halmos [8], Clayton and Frees [9] show that the above estimator is the *uniformly minimum variance unbiased estimator* (UMVUE) of the expected discovery probability of a sample of size based on an enlarged sample of size .

A quantity analogous to the discovery probability of a sample from a single environment but in the context of two environments is *dissimilarity*, which we broadly define as the probability that a draw in one environment is not represented in a random sample (of a given size) from a possibly different environment. Estimating the dissimilarity of two microbial environments is therefore closely related to the problem of assessing the species that are unique to each environment, and the concept of dissimilarity may find applications to measure sample quality and allocate additional sampling resources, for example, for a more robust and reliable estimation of the UniFrac distance [10], [11] between pairs of environments. Dissimilarity may find applications in other and very different contexts. For instance, in SELEX experiments [12]–a laboratory technique in which an initial pool of synthesized random RNA sequences is repeatedly screened to yield a pool containing only sequences with given biological functions–the dissimilarity of two RNA pools corresponds to the probability of finding binding site solutions in one pool that are absent in the other.

In this manuscript, we study an estimator of dissimilarity probability similar to Robbins' and Starr's statistic for discovery probability. Our estimator is optimal among the appropriate class of unbiased statistics, while being approximately normally distributed in a general case. The variance of this statistic is estimated using a consistent jackknife. As proof of concept, we analyze samples of processed V35 16S rRNA data from the Human Microbiome Project [13].

### Probabilistic Formulation and Inference Problem

To study dissimilarity probability, we use the mathematical model of a pair of urns, where each urn has an unknown composition of balls of different colors, and where there is no a priori knowledge of the contents of either urn. Information concerning the urn composition is inferred from repeated draws with replacement from that urn.

In what follows, and are independent sequences of independent and identically distributed (i.i.d.) discrete random variables with probability mass functions and , respectively. Without loss of generality we assume that and are supported over possibly infinite subsets of , and think of outcomes from these distributions as “colors”: i.e. we speak of color-, color-, etc. Let denote the set of colors such that , and similarly define . Under this perspective, denotes the color of the -th ball drawn with replacement from urn-. Similarly, is the color of the -th ball drawn with replacement from urn-. Note that based on our formulation, distinct draws are always independent.

The mathematical analysis that follows was motivated by the problem of estimating the fraction of balls in urn- with a color that is absent in urn-. We can write this parameter as(2)where

(3)The parameter measures the proportion of urn- which is unique from urn-. On the other hand, is a measure of the effectiveness of -samples from urn- to determine uniqueness in urn-. This motivates us to refer to the quantity in (2) as the *dissimilarity of urn-** from urn-*, and to the quantity in (3) as the *average dissimilarity of urn-** relative to **-draws from urn-*. Note that these parameters are in general asymmetric in the roles of the urns. In what follows, urns- and - are assumed fixed, which motivates us to remove subscripts and write instead of .

Unfortunately, one cannot estimate unbiasedly the dissimilarity of one urn from another based on finite samples, as stated in the following result. (See the Materials and Methods section for the proofs of all of our results).

**Theorem 1**
* (No unbiased estimator of dissimilarity.) There is no unbiased estimator of based on finite samples from two arbitrary urns- and -.*

Furthermore, estimating accurately without further assumptions on the compositions of urns- and - seems a difficult if not impossible task. For instance, arbitrarily small perturbations of urn- are likely to be unnoticed in a sample of a given size from this urn but may drastically affect the dissimilarity of other urns from urn-. To demonstrate this idea, consider a parameter and let , and . If then while, for each , .

In contrast with the above, for fixed , depends continuously on e.g. under the metricwhere denotes the total variation of a signed measure over such that . This is the case because

The above implies that is continuous with respect to any metric equivalent to . Many such metrics can be conceived. For instance, if denotes the probability measure associated with samples with replacement from urn- that are independent of samples with replacement from urn- then is also continuous with respect to any of the metrics , with , because

Because of the above considerations, we discourage the direct estimation of and focus on the problem of estimating accurately.

## Results

Consider a finite number of draws with replacement and , from urn- and urn-, respectively, where are assumed fixed. Using this data we can estimate , for , via the estimator:(4)where

(5)We refer to as the -statistics summarizing the data from both urns. Due to the well-known relation: , at most of these estimators are non-zero. This sparsity may be exploited in the calculation of the right-hand side of (4) over a large range of 's.

Our statistic in is the U-statistic associated with the kernel , where is used to denote the indicator function of the event within the brackets (Iverson's bracket notation). Following the approach by Halmos in [8], we can show that this U-statistic is optimal amongst the unbiased estimators of for . We note that no additional samples from either urn are necessary to estimate unbiasedly over this range when . This contrasts with the estimator in equation (1), which requires sample enlargement for unbiased estimation of discovery probability of a sample of size .

**Theorem 2** *(Minimum variance unbiased estimator.) If ** and ** then* * is the unique uniformly minimum variance unbiased estimator of **. Further, no unbiased estimator of ** exists for ** or *.

Our next result shows that converges uniformly in probability to over the largest possible range where unbiased estimation of the later parameter is possible, despite the non-Markovian nature of when applied sequentially over . The result asserts that is likely to be a good approximation of , uniformly for , when and are large. The method of proof uses an approach by Hoeffding [14] for the exact calculation of the variance of a -statistic.

**Theorem 3** *(Uniform convergence in probability.) Independently of **how ** and ** tend to infinity, it follows for each that*

We may estimate the variance of for via a leave-one-out or also called delete- jackknife estimator, using an approach studied by Efron and Stein [15] and Shao and Wu [16].

To account for variability in the -data through a leave-one-out jackknife estimate, we require that and let(7)

On the other hand, to account for variability in the -data, consider for and the statistics(8)

Clearly, ; in particular, the -statistics are a refinement of the -statistics. Define and, for , define(9)where

Our estimator of the variance of is obtained by summing the variance attributable to the -data and the -data and is given by(12)for ; in particular, is our jackknife estimate of the standard deviation of .

To assess the quality of as an estimate of the variance of and the asymptotic distribution of the later statistic, we require a few assumptions that rule out degenerate cases. The following conditions are used in the remaining theorems in this section:

- .
- there are at least two colors in that occur in different proportions in urn-; in particular, the conditional probability is not a uniform distribution.
- urn- contains at least one color that is absent in urn-; in particular, .
- and grow to infinity at a comparable rate i.e. , which means that there exist finite constants such that , as tend to infinity.

Conditions (a–c) imply that has a strictly positive variance and that a projection random variable, intermediate between and , has also a strictly positive variance. The idea of projection is motivated by the analysis of Grams and Serfling in [17].

Condition (d) is technical and only used to show that the result in Theorem 5 holds for the largest possible range of values of namely, for . See [18] for results with uniformity related to Theorem 4, as well as uniformity results when condition (d) is not assumed.

Because the variance of , from now on denoted , and its estimate tend to zero as and increase, the unnormalized consistency result is unsatisfactory. As an alternative, we can show that is a consistent estimator relative to , as stated next.

**Theorem 4** *(Asymptotic consistency of variance estimation.) If conditions (a)–(c) are satisfied then, for each ** and , it applies that*

Finally, under conditions (a)–(d), we show that is asymptotically normally distributed for all , as and increase at a comparable rate.

**Theorem 5** *(Asymptotic normality.) Let ** i.e. ** has a standard normal distribution. If conditions (a)–(d) are satisfied then*(14)*for all real number* .

The non-trivial aspect of the above result is the asymptotic normality of when , e.g. , as the results we have found in the literature [14],[19],[20] only guarantee the asymptotic normality of our estimator of for fixed . We note that, due to Slutsky's theorem [21], it follows from (13) and (14) that the ratiohas, for fixed , approximately a standard normal distribution when and are large and of a comparable order of magnitude.

## Discussion

As proof of concept, we use our estimators to analyze data from the Human Microbiome Project (HMP) [13]. In particular, our samples are V35 16S rRNA data, processed by Qiime into an operational taxonomic unit (OTU) count table format (see File S1). Each of the samples analyzed have more than successfully identified bacteria (see File S2). We sort these samples by the body location metadata describing the origin of the sample. This sorting yields the assignments displayed in Table 1.

We present our estimates of for all possible sample comparisons in Figure 1, i.e., we estimate the average dissimilarity of sample- relative to the full sample-. Due to (4), observe that . At the given sample sizes, we can differentiate four broad groups of environments: stool, vagina, oral/throat and skin/nostril. We differentiate a larger proportion of oral/throat bacteria found in stool than stool bacteria found in the oral/throat environments. We may also differentiate the throat, gingival and saliva samples, but cannot reliably differentiate between tongue and throat samples or between the subgingival and supragingival plaques. On the other hand, the stool samples have larger proportions of unique bacteria relative to other stool samples of the same type, and vaginal samples also have this property. In contrast the skin/nostril samples have relatively few bacteria that are not identified in other skin/nostril samples.

Heat map of sorted by site location metadata. Here, the -axis denotes the sample from the environment corresponding to urn-, and similarly for the -axis. The entries on the diagonal are set to zero.

The above effects may be a property of the environments from which samples are taken, or an effect of noise from inaccurate estimates due to sampling. To rule out the later interpretation, we show estimates of the standard deviation of based on the jackknife estimator from (12) in Figure 2. As is zero, the error estimate is given by . We see from (7), with , that

Heat map of sorted by site location metadata. Here, the -axis also denotes the sample from the environment corresponding to urn-, and similarly for the -axis, and the entries on the diagonal are again set to zero.

Assuming a normal distribution and an accurate jackknife estimate of variance, will be in the interval with at least approximately 95% confidence, for any choice of sample comparisons in our data; in particular, on a linear scale, we expect at least 95% of the estimates in Figure 1 to be accurate in at least the first two digits.

As we mentioned earlier, estimating accurately is a difficult problem. We end this section with two heuristics to assess how representative is of , when urn- has at least two colors and at least one color in common with urn-. First, observe that:(15)

In particular, is a strictly concave-up and monotonically decreasing function of the real-variable . Hence, if is close to the asymptotic value , then should be of small magnitude. We call the later quantity the *discrete derivative* of at . Since we may estimate the discrete derivative from our data, the following heuristic arises: *relatively large values of* *are evidence that* *is not a good approximation of *.

Figure 3 shows the heat map of for each pair of samples. These estimates are of order for the majority of the comparisons, and spike to for several sample- of varied environment types, when sample- is associated with a skin or vaginal sample. In particular, further sampling effort from environments associated with certain vaginal, oral or stool samples are likely to reveal bacteria associated with broadly defined skin or vaginal environments.

Heat map of , sorted by site location metadata, following the same conventions as in the previous figures.

Another heuristic may be more useful to assess how close is to , particularly when the previous heuristic is inconclusive. As motivation, observe that , because of the identity in (15), where

Furthermore, , where is certain finite constant. We can justify this approximation only when is well approximated by a linear function of , in which case we let denote the estimated value for obtained from the linear regression. Since , the following more precise heuristic comes to light: *is a good approximation of ** if the linear regression of ** for ** near ** gives a good fit, ** is small relative to **, and ** is also small.*

To fix ideas we have applied the above heuristic to three pairs of samples: , and , with each ordered pair denoting urn- and urn-, respectively. As seen in Table 2 for these three cases, is at least 14-times larger than ; in particular, due to the asymptotic normality of the later statistic, an appropriate use of the heuristic is reduced to a good linear fit and a small value. In all three cases, was computed from the estimates , with .

For the -pair, and the regression error, measured as the largest absolute residual associated with the best linear fit, are zero to machine precision, suggesting that is a good approximation of . This is reinforced by the blue plot in Figure 4. On the other hand, for the -pair, the regression error is small, suggesting that the linear approximation is good for . However, because , we cannot guarantee that is a good approximation of . In fact, as seen in the red-plot in Figure 4, , with , exposes a steady and almost linear decay that suggests that may be much smaller than . Finally, for the -pair, the regression error is large and the heuristic is therefore inconclusive. Due to the green-plot in Figure 4, the lack of fit indicates that the exponential rate of decay of to has not yet been captured by the data from these urns. Note that the heuristic based on the discrete derivative shows no evidence that is far from .

Plots of , with , for three pairs of samples of the HMP data.

## Materials and Methods

Here we prove the theorems given in the Results section. The key idea to prove each theorem may be summarized as follows.

To show Theorem 1, we identify pairs of urns for which unbiased estimation of is impossible for any statistic. To show Theorem 2, we exploit the diversity of possible urn distributions to show that there are relatively few unbiased estimators of and, in fact, there is a single unbiased estimator that is symmetric on the data. The uniqueness of the symmetric estimator is obtained via a completeness argument: a symmetric statistic having expected value zero is shown to correspond to a polynomial with identically zero coefficients, which themselves correspond to values returned by the statistic when presented with specific data. The symmetric estimator is a U-statistic in that it corresponds to an average of unbiased estimates of , based on all possible sub-samples of size and from the samples of urn- and -, respectively. As any asymmetric estimator has higher variance than a corresponding symmetric estimator, the symmetric estimator must be the UMVUE.

To show Theorem 3 we use bounds on the variance of the U-statistic and show that, uniformly for relatively small , converges to in the -norm. In contrast, for relatively large values of , we exploit the monotonicity of and to show uniform convergence.

Finally, theorems 4 and 5 are shown using an approximation of by sums i.i.d. random variables, as well as results concerning the variance of both and its approximation. In particular, the approximation satisfies the hypotheses the Central Limit Theorem and Law of Large Numbers, which we use to transfer these results to .

In what follows, denotes the set of all probability distributions that are finitely supported over .

### Proof of Theorem 1

Consider in probability distributions of the form , and , where is a given parameter. Any statistic which takes as input draws from urn- and draws from urn- has that is a polynomial of degree at most in the variable ; in particular, it is a continuous function of over the interval . Since has a discontinuity at over this interval, there exists no estimator of that is unbiased over pairs of distributions in .

We use lemmas 6–11 to first show Theorem 2. The method of proof of this theorem follows an approach similar to the one used by Halmos [8] for single distributions, which we extend here naturally to the setting of two distributions.

Our next result implies that no uniformly unbiased estimator of is possible when using less than one sample from urn- and samples from urn-.

**Lemma 6** *If ** is unbiased for* * for all **, then ** and *.

*Proof*. Consider in probability distributions of the form , , and , where are arbitrary real numbers. Clearly, is a linear combination of polynomials of degree in and in and, as a result, it is a polynomial of degree at most in and in . Since has degree in and in , and is unbiased for , we conclude that and .

The form of given in equation (4) is convenient for computation but, for mathematical analysis, we prefer its -statistic form associated with the kernel function .

In what follows, denotes the set of all functions that are one-to-one.

*Proof*. Fix and suppose that color occurs -times in . If then any sublist of size of contains , hence , for all . On the other hand, if then . Since the rightmost sum only depends on the number of times that color was observed in , we may use the -statistics defined in equation (5) to rewrite:

The right-hand side above now corresponds to the definition of given in equation (4).

In what follows, we say that a function is *-symmetric* whenfor all and permutations and of and , respectively. Alternatively, is -symmetric if and only if it may be regarded a function of , where and correspond to the order statistics and , respectively. Accordingly, a statistic of is called *-symmetric* when it may be represented in the form , for some -symmetric function . It is immediate from Lemma 7 that is -symmetric.

The next result asserts that the variance of any non-symmetric unbiased estimator of may be reduced by a corresponding symmetric unbiased estimator. The proof is based on the well-known fact that conditioning preserves the mean of a statistic and cannot increase its variance.

**Lemma 8** *An asymmetric unbiased estimator of ** that is square-integrable has a strictly larger variance than a corresponding **-symmetric unbiased estimator.*

*Proof*. Let denote the sigma-field generated by the random vector and suppose that the statistic is unbiased for and square-integrable. In particular, is a well-defined statistic and there is an -symmetric function such that . Clearly, is unbiased for and -symmetric. Since , Jensen's inequality for conditional expectations [22] implies that , with equality if and only if is -symmetric.

Since is -symmetric and bounded, the above lemma implies that if an UMVUE for exists then it must be -symmetric. Next, we show that there is a unique symmetric and unbiased estimator of , which immediately implies that is the UMVUE.

In what follows, denote integers. We say that a polynomial is *-homogeneous* when it is a linear combination of polynomials of the form , with and . Furthermore, we say that satisfies the *partial vanishing condition* if whenever , and .

The next lemma is an intermediate step to show that a -homogeneous polynomial which satisfies the partial vanishing condition is the zero polynomial, which is shown in Lemma 10.

**Lemma 9** *If ** is a **-homogeneous polynomial in the real variables **, with **, that satisfies the partial vanishing condition, then* * whenever **, ** and *.

*Proof*. Fix such that and and observe thatbecause is a -homogeneous polynomial. Notice now that the right hand-side above is zero because satisfies the partial vanishing condition.

**Lemma 10** *Let ** be a **-homogeneous polynomial in the real variables **, with **. If* * satisfies the partial vanishing condition then ** identically.*

*Proof*. We prove the lemma using structural induction on for all .

If then a -homogeneous polynomial must be of the form , for an appropriate constant . As such a polynomial satisfies the partial-vanishing condition only when , the base case for induction is established.

Next, consider a -homogeneous polynomial , with , that satisfies the partial vanishing condition, and let denote its degree with respect to the variable . In particular, there are polynomials in the variables such that

Now fix such that and . Because satisfies the partial vanishing condition, Lemma 9 implies that for all . In particular, for each , whenever , and . Thus each satisfies the partial vanishing condition. Since is a -homogeneous polynomial, the inductive hypothesis implies that identically and hence identically. The same argument shows that if , with , is a -homogeneous polynomial that satisfies the partial vanishing condition then identically, completing the inductive proof of the lemma.

Our final resultbefore proving Theorem 2 implies that cannot admit more than one symmetric and unbiased estimator. Its proof depends on the variety of distributions in , and uses the requirement that our estimator must be unbiased for any pair of distributions chosen from .

**Lemma 11** *If ** is an **-symmetric function such that* , for all , *then * *identically.*

*Proof*. Consider a point and define and as the cardinalities of the sets and , respectively. Furthermore, let denote the distinct elements in the set and define to be the number of times that appears in this set. Furthermore, let be a probability distribution such that and define . In a completely analogous manner define , , and .

Notice that is a polynomial in the real variables that satisfies the hypothesis of Lemma 10; in particular, this polynomial is identically zero. However, because is -symmetric, the coefficient of in isimplying that .

### Proof of Theorem 2

From Lemma 8, as we mentioned already, if the UMVUE for exists then it must be -symmetric. Suppose there are two -symmetric functions such that and are unbiased for . Applying Lemma 11 to shows that , and admits therefore a unique symmetric and unbiased estimator. From Lemma 7, is -symmetric and unbiased for hence it is the UMVUE for . From Lemma 6, it follows that no unbiased estimator of exists for or .

Our next goal is to show Theorem 3, for which we prove first lemmas 12–13. We note that the later lemma applies in a much more general context than our treatment of dissimilarity.

**Lemma 12** *If, for each **, ** is an integer such that ** then*(17)(18)*uniformly for* *as* .

*Proof*. First observe that for all sufficiently large and , it applies that

Note that , for all . As a result, we may bound the exponential factor on the right-hand side above as follows:

Since and , uniformly for all as , (17) follows.

To show (18), first note the combinatorial identity(19)

Proceeding in an analogous manner as we did to show (17), we see now that the term associated with the index in the above summation satisfies thatfor all sufficiently large and . Since and , the above inequalities together with (17) and (19) establish (18).

**Lemma 13** *Define **, where* * is a bounded **-symmetric function, and let**be the U-statistic of* *associated with* *draws from urn*- *and* *draws from urn*-; *in particular*, . *Furthermore, assume that*

- ,
*there is a function**such that*,- ;
*in particular*, .

*Under the above assumptions, it follows that*

*Proof*. Define ; in particular, and , and , for any , as . The proof of the theorem is reduced to show that(20)(21)

Next, we compute the variance of following an approach similar to Hoeffding [14]. Because is -symmetric, a tedious yet standard calculation shows that(22)where

Clearly, . On the other hand, if is any random variable with finite expectation and are sigma-fields then , due to well-known properties of conditional expectations [22]. In particular, for each , we have that(25)

Consequently, (22) implies that(26)

Indeed, using an argument similar as above, we find that

Due to assumptions (*i*)-(*ii*) and the Bounded Convergence Theorem, the right-hand side above tends to , and the claim follows.

It follows from (26) and (27) that

Finally, because of assumption (*iii*),

Since each term on the right-hand side above tends to zero as , (20) follows.

We now show (21). As and , it follows by (22) and Lemma 12 thatuniformly for as . In particular,

Due to the definition of the coefficients , the right-hand side above tends to zero, and (21) follows.

### Proof of Theorem 3

Note that , with . We show that the kernel function and the U-statistics satisfy the hypotheses of Lemma 13. From this the theorem is immediate because -convergence implies convergence in probability.

Clearly is -symmetric and , which shows assumption (i) in Lemma 13. On the other hand, due to the Law of Large Numbers, almost surely, from which assumption (ii) in the lemma also follows.

Finally, to show assumption (iii), recall that is the set of one-to-one functions from into ; in particular, . Now note that for each indicator of the form , with , there are choices of outside the set . Because , it follows that for all . This shows condition (iii) in Lemma 13, and Theorem 3 follows.

### Proof of equation (7)

The jackknife estimate of the variance of obtained from removing a single -data is, by definition, the quantity(28)

Note that removing a color from the -data which would otherwise add to , decrements this quantity by one unit. Let denote the -statistics associated with the data when observation from urn- is removed from the sample. Note that as each draw from urn- contributes to exactly one , for all except for some where . We have therefore that

Since there are draws from urn- which contribute to , the above sum may be now rewritten in the form given in (7).

### Proof of equation (9)

Similarly, corresponds to the jackknife summed over each possible deletion of a single -data, which is more precisely given by(29)where is the set of one-to-one functions from into .

Recall that is the number of colors seen times in draws from urn- and times in draws from urn-, giving that .

Fix and suppose that is of a color that contributes to , for some Removing from the data decrements and increments by one unit. Proceeding similarly as in the case for , if is used to denote the -statistics when observation is removed from sample-, thenwhere and are as defined in (10) and (11). Noting that for each there are draws from urn- that contribute to , the form in (9) follows.

In what follows, we specialize the coefficients in (23) and (24) to the kernel function of dissimilarity, . From now on, for each and , define(30)(31)

Above it is understood that the sigma-field generated by when is ; in particular, , for all .

The following asymptotic properties of are useful in the remaining proofs.

**Lemma 14** *Assume that conditions (a)-(c) are satisfied and define **. It follows that ** and* . *Furthermore*(32)(33)(34)

*Proof*. Observe that conditions (a)–(b) imply that . In addition, condition (b) implies that , whereas condition (c) implies that .

Next, consider the seti.e. is the set of rarest colors in urn- which are also in urn-. Also note that

(35)As an intermediate step before showing (32), we prove that(36)

Hencefrom which (36) now easily follows.

To show (32) note that (36) implieswhich establishes (32).

Now note thatwhich establishes (33).

Next we show (34), which we note gives more precise information than (27). Consider the random variable defined as the smallest such that . We may bound the probability of being large by , where is finite because of condition (a). On the other hand, note that

Define and observe that, over the event , . Since , we obtain that

The identity in equation (34) is now a direct consequence of (35).

Our next goal is to show Theorems 4 and 5. To do so we rely on the method of projection by Grams and Serfling [17]. This approach approximates by the random variable

The projection is the best approximation in terms of mean squared error to that is a linear combination of individual functions of each datapoint.

Under the stated conditions, is the sum of two independent sums of non-degenerate i.i.d. random variables and therefore satisfies the hypotheses of the classical central limit theorem. The variance of the projection is easier to analyze and estimate than the -statistic directly, which is relevant in establishing consistency for the jackknife estimation of variance.

Letbe the remainder of that is not accounted for by its projection. When is small relative to , is mostly explained by in relative terms.

The next lemma summarizes results about the asymptotic properties of , particularly with relation to the scale of as given by its variance.

*Under assumptions (a)-(c), for a fixed* , *we have that*(39)

*Furthermore, under assumptions (a)–(d) we have that*
(40)

*for all* .

*Proof*. A direct calculation from the form given in (16) gives that(42)(43)

As , (37) follows.

To show (38), first observe that(44)

Next, using the definition of the projection, we obtain thatfrom which (38) follows, due to the identity in (44). Note that the last identity implies that and are uncorrelated.

Before continuing, we note that (41) is a direct consequence of (38), (40) and Chebyshev's inequality [22]. To complete the proof of the lemma all reduces therefore to show (41) under conditions (a)–(d). Indeed, if and we let then due to the identities in (22) and (37) and Lemma 12, we obtain under (a)–(c) thatuniformly for all , as . Since for all , we have thus shown (39). Furthermore, note that for all ; in particular, due to (33) and conditions (a)–(d), we can assert that . Since , the above identity together with the one in (37) let us conclude that

as . Because of condition (d), the big-O term above tends to . As a result:

(45)On the other hand, (38) implies that . Hence, using (19) and (25) to bound from above the variance of the U-statistic, we obtain:as , where for the last identity we have used (32) and (34). Since , it follows from the above identity that

In particular, if the base- in the logarithm is selected to satisfy that , then(46)

The identities in equation (45) and (46) show (41), which completes the proof of the lemma.

### Proof of Theorem 5

For a fixed , note that is the sum of two independent sums of non-degenerate i.i.d. random variables and thus,is asymptotically a standard Normal random variable as by the classical Central Limit Theorem. We would like to show however that this convergence also applies if we let vary with and . We do so using the Berry-Esseen inequality [23]. Motivated by this we define the random variables

We need to show that(47)uniformly for , as .

It follows from (37) that

But note that . Since, according to Lemma 14, decreases exponentially fast, we obtainuniformly for all , as . On the other hand, . Furthermore, (33) implies that . Since , for some finite constant we find that

The above establishes convergence in distribution of to a standard normal random variable uniformly for , as . The end of the proof is an adaptation of the proof of Slutsky's Theorem [21]. Indeed, note that(48)

From this identity, it follows for any fixed that

The first term on the right-hand side of the above inequality can be made as close to as wanted, uniformly for , as , because of (40). On the other hand, the second term tends to uniformly for because of (41). Letting , shows that

Similarly, using (48), we have:

and a similar argument as before shows now that

which completes the proof of the theorem.

We finally show Theorem 4, for which we first show the following result.

**Lemma 16** *Let ** be the set of one-to-one functions from ** into **. Consider the kernel **, and define*(49)(50)(51)(52)*Then, for each* *and* ,

*Proof*. Fix . We first use a result by Sen [24] to show that, for each :(55)(56)(57)(58)

in an almost sure sense. Indeed, assume without loss of generality that . As the kernel functions found in (49) and (51) are bounded, the hypotheses of Theorem 1 in [24] are satisfied, from which (55) and (57) are immediate. Similarly, because and are discrete random variables, (56) and (58) also follow from [24].

Define(59)(60)and observe that

Furthermore, due to (55)–(58), we have that(61)(62)

But note that, for , and are independent and hence uncorrelated. Similarly, the random variables and are independent. Since and , it follows from (61) and (62), and the Bounded Convergence Theorem [22] that(63)(64)as .

Finally, by (30) and (31) it follows that

In particular, again by the Bounded Convergence Theorem, we have that and . Sincethe lemma is now a direct consequence of (63) and (64), and Theorem 1.5.4 of Durrett [22].

### Proof of Theorem 4

It follows by (28) and (29) that(65)(66)where and are as in (59) and (60), respectively. Furthermore, observe that

In particular, due to (37), we obtain that

By Lemma 16, converges in probability to , while similarly converges in probability to ; in particular, the first two terms on the right-hand side of the inequality converge to in probability. Since and , the same can be said about the last two terms of the inequality. Consequently, converges to in probability, as . As stated in (39), however, conditions (a)–(c) imply that and are asymptotically equivalent as , from which the theorem follows.

## Supporting Information

## Acknowledgments

We thank Rob Knight for insightful discussions and comments about this manuscript, and Antonio Gonzalez for providing processed OTU tables from the Human Microbiome Project.

## Author Contributions

Conceived and designed the experiments: JH ML. Performed the experiments: JH. Analyzed the data: JH ML. Contributed reagents/materials/analysis tools: JH ML. Wrote the paper: JH ML. Directed the research project: ML.

## References

- 1. Lladser ME, Gouet R, Reeder J (2011) Extrapolation of urn models via poissonization: Accurate measurements of the microbial unknown. PLoS ONE 6: e21105.
- 2. Good IJ (1953) The population frequencies of species and the estimation of population parameters. Biometrika 40: 237–264.
- 3. Esty WW (1983) A Normal limit law for a nonparametric estimator of the coverage of a sample. Ann Stat 11: 905–912.
- 4. Mao CX (2004) Predicting the conditional probability of finding a new class. J Am Stat Assoc 99: 1108–1118.
- 5. Lijoi A, Mena RH, Prünster I (2007) Bayesian nonparametric estimation of the probability of discovering new species. Biometrika 94: 769–786.
- 6. Robbins HE (1968) Estimating the total probability of the unobserved outcomes of an experiment. Ann Math Statist 39: 256–257.
- 7. Starr N (1979) Linear estimation of discovering a new species. Ann Stat 7: 644–652.
- 8. Halmos PR (1946) The theory of unbiased estimation. Ann Math Statist 17: 34–43.
- 9. Clayton MK, Frees EM (1987) Linear estimation of discovering a new species. J Am Stat Assoc 82: 305–311.
- 10. Lozupone C, Knight R (2005) UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol 71: 8228–8235.
- 11. Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R (2011) UniFrac: An e_ective distance metric for microbial community comparison. ISME J 5: 169–172.
- 12. Tuerk C, Gold L (1990) Systematic evolution of ligands by exponential enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science 249: 505–510.
- 13. Huttenhower C, Gevers D, Sathirapongsasuti JF, Segata N, Earl AM, et al. (2012) Structure, function and diversity of the healthy human microbiome. Nature 486: 207–214.
- 14. Hoeffding W (1948) A class of statistics with asymptotically normal distribution. Ann Math Statist 19: 293–325.
- 15. Efron B, Stein C (1981) The jackknife estimate of variance. Ann Stat 9: 586–596.
- 16. Shao J, Wu C (1989) A general theory for jackknife variance estimation. Ann Stat 17: 1176–1197.
- 17. Grams WF, Sering RJ (1973) Convergence rate for U-statistics and related statistics. Ann Stat 1: 153–160.
- 18.
Hampton JD (2012) Dissimilarity and Optimal Sampling in Urn Ensemble Model. Ph.D. thesis, University of Colorado, Boulder, Colorado.
- 19. Ahmad AA (1980) On the Berry-Esseen theorem for random U-statistics. Ann Stat 8: 1395–1398.
- 20. Callaert H, Janssen P (1978) The Berry-Esseen theorem for U-statistics. Ann Stat 6: 417–421.
- 21. Slutsky E (1925) Über stochastische asymptoten und grenzwerte. Metron 5: 3–89.
- 22.
Durrett R (2010) Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press. URL http://books.google.com/books?id=evbGTPhuvSoC.
- 23. Shevtsova IG (2010) An improvement of convergence rate estimates in the Lyapunov theorem. Doklady Mathematics 82: 862–864.
- 24. Sen PK (1977) Almost sure convergence of generalized U-statistics. Ann Probab 5: 287–290.