How Many Genes Are Expressed in a Transcriptome? Estimation and Results for RNA-Seq

RNA-seq experiments estimate the number of genes expressed in a transcriptome as well as their relative frequencies. However, an undetermined number of genes can remain undetected due to their low expression relative to the sample size (sequence depth). Estimation of the true number of genes expressed in a transcriptome is essential in order to determine which genes are exclusively expressed in specific tissues or under particular conditions. A reliable estimate of the true number of expressed genes is also required to accurately measure transcriptome changes and to predict the sequencing depth needed to increase the proportion of detected genes. This problem is analogous to ecological sampling problems such as estimating the number of species at a given site. Here we present a non-parametric estimator for the number of undetected genes as well as for the extra sample size needed to detect a given proportion of the undetected genes. Our estimators are superior to ones already published by having smaller standard errors and biases. We applied our method to a set of 32 publicly available RNA-seq experiments, including the evaluation of 311 individually sequenced libraries. We found that in the majority of the cases more than one thousand genes are undetected, and that on average approximately 6% of the expressed genes per accession remain undetected. This figure increases to approximately 10% if individual sequencing libraries are analyzed. Our method is also applicable to metagenomic experiments. Using our method, the number of undetected genes as well as the sample size needed to detect them can be calculated, leading to more accurate and complete gene expression studies.


Introduction
The transcriptome The transcriptome can be considered as the set of all RNA molecules, including mRNAs, rRNAs, tRNAs, and other non-coding RNAs such as small RNAs, present in a cell under specific conditions (see for example [1]). In the present work, we specifically refer to the mRNA transcriptome, but the ideas and methods discussed are applicable to other types of RNA or, in fact, to any situation where a similar sampling scheme is employed. RNA-seq [2] is a method to explore and quantitate the transcriptome, usually by high-throughput sequencing. In these experiments, total RNA is isolated from a cell population and then the mRNA fraction is converted to cDNA which is fragmented and sequenced massively in parallel, obtaining a large number of small gene tags, that are associated with specific gene transcripts. These sequences are then mapped to a reference (generally a reference genome or a de novo assembled reference transcriptome), obtaining quantitative data concerning transcript abundance for genes in the reference. The final result of an RNA-seq experiment performed over a given sample of mRNA is a vector, such that y = (y 1 , y 2 , Á Á Á, y g ), where each y i > 0; i = 1, 2, Á Á Á, g is the count of the number of tags found for the i − th gene and g is the total number of genes detected or estimated in the sample of effective size N = ∑y i . It is important to underline that the number of genes detected in a given RNA-seq experiment, g, is only an estimate of the true number of genes that are being expressed in that case, say G; G ! g. Thus, after performing an RNA-seq experiment we can only affirm that there are at least g genes expressed for that case, but we cannot rule out the possibility that more genes are being expressed, but were missed by our sample of size N, i.e., the case when G > g. An important goal of transcriptomics research is to obtain complete transcriptome-level information for each cell type that comprises the organism being studied. However, what is generally feasible is to extract mRNA from a large number of cells. Under these conditions sampling is conducted with replacement from a conceptually infinite population of molecules. In other words, the probability of mapping a tag to a specific gene does not alter the probabilities of mapping tags for each gene as the sampling proceeds. Although single-cell RNA-seq is becoming feasible [3], the majority of RNA-seq experiments performed to date use RNA extracted from heterogeneous mixtures of cells such as tissues, organs or even complete individuals. This increases the complexity of the population sampled by increasing the number of distinct transcripts present, for example derived from the set of genes whose expression is restricted to only a particular cell type(s).
When a gene is not detected by RNA-seq in a particular treatment, it can be due to the fact that it is not expressed or, alternatively, it was expressed but was not detected because the sample size was too small. In the former case no error is committed, but the later leads to an incorrect conclusion, if the lack of detection is taken as absolute evidence of no expression. RNA-seq literature is full of cases where the authors claim that some genes are 'exclusively expressed' in a particular condition. For example, in some cancer related studies [4][5][6][7] the authors affirm that a set of genes are exclusively expressed in the malignant tissues, while in many more cases the same claim is done about exclusive expression at some treatment or condition. We estimated, by searching the literature, that in around 600 papers such claims are made; see [8][9][10][11][12][13][14][15][16][17][18] for particular examples. To claim that a particular gene is exclusively expressed under a given condition, the researchers must show that the collection of expressed genes is reasonable complete and thus there is unlikely that the undetected gene was missed as a consequence of a small sample size.

Sampling genes is analogous to sampling species
The problem studied here, namely the estimation of undetected genes in a transcriptome, is analogous to the problem of estimating the number of species (in ecology) or classes (in statistics) [19]. Concrete examples include the ecological question of how many species exist in a delimited area, or to the estimation of the number of words known by a writer [20]. This is a difficult problem because it involves an estimate of how many things (classes, species, genes) are missing in a sample, using only the information contained in the sample itself.
To present the problem, as well as putative solutions in a formal framework, denote as f r the number of cases in which the counts of the class i, i.e., the y i 's have exactly the value of r; r = 1, 2, Á Á Á. Thus, for example, if 5 of the y i 's are equal to 1, then f 1 = 5. In this way, and without losing any relevant information, the original data y can be represented by the vector f = (f 1 , f 2 , Á Á Á) and the sample size, N, can be expressed as X r¼1 r¼1 rf r ¼ N This notation was apparently proposed in [21], in which it was used to estimate the number of classes in a population of known size. This notation was followed by [22], in which f r was defined as the frequency of the frequency r. Note that ∑ i y i = ∑ r rf r = N and in any particular case the maximum value of r is finite, however we use 'r = 1' above simply to indicate that the sum must be performed for all values of r. 'f 0 ' can then be used to represent the number of genes that are not present in a sample, or in more general terms, the number of classes that are missing in a particular sample (see Section A of S1 File for more details).
Consider a sampling experiment in which a biologist is interested in knowing how many species of fish live in a pond. After catching a fish and noting its species, say i, the frequency corresponding to that species, y i , increases by one. The fish is sent back to the pond (sampling with replacement considering an 'infinite' population) and the procedure is repeated. In this scenario important variables change as N, the number of times that the procedure is repeated or sample size, increases. In the first stages the rate of discovery of species is large and g (the number of detected species) increases rapidly; at the same time the number of species represented by a single individual, the frequency f 1 , is large: f 1 % N. As the process continues, the discovery of new species becomes less and less frequent such that g tends to stability converging on the true number of species in the pond, G. Precisely the same logic applies to the analysis of transcriptome data in an RNA-seq experiment, where the probability of mapping to a previously unsampled gene is large when N is small. As N increases, the probability of mapping a sequence to a previously unsampled gene decreases. This process can be plotted as a rarefaction curve, and used to estimate G (see for example [23]). Intuitively, a 'stopping rule' can be established for the sampling procedure; for example, "stop sampling when f 1 = 0", when all of the y i 's are larger than one. This is a reasonable rule of thumb because when f 1 = 0 it is assumed that the sample has covered the complete population, including all or nearly all of the species present. For increased confidence, "stop sampling when f 1 = 0 and f 2 = 0", etc. In summary, the values of f r when r is small, say r = 1, 2, Á Á Á, 6 contain most of the information about the 'completeness' of the sample. In the case of transcriptome data, note that the total number of genes, G, is equal to the number of genes detected in the sample, g, plus the number of genes missing, f 0 , The need for an estimate of the number of missing genes (f 0 ) To model the frequencies of expression of genes, the number of classes must first be fixed to a given value G = c. After doing this, a multinomial distribution with c parameters, or a negative binomial or a set of c independent Poisson distributions, etc. can be assumed. In fact, all current RNA-seq analysis algorithms such as edgeR [24] and DEGseq [25] assume that the number of genes expressed in a sample is equal to the number of genes found in the sample, G = g, and only then model the frequencies under a specific distribution. This is a rational assumption, given the impossibility of estimating frequencies of expression for genes that are not detected in the sample. However, there are important statistical and biological repercussions to this problem that have been under-appreciated in the literature (see for example [26]). From a statistical point of view, when the true value of G is unknown, the parameter space is open. In other words, we do not know how many parameters need to be estimated, and thus the method of maximum likelihood fails to give proper estimators [22]. On the practical side, if information concerning the completeness or richness of a sample is unknown, then it is impossible to evaluate the possibility that a gene was indeed expressed but was missed during sampling. This implies that for several classes of genes, particularly those that are only weakly expressed, it is impossible to determine whether their expression is restricted to a particular cell type, developmental stage, or environmental condition. This is a crucial consideration given that genes with important regulatory roles, as for example those encoding transcription factors, are usually expressed at lower frequencies [27] and thus have larger probabilities to remain undetected in the sample than are other types of genes.
The estimation (using DNA evidence) of the number of microorganism species in metagenomics experiments is a problem with the identical sampling and statistical framework as the one presented here for the estimation of undetected genes, and is amply represented in the literature [28][29][30][31][32][33][34][35]. A solution to the problem of f 0 estimation in RNA-seq is likely to be directly applicable to metagenomic experiments.

Non-parametric estimators for the number of missing genes (f 0 )
The estimation of the number of missing or undetected classes, f 0 , can be performed by different methods, depending on the structure of the population and the sampling scheme employed [19]. RNA-seq employs sampling with replacement, thus assuming a population of infinite size in which G is unknown. In this case, selecting a discrete distribution, such as multinomial or negative binomial, is impossible without conditioning to a known value of G, for example G = g. However, it is important to note that G itself is a random variable in that the realized value G = g holds only after the sample had been obtained; assuming a priori a particular distribution for G, as for example log-normal [23], is risky and without empirical foundations given that RNA-seq samples arise from a wide range of situations (heterogeneous mixtures of distinct cell types as tissues or organs in distinct environmental conditions or developmental stages, etc. [26]). For this wide range of possibilities it appears unrealistic to impose a given statistical law in the form of a distribution for the number of classes. At least for RNA-seq, it appears safer to use non-parametrical estimation procedures, assuming very little about the distribution, as has been done for example in [22]. Here we present only the most common non-parametric estimators for f 0 proposed in the literature. A more comprehensive list, including different methods of estimation can be found in [19].
In a seminal work, I. J. Good [22] studied the problem of the estimation of the relative frequency of occurrence of species, which as previously described, is directly applicable to the estimation of the relative frequency of detection of genes in a transcriptome. He showed that the usual relative frequency in a sample, y i /N or r/N, is a sensible estimator of the corresponding relative frequencies only when the true number of classes G is known. In that case those are Maximum Likelihood Estimators (MLE) of the corresponding parameters. However, when G is unknown, which is the case in all RNA-seq experiments, these estimators are inappropriate for small r, i.e., for genes only weakly expressed. In [22] Good presented an approximate recurrent expression for the expected value of r, This relation was first discovered by Alan M. Turing [36], and has been the basis of estimators for the coverage of a sample and, in particular, for estimators of the number of classes under different frameworks. Anne Chao in [37] used an approximate and asymptotic result to propose as estimator of f 0 , the functionf This estimator is generally called 'Chao1' in the literature. This estimator set the foundation for variants to estimate f 0 under distinct sampling schemes in the framework of species richness estimation, as for example the 'Chao2' estimator, which is bias corrected and is always obtainable [38]. Chao1 is a lower bound and thus a biased estimator of f 0 , a fact already noted in [37]. However, resampling procedures, such as Jackknife or bootstrap [39] can be employed to reduce the bias and obtain non-parametric confidence intervals, as proposed in [37,38]. The Chao1 and Chao2 estimators have been used to derive nonparametric lower bounds for the number of species shared by multiple communities [40], and evaluated taking into account the influence of rare species [41] as well as in comparisons of these estimators' performance [42,43].
Recently the group of Anne Chao developed an improved estimator for the number of missing classes in the framework of species estimation [44]. This estimator, also based in the Good-Turing recurrent expression (Eq 1), is called 'iChao1' and given bŷ The problem of the estimation of the number of expressed genes has been previously studied for the sampling of 'Expressed Sequence Tags' or 'EST' libraries [45]. Experiments quantifying the relative expression of genes based on the frequencies of their corresponding ESTs are similar in this respect to RNA-seq experiments but differ in that EST experiments usually involve much smaller sample sizes (many fewer ESTs in a sample library) and longer gene tags. The estimation of species richness in this framework has been treated in [46][47][48]. The probability of discovering a new class (gene) in this framework is presented in [46]. In [47] the concept of gene capture prediction and overlap estimation is expanded from one to multiple libraries and [48] gives a penalized non-parametric maximum likelihood estimator for species richness while [49] discusses which sequencing depth might be sufficient to interrogate gene expression profiling in chicken libraries by RNA-Seq.
In the context of uniquely expressed genes (or mRNAs) in specific cells and tissues, [50] presents an estimator for f 0 . This estimator, named the 'Medial' estimator is given bŷ Note that the factor (N − 1)/N, is not relevant in the context of RNA-seq where the sample sizes, N, are in the order of millions. Asymptotic expression for the variances of the Chao1 (Eq 2) and Medial (Eq 4) estimators are presented in [51] and [50], respectively. However, we consider that the bootstrap approach [39] gives a more robust approach for the estimation of the variance of these estimators than the asymptotic approximations. The uncorrected forms of the Chao1, iChao1 and Medial estimators (Eqs 2, 3 and 4) do not have a finite expectation. This is due to the fact that the denominators of the equations can take a value of zero with non-zero probabilities, and thus the sum that defines the corresponding expectations diverges.
Here we propose and evaluate a set of new non-parametric estimators for f 0 . We present an estimator of f 0 that is superior to the Chao1, iChao1 and Medial estimators in the framework of RNA-seq. We use the selected function to estimate the number of missing genes in a set of RNA-seq experiments, demonstrating that, in many cases, a substantial number of genes is not represented in these studies. We also propose and test estimators for the extra sample size needed to complete the estimated gene set to include an arbitrary large proportion of the genes expressed.

Results and Discussion
A possible answer to the question that titles this paper 'How many genes are expressed in a transcriptome?' is simply the number of genes detected in the sample(s). This naive answer generates an estimator (the naive estimatorf 0 0 for any sample) that will almost always underestimate the true value of the parameter, because the probability of missing one or more genes can be very large, approaching 100% in almost all real cases (see Section B of S1 File). We sought to identify more robust estimators for f 0 than the Chao and Medial estimators, at least for the framework of RNA-seq studies.
Better estimators of f 0 for RNA-seq As noted among others by Good [22] and Chao [37,38], the frequencies of rare classes (f 1 , f 2 , Á Á Á, f r with r small) carry most of the information about the number of missing classes, f 0 . This leads to the Chao1 estimator (Eq 2), which uses only the singletons (f 1 ) and doubletons (f 2 ) to estimate the number of missing classes [37], and very recently to the iChao1 estimator (Eq 3), which apart from f 1 and f 2 uses the information from f 3 and f 4 [44]. In the ecological framework of species estimation, there is no point in exploring estimators that use the information of frequencies of frequencies with larger order, say, f r with r > 4, because in that context the sample sizes are limited to relatively small values, say N 1000, and thus the observed values of f r , r > 4 are very frequently equal to zero. In contrast, in RNA-seq experiments the sample sizes are much larger; from hundreds of thousands to tens of millions of mapped gene tags. As a consequence, in RNA-seq datasets the observed values of f r , r = 4, 5, Á Á Á, 10 are, in most of the cases, larger than zero and thus can be used for the estimation of f 0 . We heuristically explored the use of functions that employ, apart from the observed values of f 1 and f 2 , the values of f 3 , f 4 , Á Á Á, f 10 . We reasoned that these small frequencies carry information about f 0 . In particular we explored, among others, functions of the form where the constant u is an scalar to be determined and the function c() is a measure of central tendency for f 2 , f 3 , Á Á Á, f 10 or a subset of these quantities. As putative functions of central tendency, c(), we used the Pythagorean means, i.e., the arithmetic mean or average as well as the geometric and harmonic means.
To evaluate putative estimators of f 0 we required to have an RNA-seq dataset that could be considered 'complete' in the sense that every gene expressed was detected by one or more tags, i.e., a sample with not missing genes. As evidence that a dataset could be considered complete, we employed a rule that all genes must be represented by at least two tags, such that f 1 = 0. This criterion has been proposed as a 'stopping rule' for sampling in various studies, for example in [52]. Note that in such cases the Chao1, Medial and iChao1 estimators (Eqs 2, 3 and 4, respectively), as well as any estimator defined by (5) return values of zero as estimates of the number of missing genes.
Accepting a given sample as complete is equivalent to assuming that the true value for the number of expressed genes is equal to the number of genes observed in that sample, say, G = g, and this implies that f 0 = 0 because G = g + f 0 . A complete sample can be used to take sub-samples of smaller size in which we know the true value of the number of missing genes, f 0 , and this implies that we can test different estimators of the parameter and study their statistical properties by repeating the process of sub-sampling.
Many RNA-seq datasets are deposited in the GEO [53] and ArrayExpress [54] public databases of gene expression profiles. We explored these datasets by downloading the auxiliary files that include the counts for each sequenced library in the accession, i.e., the vectors of gene tag counts y. The accession with identifier GSE1581, corresponding to the 'MPSS mouse transcriptome analysis project' has been used in several studies (see [55][56][57][58][59]). For our purposes, this dataset fulfilled the criterion f 1 = 0 when adding 35 libraries from different organs, and thus was considered a complete sampling of the mouse transcriptome (see Analysis). This accession comprises data for a total of g = 23332 expressed genes with a total sample size of N = 160552086 mapped gene tags.

Selection of an f 0 estimator
Having a complete sample, we evaluated distinct estimators of f 0 by resampling the original distribution via the bootstrap procedure and measuring the standard error of each estimator in each pseudo replicate. The formula for the estimated standard error is given by where B is the number of pseudo-replicates andf 0i ; f 0i are the estimated and true values of f 0 in the i − th, replicate, i = 1, 2, Á Á Á, B. We considered the best estimator to be the one with the smallest standard error over a large number of pseudo replicates obtained, assuming a wide range of sample sizes. This procedure mimics what happens in reality when sampling the transcriptome, due to the fact that a complete sample allows for the probabilities of expression to be properly estimated by maximum likelihood. To obtain pairs ff 0i ; f 0i g we used the parametric bootstrap procedure under the multinomial (equivalent to non-parametric bootstrap) or Poisson distributions. We assumed a random sample size, N i , uniformly distributed in the interval [m N,N], where N was the sample size in the complete sample, i.e., N = 160552086 and the constant m was set to m = 1/160.552086 % 0.006228508, in such a way that the minimum sample size tested was mN % 1e6, or one million. This minimum sample size was decided after pilot tests indicated that the behavior of the estimators was erratic for smaller samples. A large number, B = 100000, bootstrap samples was used to test all putative estimators, including varying the functions c() and empirically estimating the best value of the constant u. The statistical behavior of the error,f 0 À f 0 , as well as correlations between the sample size, estimated values, errors etc. for all estimators were tested (see details in Section C of S1 File).
The best estimator of f 0 , obtained by the procedure outlined above, and presented in detail in Section C of S1 File, was where the function H(f 2 ,f 3 , Á Á Á, f 6 ) is the harmonic mean of f 2 up to f 6 , i.e., thus we call this estimator h 6 or harmonic estimator of degree 6 of f 0 . As for the the Chao1, Medial and iChao1 estimators, the expectation of h 6 do not exist, because the harmonic mean, H(f 2 , f 3 , Á Á Á, f 6 ), can take a value of zero with non-zero probability and a value of zero in the denominator leads to indeterminacy. However, for large sample sizes, the probability P[H(f 2 , f 3 , Á Á Á, f 6 ) = 0] is negligible, and thus we can approximate the expectation by the mean of a large number of bootstrap replicates. Table 1 presents a numerical comparison of the Chao1, iChao1, Medial and h 6 estimators of f 0 (Eqs 2, 3, 4 and 7, respectively), evaluated in B = 100000 bootstrap replicates of the complete dataset (GSE1581). Details of the comparisons evaluated in an independent set of replicates can be consulted in Section C of S1 File.
From Table 1 we can see that h 6 exhibits better behavior than the other three estimators in an ample interval of sample sizes, going from 1 to 160.5 million tags (this last figure is the sample size of the complete RNA-seq dataset). The h 6 estimator is superior to Chao1, iChao1 and Medial in having an estimated standard error much smaller than either of the three, a raw value of 85 representing only 22%, 28% and 60% of the standard errors of the Chao1, iChao1 and Medial estimators, respectively. The value of Pearson's determination coefficient between f 0 and f 0 , r 2 , is % 0.99 for h 6 , while it is smaller, % 0.97, for the Chao1 and Medial estimators.
This means that, on average, h 6 explains a larger proportion of the variance off 0 as a linear function of f 0 than either the Chao1, iChao1 or Medial estimators. Importantly, the statistics for the estimated errors of the estimators, ðf 0 À f 0 Þ, are better centered around zero for h 6 than for Chao1, iChao1 or the Medial, having values of 3 and -3 for the median and mean in the case of h 6 and values much farther from zero for Chao1, iChao1 and the Medial estimators. The minimum and maximum of the estimated errors are also both smaller for the h 6 than for Chao1 and Medial estimators.
To appreciate the behavior of the estimators,    only values up to 1000 for the pairs ff 0 ;f 0 g. All four estimators tend to underestimate the value of f 0 when this value is large, say, when the number of missing genes is larger than approximately 3000 (value of 3000 in the X-axis of Fig 1). This happens when the sample size, say, N i , is relatively small in comparison to the size of the complete sample, N = 160552086. For example, values of f 0 ! 3000 were obtained by sample sizes N i ranging from a minimum of approximately one million (0.6% of N) up to 3.5 million (2% of N) and a mean of 2.3 million (1.4% of N). The complete range of variation in N i extends from up to 160.5 million, with a mean of 81 million. Even in such small sample sizes, for example between 0.6 and 2% of the complete sample, the least biased estimator is h 6 when comparing (Fig 1). In Panel B of Fig 1 we examine the behavior of the estimators in large sample sizes, when the true value of f 0 1000. These points correspond to cases where the sample size N i is between 14 and 160.5 million, representing between 9% and 100% of the original sample size N. In these cases, h 6 behaves consistently better than the Chao1, iChao1 and Medial estimators, by having estimated values closer to the valuef 0 ¼ f 0 which is indicated in both panels of Fig 1 by a grey line. In summary, from Table 1 and Fig 1 we conclude that the h 6 estimator is more effective than the Chao1, iChao1 and Medial estimators. More detailed analyses, including comparisons with other putative estimators, are found in Sections C and F of S1 File.

Validation of the h 6 estimator in independent datasets
It could be argued that the estimator h 6 was tailored for a specific (complete) dataset, and thus a priori there is no guarantee that the behavior of h 6 will be preserved in different RNA-seq datasets. Although the sequencing depth (N) of RNA-seq experiments has been growing due to advances in high throughput sequencing technologies, we were unable to discover additional examples of complete RNA-seq samples in the public databases in order to further test the estimators. In other words, we did not find other publicly available data in which the criterion f 1 = 0 was fulfilled. However, we found three datasets near completion in which f 1 was small, and consequently the estimated number of missing genes,f 0 , is likely to be small by any of the estimators proposed. We repeated the evaluation of the Chao1, Medial and all putative f 0 estimators -including h 6 , in these three datasets.
The three almost complete datasets selected to verify the behavior of the h 6 estimator were a comprehensive study of the human transcriptome by MPSS [60] which has been also re-analyzed by us when defining parameters including transcriptome diversity and specialization [61], and the accessions E-GEOD-38298 (using the fungus Candida albicans) and E-GEOD-46953 (using Mus musculus). In each case the datasets were subjected to the same procedure as the one explained above with the complete sample; details of the procedure as well as extra analyses can be consulted in Section C of S1 File. Table 2 presents the main characteristics of the almost complete accessions employed to validate the h 6 estimator.
We also examined the standard error of the estimators, evaluated in each case using  Statistics for three RNA-seq datasets including estimated standard errors, seðf 0 Þ for the Chao1, Medial and h 6 estimators. Table presents   the expression of only a small number of genes: 0, 7 and 25 for the human MPSS, E-GEOD-38298 and E-GEOD-46953 respectively, using the h 6 estimate (column 7). In Table 2  or Medial estimators imply that the h 6 estimator results in a less biased and more robust estimation of the parameter of interest over a large range of sample sizes and conditions (see Section C in S1 File for more details of the comparisons). Our estimator, h6, also resulted better than the iChao1 estimator in all comparisons performed; see details in Section F of S1 File.

Discussing the validity and optimality of h 6
A possible objection to the h 6 estimator is that the procedure to obtain it was purely heuristic, i.e., without employing analytical statistical theory either exact or asymptotic. Our modeling approach can be justified by the intractability of the exact moments for f r ; r > 0 without making assumptions about the distribution or even under any reasonable assumed distribution. Technical difficulties arise from the impossibility of deciding on a single reasonable distribution for G, the number of expressed genes, in each and every RNA-seq experiment that can be performed. From the pioneering work of Fisher [62], who proposed the Poisson series and negative binomial distributions, to modern approaches [47] that use the log normal distribution or even [23] which proposes mixtures of distributions, particular samples do not always follow a specific parametric model, and thus the non-parametric framework appears more sensible.
Within transcriptomes, as well as in ecological communities, a few transcripts (or species) are particularly abundant, whereas most are rare. In large assemblages such as the complete samples used here, there are more rare species than the log normal model predicts (see [23] for an ecological example). Another interesting model is presented in [63], where the authors postulate a Pareto-like probability function for gene expression, which appears to be invariant among eukaryotic cell types. However, this model predicts an unlimited increase in the number of species (i.e. distinct genes) as the sample size approaches infinity, and thus this empirical parametric approach is not useful for the estimation of f 0 . Our approach to obtain the estimator h 6 explored a limited number of functional forms, given in (Eq 5) and motivated by the Chao1 estimator (Eq 2). We demonstrate that the harmonic mean of f 2 , f 3 , Á Á Á, f 6 includes valuable additional information that is not used by the Chao1, iChao1 or Medial estimators, giving a less biased and more robust estimator. Logically, we cannot guarantee that h 6 is the best of all putative estimators for all possible RNA-seq samples; its optimality is naturally restricted to the functions explored, and was validated with independent RNA-seq datasets.

Estimation of the number of undetected expressed genes in public datasets
The problem of estimating the number of expressed genes that remain undetected in RNA-seq experiments is largely irrelevant if this figure is either zero or very small in most RNA-seq experiments. This is an intuitive possibility, given that the number of gene tags obtained with current high-throughput sequencing technologies is large, ranging from hundreds of thousands up to hundreds of millions. However, the estimated number of missing genes in public RNA-seq experiments can be in the order of thousands, even when large sample sizes are employed. This source of uncertainty can lead researchers to falsely conclude that some genes are not expressed in a given tissue or condition.
We explored RNA-seq experiments deposited in the GEO [53] and ArrayExpress [54] public databases, and downloaded a total of 31 accessions which consisted of files with counts of reads per gene. Additionally, we included files from a sunflower experiment conducted in our laboratory [64]. Each accession included a variable number of libraries that in total yielded 311 vectors of gene counts (see Methods). In RNA-seq experiments there exists no homogeneous criterion for what constitutes a 'gene'; i.e., what is to be taken as the unit of expression. For example, some studies take different splicing variants of transcripts derived from the same locus as different 'genes' for measuring expression, while in other cases all splicing variants are taken as a single 'gene'. Alternatively, all close paralogs can be grouped as the same 'gene' [1]. It is important to take into account that when we are estimating undetected genes we are doing so in the particular framework of a given RNA-seq experiment and that it is difficult to make a general inference, even for the same species, using different datasets. Given the different definitions of 'gene' in different studies, it could be more precise to talk about the estimation of 'undetected classes'; however, for consistency, we will keep discussing the concept as 'undetected genes'.
To study the number of undetected genes not only in each one of the individual libraries, but also in the full accession or total library, we collapsed the data for all libraries in each accession in a single sample, adding the tags by gene; i.e., the count for gene i in the total library, y t i , was calculated as where y ij ; i = 1, 2, Á Á Á, g; j = 1, 2, Á Á Á, v are the counts for gene i in library j and v is the total number of libraries for the accession. This procedure was possible for 31 of the 32 accessions, given that one of the accessions had not the same genes identifiers in the library files, and thus these libraries were analyzed only independently. By collapsing all libraries of one accession in a single total library, we are analyzing all genes that were detected in such a collection, some of which could be present only in some of the individual libraries. This total library is not usually analyzed by the researchers, given that the aim of many experiments is to detect differential gene expression between treatments (sets of libraries). However the total library contains all genes detected in the experiment, and by estimating the number of undetected genes in this library we can estimate the total number of relevant entities that were missing. Table 3 presents the results of the analysis of undetected genes for 31 total libraries in the same number of accessions.
Representatives were included from a wide range of living organisms, from protozoa (Tetrahymena thermophila), fungi (Candida albicans, Neurospora crassa), slime molds (Physarum polycephalum), a brown alga (Saccharina japonica, Table 4), plants (Zea mays, Glycine max, Capsicum annuum, Helianthus annuus), insects (Drosophila melanogaster) up to mammals (Homo sapiens, Mus musculus, Sus scrofa and Bos taurus). The sample size, N, varied in the range of 2.39 to 579.73 million tags, with a median of approximately 100 million tags. The number of detected genes, g, varied from approximately 6 thousand in the fungus Candida albicans to a very large value of more than 54 thousand in soybean (accession E-GEOD-29163). This wide variation in the number of genes estimated can be explained not only by differences between the genome sizes of the organisms, but also by the lack of homogeneity in the N-Sample size in millions, g-Number of genes detected,ĥ 6 -Estimated number of missing genes, seðĥ 6 Þ-Estimated standard error forĥ 6 , 95% approximated confidence intervals forĥ 6 (lower and upper bounds) and estimated percentage of missing genes, %h 6 /G. genes,ĥ 6 , is estimated as 0, 1 and 7 respectively, 95% confidence intervals forĥ 6 include zero. These were the accessions used to test and validate the h 6 estimator. In contrast, for the majority of the accessions (rows 14 to 31) the estimated number of undetected genes is larger than one thousand, indicating that a substantial proportion of the expressed genes, ranging from 7% to 25% of the total number of expressed genes, remained undetected by RNA-seq. The last column of Table 3 (%h 6 /G) presents the percentage of undetected genes with reference to the total, say 100 Âĥ 6 =Ĝ ¼ 100 Âĥ 6 =ðg þĥ 6 Þ. This percentage ranges from approximately zero for nine accessions (rows 1 to 9) up to 25% for the accession E-GEOD-33793 (row 30), and has an estimated median and average of 7% and 6%, respectively. From these analyses we can infer that, on average, existing RNA-seq experiments fail to detect approximately 7% of the genes expressed in the organisms studied with the sample sizes usually employed. Note that there is no positive correlation between the sample size, N, and the estimated number of undetected genes,ĥ 6 , the estimated value being r = −0.1721. For example, one of the accessions with a large sample size (N > 546 million, row 31) is one with a large number of undetected genes (ĥ 6 ¼ 7; 131). Conversely, two of the samples considered to be complete (rows 2 and 3) were obtained with small samples, N = 31.4 and 35 million, respectively. The discordance between the number of undetected genes estimated in the two human accessions, HumanMPSS and E-GEOD-44384, (rows 2 and 31 in Table 3) can be explained by the fact that different units of measure were taken as 'genes'. In the first case (HumanMPSS, row 2, [60]) canonical human genes were used as units of expression, while in the second (E-GEOD-44384, row 31, [65]), a study of RNA methylation targets, small RNAs are the units of measure, i.e., the 'genes'. Finally, and maximum (max.) for the values of missing genes,ĥ 6 , and estimated percentage of missing genes, %h 6 /G.
from Table 3 we note that the standard errors ofĥ 6 (column seðĥ 6 Þ), estimated by the bootstrap procedure are relatively small, 2 seðĥ 6 Þ 34, leading to small 95% approximate confidence intervals for this parameter (but see Analysis and Section D in S1 File). Table 4 presents a summary of the statistics for undetected genes in the individual libraries of each accession, grouped by organism.
This table includes data derived from 14 organisms, 8 of them (rows 1 to 8) represented by a single accession and the remaining represented by a minimum of 2 and up to 8 accessions. The number of libraries analyzed by organism ranges from 2 for Saccharina japonica (a brown algae, row 2), up to 99 for mouse (Mus musculus, row 14). The minimum, average and maximum values for h 6 and %h 6 /G represent the variation within an organism. For the cases where more than one accession was analyzed (rows 9 to 14), the variation between accessions can be very large, as discussed above. From the sets of libraries representing a single accession (rows 1 to 8), the one with the largest average number of undetected genes is the one corresponding to chili pepper (Capsicum annuum, row 4, [1]), having an average of 21482 missing genes that represent approximately 50% of the total number of estimated genes. However, the number of genes detected in this accession when the individual libraries were amalgamated (total; row 28 of Table 3) was 34066 and the estimated number of undetected genes in that total library was 4786, representing only 12% of the total genes. This large difference between analyses of total and individual libraries regarding undetected genes is explained by the presence of specific genes that are expressed only in one of the libraries or conditions studied within an accession.

Estimation of the extra sample needed for a comprehensive coverage
Having an estimation of the number of genes that remain undetected in a given sample, sayf 0 , we can calculate the extra sample size (increased sequencing depth), say m ψ (given in gene tags), needed to increase the number of observed genes from the value of g in the current sample up to g þ cf 0 , where ψ is a proportion 0 < ψ < 1. In [52] the authors propose that the size of the extra sample, m ψ , can be obtained by a numerical procedure including bootstrap, and yields as approximate solutions to the formula where f 1 , f 2 are, as before, the estimated numbers of singletons and doubletons in the sample, f 0 is the estimated number of undetected genes (Chao1, iChao1 or Chao2 in this context), and G ¼ g þf 0 represents the estimated total number of genes or classes. We found that a more realistic value for m ψ in the case of RNA-seq is given by where h 6 is our harmonic estimator of degree 6 for f 0 and consistently,Ĝ ¼ g þ h 6 , is the estimate of the total number of genes using the estimator h 6 . Eqs 8 and 9 are subjected to the conditionf 0 =ðĜð1 À cÞÞ > 1 to give positive values of the extra sample size. In general the researcher will be interested in values of ψ near 1, for example ψ = 0.95, 0.99, etc.
Note that Eqs 8 and 9 return a value of m ψ = 0 when the sample is complete, i.e., when f 1 = 0 and thusf 0 ¼ 0, and diverge to infinity when ψ = 1, indicating the impossibility to obtain a sample in which can be assured that there will be no undetected genes. Section B in S1 File presents a derivation of the probability of non-missing genes for samples of different sizes under different conditions. To compare the performance of m ψ and m 0 c (Eqs 8 and 9), we obtained a large set of bootstrap replicates from the complete sample (accession GSE1581), with sample sizes ranking from 0.5 to 10 million tags and calculated the predicted and realized gain in number of extra genes observed. We found that m 0 c is much more accurate and precise than m ψ to calculate the extra sample size needed. The weighted square error for m 0 c was approximately 50% smaller that the one for m ψ in the same samples, thus we conclude that m 0 c must be preferred over m ψ for the estimation of the extra sample size. Details of the process to compare these estimators are presented in Section E of S1 File. S1 Table presents the extra sample needed to obtain 95% of the number of undetected genes employing m ψ and m 0 c (columns "m_Chao" and "m_h6", respectively) for the cases wherê f 0 =ð0:05 ÂĜÞ > 1, i.e., when the condition to use the functions is fulfilled. In all comparable cases the estimation of undetected genes by Chao1 (column "Chao1") is smaller than the estimation usingĥ 6 (column "h6") and consequently the estimation of the extra sample size using the Chao1 estimator (column "m_Chao") is always smaller that the estimation of extra sample size using m 0 c (column "m_h6"). On average, the ratio of extra sample sizes, m 0 c =m c is approximately 18, while the ratio of estimated undetected genesĥ 6 /Chao1 is around 2. Given that the Chao1 estimator of undetected genes frequently underestimates the target parameter, our estimator of extra sample size needed to complete the sample, m 0 c , returns a more realistic value (see Section E of S1 File).
Current methods of RNA-seq analysis allow researchers to carry out one or more sequencing runs for the same library. A reliable estimate of the number of undetected genes and extra sample needed to observe a given proportion of the undetected genes can be employed to decide, in an informed way, whether additional sequencing runs of existent libraries are needed or not. The first sequencing run of a RNA-seq library can be used as a 'pilot' test to decide if more sequencing runs are needed. For example, in our laboratory we performed an RNA-seq experiment exploring changes in the transcriptome of chili pepper fruit during development [1]. In that case we estimated that there are between 4565 and 5008 genes that remained undetected in the sequencing libraries (Accession GSE54123, see Table 3). We calculated that, to observe 95% of those undetected genes, approximately 9 million additional sequences (see S1 Table) would be required. In contrast, for a gene expression study in sunflower that was also performed in our laboratory [64], only a small number of genes remained undetected (between 13 and 33 by a 95% confidence interval forĥ 6 in row "Sunflower" in Table 3).
The analyses presented in S1 Table and summarized in tables 3 and 4 can be used as a guide for sequencing depths required by RNA-seq experiments. In particular, if the experimental aims and design, as well as sequencing technology and bioinformatic pipeline are similar to the ones used in the datasets we analyzed, our results provide guidelines for the sample size needed in future studies.

Conclusions
The problem studied here is to decide if the genes observed in an RNA-seq library are in fact all the ones expressed, or if there is certain number of expressed genes that were not observed in the sample (missing or undetected genes). The estimation of the number of undetected genes is an essential question, both, to conclude that an unobserved gene is in fact not expressed in a given condition, as well as to predict the sample size (sequencing depth) needed for an RNA--Seq experiment.
We present a non-parametric estimator, h 6 , of the number of genes that remain undetected in RNA-seq experiments that is superior to the estimators previously reported. We demonstrate that h 6 is less biased and consequently has a smaller standard error than the Chao1, iChao1 and Medial estimators for a wide range of sample sizes in the context of RNA-seq. We also present a function to estimate the extra sample size needed to observe a given proportion of the undetected genes that is more precise and accurate than the function presented in [52].
By analyzing a total of 342 vectors of gene counts from 32 accessions (311 individually sequenced libraries plus the total vectors for each one of 31 accessions) we conclude that there are very few RNA-seq studies that can be considered as complete, defined as experiments in which all genes are detected. On average we estimate that, given the sequencing depths currently employed in most RNA-seq studies, approximately 6% of genes per accession and 10% of the genes per library within an accession are undetected.
The statistical tools presented here will help to evaluate the inferences of RNA-seq analyses by estimating the completeness of the samples obtained and helping to decide if extra sampling is needed.

Analysis Datasets
The RNA-seq data analyzed here was downloaded from the NCBI GEO [53,66] and EMBL ArrayExpress [54,67] repositories. The inclusion criterion for the data consisted of raw data for gene tag counts ordered by 'gene' (where 'gene' was an identifier). Accessions found with count data for sequences (instead of 'genes') or in which the counts were normalized were rejected. This resulted in the selection of 30 accessions from 14 different organisms with a total of 272 gene count vectors. Additionally we included two more RNA-seq experiments, the previously reported set of human MPSS data [60,61] comprising gene counts for 32 human tissues, and a study of the sunflower transcriptome, which comprised 7 libraries [64]. The full dataset therefore included 32 RNA-seq experiments with a total of 311 individual libraries (see Tables 3 and 4 and S1 Table). All these data were input into a relational database and processed with R [68] to form 'data frame' objects in which genes are presented in rows and columns represent individual libraries.

Design and selection of the f 0 estimators
The functional form of the putative f 0 estimators, presented in Eq 5, was motivated by the estimator Chao1, presented in [37], which uses information contained in only f 1 and f 2 . We reasoned that additional information about undetected genes exist in the frequencies of frequencies f r where r = 3, 4, Á Á Á, 10. To specify putative estimators we systematically substituted the function c() in Eq 5 by the arithmetic, geometric or harmonic means of f 2 to f r ; r = 3, 4, Á Á Á, 10. This yielded a set of 3 × 8 = 24 putative estimators to be evaluated (see Section C in S1 File). Other functional forms were also evaluated, but the results were unfavorable, and thus they are not presented.
To test the putative estimators of f 0 we employed the total count of the accession GSE1581 [55][56][57][58][59], which has a sample size N = 160, 552, 086 gene tags. In this experiment, the number of expressed genes detected was 23332 and can be considered complete by having f 1 = 0; i.e., all genes were represented by at least two gene tags. In this complete sample we set G = 23332 and thus if we take a subsample and observe the number of genes obtained, g, the true number of missing genes in that subsample, say, f 0 = G − g = 23332 − g, and using the observable frequencies f 1 , f 2 , Á Á Á,, we can try all putative estimators,f 0 , calculating in each case the error of each estimator,f 0 À f 0 and, by repeating this process a large number of times, estimate the standard error of each estimator using Eq 6. The process of resampling was carried out in B = 100,000 subsamples obtained by assuming the multinomial distribution and sample sizes, N, uniformly distributed between 1 and 160.5 million tags. The estimator with better statistical properties, including a smaller standard error, was the harmonic estimator of degree 6, h 6 (Eq 7). The use of h 6 was validated using other nearly complete datasets. Details of the selection and validation process, including extra tables and figures, are presented in Sections C of S1 File. All analyses were performed in R [68].
Design and testing of the estimator of extra sample size m 0 c To design an estimator for the extra sample size, m 0 c , required to observe a proportion ψ of the estimated missing genes, cf 0 , we first tried substituting the estimator of f 0 (Chao1) in Eq 8 by our estimator h 6 . However, by trying other functional forms of the quotient f 1 /2f 2 in Eq 6 we obtained the quotient h 6 /f 1 which is part of Eq 9 and yielded better results than the original equation presented in [52]. To test different functional forms of the estimator of extra sample size we used the weighted squared error, defined as where g þ cf 0 j N is the number of genes predicted by the estimator and E[GjN+m ψ ] is the expected number under the distribution of the data. se(m ψ ) was evaluated for a large set of bootstrap samples with sample sizes, N, uniformly distributed between 0.5 and 10 million tags. Section E in S1 File presents the details of the results obtained.

Analyzes of public RNA-seq datasets
All R datasets containing the gene counts (see 'Data' above) were processed in R [68] to obtain the basic statistics of the samples, punctual estimations of missing genes by the Chao1, iChao1, Medial and h 6 estimators, as well as standard error and 95% approximate confidence limits for h 6 (see Section D of S1 File for details of the method to obtain the approximate confidence limits). S1 Table presents the full results.
Software to calculate h 6 and related statistics An R [68] function, 'h6', included here as S1 Text, was programed and tested, implementing the estimation off 0 by our estimator h6, as well as related statistics, including approximate standard error, bias and confidence intervals forf 0 , as well as the estimate of the extra sample size needed to estimate a proportion, ψ, of the undetected genes, m 0 c . The R package 'Unde-tectedGenes', containing the function 'h6' as well as examples of analysis is available at Computational Biology, Langebio. To install the package in R type 'R CMD install file_name' (where 'file_name' is the name of the downloaded file) at the command line and in the directory where 'file_name' is located. After installation, to use the package in R type 'library(UndetectedGenes);? UndetectedGenes' at the R prompt (>). See also Section G of S1 File.
Supporting Information S1 File. Supporting results with additional tables and figures. Sampling framework and notation (Section A). The probability of missing genes (Section B). Comparing f 0 estimators (Section C). Approximate confidence intervals for f 0 (Section D). Calculating extra sample needed to estimate some of the missing genes (Section E). Comparing h 6 with iChao1 and other estimators (Section F). R functions (Section G). (PDF) S1 Table. Missing genes statistics for all analyzed datasets. Presents all statistics relevant to missing genes estimation in all datasets analyzed in an Excel file. (XLSX) S1 Text. 'h6' R function. A documented R function (in plain text format) for the estimation of undetected genes and related statistics. (TXT)