Neutral Theory Predicts the Relative Abundance and Diversity of Genetic Elements in a Broad Array of Eukaryotic Genomes

It is universally true in ecological communities, terrestrial or aquatic, temperate or tropical, that some species are very abundant, others are moderately common, and the majority are rare. Likewise, eukaryotic genomes also contain classes or “species” of genetic elements that vary greatly in abundance: DNA transposons, retrotransposons, satellite sequences, simple repeats and their less abundant functional sequences such as RNA or genes. Are the patterns of relative species abundance and diversity similar among ecological communities and genomes? Previous dynamical models of genomic diversity have focused on the selective forces shaping the abundance and diversity of transposable elements (TEs). However, ideally, models of genome dynamics should consider not only TEs, but also the diversity of all genetic classes or “species” populating eukaryotic genomes. Here, in an analysis of the diversity and abundance of genetic elements in >500 eukaryotic chromosomes, we show that the patterns are consistent with a neutral hypothesis of genome assembly in virtually all chromosomes tested. The distributions of relative abundance of genetic elements are quite precisely predicted by the dynamics of an ecological model for which the principle of functional equivalence is the main assumption. We hypothesize that at large temporal scales an overarching neutral or nearly neutral process governs the evolution of abundance and diversity of genetic elements in eukaryotic genomes.


Introduction
Species are unevenly represented in ecosystems. In no environment whether terrestrial or aquatic, temperate or tropical, all species are equally common [1]. Species diversity and their relative abundance have always intrigued ecologists [2]. Ecological models of species abundance are basically of two kinds: descriptive (statistical-based) or mechanistic (niche-based or neutral). While many mechanistic approaches assume niche differences as the main cause driving community composition, neutral models consider niche differences among species irrelevant. The unified neutral theory of biodiversity (UNTB) [3,4], originally inspired by neutral population genetics [5,6], assumes interactions among trophically similar species as equivalent on an individual or ''per capita'' basis. This provocative assumption means that the fate of individuals, regardless of the species, appear to be controlled by similar birth, death, dispersal, and speciation rates. Because each species follows a random walk, biodiversity and patterns of relative species abundance emerge by a process of drift in the community. The fundamental biodiversity number (h), analogous to 4Nm of population genetics, governs species richness at large spatial and temporal scales. UNTB is thus a useful null model against to test alternative biological hypotheses for the origin and maintenance of relative species abundance distributions [7,8].
One can draw an analogy between the distribution of relative species abundance in ecological communities and the distribution of relative abundance of genetic sequences, unique or repetitive, that populate eukaryotic genomes: long terminal repeat (LTR) retrotransposons, non-LTR retrotransposons, cut-and-paste DNA transposons, rolling-circle DNA transposons, self-synthesizing DNA transposons, satellites, simple repeats, tRNA, miRNA, snoRNA and genes among others. For simplicity we will refer to these sequence classes as ''genetic species'', the genomic analogue of biological species in community ecology [9]. Dynamical models of genomic diversity have focused on the selective forces shaping the abundance and diversity of transposable elements (TEs) [9][10][11]. Moreover, ecological models of genomes have already been formalized [12,13], and complex models invoking the genomic equivalent of parasitism, competition and cooperation between TEs [14] have been simulated. However ideally, models of genome dynamics should consider not only TEs, but also the diversity of all genetic species populating eukaryotic genomes. Such model has never been formalized, just as genetic species abundance and diversity has never been predicted under neutrality in genomics. Does a neutral model with applications in ecology fit the pattern of species abundance and diversity in eukaryotic genomes? How well does such a model perform over the great diversity of eukaryotic genome sizes? To what extent are the abundance and diversity of genome components the result of adaptive or stochastic processes? Here we test the statistical fit of the UNTB predictions in 31 genomes ranging from unicellular eukaryotic species to mammals (Table S1). We organize the results into four sections. First we analyze the relative species abundance (RSA) curves of genomes and chromosomes. Second we simulate the random distribution of genetic elements in chromosomes to test the role of chance in chromosome organization. Third, we test the statistical fit of the UNTB to the relative abundance and diversity of genome elements in chromosomes. Finally, assuming that the chromosome length is the genomic analogue of area in ecology, we ask if the universal ecological species-area relationship is also observed in genomes.

RSA curves in genomes and chromosomes
Ecologists frequently use RSA curves to compare the richness, the degree of dominance, and the number of rare species in communities. An interesting property of RSA curves is that species are unlabeled so that the RSA distributions of different ecosystems can be compared whatever species they contain. We took advantage of current automatic methods of genome annotation and sequence recognition to build RSA curves for chromosomes (Table S2 and External Dataset 1 at EDS). Genetic species were defined according to biotypes [15,16], and the ''classes'' of the RepBase [17] nomenclature. Figure 1 displays RSA curves for a subset of genomes and their largest chromosomes ( Figure S1 shows the complete set of RSA curves). Although the curves differ in many ways, two patterns are evident: (1) RSA curves of genomes and chromosomes are very similar for each eukaryotic species, and (2) all RSA curves display the universal S-shape observed in ecology [2,3], where few species are dominating, many are common, while the majority are rare. Both observations suggest a common mechanism of distribution of genetic elements in genomes and chromosomes.
To what extent does a chromosome's subset of genetic elements represent a random sample of the complete set of elements of the genome? To answer this question, we simulated the random distribution of the full set of elements of a genome among the chromosomes. Based on 1,000 simulation runs, we created expected abundances and standard deviations for each genetic species in each chromosome according to its frequency in the genome. Statistical tests (t-test, FDR,0.05) established that less than 4% of all genetic species showed abundances in chromosomes that agreed with their random expected distribution (External Dataset S2 at EDS). Therefore, a homogeneous random process cannot account for the observed abundances of genetic species in chromosomes, a result just pointed out for TE's in D. melanogaster [9,10]. However, if the observed RSA curve of each chromosome is superimposed on the mean distribution of 1,000 simulated RSA curves taken from all elements of the genome, a notable agreement is detected. Figure 2 shows this fit for two chromosomes (see External Dataset S3 at EDS for complete results). This remarkable concurrence arises because ''genetic species'' are unlabeled in the RSA plots, and hence allows compensation by shifts in the ranking order of abundances. Statistical differences between RSA curves were detected for 76 out of 548 chromosomes tested (KS-test, P,0.05; see External Dataset S2 at EDS2). What is the dynamical mechanism responsible for the 86% agreement of chromosomes tested? Could a neutral dynamical model predict this shared demographic pattern of genomes?

Testing Hubbell's Neutral Theory in Genomes
Stochastic models are not restricted to physics [18]. Analogous to the kinetic theory of ideal gases, Hubbell's neutral theory of biodiversity is a stochastic theory assuming equivalence among interacting individuals. The theory assumes that diversity in a local community of individuals is in steady state, maintained by a balance between local extinction and immigration from the metacommunity at a constant stochastic rate (m). Births and deaths in the local community occur at constant per capita rates regardless of the species. This community reaches a dynamic equilibrium between stochastic extinction and the immigration of new species, while in the metacommunity diversity is maintained by speciation at a single constant rate (u) [4,8].
For genomes, we assume that each chromosome is the physical arena in which genetic elements spread, degenerate and are replaced by other elements of the same or different species. Genetic elements could come from the same chromosome, or from any other chromosome of the genome. As a first approximation, we assume that each chromosome represents a local community of J elements and S different genetic species while the rest of chromosomes correspond to the metacommunity of size J M . Given the total number of genetic elements in each chromosome we optimized the neutral parameters m and h ( = 2J M n) of Ewens [19] and Etienne's [20] sampling formulas using maximum-likelihood estimation ( Figure 3). These models were optimized using EcoloPy-UNTBgen program, a new software based in previous programs [20][21][22], and especially designed to test for neutrality with large genomic datasets (see methods). Since Ewens' model is a special case of the Etiennes' model, we compared the parameter of likelihood-ratio test (LRT) [23] to choose the best-fit model to the data. With best-fitted parameters we simulated 10,000 neutral distributions of genetic elements for each chromosome and test for neutrality.
To test the UNTB, we computed the evenness (H) of each simulated distribution according to the methodology of Jabot and Chave [21] (see methods). We confronted the set of H values with the simulated distribution to assess whether the empirical H values of chromosomes lie outside the confidence limits of the neutral expectation ( Figure 4). We considered chromosomes to be significantly neutral if the empirical evenness of the chromosome was within the 95% of the 10,000 random neutral values. We observed deviations from neutrality in 31 out of 548 chromosomes (5.6%) However, significant deviations vanished after correction for multiple testing (FDR,0.05, n = 548) [24]. This result was robust even when genetic species were defined at different levels of the hierarchy of the RepBase ontology, and even statistically, when an alternative test of neutrality was implemented [25] (see methods). Table 1 displays a representative sample of the results (for all results see External Dataset S4 at EDS). According to these results, Hubbell's neutral model is a sufficient model to explain the abundance and diversity of genetic elements in each chromosome of all the 31 eukaryotic genomes we analyzed.
To assess the power of this conclusion we simulated neutral and log-normal distributions as null and alternative hypotheses along a wide range of species (S) and individuals (J). The proportion of times the test failed to reject the null hypothesis being false was 50% for small chromosomes and communities (J,25,000, and S,30, Figure S2A). Conversely, the percentage of times the test failed to accept the null hypothesis being true was low (,5%) for a wide range of S and J values ( Figure S2B). Therefore, for large neutral communities the neutral test employed provided a robust positive answer with high power.

Genome species-area relationship
If a purely neutral stochastic process controls the abundance and diversity of genetic elements in chromosomes, it is expected that S, the number of genetic species, will increase with  chromosome length. In ecology it is universally observed that larger areas contain more species. Does this pattern hold true for genomes and chromosomes? The standard species-area relationship in ecology is the Arrhenius power law [26] S = cA z , where S is the species number, A is the area and c and z are constants. Considering a fixed length of chromosome, others things being equal, it is expected that non-polyploid species show a higher number of genetic species than polyploids, and therefore a better statistical fit to the species-area relationship. After least square fit of the power function we effectively observed c = 0.28, z = 0.27 (R 2 = 0.64 n = 548) for all chromosomes studied including polyploids (figured, for simplicity, by fish and plant species), and c = 0.50, z = 0.25 (R 2 = 0.81, n = 412) excluding them. For relatively recent polyploid species, there was no sufficient time of divergence to produce genetic species according to the neutral expectations. Figure 5 displays the log-log transformation of both curves. In both cases, the adjustment was statistically significant by a linear regression model (Pearson, P,,0.001). As in community ecology, eukaryotic chromosomes display the universal speciesarea relationship with z values analogous to regional spatial scales [27].

Conclusion
Almost one hundred years ago ecologists recognized the universal uneven distribution of species abundance, and the increase in number of species with increasing area [1]. Just a decade ago, however, neutral demographic processes emerged as the simplest mechanical explanation behind both patterns in communities [3]. More recently, Lynch and Conery [28] hypothesized that complexity of eukaryotic genomes emerged passively during evolution as a consequence of population size reduction. Here, we demonstrated that a simple stochastic process associated to a few number of parameters fits the pattern of abundance and diversity of genetic species along a great diversity of eukaryotic genomes.
We are certainly aware that the fit of a neutral pattern does not necessarily imply the existence of a neutral process behind the pattern, but it does offer the simplest explanation consistent with current data. The excellent, taxonomically broad fit of neutral theory to genomic element diversity and abundance raises the unavoidable question: why is there not a stronger signature of natural selection in ecological communities or in genomes at large scales? Ecologists have recognized the existence of many kinds of trade-offs, for instance species with high dispersal rates are not good competitors. However, it is not yet known to what extent such trade-offs maintain diversity or is consistent with, neutral dynamics. For genomes, the mechanisms that maintain element diversity, and whether these involve trade-offs, are not yet understood. Which mechanisms operate will also depend on whether genome size is under strong or weak selection [29]. More likely, element diversity of genomes results from some combination of neutral drift and weak selection on different genetic species [30].
Independently of the answer, the model tested here should be the null hypothesis to test for alternative mechanisms explaining the community dynamics of genetic elements in eukaryotic genomes.

Genomes, Genetic Species and Elements
Genome sequences of 31 species ranging from unicellular eukaryotes to mammals were downloaded from Ensembl [16] ( Table S1). Genetic elements of genomes were divided in biotypes and repeated elements. Repeated elements in chromosomes were  RepeatMasker program (http://www. repeatmasker.org) with default parameters. Biotypes were retrieved using the Biomart API [15], and each type was used as genetic species (GS). Pseudogenes were removed from the analysis. For repetitive elements, the ''class'' level of the hierarchy of the RepBase ontology [17], also referred to as ''superfamily'', according to the International Committee on Classification of Transposable Elements (http://giri.org) was considered here a ''genetic species'' (GS). Each of the elements of a GS (repetitive element or biotype) was considered a genetic element of a genome. Table S2 summarizes the information of species and number of elements for two chromosomes. The complete set of genetic species and the corresponding number of elements for each   chromosome in all genomes is available in External Dataset S1 at EDS.

Randomization of genetic elements in chromosomes
To test for the random allocation of genetic elements in chromosomes we generated 1,000 random distributions of genetic elements in each chromosome for all genomes. Simulations considered chromosome size as the sum of all 10 kb windows containing at least 1 genetic element; although this methodology is simple, it ensures to avoid wired regions as centromeres or not fully sequenced regions. All genetic elements of each genome were distributed among chromosomes, according to a probability dependent of the size of the chromosome. Therefore, for genetic elements of the human genome there is a six times greater likelihood of belonging to chromosome 1 than to chromosome 22 (since their respective lengths are 225 Mb and 35 Mb).

EcoloPy-UNTBgen Package
There are specific tools for the statistical analysis of species abundances data in ecology. Many of them implement statistical functions fitting data to models and testing for neutrality [21,22]. However none of these programs are able to deal with the large dataset characteristic of genomes. The high number of repetitive elements in genomes makes impractical most algorithms, especially the computation of the Etienne model described below. In order to adapt the algorithm to large samples, we developed UNTBgen in EcoloPy package (http://bioinfo.cipf.es/ecolopy/). UNTBgen implements all functions of UNTB [22] and the Etiennes's programs [20] which were written in R language and in PARI/GP system, respectively. However, UNTBgen takes advantage of the GNU Multiple Precision (GMP) library [31], and the Multiple-Precision Binary Floating-Point (MPFR) library [32], through the fast multiprecision GMPY module.

Ewens and Etienne models
Ewens sampling formula (ESF) [19] describes the probability distribution of a configuration of alleles in a sample of genes under the infinite-alleles model. By defining h~2J M u (3) it is possible to apply ESF and to compute its likelihood given a community of individual of different number of species, Pr(S,n 1 ,n 2 ,:::,n s Dh)~J M !h S 1 w 1 2 w 2 :::J hJ M M w 1 !w 2 !::: where, S is the total number of species, n i corresponds to the abundance of species i and w a the number of species with abundance a. UNTBgen is able to generate, using this sampling formula a neutral distribution of species abundances given sample size J and h. The number of species generated is free according to the formula but can be fixed by keeping only those random abundances generated with the desired number of species. The likelihood function: where I is the number of immigrants that compete with the local individuals for vacant spots in a fixed area size after the death of individual in the local community. Etienne's sampling formula is: where S (n,k) is the stirling number of the generic n and k values. This computation is the main computation bottleneck in the resolution of the equation, as was mentioned by Etienne [20]. UNTBgen performs the same calculation implemented by Jabot and Chave in the program Tetame [33] that takes advantage of the recurrence function of stirling numbers: This function allows building a table of stirling numbers instead of computing them directly (in order to compensate the memory requirements of genomic data, this table may be dynamically reduced to the needed values of n and k). After computation, UNTBgen computes the likelihood of Etienne's sampling formula according to the parameters h and m for a given dataset allowing parameters optimization.

Model optimization and testing
Ewens and Etienne models were optimized through different optimization strategies. In the case of the Ewens' formula, h is the only parameter to be considered, and its estimation was achieved with the golden optimization strategy [34]. For Etienne's model, two parameters were optimized, h and m. In this case we used different strategies implemented in Scipy [31], the best solution among the downhill simplex algorithm [35], the L-BFGS-B algorithm [36], the truncated Newton algorithm and the Sequential Least Squares Programming.
Parameter optimization is a critical step especially under Etienne's model. Under this model the optimization of the parameters may lead to several optima [37]; a simple way to check if our estimation of the global optima is true consists in exploring the likelihood surface representative of the variation of the parameters studied. We thus followed this procedure given a range of values of h and m for representative chromosomes in our dataset. The inspection of the likelihood surface generated ( Figure  S2) allows us to find the best solution for both parameters. Since Ewens' model is a special case of Etienne's model, we compared the delta parameter of the likelihood ratio test (LRT) against the value of a chi-squared distribution with one degree of freedom. Additionally, given the large number of statistical tests performed, we applied a Bonferoni correction to the number of results showing statistical significance to correct for Type I error (FDR,0.05) [38]. Etienne's model was thus kept as best fit model for only those chromosomes that pass the LRT after FDR adjustment; otherwise the null model using Ewens' formula was selected.

Test of neutrality
Recently two exact tests were developed to accept or reject the neutral community ecology hypothesis [21,25]. Both tests are based on the simulation of a given number of random neutral communities, with parameters estimated from the actual data, and the posterior comparison with the observed distribution of abundance.
The first test [25] compares likelihood distributions. This corresponding distribution of random neutral abundances is compared to the likelihood of the observed data. The major problem of this test is technical. The machine computation time needed to calculate the K(D,A) from thousands of neutral simulations with genomic data make this approximation impractical. However, in order to ensure the robustness of our results, we applied this test using 100 simulations (Table S3).
The second test [21] compares, instead of likelihood, the Shannon's entropy [39] index (sometimes called evenness): where, n is the total number of genetic species in chromosome and pi is the fraction of individuals belonging to the species i in that chromosome. This approach is faster because random abundances do not need to be optimized in the neutral model.  'm'' column shows the migration parameter according to Etienne's model. ''Best model'' column shows which model was found to have a better fit for each chromosome. ''lnL p(q)-val'' column corresponds to the p-value (q-value) of the neutrality test comparing the likelihood to fit Etienne's model of simulated data to the likelihood of observed data. ''N'' column shows the number of simulations that where generated for each chromosome (in the context of the neutrality test comparing likelihoods). ''delta-H'' represents the difference in shannon entropy (H) between 1,000 simulation and the observed values. ''shannon p(q)-val'' column shows the p-value(q-value) of the neutrality test that consists in comparing the shannon entropy of simulated distribution of species' abundances to the observed data. (DOCX)