The mode and tempo of genome size evolution in the subgenus Sophophora

Genome size varies widely across organisms, with no apparent tie to organismal complexity. While genome size is inherited, there is no established evolutionary model for this trait. Hypotheses have been postulated for the observed variation in genome sizes across species, most notably the effective population size hypothesis, the mutational equilibrium hypothesis, and the adaptive hypothesis. While much data has been collected on genome size, the above hypotheses have largely ignored impacts from phylogenetic relationships. In order to test these competing hypotheses, genome sizes of 87 Sophophora species were analyzed in a comparative phylogenetic approach using Pagel’s parameters of evolution, Blomberg’s K, Abouheif’s Cmean and Moran’s I. In addition to testing the mode and rate of genome size evolution in Sophophora species, the effect of number of taxa on detection of phylogenetic signal was analyzed for each of these comparative phylogenetic methods. Sophophora genome size was found to be dependent on the phylogeny, indicating that evolutionary time was important for predicting the variation among species. Genome size was found to evolve gradually on branches of the tree, with a rapid burst of change early in the phylogeny. These results suggest that Sophophora genome size has experienced gradual changes, which support the largely theoretical mutational equilibrium hypothesis. While some methods (Abouheif’s Cmean and Moran’s I) were found to be affected by increasing taxa numbers, more commonly used methods (λ and Blomberg’s K) were found to have increasing reliability with increasing taxa number, with significantly more support with fifteen or more taxa. Our results suggest that these comparative phylogenetic methods, with adequate taxon sampling, can be a powerful way to uncover the enigma that is genome size variation through incorporation of phylogenetic relationships.


Introduction
When considering trait evolution, sequence of the genotype is traditionally inspected for evidence of selection or drift, through methods such as DN/DS ratios. These tests, however, are not easily applied to genome size. Genome size, like gene expression, is an intermediate phenotype; while the trait is directly influenced by the sequences in the genome, there is not a specific sequence tied to it, and it must therefore be analyzed in a phenotypic fashion. Genome size has a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 genome size change [6,20]. The proposed significant relationship between effective population size and genome size is lost when accounting for the phylogeny, leaving Lynch's effective population size hypothesis for genome size evolution conjectural. In general, the lack of phylogenetic consideration has resulted in a lack of knowledge about how changes in genome size have occurred throughout evolutionary history, whether random or adaptive and selected.
In an effort to address the lack of consideration of phylogenetic relationships among species when analyzing genome size variation, we produced a phylogeny of Drosophilidae, with a focus on Sophophora, using aligned sequence data for 87 species. The resulting tree and associated branch lengths was used to generate Pagel's parameters of evolution, Blomberg's K, Moran's I, and Abouheif's C mean , for genome size evolution in Sophophora [21][22][23][24]. If Pagel's λ and Blomberg's K are approaching one, the presence of phylogenetic signal would suggest that genome size is not evolving according to the adaptive hypothesis. If there is signal, and Pagel's κ and δ values are approaching 1, this would suggest gradual change of genome size throughout time, supporting the mutational equilibrium hypotheses. If there is signal, with δ and κ values below one, this would suggest early change in branches and the tree, supporting the low effective population hypothesis.
The complete analysis utilized a relatively large number of species (87) and a 3X range of genome size values to generate estimates. To determine the reliability of these phylogenetic analyses with different numbers of taxa, we generated the same parameter estimates with reduced taxa numbers and reduced ranges of genome size. Several, but not all of the parameters are sensitive to taxon number; genome size range had little effect on the results.

Genome size database
Genome sizes for species were obtained from published datasets [12], with additional data from the laboratory database of J. Spencer Johnston. Genome sizes were estimated using the flow cytometric method [25] for species obtained from the UC San Diego Species Stock Center (http://stockcenter.ucsd.edu) ( Table 1).
Genome sizes were obtained from published literature and the laboratory database of J. Spencer Johnston. Species were obtained from the UC San Diego Stock Center. Genome size in these species were found to have a mean of 215.5 Mbp, a median of 210.8 Mbp, a minimum of 139.9 Mbp, and a maximum of 395.2 Mbp.

Model testing
Each sequence alignment was analyzed in JModelTest 2.1.4 to determine the model of sequence evolution that provided the best likelihood value [26]. The likelihood search assumed 11 possible substitution schemes, allowing for both invariant sites and gamma distributions. A fixed BIONJ-JC tree was used for the likelihood calculations. All runs returned the same

Data file preparation and tree construction
All sequences were interleaved to produce a 10,382 bp alignment. Missing sequence data was imputed for taxa that did not have gene sequences for every gene. Missing data does not influence the results of branch lengths or phylogenetic relationships. This resulted in an average of 7 genes per taxa, with a maximum of 15 and a minimum of 3 genes. A phylogeny for Sophophora was reconstructed using a supermatrix model of phylogeny construction utilizing MrBayes 3.2.3 on the CIPRES supercomputer (http://www.phylo.org/) with four chains and four runs and a GTR gamma + I evolutionary model for 32,835,000 generations (sampling every 1,000) using a Dirichlet prior of (1, 0.5, 1, 1) [27,28]. Parameter output was visualized in Tracer v 1.6 to assure the four runs had reached convergence and to determine burn-in. The consensus tree was then visualized in FigTree v.1.4.2. Genome size was mapped onto the phylogeny using the ContMap function from the phytools package from R3.3.0 [29]. Multiple trees were constructed with varying Bayesian priors to test if there were any issues with branch lengths (Dirichlet (1,1,1,1), exponential (10)) and a maximum likelihood tree.

Tree manipulation and significance tests with different numbers of taxa
To test for the effect of taxa number on significance levels in a phylogenetic signal analyses, multiple reduced taxa phylogenies were made (5, 10, 15, 20, 30 taxa) with 20 different trees for each group. Trees were constructed by randomly trimming the taxa from the original Drosophila tree utilizing the drop.tip function in the package ape from R 3.3.0 [30], while maintaining tree topology and branch lengths. Taxa retained for each tree were chosen by random number generation.

Trait analyses
Comparative phylogenetic analyses (Pagel's parameters of evolution, Blomberg's K, Moran's I, and Abouheif's C mean ) were run on both the full phylogeny and each reduced taxa phylogeny with genome size a continuous trait. Pagel's lambda (λ) and Blomberg's K test for phylogenetic signal assuming Brownian motion. Pagel's kappa (κ) tests how traits evolve along branch lengths (κ < 1, early change; κ = 1, gradual change). Pagel's delta (δ) tests how traits change from the overall path on the tree, from root to tip (δ < 1, rapid early change, δ = 1, gradual change; δ > 1, increasing rate of change). All comparative phylogenetic analyses were completed using functions and packages available in R. Pagel's parameters of evolution were measured using the function PGLS from package caper [31]. Blomberg's K was estimated using the phylosignal function from package picante [32]. Moran's I and Abouheif's C mean values were calculated using the function abouheif.moran with 999 permutations from package adephylo [33].

Alternative test for adaptive hypothesis of genome size evolution
In order to test the alternative adaptive hypothesis for genome size, climatic data for these Sophophora species (critical thermal maximum, maximum temperature, minimum temperature, annual mean temperature, annual precipitation, precipitation from the wettest month, precipitation from the driest month, and latitude) were mined from two Kellerman et al. papers on phylogenetic constraint of climatic variables in Drosophila [34,35]. This totaled 38 species of Sophophora. These variables were analyzed with multiple regression and phylogenetic generalized least squares (PGLS) analyses utilizing the function pgls from package caper [31].

Statistical analyses
Statistical tests of fit for each comparative phylogenetic analysis is provided with output of each. For taxa number analysis, the phylogenetic values for each of the 20 runs with taxa number dataset were visualized with boxplots. The effect of taxa number on estimated phylogenetic values was tested using ANOVA in R with number of taxa treated as a random variate. In order to understand how Pagel's λ was affected by increasing taxa number, statistical differences (p-values for H o : λ = 1.0 or 0.0) were plotted in a boxplot using values from each of the 20 runs at each level of reduction. These p-values were then compared with ANOVA using number of taxa as a random variate. Genome size range was also used as a covariate in an ANCOVA in R (λ p-value = Genome size range + taxa number +genome size range Ã taxa number) in order to see if there was an interaction between range in genome size and taxa number effect on λ significance values.

Genome size
Genome size for the female of each species is given in Table 1. Genome size for the Sophophora subgenus of Drosophila, plus a few outgroups from Drosophila subgenera (Zaprionus, Scaptodrosophila, Scaptomyza, Hirtodrosophila and Chymomyza) ranges from 139.9 Mb to 395.2 Mb with an average of 215.5 and a median of 210.8 (Table 1).

Sophophora phylogeny
The overall Sophophora phylogeny is well supported with high posterior probability values at each node, most being 1, with the lowest support value being 0.53 (Fig 1). The relationships in this phylogeny are supported by other large Drosophila phylogenies in the literature [12,[36][37][38]. These results suggest that the constructed phylogeny is representative of the true relationships found in Sophophora, and should have reliable branch lengths. No significant differences were found among trees constructed with varying Bayesian priors.

Tests of rate and mode of genome size variation
Model fit of the complete dataset (87 genome sizes) in a phylogenetic context using the above phylogeny with branch lengths shows that phylogenetic relatedness is a significant component of genome size variation among Sophophora. All tests for phylogenetic signal/dependence (λ, Blomberg's K, Abouheif's C mean , and Moran's I) indicate complete signal with high significance values (Table 2), most notably λ = 0.987. Genome size across the phylogeny was also found to have a κ value of 0.971 and a δ value of 0.589.

Multiple regression and PGLS for climatic variables
After incorporating phylogenetic relationships, climatic variables failed to be significantly related to genome size variation in 38 species of Sophophora. Multiple regression analysis indicated that genome size was significantly influenced by the climatic variables (p = 0.015, Adj. Rsquared = 0.30). When this model was analyzed utilizing PGLS to incorporate phylogenetic relationships, the pattern disappeared (p = 0.602, Adj. R-squared = 0.044, Table 3).

Taxa number analyses
Effects on mean values. When subsets of taxa are analyzed, means for λ, Blomberg's K, Moran's I, and Abouheif's C mean all increased with an increase in taxon number, indicating an increased signal of phylogenetic dependence with increased taxa number (S2 Table). However, a significant differences in the estimated parameter value for different taxon numbers was only found in Moran's I (p = 1.67e-05) and Abouheif's C mean (p = 0.0469). No significant effect was found with increasing taxa number for λ values (Fig 2A).
Effects on significance. In contrast to the effect of taxon number on the means, the number of significant differences of the λ estimate from the boundary conditions, 0.0 or 1.0, increased with taxa number. As shown in the boxplots, λ p-values decreased and had lower variation with increasing taxa numbers, indicating that higher taxa numbers convey higher confidence in the results for each test (Fig 3). The variation among p-values for different taxa numbers was statistically significant (p = 0.000771, ANOVA, S3 Table). A significant decrease of the p-value for Blomberg's K was also (Fig 3) observed with increasing taxa number (p = 3.65e-07, S3 Table).
Because experimental error is expected to make up an ever larger proportion of the total variation when the measured range of genome sizes is small, we tested whether the tests of phylogenetic signal are sensitive to genome size variation among taxa. Genome size range was used as a covariate in an ANCOVA in order to determine if the range in genome size contributed to the significance of the λ results among the taxa number datasets. While the ANCOVA model was significant (p < 0.01), there was no significant interaction between genome size range and taxa number (p = 0.263). Genome size range did not contribute significantly to the model (p = 0.516), while the taxa number contribution was highly significantly (p < 0.001, Table 4).

Discussion
Here, we use a comparative phylogenetic approach to investigate genome size in Sophophora species. We specifically look at measures of phylogenetic signal (Pagel's λ, Blomberg's K, Abouheif's C mean , and Moran's I) and at measures of mode and tempo of evolution (Pagel's δ and κ) in order to test three hypotheses of genome size evolution: low effective population size hypothesis, mutational equilibrium hypothesis, and an adaptive hypothesis. Genome size was found to have complete phylogenetic signal for Sophophora (λ = 0.987, Blomberg's K = 1.373, Abouheif's C mean = 0.240, Moran's I = 0.180, Table 2). Based on our expectations for the adaptive hypothesis, the presence of complete phylogenetic signal suggests that little, if any, of the genome size variation is evolving in an adaptive fashion. This conclusion is also supported by the results of the PGLS analysis, which found that genome size is not significantly related to climatic variables (Table 3). Interestingly, when these climatic variables were phylogenetically analyzed by Kellerman et al., they were found to have complete phylogenetic signal [34,35]. Since, genome size and climatic variables both have phylogenetic signal, we can assume that any patterns we see between these characteristics in Drosophila in a non-phylogenetic aspects are due to constraints of the phylogeny, not a direct relationship. Also, when inspecting the trait mapped on the phylogeny, there is not significant evidence for bursts of change in concordance with the adaptive hypothesis, aside from the decrease in the D. melanogaster clade (Fig 1). Rather, genome size evolution is reliant upon phylogenetic patterns. These results are supported by recent work on genome size evolution in Drosophila species [39]. Here, Sessegolo et al. investigated the impact of phylogeny on genome size and transposable elements for 26 species of Drosophila utilizing available sequences and a de novo transposable element assembly approach. They found a significant correlation between genome size and global transposable element content, with strong phylogenetic signal for each. While simple repeats accounted for up to 1% of the repeatome, LTRs and LINE elements There is no clear pattern with increasing taxa number for Pagel's parameters of evolution or Blomberg's K; however, there is an increase in values for both Moran's I and Abouheif's C mean . These differences are tested statistically in S2 Table. doi:10.1371/journal.pone.0173505.g002 were found to be major components. These data suggest that the genome size variation of Drosophila species are largely driven by transposable elements. The current study, while not including information on proportions or dynamics of repeat sequences, has largely expanded the number of taxa used an earlier study from 26 to 87. By increasing the number of taxa, we can hope to determine if the overarching patterns of genome size evolution in Drosophila remain consistent and better identify which species may be of interest for full sequence investigation.
Genome size was also found to have a κ value of 0.971 (Table 2), indicating that genome size evolves in a gradual fashion and reflects individual branch lengths. The phylogenetic signal, the gradual manner in which genome size is changing, and the relationship of branch length and amount of change supports the mutational equilibrium hypothesis [8]. However, there have been concerns expressed about this hypothesis, as it seems to be largely theoretical and has yet to have a large enough dataset to support it [15,16]. Small imbalances between insertions and deletions are not likely to move fast enough to change genome size dramatically,   Table. doi:10.1371/journal.pone.0173505.g003 especially when it seems as if genome size is being driven by relatively large insertions of transposable elements [39]. The recent accordion model shows that dynamic changes in transposons may be associated with large deletions and lead to apparent stasis of genome size [17]. However, transposition and large deletions, if imbalanced, would drive genome size evolution at an accelerated rate. However, neither stasis nor an imbalance of transposition and large deletions would necessarily produce the phylogenetic signal observed for these 87 species. Interestingly, a δ value of 0.589 (Table 2) indicates that the rate of genome size change was not always constant. This δ value suggests change occurred rapidly early in the phylogeny, likely at the formation of the Drosophila genus, with a decrease in that rate as time went on throughout Sophophora. The early change in genome size could be due to low population sizes, which would appear to support the effective population size hypothesis [6]. However, genome size in this group has moved towards smaller sizes rather than increased sizes [14], contradicting the hypothesis that lower effective population sizes lead to larger genome sizes. The original burst could therefore be adaptive. If so, the smaller genome sizes could have been due to selection on one of their phenotypic correlates [40]. Specifically, selection could have acted on smaller cell and body sizes or shorter development and cell cycle times [41][42][43][44][45]. It is also possible that the low effective population was inefficient at selecting out a slightly deleterious, non-adaptive trait, such as increased deletion rate. After early, relatively large genome size changes, the rate of evolution could have slowed to the current, gradual rate. It is important to note that the change in rate would have had to happen quickly to not be reflected in the κ values.
It is important to ask if further sampling will change the conclusions above. If the change in Sophophora genome size does actually fit the mutational equilibrium hypothesis, it is possible that heavier sampling of the genus or subgenus could fill in the gaps for the large change early in the tree. While this is a possibility, taxon sampling issues addressed in this study, suggest that the significance values and the magnitude of these values vary little when overall study size reaches n = 30. The number of taxa examined here (n = 87) is well above that. The importance of a large enough sample size for tests of phylogenetic signal cannot be ignored. Increases in significant measures of phylogenetic signal with taxa number were found to increase with taxon number in both Abouheif's C mean and Moran's I. This suggests the results of these two methods are sensitive to taxa number and they should be used sparingly, more so as preliminary tests for comparative studies. At the same time, while Abouheif's C mean and Moran's I were sensitive to increasing taxa number, there were no significant effects of taxa number on either Pagel's λ or Blomberg's K, suggesting the results of these methods are less sensitive to taxa number. At the same time, while there are no clear patterns for the magnitude of the parameter values of phylogenetic signal, there is a significant difference in the p-values obtained at different taxa numbers. Since the p-values are measures of significant differences from the bounds (signal vs. no signal), they can be considered proxies for test reliability. Based on these analyses, sample sizes of at least 15 are necessary to achieve reliable results in terms of significance for Pagel's λ and Blomberg's K (Fig 3). The pattern of increased reliability (statistical p-value from the bounds) continues as the taxa number increases; the best results are obtained with larger taxon sampling. These results are supported by a previous study tested the effectiveness of detecting phylogenetic signal using simulated taxa with ranges of Brownian motion [46].
The number of taxa is important, yet the range in the trait value across the tree could also affect the reliability of the phylogenetic signal results. Narrow or wide ranges in variation could skew the interpretation of these comparative results. However, we found that sampling from the range of genome size in Sophophora, had a non-significant effect (p = 0.5164) and no significant interaction was found between genome size range and taxa numbers (p = 0.263).
Only taxa number was found to be significantly contributing to the fit of the ANCOVA model (Table 4). Reduced taxa results, in conjunction with previous results using simulated datasets [46], show the strength of these tested methods for calculating phylogenetic signal. Most emphasis should be put into Blomberg's K and Pagel's parameters of evolution, as they are least sensitive to taxa number affecting the calculated phylogenetic signal value. However, these two methods must have at least a minimal sample size (15)(16)(17)(18)(19)(20) to achieve reliable results. While there are some taxa number effects on phylogenetic signal estimates for Abouheif's C mean and Moran's I, they still are good quick, preliminary measures for phylogenetic signal before the use of more robust comparative methods, such as Pagel's λ and Blomberg's K.
While the signal detected here rejects a non-phylogenetic model of change, it has yet to fully support one of the proposed phylogenetic patterns of change (effective population size vs. mutational equilibrium). The early burst of change (δ = 0.589) would seem to fit the small species effective size hypothesis, yet the trend is to a decrease rather than an increase in genome size, suggesting that this change could be due to adaptation or selection. The gradual change (κ = 0.971) in genome size after that burst suggests a model similar to the mutational equilibrium hypothesis with large deletions balancing out the large insertions due to transposable elements. We argue therefore, that the rapid early change in Sophophora may represent an increase in deletion rate, and possibly an adaptive radiation associated with selection for rapid development rate and small size. Subsequent change is gradual as expected of a deletion insertion balance.
Supporting information S1  Table. ANOVA results and means for each phylogenetic analysis. Phylogenetic value means from each phylogenetic analyses for taxa number (5,10,15,20, and 30 taxa) datasets were compared using ANOVA. Given the results, Moran's I and Abouheif's Cmean are significantly affected by taxa number. (XLSX) S3 Table. ANOVA and means for P-values for λ and Blomberg's K analyses. P-values from the opposite bounds for Pagel's λ and Blomberg's K for the taxa number (5,10,15,20, and 30 taxa) datasets were compared using ANOVA. Given the results, the reliability of these phylogenetic signal tests increase as taxa number increases. (XLSX)