So many genes, so little time: comments on divergence-time estimation in the genomic era

Phylogenomic datasets have emerged as an important tool and have been used for addressing questions involving evolutionary relationships, patterns of genome structure, signatures of selection, and gene and genome duplications. Here, we examine these data sources for their utility in divergence-time analyses. Divergence-time estimation can be complicated by the heterogeneity of rates among lineages and through time. Despite the recent explosion of phylogenomic data, it is still unclear what the distribution of gene- and lineage-specific rate heterogeneity is over these genomic and transcriptomic datasets. Here, we examine rate heterogeneity across genes and determine whether clock-like or nearly clock-like genes are present in phylogenomic datasets that could be used to reduce error in divergence-time estimation. We address these questions with six published phylogenomic datasets including Birds, carnivorous Caryophyllales, broad Caryophyllales, Millipedes, Hymenoptera, and Vitales. We introduce a simple and fast method for identifying useful genes for constructing divergence-time estimates and conduct exemplar Bayesian analyses under both clock and uncorrelated log-normal (UCLN) models. We used a “gene shopping” method (implemented in SortaDate) to identify genes with minimal conflict, lower root-to-tip variance, and discernible amounts of molecular evolution. We find that every empirical dataset examined includes genes with clock-like, or nearly clock-like, behavior. Many datasets have genes that are not only clock-like, but also have reasonable evolutionary rates and are mostly compatible with the species tree. We used these data to conduct basic divergence-time analyses under strict clock and UCLN models. These exemplar divergence-time analyses show overlap in age estimates when using either clock or UCLN models, but with much larger credibility intervals for UCLN models. We find that “gene shopping” can be productive and successful in finding gene regions that minimize lineage-specific heterogeneity. By doing relatively simple assessments of root-to-tip variance and bipartition conflict, we not only explore datasets more thoroughly but also may estimate ages on phylogenies with lower error. We also suggest the need to explore more detailed and informative approaches to determine fit and deviation from a molecular clock, as existing approaches are exceedingly strict.


42
Datasets based on thousands of genes from genomes and transcriptomes have emerged as a 43 major tool in addressing broad evolutionary questions including, but not limited to, 44 phylogenetic reconstruction, gene and genome duplication, and inference of molecular 45 evolutionary patterns and processes. And while these datasets have been used for 46 divergence-time estimation (e.g., Jarvis et al. 2014b;Prum et al. 2015), their overall utility 47 for divergence-time analyses has not been fully examined. In particular, it is unclear whether within these enormous datasets there exist nearly clock-like gene regions that may 49 aid in producing lower error divergence-time estimates. While some authors, such as Jarvis 50 et al. (2014b), have suggested choosing clock-like genes, a repeatable and fast procedure to 51 identify these genes has not been developed for phylogenomics and an examination of the 52 frequency of these genes in empirical datasets has not been conducted. 53 Divergence-time estimation is a complicated, but often essential, step for many sites and can contribute to extensive deviations from the molecular clock. As a result, 64 complex models have been developed to accommodate for these deviations (Sanderson Mirarab et al. 2014;Roch and Warnow 2015)? How many genes support the dominant species tree signal (e.g., Salichos et al. 2014;Smith et al. 2015)? Genomic datasets also 105 allow us to examine the extent of molecular rate variation in genes, genomes, and lineages analysis on their genomic data, a thorough examination of lineage-specific rate 113 heterogeneity for divergence-time estimation has not been conducted. Nevertheless, the 114 availability of full genomes and transcriptomes makes identifying genes with lower rate 115 variation possible and so are more suitable for divergence-time estimation.

116
Here, we examine six genomic and transcriptomic datasets across animals and 117 plants and with different temporal and taxonomic scopes to examine the extent of 118 lineage-specific rate heterogeneity. We investigate the distribution of variation in the 119 branchwise rates of evolution across thousands of genes to understand whether these new 120 genomic resources may improve divergence-time estimation by allowing for simpler models 121 of molecular evolution. Finally, we introduce a simple sorting procedure to identify 122 informative and nearly clock-like genes. This procedure can be used to examine the overall 123 distribution of evolution, rate heterogeneity, bipartition concordance, and potential utility 124 of genes for divergence-time analysis.  The range in datasets spans different taxonomic groups, datasets sizes (e.g., CAR vs variance of root-to-tip lengths for each tree. This was performed on each rooted ortholog,  Simulations.-We conducted simulations to examine expectations of rate variation given 152 clock-like, noisy clock-like, and uncorrelated lognormal data. We first generated simulated 153 clock-like data using Indelible v1.03 (Fletcher and Yang 2009) using the WAG model 154 with 500 characters for amino acid datasets, and JC with 1500 characters for nucleotide 155 datasets, on each of the empirical species tree topologies. For these data simulations, 156 because the species tree often had no branch lengths available, node heights were first 157 simulated randomly using Indelible and then the tree was rescaled to a height of 0.25, 158 0.5, or 0.75. We used the trees generated by Indelible to further simulate 100 noisy clock (rate=1.0, noise=0.25, and rate=1.0, noise=0.75) and uncorrelated lognormal (UCLN) 160 trees (mean.log=-0.5, sd.log=0.5, and mean.log=-0.5, sd.log=1.0) using NELSI v0.21 (Ho 161 et al. 2015). We note the 'noise' in NELSI corresponds to the standard deviation of a 162 normal distribution with mean = 0. For the noisy clock, branch-specific rates are a sum of 163 the global rate (here, 1.0) and a draw from this normal distribution. The simulations with 164 noise=0.75 thus are only loosely clock-like, and serve as a comparison between the more 165 clock-like (noise=0.25) and UCLN analyses. We used RAxML v8.2.3 (Stamatakis 2014) to 166 reconstruct each of these datasets. For each simulation, we examined the rate variation and 167 the root-to-tip length variation on the reconstructed phylograms.

168
While the focus of this study is not the performance of divergence-time estimation 169 methods, we still wanted to examined an exemplar from the simulations to ascertain the 170 variation in the results given different clock models. We used one random realization of 171 node heights as simulated from the Indelible analyses as mentioned above to generate 172 two datasets with NELSI. One dataset had three genes generated from a clock rate of 1 and 173 noise at 0.25, and the other dataset had three genes generated from a UCLN model and 174 mean.log at -0.5 and sd.log=1. As above, each amino acid gene consisted of 500 residues, 175 while DNA genes consisted of 1500 nucleotides. For both the noisy clock-like and UCLN 176 datasets, we conducted BEAST analyses with both a clock model and a UCLN model. A 177 birth-death tree prior was used as the prior for all node heights, and runs were conducted 178 for 10,000,000 generations with the first 10% discarded as burnin. Results were 179 summarized using treeannotator from the BEAST package. Median node heights as well as 180 95% HPD node heights were compared between the simulated datasets and the tree used 181 to generate these datasets.

182
Sorting and dating analyses on real data.-In addition to these analyses on simulated 183 datasets, we conducted divergence-time analyses on a subset of the empirical datasets.

184
Because these datasets consist of hundreds to thousands of genes, we developed a sorting 185 procedure intended to mimic that which would be performed as a "gene shopping" 186 analysis. The sorting procedure relies on the root-to-tip variance statistic, bipartition calculation to determine the similarity to the species tree, and total treelength. We sorted 188 first by the similarity to the species tree, then root-to-tip variance, and finally treelength. 189 We limited the results to the top three genes reported from the sorting procedure. Because 190 we filtered for genes that were consistent with the species tree, these genes were then 191 concatenated and the topology was fixed to be consistent with the species tree. For each of 192 these datasets, we conducted two BEAST analyses, one assuming a strict clock and the other 193 assuming a UCLN model. Because specific dates were not the focus of this examination, 194 the birth-death tree prior was used instead of fossil priors for nodes. The analyses were run 195 for 10,000,000 generations with the first 1,000,000 discarded as burnin. As above, results

196
were summarized using treeannotator from the BEAST package. Median node heights and 197 95% HPD node heights were compared between the clock and UCLN runs as the node 198 heights on the true phylogeny are unknown. clock test for the VIT dataset to 17% for the MIL dataset (see Table 1). The variation in 208 the percentage of deviation from the clock may reflect dataset size and the age of the clade 209 involved. As for size, the CAR dataset has 7 taxa that are not included in the CARY 210 dataset but otherwise overlaps partially and has far fewer taxa in total. The CARY 211 dataset, in addition to being much larger, also contains known shifts in life history (Yang 212 et al. 2015). These differences may account for the variation between these two datasets.

213
As for clade age, HYM and MIL are significantly older than the other datasets, which may 214 account for their rate variation. Nevertheless, each dataset indeed had at least a few 215 orthologs that passed a strict clock test even if these orthologs were in the small minority.

216
Because passing a clock test does not necessarily indicate that the gene would be 217 good for phylogenetic reconstruction, we also measured treelength and root-to-tip variance 218 for each ortholog (see Figures 2-3). Clock tests are stringent in their need to conform to 219 the clock (see below) and so by examining the root-to-tip variation and lineage-specific were excluded for clock tests and in determining root-to-tip variance for "gene shopping", 236 we allowed outgroups to remain for lineage-specific rate variation analyses as in the right 237 handed plots of Figure 3. The VIT dataset was an exception with several lineages other 238 than the outgroup having high rates. In each dataset, there were genes that fell within the 239 distribution of simulated trees that are clock-like or clock-like with low noise.

240
One potential benefit of identifying orthologs with lower lineage-specific rate 241 variation within phylogenomic datasets is to use these, or a subset of these, orthologs to conduct divergence-time analyses. The hope is that by using clock-like genes, we may 243 overcome or lessen the impact of lineage-specific rate variation on the error of divergence 244 time analyses. The non-identifiability of rates and dates (e.g., longer branch lengths may 245 be the result of a long time or fast evolution) is exacerbated by lineage-specific rate 246 heterogeneity. We used a subset of orthologs to conduct divergence time analyses and we 247 implemented a sorting procedure (packed in SortaDate) to (i) filter the genes that best 248 reflect the species tree (i.e., higher bipartition concordance witht he species tree), (ii) have 249 lower root-tot-ip variance (i.e., most clock-like), and discernible amounts of molecular 250 evolution (i.e., greater tree length; Figure 1). For each empirical dataset, we generated 251 such an alignment (see Table 2). The genes that were filtered and used for divergence-time 252 analyses for the BIR, CARY, VIT, and HYM datasets rejected the clock. The genes for the 253 CAR and MIL datasets either didn't or weakly rejected the clock. Resulting HPD trees 254 were rescaled so that the root heights were equivalent to allow for easier comparisons 255 between datasets. Typically, fossil placements would be used for scaling but because these 256 are not intended to be runs for future use, we eliminated fossil placements as one source of 257 variation. We found rough correspondence of node heights between the clock and UCLN 258 analyses, especially for the four smallest datasets (see Figure 4). The UCLN analyses, as 259 expected, had far greater variance in the 95% HPDs for node ages. We found the greatest 260 differences in the larger BIR and CARY datasets (see Table 3). For both datasets, we saw 261 major differences in tree heights, especially for CARY. This may reflect the size of the 262 dataset or the underlying rate variation in the datasets. In general, strict clock estimates 263 resulted in younger median node ages than analogous UCLN estimates, as well as younger 264 maximum and older minimum 95% HPD values (see Table 3). The covariance statistics for 265 UCLN runs ranged from the lowest mean values of -0.002 (stdev=0.04) in MIL to the 266 highest of 0.246 (stdev=0.12) in BIR.

267
As is always the problem with real datasets, the true divergence-times are unknown.

268
So we conducted exemplar analyses. For each empirical dataset, we simulated data for 269 three genes under both noisy clock and UCLN models to examine the variation in the 270 resulting divergence-time analyses where the true dates were known. For these simulated 271 datasets, a strict clock was rejected in each case, including those datasets that were 272 simulated under a clock with noise. We compared the resulting node heights from the 273 divergence time analyses under clock and UCLN models with the tree used for simulation 274 (see Tables 4-5 and Figure 5). For the datasets generated under a noisy clock model, more 275 of the true node heights were found in the 95% HPD interval when using the UCLN model 276 for inference than the strict clock model for inference. However, the precision as measured 277 by the total width of the 95% HPD interval for the UCLN runs were much lower than the 278 clock runs (see Tables 4 -5). Those nodes that were not within the interval of the 95% HPD 279 when using the strict clock model for reconstruction, were close to the true value. So, while 280 fewer true node ages were contained in the strict clock HPDs, the overall error rate was 281 lower. For example, in the CARY dataset, while fewer nodes in the clock estimate were 282 found to be within the interval (52 vs 67 for the UCLN), the distance of the interval from 283 the estimate was lower for the clock dataset for both the high and low value for the 95% 284 HPD. Stated another way, the UCLN intervals were large enough that the true age was 285 often included, but this was at the cost of far lower precision. Because of this error relative 286 to the strict clock, the UCLN perhaps should not be the preferred model, especially if the 287 researcher is going to use a single summary tree for future analyses.

288
Several gene trees from the examples discussed fail a standard strict clock test but 289 have low tip-to-root variance. To explore this further, we simulated strict clock amino acid 290 and nucleotide data on orthologs from each empirical dataset and examined the frequency 291 of incorrectly rejecting a strict clock. The false rejection rate for clock tests using amino 292 acid data and a strict clock were between 5% and 8%. For the two nucleotide datasets, the 293 rejection rate was much higher at 23% and 46%. This suggests that for amino acid data, 294 the false rejection rate was near the nominal value, while for the nucleotide datasets the 295 false rejection rate was unreliable. Both nucleotide datasets (BIR and CARY) also had the 296 largest number of species and so the rejection rate may be a function of the number of 297 taxa. Sensitivity of the clock-test to nucleotide data is not the focus of this study, but 298 should be examined in more detail. Also, it would be more informative to examine the 299 deviation from the clock instead of a boolean test of significant fit. In regard to divergence 300 time estimation, if a strict or stricter clock can be used, molecular phylogenies may be 301 dated with significantly lower error. As an added benefit, fewer fossils would be necessary 302 to calibrate nodes (and indirectly, rates). We suggest that the community explore model fit 303 to relaxed clock models as well as potential alternatives to the prevailing strict clock test 304 that may be more beneficial for divergence time estimates and more informative in regard 305 to rate heterogeneity in phylogenomic datasets. 306 * Prum, R. O., Berv, J. S., Dornburg, A., Field, D. J., Townsend, J. P., Lemmon, E. M., and Tamura, K., Battistuzzi, F. U., Billing-Ross, P., Murillo, O., Filipski, A., and Kumar, S.