Trees of Unusual Size: Biased Inference of Early Bursts from Large Molecular Phylogenies

An early burst of speciation followed by a subsequent slowdown in the rate of diversification is commonly inferred from molecular phylogenies. This pattern is consistent with some verbal theory of ecological opportunity and adaptive radiations. One often-overlooked source of bias in these studies is that of sampling at the level of whole clades, as researchers tend to choose large, speciose clades to study. In this paper, we investigate the performance of common methods across the distribution of clade sizes that can be generated by a constant-rate birth-death process. Clades which are larger than expected for a given constant-rate branching process tend to show a pattern of an early burst even when both speciation and extinction rates are constant through time. All methods evaluated were susceptible to detecting this false signature when extinction was low. Under moderate extinction, both the -statistic and diversity-dependent models did not detect such a slowdown but only because the signature of a slowdown was masked by subsequent extinction. Some models which estimate time-varying speciation rates are able to detect early bursts under higher extinction rates, but are extremely prone to sampling bias. We suggest that examining clades in isolation may result in spurious inferences that rates of diversification have changed through time.


Introduction
The branching patterns of reconstructed molecular phylogenies contain information about the tempo and mode of evolution [1], [2]. This insight has been invaluable to our understanding of many evolutionary processes and patterns. Recent studies have identified a common pattern of apparent slowdowns in the rate of diversification through time (reviewed in [3], [4]). Such a pattern is consistent with some theoretical work on adaptive radiations, based on the idea that diversification rates will decrease as species fill available niches [5], [6], [7] (but see [8] for an alternate interpretation). Phylogenies with slowdowns have been inferred for diverse groups of organisms (e.g. lizards [9], birds [10], [11], fish [12], and many groups of plants [13]).
The most commonly used metric for detecting shifts in the rate of diversification is the c-statistic introduced by Pybus and Harvey [14]. This statistic quantifies how internode distances (i.e. waiting times to speciation) vary through time compared to what one would expect under a pure-birth model of diversification. Under the pure-birth null expectation, c is distributed according to a standard normal distribution. A c-value less than 21.645 (onetailed test) represents a statistically significant slowdown in diversification rate. Positive values of c can be caused by either speed-ups in diversification rates or species turnover as recently diverged lineages have have not been around long enough to have been ''pruned'' by extinction, creating an overabundance of nodes closer to the present (''pull of the present''; [2]). These two scenarios cannot be distinguished with the c-statistic, so most studies follow the authors' [14] original recommendation to disregard significantly positive c-values.
There has been considerable controversy surrounding the interpretations of slowdowns using the c-statistic on molecular phylogenies. A preponderance of studies have inferred significantly negative cvalues across a wide range of taxonomic groups.
McPeek [8] collected a large number of phylogenies from the literature and found that 80% of clades had cv0 and 42% had cv{1:96. However, a number of other studies have pointed out that negative c-values can result from factors other than slowdowns in diversification rates. It has been shown that c can be biased by under-parameterization of the model of sequence evolution [15] and non-random taxon sampling [16] (see [17] and [18] for potential solutions to this problem). Recent speciation events will have a similar effect; if speciation is modelled as a process rather than as a singular event, this can lead to apparent slowdowns in diversification rates [19], [20].
A recent simulation study, conducted by Liow et al. [21], found that c tended to be positively biased when trees were simulated under a birth-death process with variable rates of speciation. They found that, even when present, slowdowns in net diversification rates were very difficult to detect using c, as short branches near the tips of the tree tend to obscure the signal of the slowdown. Under the conditions simulated by Liow et al. [21], it should be only possible to observe a slowdown for a small subset of values of speciation rate (l) and extinction rate (m) and if lineages are sampled at a particular time in the clade's history. It seems unrealistic to propose that these two conditions are met for the majority of phylogenies. Nevertheless slowdowns are commonly inferred from empirical data. The reasons for the discordance between the analysis of empirical data and expectations derived from simulation studies are poorly understood and warrant further investigation.
Several new modeling approaches have been developed to detect non-homogeneous diversification. One is to use a model in which speciation and/or extinction vary through time [22], [22], [23], [24], [25]. Diversity-dependence (in which speciation rate varies as a function of the number of taxa at a given time) has also been proposed as an explanation for the patterns of species richness [26]. This has recently been modeled in a number of studies (e.g. [27], [28]) to look for signatures of an adaptive radiation. Diversity-dependence is a useful approach as it is seemingly consistent with theory on adaptive radiation, which posits that diversification should slow as niches become filled [6], [7]. There is also some evidence from the fossil record [29] (but see [30] for an opposing perspective) to support this modeling approach. However, such models from both paleobiology and comparative biology are susceptible to the criticism that the ecological mechanisms by which diversity dependence would operate on the scale of a clade are not entirely clear [31]. While the mathematics of the various methods differ, they all involve fitting non-homogeneous birth-death models to phylogenies rather than using summary statistics such as the c-statistic. The behavior and performance of these methods have not been explored in as much detail as the c-statistic.
One important consideration that is often overlooked in studies of clade diversification is that our inferences may be biased by the way we choose clades to investigate. There are several types of sampling bias that are likely to be important. Cusimano and Renner [16], Brock et al. [17] and Cusimano et al. [18] investigated systematists' tendency to sample representative taxa, leading to an overabundance of nodes deep in the tree. However, we should also consider that we are only sampling clades that survive to the present day. This means that the observable distribution of surviving trees is only a subset of all possible trees [32], [33]. Another form of bias is that researchers interested in the adaptive radiations are likely to be interested in the speciose clades. These clades likely reside in the 'tails' of the distribution of possible clade sizes and thus give a biased sample for the inference of diversification rates [11]. This effect of sampling clades has seldom been addressed in the literature (but see [32], [33]). Inferences of early bursts are particularly problematic because even under a constant pure-birth or birth-death process, large clades are likely to show patterns of a rapid diversification early in their history. This is simply due to the fact that in order to be large, they are more likely to have undergone a stochastically high rate of speciation early in the process and subsequently regressed to the mean speciation rate [34], [11]. Analyzing large clades in isolation may lead to inferring ecological processes (such as adaptive radiations) attributable solely to these stochastic processes.
Phillimore and Price [11] investigated the influence of clade size and age on the c-statistic though their simulations were limited to a small set of parameter space. Specifically, they conditioned their simulations on tree age to investigate the correlation between c and clade size and on clade size to investigate the correlation between c and tree age. As a result, the simulations tended to produce trees where the distribution of the parameter of interest was centered on the expected value. Here, we are specifically interested in investigating the statistical properties of trees that are unexpected, that is, unusually large. By simultaneously conditioning our phylogenies on both tree age and size, we were able to obtain trees from the 'tails' of the distribution. In addition to using the cstatistic, we investigate the effects of the non-random sampling of clades using a model selection approach. We show that the effect of this bias can be severe but it affects different methods in different ways, and we make recommendations on the appropriate use of these methods.

Results and Discussion
For all phylogenies simulated, regardless of the value of extinction rate used, large phylogenies are disproportionally likely to have undergone an initial burst of speciation events [35], [11]. This can be visualized with a lineages-through-time plot (Figure 1). This result has consequences for the inference of diversification rate patterns; as the clades we choose to examine are not a random sample of clades but often tend to be interesting and speciose and, at least for molecular phylogenies, have survived to the present day. It is also the case that these large clades are the ones most amenable to statistical analysis.
We simulated phylogenies under constant-rate diversification, conditioning on both the number of taxa in the clade (N) and the age of the tree (t). We did this in order to investigate the statistical properties of trees drawn from different parts of the distribution of possible tree sizes that can result from a constant-rate process. Consistent with previous work [35], [11], trees which are large at the present day (i.e. those drawn from the 'right tail' of the distribution) are more likely to harbor a signal of a slowdown using the c-statistic even when the phylogeny is generated under a constant-rate process with low extinction. This is evident in Figure 2, where data points corresponding to tree sizes larger than the expectation (denoted by a dashed line) show elevated Type-1 error rates when simulated under low extinction rates. However this is not the case when background extinction rates are higher as extinction alters the distribution of branch lengths on the reconstructed phylogeny. The resulting distribution effectively obscures any signal of a burst (in our simulations, the burst is entirely attributable to stochastic processes) that might have occured deep in the tree [10], [36], [21]. This is especially true for large phylogenies ( Figure 2); c is a summary statistic for all nodes in the tree and therefore the few early branching events will have an even smaller influence on the calculation of the c-statistic for large trees with proportionally more nodes.
Our results suggest two things about the use of the c-statistic. First, the test has very low power to detect changes in speciation rates when species turnover rates are even modestly high [21]. Second, while there is some concern that significantly negative cvalues may potentially be misleading [15], [16], [37], bias in sampling large clades does not tend to create false signatures of a slowdown under modest extinction rates. We know from the fossil record that extinction rates have been moderately high for most groups. This implies that, even if slowdowns in diversification rates are relatively common, significantly negative c-values should be rare. However, this is not the case [8]. The widely reported evidence for slowdowns in studies using c is thus somewhat paradoxical and deserves explanation. Pessimistically, we may attribute this to some yet unstudied artifact such as sampling effort. More intensive sampling of lineages would have the effect of increasing the number of nodes close to the present. This would deteriorate any signal of an early burst detectable by the c-statistic. It should be noted however, that a number of the phylogenies that provide evidence for a slowdown are near complete at the species level (e.g. [11]), though there may still be additional lineages that should actually be recognized as species. It may be the case that if more lineages were included, many of the signatures of early bursts in the literature would no longer be present. This is difficult to evaluate except on a case-by-case basis.
We found that the diversity-dependent models are prone to bias due to sampling in a similar manner as the c-statistic ( Figure 3); when death rates are low, the diversity-dependent model was preferred to a constant birth-death model more frequently for clades at the large end of the distribution of clade size. But even with moderate extinction, the diversity-dependent model was rarely preferred. In fact, for higher extinction rates, the reverse was true-larger clades were less likely to fit a diversity-dependent model than smaller clades ( Figure 3). However, it should be noted that the model of diversity-dependence we used here did not explicity model extinction. Consequently the model estimates  diversification rate as a function of contemporaneous lineages in the reconstructed phylogeny rather than as a function of contemporaneous lineages in the true (unobserved) phylogeny [38]. Recently Etienne et al. [28] provided a full likelihood solution for diversity-dependence with extinction which uses a Hidden-Markov approach to fit the model to a phylogeny. We justify the use of a diversity-dependent model which does not estimate extinction in our simulation study due to the fact that it is currently the more commonly employed approach. The diversitydependent model thus has the same drawback as the c-statistic. Extinction will tend to erode the signal of the early diversification events, especially so for large trees. The results from the timevarying speciation rate model (Figures 4 and 5; discussed below) suggest that this result will not hold if both extinction rates and speciation rates are explicitly modeled. However, it should be noted that there is some evidence that estimates of extinction can be inconsistent when there is rate variation across clades [39].
A correction that has been employed to increase the power to detect slowdowns for both the c-statistic and diversity-dependent models is to collapse some nodes close to the present. Species delimitation is a problem that has recently received a great deal of attention from molecular systematists (e.g. [40], [41], [42]). Subdividing lineages into subspecies will necessarily increase the effect of the pull of the present and further obscure evidence of a slowdown. Collapsing these recent nodes will certainly influence our ability to infer slowdowns and this procedure has been done by some researchers (e.g. [11], [43]), by regarding recently-diverged species as not being ''good'' species. However, there is currently a lack of theory to guide such decision-making. We therefore recommend that authors should be cautious in collapsing nodes as we do not fully understand how species delimitation affects these methods.
The time-varying speciation models [36], [23] were preferred for data generated under a constant-rate process with increasing frequency as tree size increased. We found this to be true even when extinction was present (Figures 4 and 5), though the effect was dampened as extinction increased. As stated above, large trees generated under a constant rate birth-death process are subject to having undergone stochastic bursts of speciation in order to obtain their current size. The time-varying speciation rate models are sensitive to these bursts, which is both a positive and negative; positive because they have more power to detect changes in diversification through time and negative because these models are more prone to inferring spurious results as a consequence of sampling bias. We took two alternative approaches to compare model fits of a constant-rate birth-death model versus a timevarying speciation model: the full-likelihood approach of Rabosky and Lovette [36] and an approximate approach based on the coalescent process, recently derived by Morlon et al. [23] (see Methods for details). Contrary to our expectations, we found that these two approaches differed substantially in terms of which model was prefered when using Akaike Information Criterion (AIC). For trees from the larger end of the distribution, the coalescent approach tended to provide support for a time-varying speciation model over a constant-rate model at much higher frequencies than the full likelihood approach, especially when extinction rates were low ( Figure 5). The reasons for these differences are not entirely clear. While many of the large trees do show 'early bursts' due to stochastically high rates early in the process (see Figure 1 for an example), the proneness of the coalescent-based approach to favor a model more complex than the generating model is worrisome as sampling bias appears to very strongly influence model choice. Both approaches are relatively new and their respective statistical properties have not been explored at all beyond the publications in which they were presented [36], [23]. The statistical properties of these and other related models is something that certainly warrants further investigation. Researchers are increasingly fitting more complex models of diversification to study diversity dynamics through time and across clades (e.g. [44], [25], [24]) and it is imperative that we have a good understanding of these if we are to make meaningful inferences.
Stochastic bursts, which are more likely to have occured in 'unusually large' trees, may be falsely inferred to have been caused by ecological processes (e.g. adaptive radiations). While the bursts are certainly 'real', they are not attributable to any biological phenomena. If the background extinction is close to zero, then all methods investigated in this paper are susceptible to this false inference. However, when background extinction is relatively high, these stochastically generated slowdowns are not likely to be detected by the c-statistic as the subsequent extinction removes the signal. As a summary statistic for the whole phylogeny, the power of the c-statistic to detect slowdowns (stochastically or ecologically produced) when species turnover is present is very low [21], [45]. It is fair to suggest that while there are some reasons to believe that significantly negative c-values in the empirical literature may result from known statistical artifacts [15], [16], [37], the cause of their ubiquity remains unclear. We suggest that failing to sample cryptic species may contribute to this. Methods that do not deal directly with extinction and extinct lineages, like the diversity-dependent model [27] used here, show similar patterns of performance to the c-statistic.
Our findings are also of relevance to studies examining the efficacy of diversification rate models and statistics. There are a multitude of ways to conduct simulations of birth-death models and Stadler [46] has discussed the statistical properties of various conditioning schemes in detail. We caution theoreticians to pay close attention to these differences as conditioning simultaneously on the number of taxa and the age of the tree can produce phylogenies drawn from the 'tails' of the distribution and thus prone to be bottom heavy. Such trees may lead to biased estimation of parameters and misconstrued inferences.
We suggest three future directions to address this issue of sampling bias. First, performing diversification analysis on megaphylogenies, including the clade of interest, may be advisable instead of examining clades in isolation. Investigating large clades while ignoring closely related groups that are less speciose may lead to spurious patterns of slowdowns in diversification. This will require further development of methods that relax the assumption of uniformity of the diversity dynamics across the tree (e.g. [44]). We appreciate that for many groups, such an approach may not always be feasible so we urge researchers to at least be cautious in their interpretation of their results. Second, increased attention should be given to how lineages are defined; lineages that are currently denoted as subspecies may soon be considered ''good'' species [20]. Whether these lineages are included or not will influence the inference of slowdowns. Third, further investigation  [36]) is preferred over a constant-rate birthdeath (BD) model using AIC to select amongst the models. Only the TVS and BD models were compared. We used a DAIC cutoff of 4 to favor a TVS model when the generating model was a constant-rate process. Extinction rate, e~m=l, varies across the plots (e~0,0:1,0:25,0:5); speciation rate, l, and total tree-depth, t are held constant (l~1 and t~5).   [23]) is preferred over a constant-rate birth-death model (BD; Model 3 of [23]) using AIC to select amongst the models. Both the models were formulated according to a coalescent-based approximation of the likelihood [23]. We used a DAIC cutoff of 4 to favor the TVS model when the generating model was a constant-rate process. Extinction rate, e~m=l, varies across the plots (e~0,0:1,0:25,0:5); speciation rate, l, and total tree-depth, t are held constant (l~1 and t~5). of the statistical properties of these models may allow researchers to be more confident that the patterns they observe represent truly meaningful variation.

Methods
All of our simulations were carried out using constant rates of speciation and extinction through time, so that any detection of a slowdown can be considered a Type-1 error. Tree simulations were conducted with Stadler's [46] TreeSim R package, modified slightly for the purposes of computational efficiency. For each simulated phylogeny, we calculated the c-statistic [14] and fit five models of diversification: 1) a constant-rate birth-death model [47]; 2) a diversity-dependent model [27]; 3) a time-varying speciation rate model [36]; 4) a constant-rate birth-death model using a coalescent approach [23]; and 5) a time-varying speciation rate model, also using a coalescent approach [23]. Models 1-3 were fit using the R package laser [48]. Models 4 and 5 were fit using code from Morlon et al. [23], also modified for computational efficiency.
We simulated phylogenies conditioned on both the age of the tree t and the number of taxa N (for a discussion on various methods of conditioning simulated phylogenies, see [46]). In order to simulate trees from different regions of the cumulative probability distribution (i.e. 1 st percentile, 2 nd percentile, . . ., 99 th percentile), we used the equations of Foote et al. [49] for calculating Pr(NDl,m,t) where l is the instantaneous speciation rate per unit time and m is the instantaneous extinction rate per unit time. The probability of observing a phylogenetic tree of age t, containing at least n taxa, is given by the equations: for l~m and b~l a m and for l=m Note that in their formulation of these equations, Foote et al. [49] use the paleontological convention of denoting l as p and m as q. We set l~1, t~5 and varied m (m~f0,0:1,0:25,0:5,0:75,1:0g). Conditioning on the survival of at least one lineage to the present day, we used (1) and (2) to calculate the number of taxa associated with the percentiles of the cumulative distribution as discussed above. For each value of m and each percentile (p~0:01,0:02, . . . ,0:99), we simulated 1000 phylogenies under a constant-rate birth-death model (or in the case of m~0, purebirth). We also varied l and t but the results of our analysis did not differ qualitatively.
As stated above, the c-statistic [14] quantifies the temporal distribution of internode distances. We calculated c for each phylogeny using the R package ape [50]. Following the author's original recommendation [14] and the conventions of the literature, we used a one-tailed test, such that c was considered to be significant if it was v{1:645. Again, as all phylogenies were simulated under a constant-rate process, we considered it to be a Type-1 Error if c was significantly negative.
The diversity-dependent (DD) model we used was the exponential declining model of Rabosky and Lovette [27] where speciation rate is modeled as a function of the number of contemporaneous lineages in the reconstructed phylogeny. If N(t) the number of lineages at time t, l(0) is the initial (background) speciation rate and k describes the strength of diversity-dependence, l(t) can be described as the function In the DD model, background extinction rates are assumed to be 0. The decline in diversification rates at higher densities is thus only due to a decrease in speciation rate.
To model time-varying speciation without explicitly considering diversity-dependence, we used the 'SPVAR' model of Rabosky and Lovette [36] where speciation rate is an exponential function of time such that where x is a constant describing the relationship. Here, extinction rate m is explicitly modeled but assumed to be constant across the phylogeny.
As an alternative to using the full-likelihood equation for a timevarying speciation model [36], we also employed a recently derived approach [23] for fitting birth-death models, based on the coalescent process from population genetics [51]. Morlon et al. [23] model a population of species evolving under the Wright-Fisher process. Following Morlon et al. [23], the likelihood L of observing the internode distances g i between node i and node iz1 can be written as where u i is the distance between node i and the present and N(t) is the number of species (population size in population genetics) at time t in the past. Note that this is an approximation of the likelihood as N(t) is approximated by its expected value E N(t) ½ [23]; this was done to make the model analytically tractable. The constant-rate birth-death model we use corresponds to Model 3 from [23]. The time-varying speciation model was specified in the same way in the full-likelihood approach described above, such that l(t)~l(0)e {xt . This corresponds to Model 4a from [23].
For the model-based approaches, we compared a constant-rate birth-death model to each alternative model in our set. We used an AIC approach (sensu [52]) to select the best fit model between a constant-rate birth-death model and a time-varying speciation rate/diversity-dependent model. We emphasize for clarity that only 2 models were considered at a time. We used a DAIC cutoff of 4 for the more parameter-rich model to be favored. While a DAIC of 4 is an arbitrary value, we justify its use as it is conventionally used in phylogenetic comparative methods, following recomendations in [52]. All analyses were conducted in the R programming environment [53]. The code modified from Stadler [46] and Morlon et al. [23] will be available upon acceptance on M.W.P.'s personal website http://mwpennell. wordpress.com/