^{1}

^{¤}

^{1}

^{1}

^{1}

^{*}

AJD and AR conceived the original idea, developed the software, and performed the marsupial and virus data analyses. SYWH developed the simulation software, performed the simulation analysis, developed the use of prior distributions for calibrating node ages, and performed the analyses on the bacteria, yeast, plant, metazoan, and primate datasets. MJP collected and curated the marsupial dataset and provided expert calibration information. AJD, SYWH, MJP, and AR contributed to the writing of the article.

¤ Current address: Department of Computer Science, University of Auckland, Auckland, New Zealand

The authors have declared that no competing interests exist.

In phylogenetics, the unrooted model of phylogeny and the strict molecular clock model are two extremes of a continuum. Despite their dominance in phylogenetic inference, it is evident that both are biologically unrealistic and that the real evolutionary process lies between these two extremes. Fortunately, intermediate models employing relaxed molecular clocks have been described. These models open the gate to a new field of “relaxed phylogenetics.” Here we introduce a new approach to performing relaxed phylogenetic analysis. We describe how it can be used to estimate phylogenies and divergence times in the face of uncertainty in evolutionary rates and calibration times. Our approach also provides a means for measuring the clocklikeness of datasets and comparing this measure between different genes and phylogenies. We find no significant rate autocorrelation among branches in three large datasets, suggesting that autocorrelated models are not necessarily suitable for these data. In addition, we place these datasets on the continuum of clocklikeness between a strict molecular clock and the alternative unrooted extreme. Finally, we present analyses of 102 bacterial, 106 yeast, 61 plant, 99 metazoan, and 500 primate alignments. From these we conclude that our method is phylogenetically more accurate and precise than the traditional unrooted model while adding the ability to infer a timescale to evolution.

This new method can simultaneously infer phylogeny and estimate the molecular clock. The authors run their method on several large alignments to show its phylogenetic accuracy and ability to infer a timescale to evolution.

From obscure beginnings, phylogenetics has become an essential tool for understanding
molecular sequence variation. In the past decade, huge progress has been made in developing
methods for inferring phylogenies and estimating divergence dates. This development has been
characterized by increases, both in the complexity of the models used to describe molecular
sequence evolution, and in the sophistication of the methods for analyzing these new models.
Nevertheless, a well-known problem that has persistently troubled phylogenetic inference is
that of substitution rate variation among lineages. In order to infer divergence dates, it
is convenient to assume a constant rate of evolution throughout the tree [

Such problems with the molecular clock hypothesis have resulted in it being abandoned
almost entirely for phylogenetic inference in favor of a model that assumes that every
branch has an independent rate of molecular evolution. Under such an assumption, it is
possible to infer phylogenies (e.g., [

Recently, it has been realized that less drastic alternatives to the unrooted model of
phylogeny may exist. Instead of dispensing with the molecular clock entirely, attempts have
been made to relax the molecular clock assumption by allowing the rate to vary across the
tree [

Bayesian relaxed-clock methods, including those published by Thorne et al. [

Autocorrelation of rates from ancestral to descendant lineages will occur whenever the largest component of rate variation is due to inherited factors, whether these are life-history traits or biochemical mechanisms. As one looks over smaller and smaller timescales, the differences in such inherited factors become smaller relative to the variance caused by stochastic and uninherited factors (such as environmental or chance events). An alternative way of considering this is that the autocorrelation is so strong that very little of the variation in rate can be attributed to inherited factors. At the other extreme, over very long timescales, we might expect so much variation in the inherited determinants of rate that the autocorrelation from lineage to lineage begins to break down, especially with sparse taxon sampling. However, it is difficult to predict where the boundaries between these effects are and thus to specify what the degree of autocorrelation will be.

Relaxed-clock models present a potentially useful method for removing the assumption of a
strict molecular clock, but a major shortcoming of the methods that have been proposed thus
far is that they require the user to specify the tree topology. This is a problem because in
many cases, important parts of the tree may be uncertain or unresolved, resulting in a
number of plausible tree topologies. Furthermore, a molecular clock may have been assumed
when estimating the input tree (for example to find a root), but rate variation among
lineages can adversely affect phylogenetic inference (e.g., [

Here we present a Bayesian Markov chain Monte Carlo (MCMC) [

We generated alignments of nine nucleotide sequences, each 1,000 nucleotides in length,
on the rooted tree in

The timescale is drawn in arbitrary time units. Apart from the branch leading to the outgroup, sequence O, all branches are five time units in length.

Fifty sequence alignments were generated under each of five sets of rate variation
models: (1) Rates were fixed at 0.01 average substitutions per site per time unit
throughout the tree (i.e., rates conformed to a molecular clock) (CLOC); (2) rates were
lognormally autocorrelated among branches, with an ancestral rate of 0.01 average
substitutions per site per time unit and a variance parameter (
^{2}) of 0.1, so that ^{2}

A normally distributed calibration prior with mean 20.0 and standard deviation 1.0 was
specified for the age of the root of the tree, and the tree topology was fixed. Each
alignment was analyzed using BEAST [

In four of the five cases, the uncorrelated relaxed-clock approach to estimating rates
performed well (

When the sequences were simulated under a molecular clock, the 95% HPD
interval of the posterior rate estimate almost always contained the true rate under all
three analysis models (

For the data generated under lognormal models (ACLN and UCLN), both of the uncorrelated models (UCED and UCLN) performed well with respect to coverage, with the 95% HPD containing the true rate between 93% and 100% of the time for individual branches. However, for the UCED model this was at the expense of power, with the average size of the HPDs being twice as large as those for the UCLN model.

For data generated under UCED, the UCED model performed better than UCLN with both models giving the same average size of HPDs, but with the latter model including the true rates in the HPDs slightly less often (82%). Neither model performed as well when the data were generated under an ACED model, with the true rate in the 95% HPD between 36% and 90% of the time.

The accurate estimation of molecular rates is important because it has a direct impact on
the estimation of branch lengths, which can in turn affect the inferred tree topology.
Collectively, the results provide a strong recommendation against assuming a molecular
clock when analyzing data that have not evolved under clocklike conditions, but the
uncorrelated relaxed-clock models also perform well when the data are clocklike. The
results favor the use of the UCLN model in that it has an accuracy comparable to the UCED
model, but it results in considerably smaller HPDs. In particular, because the UCLN model
has the variance of the lognormal distribution as a parameter, it can better accommodate
data that are close to being clocklike. This is not contradicting the findings of a
previous simulation-based study [

We selected two virus datasets that were matched in the number of sequences (

Both datasets were analyzed under the strict molecular clock and the UCLN and UCED
models. For all analyses the HKY (Hasegawa-Kishino-Yano) model of nucleotide substitution
[

The estimated coefficient of variation, σ _{r}
_{r}
_{r}

The divergence times correspond to the mean posterior estimate of their age in years. The yellow bars represent the 95% HPD interval for the divergence time estimates. Both the mean and 95% HPD of the divergence times were calculated conditional on the existence of the clade defined by the divergence. Each node in the tree that has a posterior probability greater than 0.5 is labeled with its posterior probability. The sampling times of the tips were assumed to be known exactly. Branches colored in red had a posterior rate greater than the average rate, whereas branches colored in blue had a lower-than-average rate.

In addition to the viral sequences, we analyzed a marsupial dataset. The alignment
contained concatenated nuclear protein–coding genes (

The extensions to BEAST for inferring divergence times, described here, are well suited
to the marsupial dataset. It possesses some phylogenetic uncertainty, so it is more
reasonable to integrate over the posterior distribution of topologies than to assume a
single true topology. Furthermore, the dataset includes taxa that have evolved to
substantially different sizes, life histories, and niches, which are all hypothesized
predictors of molecular rate variation [

The early fossil record of marsupials [

First, to ascertain the joint prior distribution on the nodes of interest, the four
calibration points, the Yule prior, and the reciprocal monophyly constraints were analyzed
without any sequence data. The combined results of two runs of 10,000,000 steps are given
in

In order to analyze the marsupial data, we assumed a general time-reversible
[

There was a slight tendency toward a positive correlation in the rate of parent and child
branches but this was not significant (zero was included in the 95% HPD). The
coefficient of variation was estimated to be 0.32 (95% HPD:
0.23–0.43), suggesting that the marsupial dataset is more clocklike than both of
the virus datasets.

(A) The combined prior distribution of divergence times for the MAP tree topology. The green bars represent the 95% HPD interval for the divergence times. (B) The posterior distribution of the divergence times. The divergence times correspond to the mean posterior estimate of their age in millions of years. The yellow bars represent the 95% HPD interval for the divergence time estimates. Both the mean and 95% HPD of the divergence times were calculated conditional on the existence of the clade defined by the divergence. Each node in the tree is labeled with its posterior probability if it is greater than 0.5. The three nodes with normally distributed calibration priors are indicated by orange bars. Branches colored in red had a posterior rate greater than the average rate, whereas branches colored in blue had a lower-than-average rate.

We see no autocorrelation for the viruses we analyzed (the HPD interval of the covariance of parent and child branches was [−0.17,0.15] and [−0.18,0.15] for influenza and dengue-4 datasets respectively under the lognormally distributed model of rate variation and [−0.2,0.13] and [−0.19,0.13] for the exponentially distributed model of rate variation). For the marsupial dataset there is a small degree of autocorrelation suggested by the mean estimate, but it is not significantly different from zero (mean: 0.07, HPD: [−0.256, 0.4]). We would expect that larger datasets, particularly of diverse organisms that vary considerably in life-history traits or proofreading mechanisms, might exhibit substantial autocorrelation.

Five large datasets were obtained from previous studies: (1) amino acid alignments of 102
genes from eight bacterial species; (2) nucleotide alignments of 106 genes from eight
yeast species [

To assess the accuracy of the phylogenetic methods being tested, estimates of the
phylogeny need to be tested against the true phylogeny for each dataset. In order to
obtain the best possible estimates of the phylogeny for each dataset, the alignments in
each of the five datasets were concatenated. The five concatenated alignments were
analyzed under the HKY model of nucleotide substitution with gamma-distributed rate
variation among sites and a proportion of invariant sites. Each analysis was run for
5,000,000 MCMC steps, with a discarded burn-in of 500,000 steps. Identical trees were
obtained using BEAST with a UCLN model and with MrBayes (

The datasets are as follows: (A) bacterial, (B) yeast, (C) plant, (D) metazoan, and (E) primate.

For each of the five groups of data, each alignment was analyzed using MrBayes (unrooted
Felsenstein [UF] model), BEAST with a molecular clock (CLOC); and
uncorrelated lognormal relaxed clock (UCLN). The HKY model of nucleotide substitution was
assumed, with gamma-distributed rate variation among sites and a proportion of invariant
sites. Most analyses were run for 500,000 MCMC steps with 50,000 burn-in steps, although
some datasets required 1,000,000 steps with 100,000 burn-in steps. All analyses were
checked for convergence using the program Tracer 1.2 [

All three methods performed poorly in analyses of the bacterial and metazoan datasets.
This result is not surprising, however, considering the substantial time depth of these
trees. The uncorrelated relaxed-clock method produced the most accurate estimates of
phylogeny overall (

In the case of the primate data, all three methods were similarly accurate in estimating phylogenies. This is probably because the data were relatively clocklike, with the molecular clock assumption rejected for less than a third of the alignments. For all of the datasets that were analyzed, the phylogenetic estimates made using a strict molecular clock were the most precise. As expected, the average size of the 95% credible set of trees was always the smallest for the molecular clock method, and nearly always greatest for the unrooted method. Under conditions in which the data more or less conform to a molecular clock, such as the primate data examined in this study, the molecular clock method should be used due to its superior precision.

The relaxed phylogenetics methods described here co-estimate phylogeny and divergence times
under a relaxed molecular clock model, thus providing an integrated framework for biologists
interested in reconstructing ancestral divergence dates and phylogenetic relationships. The
method presented here naturally incorporates the time-dependent nature of the evolutionary
process without assuming a strict molecular clock. One of the byproducts of estimating a
phylogeny using a relaxed clock is an estimate of the position of the root of the tree, even
in the absence of a non-reversible model of substitution [

Recently, a number of authors have begun to investigate the impact of various forms of
model misspecification on the accuracy of posterior probabilities of clade support
[

We have presented a large analysis of 102 bacterial, 106 yeast, 61 plant, 99 metazoan, and 500 primate alignments that overall suggests the relaxed-clock models are both more accurate and more precise at estimating phylogenetic relationships than current unrooted methods implemented in MrBayes and other programs. Overall, these initial results suggest that a relaxed phylogenetic approach may be the most appropriate even when phylogenetic relationships are of primary concern and the rooting and dating of the tree are of less interest.

The molecular clock assumption can be relaxed in a variety of ways [_{1}, _{2},…,
_{2 n−1 }} and a
corresponding vector of node heights _{1}, _{2},…,
_{2 n−1 }} in units of time.
The node height vector, in conjunction with an edge graph, _{R}
_{2} in the tree given the ancestral rate
_{A}
_{( i) } and the time Δ _{i}

The first such model to be described [_{A},

In the autocorrelated relaxed-clock models that have been described, including the commonly
used lognormal model [

We present an alternative to the autocorrelated prior in which there is, a priori, no correlation of the rates on adjacent branches of the tree. Instead we propose a model in which the rate on each branch of the tree is drawn independently and identically from an underlying rate distribution. We investigate two candidates for the rate distribution among branches:

These uncorrelated priors can be framed in a hierarchical Bayesian framework, as with the autocorrelated priors. In this scenario the exponential version of uncorrelated relaxed clock would have a prior probability on the rate vector of:

This model corresponds to an exponential prior distribution on rate _{i}
^{−1} and no dependence on either the rate of the previous
branch or the time between the two rates. The parameter λ is a hyperparameter
that is fixed and not estimated via MCMC, and represents a prior statement about both the
mean and the variance of branch rates. This prior reflects a punctuated view of change in
evolutionary rate, so that the prior expectation of the rate at all branches is the same,
with no autocorrelation between adjacent branches. Notice that the posterior distribution
of rates among branches need not be the same as the prior in this setup and that
autocorrelation may exist in the posterior, even though it is not specified in the prior.

Instead of framing Equations

A particular requirement of Bayesian phylogenetic inference is the responsibility given
to users to specify a prior probability distribution on the shape of the phylogeny (node
ages and branching order). This can be either a benefit or a burden, largely depending on
whether an obvious prior distribution presents itself for the data at hand. For example,
the coalescent prior [

In some cases, the choice of prior on the phylogenetic tree can exert a strong influence
on inferences made from a given dataset [

The full Bayesian sequence analysis with an uncorrelated relaxed-clock model allows the co-estimation of substitution parameters, relaxed-clock parameters, and the ancestral phylogeny. The posterior distribution is of the following form:

The vector Φ contains the parameters of the relaxed-clock model (e.g., μ
and σ^{2} in the case of lognormally distributed rates among branches).
The term Pr _{G}
_{inv}

We summarize the posterior density in

The formulation in

The function ^{−1}(

Each of the 12 categories has equal probability ( ^{th} rate category (numbered from left to right)
corresponds to the (I − 0.5)/12 quantile of the lognormal distribution.

One issue that remains largely unresolved in this piece of work is the issue of model comparison and model selection. Within a Bayesian framework, Bayes factors are usually regarded as the correct way to deal with model selection. Typically this involves a technique known as reversible-jump MCMC. We have not implemented this, but we do plan on developing a reversible-jump MCMC version of this framework in the future. Typically model selection is easy when one model produces a much better fit. Because all of the models for rate variation examined here differ by one free parameter at most, a simple comparison of the average log posterior probabilities will usually be revealing. It is only when the log posteriors are very similar and the results are qualitatively different between the two models that model selection becomes an issue. This combination of conditions did not occur in any of our real datasets.

The MCMC must sample the tree topology, the divergence times, and the individual
parameters of the substitution model and tree prior(s). Therefore, a series of proposal
distributions (often called “moves”) needs to be employed. Our MCMC
implementation employs an array of moves, each of which is designed to explore a certain
subspace in the overall parameter/model space being explored. For example, some moves
propose local changes to the tree topology while keeping the coalescent interval and all
the other parameters constant. Some moves propose a change to a single substitution
parameter (such as the shape parameter of the gamma distribution) while keeping everything
else constant. The general scheme is to (1) choose a random move with a probability
proportional to a specified weight, then (2) apply the move to the current state, and (3)
assess the relative score of the new state. The new state is adopted if it has a higher
posterior probability; otherwise it is adopted with probability equal to the ratio of its
posterior probability to the posterior probability of the previous state. The weights
allow the researcher to favor certain moves which can help with the performance of the
MCMC, but generally the default weights give good results. Most of the moves used in our
MCMC implementation have been previously described [

The output of an MCMC analysis is a set of samples from the posterior distribution. In
the case of the uncorrelated relaxed-clock models described above, the posterior
distribution is a distribution over tree topologies, dates of divergence, branch rates,
and parameters of the rate and substitution models. This complex set of samples can be
summarized in many ways. One of the simplest summaries of the branch rate distribution is
to sample the coefficient of variation (σ _{r}
_{r} = 1 _{r}
_{r}
_{r}

This is the simple average of the calculated

Some subtlety in the interpretation of the posterior distribution of rates is required
because both the amount of time a branch represents, _{j}, _{j}
_{j}t_{j}
_{j}
_{j}

rather than the simple unweighted average ^{( T) }|

In the above discussion on rate models, it was assumed that it is possible to estimate
absolute rates of evolution and the variance in absolute rates. In fact, even under a
molecular clock assumption, the divergence times and the overall substitution rate can
only be separately estimated if there is a source of external calibration information. In
the framework described here, this information can come from one of three sources: (1)
Prior information on the age of internal nodes: In a phylogenetic context, calibration
information is often obtained by assigning the age of a known fossil to a particular
internal node [

All of these forms of calibration information can be incorporated into our MCMC implementation either on their own or in any combination, as appropriate.

(167 KB DOC).

The authors would like to thank S.-M. Chaw and H. Philippe for providing data, and Lindell
Bromham for coining the phrase “dating with confidence.” All of the
methods described above have been implemented in the BEAST software package (

autocorrelated exponential distribution

autocorrelated lognormal distribution

base pairs

strict molecular clock

highest posterior density

maximum a posteriori

Markov chain Monte Carlo

uncorrelated exponential distribution

uncorrelated lognormal distribution

unrooted Felsenstein