Accurate Bayesian phylogenetic point estimation using a tree distribution parameterized by clade probabilities

Lars Berling; Jonathan Klawitter; Remco Bouckaert; Dong Xie; Alex Gavryushkin; Alexei J Drummond

doi:10.1371/journal.pcbi.1012789

Abstract

Bayesian phylogenetic analysis with MCMC algorithms generates an estimate of the posterior distribution of phylogenetic trees in the form of a sample of phylogenetic trees and related parameters. The high dimensionality and non-Euclidean nature of tree space complicates summarizing the central tendency and variance of the posterior distribution in tree space. Here we introduce a new tractable tree distribution and associated point estimator that can be constructed from a posterior sample of trees. Through simulation studies we show that this point estimator performs at least as well and often better than standard methods of producing Bayesian posterior summary trees. We also show that the method of summary that performs best depends on the sample size and dimensionality of the problem in non-trivial ways.

Author summary

Our research introduces novel methods to analyse a set of phylogenetic tree topologies, such as those generated by Bayesian Markov Chain Monte Carlo algorithms. We define a new model for a distribution on trees that is based on observed clade frequencies. We study it together with closely related models that are based on observed clade split frequencies. These distributions are easy to work with and, as we show experimentally, provide excellent estimates of the true posterior distribution. Furthermore, we demonstrate that they enable us to find the tree with the highest posterior probability, which acts as a summary tree or point estimate of the distribution. In simulation studies, we show that the new methods performs as least as well or better than existing methods. Additionally, we highlight that choosing the best method for summarizing sets of trees remains challenging, as it depends on the sample size and complexity of the problem in non-trivial ways. This work has the potential to improve the accuracy of phylogenetic studies.

Citation: Berling L, Klawitter J, Bouckaert R, Xie D, Gavryushkin A, Drummond AJ (2025) Accurate Bayesian phylogenetic point estimation using a tree distribution parameterized by clade probabilities. PLoS Comput Biol 21(2): e1012789. https://doi.org/10.1371/journal.pcbi.1012789

Editor: Natalia L. Komarova

Received: February 21, 2024; Accepted: January 13, 2025; Published: February 13, 2025

Copyright: © 2025 Berling et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The CCD-MAP trees present a significant advancement over the current standard method (MCC) for point estimators in BEAST1/2, offering fast, efficient, and notably improved performance. These new point estimators are freely available at https://github.com/CompEvol/CCD/ as a package for BEAST2. The installation instructions and user manual are also provided at the GitHub. The Accurate Bayesian phylogenetic point estimation using a tree distribution 23 simulated datasets and DS1 to DS4 used for the evaluation are also freely available under doi: 10.17608/k6.auckland.c.7102354.

Competing interests: The authors have declared that no competing interests exist.

Introduction

One of the main inference paradigms in phylogenetics is Bayesian inference using Markov Chain Monte Carlo (MCMC)[1–3]. The distinguishing characteristic of phylogenetic models is the tree topology describing the evolutionary relationships for a set of taxa. Bayesian inference is based on a statistical model that describes the probability of a set of sequences given a phylogenetic tree, consisting of a topology with associated divergence times (or branch lengths) and model parameters. The MCMC algorithm iteratively samples a state space that, if set up with appropriate length and sampling interval, returns a sample that is a representation of the true underlying posterior distribution. In the case of phylogenetic MCMC algorithms, the output of such an analysis is a sample of phylogenetic trees, typically numbering in the thousands.

In a Bayesian phylogenetic analysis, the posterior distributions of many continuous parameters (e.g. kappa, base frequencies, molecular clock rate, population size) are easily summarised by considering statistics of the marginal distribution of the parameter of interest from the samples obtained by MCMC. On the other hand, one of the most crucial parameters—the tree topology—is a discrete parameter whose central tendency and variance are harder to characterise due to the high-dimensional and non-Euclidean nature of tree space [4–6]. It has thus become standard practise to employ summary or consensus tree methods to condense the output into a single tree [7]. Although we focus on Bayesian phylogenetics in this paper, it is worth noting that this approach is not unique to it but rather commonly employed across the field of phylogenetics when analysing collections of trees. This single tree, which in this paper we refer to as a Bayesian point estimate, is then used for further representation and interpretation of an analysis. Despite considerable efforts dedicated to the development of summary methods [8], it remains unclear which method performs best for summarising collections of trees. Most summary methods construct a tree in two steps [7]: First, a tree topology is constructed or selected, and, second, this discrete topology is then annotated with divergence times (or branch lengths). In this paper we focus on the first step, the construction of a rooted binary tree topology.

The predominant challenge for many summary tree estimators is the complexity of the tree space they are operating on. This is particularly the case for methods trying to compute a mean in a high-dimensional, non-Euclidean space such as the Billera-Holmes-Vogtmann (BHV) space [5,9,10] or a space induced by rearrangement operations [6]. While good progress has been made, these methods suffer from the complexity of tree space geometry and are not tractable yet for large problems [6,10]. The two most popular methods in practice thus operate only on the sampled trees. First, consensus methods focus on finding a consensus among the given trees. The prevalent variant is the greedy majority-rule consensus (greedy consensus or MRC) tree, which builds up a tree by including clade after clade greedily (i.e., more frequent clades first) that are compatible with the current tree; ties are broken arbitrarily [8]. Consensus methods are however prone to polytomies (i.e., parts of the tree remain unresolved) and finding the most resolved greedy MRC tree is an NP-hard problem [11]. Second, the maximum clade credibility (MCC) tree picks the tree from the sample distribution with maximum product of (Monte Carlo) clade probabilities. While the computation of the MCC tree is fast and efficient, it comes at a cost in accuracy due to the restriction to the sampled trees. The equivalent of this tree outside the BEAST framework would be the sampled tree with highest posterior probability, commonly used within MrBayes [2].

A good estimate of the tree distribution is still needed for questions concerning, for example, the credibility set of trees and the information content (entropy) [12] as well as for applications such as Bayesian concordance analysis (BCA) [13]. Introduced by Höhna and Drummond [14] and improved by Larget [15], the conditional clade distribution (CCD) offers an advanced estimate of the posterior probability distribution of tree space. Based on simple statistics of the sample, it provides normalized probabilities of all represented trees and allows direct sampling from the distribution. CCDs have for example been used to measure the information content and detect conflict among data partitions [12], for species tree–gene tree reconciliation [16], and for guiding tree proposals for MCMC runs [14]. Constructing the CCD and performing these tasks can be done efficiently [12,15]. Zhang and Matsen [17,18] and Jun et al.[19] looked at a slightly more complex model than a CCD, called a subsplit directed acyclic graph (sDAG). While similar to a CCD, probabilities in an sDAG can be obtained from a sample of trees, they also discuss different methods to learn the model parameters [17–19].

In this paper we extend the applicability of CCDs by introducing a new parametrization for CCDs and describing fixed-parameter tractable algorithms to compute the tree with highest probability. We demonstrate the usefulness of the new distribution and these new point estimates for Bayesian phylogenetics by comparing them to existing methods in simulation studies. Particularly, we find that these point estimates generally outperform the MCC tree and are more robust to the random sampling process of MCMC.

Methods

In this section, we first discuss properties of tractable tree distributions and define CCDs with three different parametrizations. We then recall the definitions of the MCC and greedy consensus tree and show how CCDs give rise to new point estimators. Lastly, we describe the datasets we generated for our experiments. Throughout, we write tree instead of tree topology and further assume that all our trees are rooted and, unless mentioned otherwise, are binary.

Tractable tree distributions

The following are some key criteria we would like any distribution over a set of trees to meet. While not a formal definition, these are important desiderata for any such distribution to satisfy. We consider a probability distribution over a set of trees (on the same taxa) a tractable tree distribution if some common tasks can be performed efficiently in practice. Example tasks are computing the probability of a tree and retrieving the tree with maximum probability. As the main quality criteria for a tractable tree distribution we consider its accuracy, that is, how well it estimates the probability of trees, in particular of those in the 95% credibility set. In simulation studies we can also test whether a distribution contains the true tree. If we generate a type of distribution for the same data multiple times, we can consider the precision and the stability, that is, how much the probabilities of trees and how much the accuracy change, respectively. Since below we populate the parameters of CCDs deterministically from samples, we can only measure these indirectly through samples from different MCMC runs.

A simple example distribution is the set of sampled trees from an MCMC run; we call this a sample distribution. It offers Monte Carlo probabilities and while some tasks can be performed efficiently, it has quite low accuracy, poor representativeness, and is in general not stable. In fact, since the space of trees increases super-exponentially with the number of taxa, a sample on several thousand trees typically misses the majority of trees with non-negligible posterior probability even for moderate size problems.

Reintroducing the concept of a CCD, we first define a graph, which we call a forest network, capable of representing a larger number of trees. Assigning probabilities to certain vertices (or edges), we obtain a CCD graph. The version of a CCD by Larget [15] is one possible parametrization of a CCD based on observed clade splits; we call this a CCD1. Our new parametrization, CCD0, is based on observed clades. Here we use the observed clade split and clade frequencies to populate these parameters. We also show how to efficiently sample trees from a CCD and how dynamic programming allows efficient computation of values such as the number of trees and its entropy.

Forest network.

Let X be a set of n taxa. A forest network N on X is a rooted bipartite digraph with vertex set ( C , S ) that satisfies the following properties:

Each C ∈ C represents a clade on X. So for each C ∈ C, we have C ⊆ X; for each taxon ℓ ∈ X, { ℓ } ∈ C, and also X ∈ C.
Each S ∈ S represents a clade split (also called subsplit in the context of unrooted trees [17]). So each S ∈ S has degree three with one incoming edge ( C , S ) and two outgoing edges , such that , for some . Then and S is a clade split of C. We also use the notation for S.
Each non-leaf clade has outdegree at least one and each clade except X has indegree at least one.

Note that X is the root of N, the taxa in X are the leaves of N, and each non-leaf clade has at least one clade split. We use terms such as child and parent naturally to refer to relations between vertices of N. (For example, the root clade in the forest network in Fig 1B has three child clade splits and each clade split S has a parent clade C.) When talking about multiple graphs, we let C ( N ) and S ( N ) denote the clades and clade splits, respectively, of N. For a (rooted binary phylogenetic) tree T on X, we use analogous definitions for C ( T ) and S ( T ) (each pair of sibling clades in T forms a clade split of T). For a clade C, we define S ( C ) as the set of child clade splits of C.

A forest network N displays or contains a tree T if each clade split of T is in S ( N ) , i.e., S ( T ) ⊆ S ( N ) ; see Fig 1 For a clade C define N(C) as the restriction of N to C, that is, the forest subnetwork rooted at C containing all vertices reachable from C. Analogously, for S ∈ C ( N ) , we can define the forest subnetwork N(S) of N that is rooted at the parent clade C of S but contains only S as child of C and all vertices reachable from S. Note that, for a clade split of X, network N contains all trees composed (amalgamated) of one subtree from and one subtree from ; this holds recursively. Hence, a forest network is suitable to represent huge numbers of trees when all combinations of subtrees are included.

Download:

Fig 1. A CCD graph

((B) forest network with clade split probabilities) based on a tree sample (A) smoothens the probabilities to all trees it displays: (A) Posterior sample of size seven consisting of three different trees sampled thrice, twice, and twice. Only the clades ABCDE and ABC are split in multiple ways. The resulting probabilities of the trees in the CCD1 are thus 9 ∕ 49, 8 ∕ 49, and 8 ∕ 49. (B) Truncated CCD graph (cherry splits and singletons omitted) based on the sample trees above also displays the unsampled trees below. (C) Unsampled trees with CCD1 probabilities 12 ∕ 49, 6 ∕ 49, and 6 ∕ 49, respectively.

https://doi.org/10.1371/journal.pcbi.1012789.g001

CCD graph.

In order to turn a forest network into a tree distribution, we need to be able to compute a probability for a tree T. Larget [15] suggested to use the product of clade split probabilities over all clade splits in S ( T ) as the probability of T. We define a CCD graph as a forest network G where each clade split S in S ( G ) has an assigned probability Pr ( S ) such that, for each clade C ∈ C ( G ) , we have ; see again Fig 1B. In other words, we can randomly pick a clade split at C. From Larget [15,Appendix 2] we then get that G represents a tree distribution. So for a tree T displayed by G, we have

(1)

and, for any other tree , we have . Furthermore, the sum of probabilities of all trees displayed by G is one. We now show how CCD1 and CCD0 assign probabilities based on observed clade split and clade frequencies, respectively.

CCD1, observed clade splits.

CCD1 is a tree distribution over the space of trees on a fixed set of taxa X based on a CCD graph with clade split probabilities obtained as follows. Let , a (multi-)set of trees on X, e.g., the samples of an MCMC run. Let C and S be the sets of clades and clade splits appearing in T, respectively. Then let G be the forest network induced by T, that is, G has vertex set C ∪ S and edges naturally induced by the clade splits S (we know the two child clades and the parent clade of each clade split). Furthermore, we assign clade split probabilities as follows to turn G into a CCD graph. For a clade C ∈ C and a clade split S ∈ S, let f (C) and f (S) denote the frequencies of C and S appearing in the sample T, respectively. Note that

f ( S ) ≤ f ( C ) for all pairs of S , C with S ∈ S ( C ) ;
for each non-leaf clade C;
f ( X ) = k and, for each ℓ ∈ X, f ( { ℓ } ) = k.

The conditional clade probability (CCP) Pr ( S ) of a clade split S is defined as the ratio of S being the split of C in the posterior sample, i.e.,

(2)

Note that and Pr ( S ) = 1 if S ∈ S ( { a , b } ) for some leaves a , b. The resulting CCD graph is what we call a CCD1, the conditional clade distribution induced by the probability distributions of clade splits.

Example.

Let us consider the example shown in Fig 1 where the posterior samples consists of three trees with the first being sampled three times, and the others twice each. Observe that the root clade ABCDE is split in three different ways, namely, ABC ∥ DE, ABCD ∥ E, and ABCE ∥ D. The probabilities of these three clades splits are Pr ( ABC ∥ DE ) = 3 ∕ 7, Pr ( ABCD ∥ E ) = 2 ∕ 7, and Pr ( ABCE ∥ D ) = 2 ∕ 7. Furthermore, the clade ABC is split in two different ways with probabilities Pr ( AB ∥ C ) = 3 ∕ 7 and Pr ( A ∥ BC ) = 4 ∕ 7. All other clades are trivial or are only split in one way, e.g., the clade ABCD is always split into ABC ∥ D, so Pr ( ABC ∥ D ) = 1 ∕ 1.

The resulting CCD contains 6 different trees – the three sampled trees as well as three unsampled trees (Fig 1C). Note that the tree sampled most often still has the highest probability, with 3 ∕ 7 ⋅ 3 ∕ 7 = 9 ∕ 49, among the sampled trees, as the other two trees have a probability of 1 ∕ 7 ⋅ 4 ∕ 7 = 4 ∕ 49 each. Furthermore, the unsampled tree containing the most frequent clade split ABC ∥ DE of the root clade and the most frequent clade split A ∥ BC of ABC, has a higher probability of 3 ∕ 7 ⋅ 4 ∕ 7 = 12 ∕ 49.

CCD0, observed clades.

For the new CCD0, our goal is to have a distribution where the probability of a tree is based on the product of its clades’ frequencies. We could derive the probability of a clade C from a posterior sample on k trees as . While in general this does not yield a distribution as the tree probabilities do not sum to one, we can compute the normalizing factor; in fact, we can even compute the normalizing factor per clade split. Since for complex problems even large samples may not contain all plausible clade splits, we have as another feature for CCD0 that we also include (some) non-observed clade splits.

A CCD0 is again based on a forest network G with clades C as before as those appearing in T and the clade splits S defined as follows. Let S be the set of all possible clade splits that can be formed from C, that is, for any three clades with and , we have . (In the example above, there are no additional clade splits besides the observed ones for CCD1.) We turn G into a CCD graph by turning the clade frequencies into clade split probabilities (with an algorithm explained in Sect S1.1 of S1 Text). In particular, the clade split probabilities are set such that the probability of any tree T in G given by Eq (1) is equal to product of (Monte Carlo) clade probabilities, , normalized over all trees in G.

Both CCD0 and CCD1 are estimates of the true posterior tree distribution. Their models assume that clades/clade splits in one part of a tree behave independent of other clades. So a CCD smoothens the probabilities of a sample distribution by moving probability of overrepresented sampled trees to trees that have not been sampled, but whose clades/clade splits appear within the samples. CCD0 provides a simpler model since it is only based on observed clades whereas a CCD1 is based on clade splits and thus has more parameters. Here we use the observed frequencies to populate these parameters, but other methods such as Maximum Likelihood optimization and variational methods could be investigated [18,19].

Example, continued.

Note that the three sampled trees from Fig 1A result in the same CCD graph for CCD0 and CCD1 as no potential pair of a child clades can be combined into an unobserved parent clade. In contrast, in the example in Fig 2, CCD0 and CCD1 are different as CCD0 contains the clade split AB ∥ CD (we observe clades AB, CD, and ABCD) but CCD1 does not (we do not observe this clade split).

Download:

Fig 2.

For this sample of trees, the CCD graph of CCD0 and CCD1 differ since AB and CD can form an unobserved clade split.

https://doi.org/10.1371/journal.pcbi.1012789.g002

CCD2 and further tree distributions.

Similar to a CCD graph, Zhang and Matsen [17] and Jun et al.[19] use a structure they call a subsplit directed acyclic graph (sDAG). Here the vertices are clade splits as well as a root clade and leaf clades with an edge when a clade/both clades of a clade split correspond to one clade of the parent clade split. They then add probabilities to the edges to turn it into a distribution. An sDAG thus describes a model where the probability of a clade split not only depends on its parent clade C but also the clade split C is part of. This model has thus more parameters and is more complex than CCD1 and CCD0. Furthermore, we can represent the core structure of this model, which we call CCD2, with an extended CCD graph where each clade vertex is further distinguished by sibling clades; see Fig 3 for the extended CCD graph of the example in Fig 1. In this paper we focus on populating the CCD2 parameters solely using the tree sample, whereas Zhang and Matsen [17,18] studied other approaches to compute sDAGs and its parameters, applying more advanced techniques such as regularization and variational methods.

Download:

Fig 3. Truncated extended CCD graph (cherry splits and singletons omitted, sibling clades in brackets) based on the sample trees from Fig 1A that represents a CCD2.

Note that it only contains the three sampled trees. (While the clade vertices might seem redundant here, they have in general higher in- and outdegree.)

https://doi.org/10.1371/journal.pcbi.1012789.g003

Dumm et al.[20] extended an sDAG to a history sDAG by adding labels (e.g. ancestral sequences) to each vertex, so that a clade/clade split can exist multiple times but with different labels. They use history sDAGs to represent and find maximally parsimonious trees.

Remark. When computing a CCD1 or CCD2 based on a tree sample T, it is important that T does not contain outlier trees that should have been discarded as burnin. Suppose otherwise, that there is an outlier tree T that does not share any clades (except X and the taxa) with the other trees in T. Then X has one clade split corresponding to T with ; all other non-leaf clades of T have only one clade split and so probability 1. Therefore, and is thus vastly overestimated. However, it is possible to built a simple heuristic to detect such outliers: For example, one could check if removing a tree T from the CCD and thus decreasing clade and clade split frequencies by one, significantly change the probability of T. Another option is too check if T contains any clades or clade splits that have been observed only once. Nonetheless, this behaviour should be kept in mind when working with CCD1 and CCD2, and in particular when T contains only few different trees.

Utilizing CCDs.

With the CCD graph as data structure underlying CCD0, CCD1, and CCD2, we can efficiently sample and compute interesting values over a whole CCD. To sample a tree from a CCD, starting at the root clade X, pick a clade split among S ( X ) based on their probabilities; then proceed in the same fashion with and until a fully resolved tree is obtained.

We can also use dynamic programming to compute values such as the number of different trees (topologies) and the entropy of a CCD, or (as explained below) find the tree with maximum probability. For example, to compute the number of different trees in a CCD graph G, for a clade C, let t ( C ) be the number of different trees in G(C). For a leaf ℓ, we have t ( ℓ ) = 1, and for any other clade, we can use the following recursive formula:

(3)

Using dynamic programming, we compute these values bottom-up through G. Then t ( X ) is the total number of different trees in G. Note that this calculation takes linear time in the number of clades and clade splits.

Analogously, we can compute the entropy of the CCD by computing, for each clade C, the entropy of G(C); let denote this value. We can then use the formula by Lewis et al.[12]:

(4)

where for each leaf ℓ ∈ X we have . The entropy of the CCD is then . Note that exp ⁡ ( - H ) is the average probability of a tree in the CCD and we can define as the number equivalent – the effective number of distinct topologies in the distribution.

Point estimators

We recall the definitions of the two most commonly used point estimators and define new point estimators based on CCD0 and CCD1. Let T be again a tree sample on k trees for which we can compute the frequencies for trees, clades, and clade splits.

MCC tree.

Let PrCC ( C ) denote the clade credibility (Monte Carlo probability) of clade C, i.e., PrCC ( C ) = f ( C ) ∕ k. The clade credibility PrCC ( T ) of a tree T ∈ T is the product of its clades’ clade credibilities:

(5)

The maximum clade credibility (MCC) tree is the tree T in T that maximizes PrCC ( T ) . Note that the MCC tree is restricted to be from the sample.

Greedy consensus tree.

Let be the nontrivial clades appearing in T ordered by decreasing frequency; ties are broken arbitrarily. Starting with a star tree with root X and leaves { ℓ } , ℓ ∈ X, we process the clades in order. For the next clade , we test whether is compatible with current tree , that is, whether there is a clade (vertex) C containing in and with no child clade of C containing or properly intersecting . If we find such a clade C, we refine by making a new child of C and making all child clades of C that are contained in child clades of . After , the resulting tree is the greedy consensus tree. For n taxa and k trees, the greedy consensus tree can be computed in time or time [11,21], or in time [22] ( ignores logarithmic factors).

CCD-based point estimators.

For a CCD[i], i ∈ { 0 , 1 , 2 } , we call the tree T with maximum probability Pr ( T ) in the CCD[i] the CCD[i]-MAP tree. Using the recursive relationships for CCDs explained above, we can find the CCD[i]-MAP tree efficiently as follows. Let Pr ⋆ ( C ) denote the maximum probability of any subtree rooted at clade C. With Pr ⋆ ( ℓ ) = 1 for a leaf ℓ, we can compute Pr ⋆ ( C ) with the following formula:

(6)

The maximum probability of any tree in the CCD[i] is then given by Pr ⋆ ( X ) . The tree T achieving this maximum probability can be obtained along with the corresponding value by dynamic programming.

Note that the CCD0-MAP tree is based on the same criteria as the MCC tree since the clade split probabilities in a CCD0 are based on clade frequencies. However, the choice for the CCD0-MAP tree is not restricted to the sample. Further note that the greedy consensus greedily picks clades based on their clade credibility. We combine these two ideas into another point estimator for CCD0. The CCD0-MSCC tree (‘S’ for ‘sum’) is the tree in the CCD0 that maximizes the sum of clade credibilities.

When annotating a tree T obtained with a CCD with clade support, an alternative to the Monte Carlo probabilities from the MCMC run is to use the probability of each clade of T to appear in a tree of the CCD. In fact, the probability of a clade C in a CCD1 equals the Monte Carlo probability of C in the sample used to construct the CCD1 (see Sect S1.1 of S1 Text for a proof). For a CCD0 or if the parameters were set differently, clade probabilities can be computed efficiently with the CCD graph.

Datasets

We performed well-calibrated simulation studies [23] using the LinguaPhylo packages LPhyStudio and LPhyBEAST [24] and BEAST2 [1] to obtain posterior samples. We used both Yule tree and time-stamped coalescent simulations. (See Figs A and B of S1 Text for graphical models.)

For our Yule tree simulations we generated two sets of 250 trees and alignments with 10 and 20 (n) taxa, as well as 100 trees and alignments with 50, 100, 200 and 400 taxa. For all simulations (except n = 20) the birth rate of the Yule [25] process was fixed to 25.0 (12.5 for n = 20). For the substitution model, we used the HKY+G model [26]. The shape parameter for the gamma distribution of site rates was modelled using a log-normal distribution, with a mean in log space of - 1.0 and a standard deviation in log space of 0.5. The transition/transversion rate ratio (κ) also followed a log-normal distribution, with a mean in log space of 1.0 and a standard deviation in log space of 1.25. The nucleotide base frequencies were independently simulated for each replicate from a Dirichlet distribution with a concentration parameter array of [5.0, 5.0, 5.0, 5.0]. The length of the sequence alignments was 300 sites (600 sites for n = 20) and the mutation rate was fixed at 1.0, so that divergence ages were in units of substitutions per site. In addition, we generated another set of simulations for 400 taxa where the only change is a four times as long sequence length of 1200.

In our time-stamped coalescent [27] simulations, we generated 100 phylogenetic trees and alignments for each of four different taxa sizes n: 40, 80, 160, and 320. Each tree coalescent process had a population size parameter (θ) drawn from a log-normal distribution with a mean in log space of - 2.4276, representing a mean in real space of approximately 0.09, and a standard deviation in log space of 0.5. The alignments consisted of 250 sites each. The youngest leaf was assigned age 0. The remaining leaf ages were distributed uniformly at random between 0 and 0.2. All other parameters were as in the Yule simulations.

We refer to the resulting datasets with Coal40, …, Coal320, Yule10, …, Yule400, Yule400-long. For each simulation, we ran 2 replicates with BEAST2 to obtain tree samples with 35k trees (50k trees for n = 10 and 20). In all cases, the replicates were checked to have run sufficiently long to ensure convergence, and excess burnin was discarded.

In addition, we perform an analysis on the DS datasets [28,29], specifically DS1 to DS4; see Sect S1.3.2 of S1 Text for details.

Results

We have presented a new tree distribution, CCD0, and introduced new point estimators. We now apply both CCD0 and CCD1 to the datasets described above to evaluate their point estimators and their performance as tractable tree distributions.

Tree distributions

To evaluate the accuracy and precision of the CCDs and sample distributions, we used the datasets Yule10 and Yule20. In Sect S1.3.2 of S1 Text, we also looked at DS1 to DS4 and included CCD2. For each simulation, we combined the 50k trees from the two replicates into one sample distribution of 100k trees, which acts as our (reference) golden distribution. These inference problems are relatively easy, and therefore, the probability of each tree (in particular, the high probability trees) is quite accurately estimated by the golden distributions.

However, for larger datasets with more taxa and higher complexity (in terms of entropy), achieving a “golden run” that accurately estimates tree probabilities within a reasonable time-frame is impossible. The size of tree space grows super-exponentially, and the probabilities of individual trees become exceedingly small, making it infeasible to estimate them based on their frequency in an MCMC sample. For instance, estimating the probability of a tree with a probability of using an MCMC process that takes one second per sample would require at least samples, which would take over three years to complete. This estimate does not even account for the massive amounts of physical storage space needed to retain these tree samples, rendering it impossible to achieve such a golden run for problems more complex or notably larger than the Yule10 and Yule20 examples presented. In fact, this limitation of sample distributions is the very reason why we consider CCDs as estimates of the posterior tree distribution.

We used (sub)samples of size 3, 10, 30, 100, 300, 1k, 3k, 10k, and 30k to generate a CCD0, CCD1, CCD2, and sample distribution for each of the two replicates of all simulations – eight distributions per simulation. For each tree T in the golden distribution, we then calculated the probability of T in each of the eight distributions. Comparing these to the golden probabilities, we use different statistical measures to evaluate the accuracy of each distribution.

Accuracy.

For each sample size, we computed the mean absolute error (MAE) of tree probabilities for each distribution. In the context of a specific distribution (Sample, CCD0, CCD1, CCD2), the MAE is calculated as the average of the absolute differences in probabilities between this distribution and the reference (golden) distribution, considering all trees within the reference distribution. Note that the MAE weights the accuracy on high-probability trees more compared to lower probability trees. We then counted how often each distribution type had the lowest MAE, their number of wins. We further divided the simulations into five equal-sized groups (each of size 100) based on their entropy [12], that is, the sum of - Pr ( T ) log ⁡ Pr ( T ) over all trees in the golden distribution. (For Yule10 the entropy bounds are set by 0.41, 1.76, 2.5, 3.25, 4.30, 7.68 with means of 1.20, 2.09, 2.84, 3.67, 5.30 and for Yule20 they are 0.09, 2.29, 3.22, 4.03, 5.08, 7.73 with means of 1.70, 2.82, 3.61, 4.52, 5.93.) Heatmaps of the wins in these categories for Yule10 and Yule20 are shown in Fig 4, where each tile is colored by the distribution that has the majority of wins and its win-% is given. A more detailed view on the number of wins can be found in Fig D in S1 Text.

We observe that there are three regimes based on the sample size: Roughly, from 3 to about 100 samples, CCD0 is the most accurate method; from 100 to 10k samples, CCD1 gives the best estimates; for the largest samples, CCD2 takes over CCD1. A lower entropy seems to prolong the dominance of CCD0. The boundaries of the regimes also vary with the problem size. The experiment confirms the regimes we expected: CCD0 is the simplest model and quickly provides a good estimate; CCD1 has more parameters, so needs longer to be saturated, whereas CCD0 then starts to show its bias. The same is true for CCD2, which needs even more samples than CCD1. In the long run, the sample distributions provide the best estimate, which can be observed by taking a more detailed look at the best performing distribution per simulation (cf. Sect S1.3.1 in S1 Text).

Download:

Fig 4.

Heatmap showing the majority wins based on MAE with simulations in five entropy categories (higher means noisier/harder); more saturated colors mean a larger wining margin for the respective distribution (CCD0, CCD1, CCD2 or the sample distribution).

https://doi.org/10.1371/journal.pcbi.1012789.g004

We also observe the regimes when we look at the mean relative error (MRE) of tree and clade probabilities; see Fig 5. (Since the results look very similar for Yule10 and Yule20, those for Yule10 can be found in Sect S1.3 of S1 Text.) The MRE is defined as the mean of the absolute difference in probability between the golden distribution and the generated distribution (CCD0, CCD1, CCD2, or Sample) divided by the golden probability for all trees/clades. Note that for the MRE, a small absolute difference in probability for low probability trees causes a larger relative error. Since tree probabilities in the tail of the distribution are not that well estimated, we consider thus only the trees in the 50% and 95% credibility intervals. That is, the minimum number of highest ranked trees in the golden distribution whose probabilities sum up to 50%/95%. For clades we consider all clades in the golden distribution. For small sample sizes, CCD0 performs better/equal than CCD1 up to about sample sizes of 30/300. Note that CCD0 then does not improve any further, indicating the limitations of the CCD0 model. The performance of CCD1 remains the best even for larger sample sizes with CCD2 close behind and the sample distribution only slowly catching up. Note that in the case of clade probabilities, we have merged CCD1 and Sample because they are the same.

Download:

Fig 5. Median MRE for trees and clades in the golden distribution per sample size for Yule20.

Trees are separated into 50% and 95% credible sets.

https://doi.org/10.1371/journal.pcbi.1012789.g005

Looking at the mean estimated rank of the top tree of the golden distribution in the other distributions for each simulation reveals a similar picture; see Fig 6. CCD0 is best for sample sizes up to and including 30, but above 100 CCD1 and CCD2 perform better on average; the sample distribution requires 1k samples to become competitive.

Download:

Fig 6. Mean rank of the top tree (rank 1) in the golden distribution in the other distributions per sample size for Yule20.

https://doi.org/10.1371/journal.pcbi.1012789.g006

Precision.

To evaluate the precision, we computed the difference in the tree probabilities between the two replicates for each sample size. The mean over the 100 simulations for Yule20 are shown in Fig 7. We observe that the CCDs consistently show a higher precision than the sample distribution for all sample sizes. Note that high precision also implies a high stability.

Download:

Fig 7. Evaluating the precision of the distributions, we computed the mean of mean absolute differences of tree probabilities by the distributions between two replicates per sample size for Yule20.

https://doi.org/10.1371/journal.pcbi.1012789.g007

Representativeness.

Note that by construction for a given MCMC run, if the sample distribution contains the true tree then so do the CCDs; analogously, if CCD1 (CCD2) contains the true tree then so does CCD0 (resp. CCD1 and CCD0). Table 1 shows how the percentage of the distributions (both replicates per simulation) that contain the true tree for the 250/100 simulations of the Yule20 and Yule50 dataset. For the former, we observe that CCD0 and CCD1 cross the 95% threshold already for 100 samples, while the sample distribution only does so at 3k samples. The difference becomes even more apparent for Yule50, where the sample distribution only reaches 3.5% with 30k sampled trees, while the CCDs quickly contain the true tree in the majority of simulations and also reach the 95% threshold.

Download:

Table 1.

Percentage of the true tree being contained in a distribution for Yule20 and Yule50 (out of the 250/100 simulations with 2 replicates each).

https://doi.org/10.1371/journal.pone.0313772.t001