Statistical inconsistency of the unrooted minimize deep coalescence criterion

Species trees, which describe the evolutionary relationships between species, are often inferred from gene trees, which describe the ancestral relationships between sequences sampled at different loci from the species of interest. A common approach to inferring species trees from gene trees is motivated by supposing that gene tree variation is due to incomplete lineage sorting, also known as deep coalescence. One of the earliest methods motivated by deep coalescence is to find the species tree that minimizes the number of deep coalescent events needed to explain discrepancies between the species tree and input gene trees. This minimize deep coalescence (MDC) criterion can be applied in both rooted and unrooted settings. where either rooted or unrooted gene trees can be used to infer a rooted species tree. Previous work has shown that MDC is statistically inconsistent in the rooted setting, meaning that under a probabilistic model for deep coalescence, the multispecies coalescent, for some species trees, increasing the number of input gene trees does not make the method more likely to return a correct species tree. Here, we obtain analogous results in the unrooted setting, showing conditions leading to inconsistency of the MDC criterion using the multispecies coalescent model with unrooted gene trees for four taxa and five taxa.


Introduction
Evolutionary trees estimated at different loci, gene trees, vary from one another and from the species tree, which represents the history of speciation events. Although there are many causes of such gene tree discordance, one of the most commonly modeled is deep coalescence, the failure of two or more gene lineages to coalesce (i.e., be copied from the same gene in the population) in their most recent ancestral population [1,2]. This phenomenon, also called incomplete lineage sorting, is modeled by the multispecies coalescent, which makes probabilistic predictions for the probabilities of different gene tree topologies to be observed in a sample of gene trees [3][4][5]. Although other sources of gene tree heterogeneity are possible, such as gene duplication and loss, hybridization, recombination within genes, and ancient population structure [1,2,6,7] deep coalescence is thought to be quite common for species that underwent rapid radiations [8,9] and is often used to infer species relationships from gene trees [2,10,11]. Species tree inference methods can be based on sequence data or based on analyzing gene trees (e.g., consensus methods). These latter techniques are also called two-stage methods, meaning that a first stage is estimating the gene trees, and the second stage is combining the information in the gene trees to estimate the species tree. Two-stage techniques are typically computationally faster than sequence-based techniques and do not have issues with convergence of MCMC algorithms that have arisen for real data sets with many loci for Bayesian methods [12]. Consequently, two-stage methods have remained popular in spite of more sophisticated sequence-based methods. Whether or not a two-stage method is explicitly motivated by the multispecies coalescent, a central concern is whether it is statistically consistent under the multispecies coalescent model. A method is consistent in this setting if the probability that it returns the correct species tree topology tends to 1.0 as the number of loci tends to infinity.
Among two-stage methods for inferring species trees, some have been found to be consistent from known gene trees, while others have been shown to be inconsistent. Two-stage methods that have been shown to be statistically consistent (assuming gene trees are known without error) include rooted triple consensus [13], ASTRAL [14], MP-EST [15], NJ st (also called USTAR) [16,17], and STAR [18]. Some have been found to be statistically inconsistent, including democratic vote [19], greedy consensus [20], matrix representation with parsimony [21], and the minimize deep coalescence (MDC) method in the rooted setting [22]. The MDC method, the focus of this paper, infers the species tree which minimizes the total number of deep coalescence events needed to explain each gene tree, summed over all the input gene trees.
The idea behind MDC was introduced by Maddison (1997) and was an early method to be implemented for inferring rooted species trees from rooted gene trees motivated by deep coalescence [23][24][25]. Initially, the method was only applied with rooted gene trees as input, and implementations returned a rooted inferred species tree. To illustrate the idea, consider the species tree and gene trees in Fig 1. [26], which use approximate Bayesian computation (ABC), is the first method we are aware of to explicitly use the coalescent (i.e., using probabilities from the model) to estimate rooted species trees from unrooted gene trees; however, a version of the minimize deep coalescence method (MDC) was also developed to infer rooted species trees from unrooted gene trees [27]. The idea behind the method is to calculate the MDC score contributed by an unrooted gene tree for a candidate rooted species tree by minimizing the cost over all possible rootings of the gene trees.
Although MDC was one of the first methods to be implemented to infer species trees from gene trees [23], this criterion was found to be statistically inconsistent in the rooted setting (i.e., using rooted gene trees as input) in the same year that its unrooted extension was published [22,27]. For some species trees, the probability that MDC returns an incorrect species tree tends to 1.0 as the number of input rooted gene trees goes to infinity. Although more accurate methods for inferring species trees have been developed, MDC is still sometimes used to quickly estimate a candidate species tree or phylogenetic network [28], which motivates studying its properties. Currently, there are no fast methods for inferring rooted species trees from unrooted gene trees. Consequently, a possible application of MDC in this setting is to generate candidate trees (particularly because MDC can be used to find sub-optimal trees) to reduce the search time needed for other more computationally intensive methods.
We also note that PhyloNet can use unrooted MDC, which we call UMDC, to return a rooted species tree even in the case of four taxa, although four-taxon gene tree topologies do not identify the rooted species tree under the multispecies coalescent [29]. A theoretical result from Allman et al. (2011) is that the true distribution of unrooted genetic tree topologies can be used to infer rooted species tree when there are five or more taxa, but not when there are only four taxa.
Part of the argument for the identifiability of the rooted species tree from unrooted gene trees is that there are certain inequalities that hold in the gene tree probabilities. For instance, consider distinguishing the two rooted species trees ((((a, b), c), d), e) versus ((((a, b), c), e), d). Both species trees have the same unrooted topology but imply different inequalities in some of the unrooted gene tree topology probabilities. For the first species tree, lineage c is more likely to coalesce with d than e; consequently, for the species tree, the unrooted gene tree ((a, b), e, (c, d)) is more probable than the unrooted gene tree ((a, b), d, (c, e)). However, for species tree ((((a, b), c), e), d), these inequalities are reversed. Therefore, observing more unrooted gene trees with topology ((a, b), e, (c, d)) than topology ((a, b), d, (c, e)) gives evidence favoring the first species tree over the second species tree. Interestingly, for distinguishing the two species trees, the frequency of the matching unrooted gene tree ((a, b), c, (d, e)) is not helpful. Instead, it is frequencies of nonmatching unrooted gene trees that are useful for inferring the rooted species tree [26]. This paper examines features of the distribution of unrooted topological gene trees that occur under the multispecies coalescent model on a species tree, for deriving asymptotic (i.e., large numbers of loci) UMDC behavior from unrooted gene trees for four taxa and five taxa.

Results
Let S be a binary, rooted species tree on a taxon leaf set X, and let λ be a list of branch lengths on S measured in coalescent time units. Here, λ i = 1 means that branch i has a length of N e generations, where N e is the effective population size. Let R(X) be the set of all rooted, binary trees for taxon set X and let U(X) be the set of all unrooted, binary trees for the same taxon set. Let T denote a rooted gene tree, and S 0 a candidate species tree. Let α � (T, S 0 ) denote the rooted deep coalescence cost (the minimum number of extra lineages) for a rooted gene tree T and candidate species tree S 0 . For an unrooted tree U with possible rootings T 1 , T 2 , . . ., T k , where k = 2n − 3 and n is the number of taxa, the unrooted coalescence cost is The number of extra lineages at a species boundary is the number of lineages greater than 1 passing from a population to its immediate ancestor. The total number of extra lineages for a gene tree-species tree pair is the sum of extra lineages over the entire species tree (Fig 1).
If G is an observed set of unrooted gene trees, we can think of UMDC as returning the inferred treeŜ that minimizes the average UMDC score: For a given species tree S with branch lengths λ and candidate species tree topology S 0 , we can define the expected UMDC cost as In other words, we interpret α applied to a rooted gene tree as minimizing over all possible rootings of the gene tree. The equivalence is due to the fact that we can compute the probability of an unrooted tree by summing over the probabilities of all possible rootings [29].
We note that this expected value depends on the branch lengths of the species tree S, but that branch lengths for S 0 do not need to be specified since only the topology of S 0 is used (and estimated). If the expected UMDC score is minimized by some S 0 6 ¼ S, then UMDC is inconsistent since, by the Law of Large Numbers, as the number of loci tends to infinity, the UMDC score will be minimized by a tree other than the species tree with probability 1. To show inconsistency, it is sufficient to find a species trees S and S 0 and branch lengths λ such that E[α S,λ (T,

Trees with four leaves
Here are three unrooted, binary trees on four leaves. For the species tree, we can consider the two cases of symmetric or asymmetric binary species trees (Table 1). We indicate the i th distinctunrootedgenetreetopology, i = 1, 2, 3, asT i . The asymmetric species tree is also called a caterpillar, denoted S C , and we denote the balanced tree as S B , using one representative labeling for each case. Thus, the species tree is either (S To see how this implies that UMDC is inconsistent, let the true species tree be S B = (((a, b): x, (c, d):y), then the expected coalescence cost under candidate tree S C = (((a, b), c), d) is (4/3) exp(−(x + y)). Under candidate tree S B , the expected cost is (8/3)exp(−(x + y)). Thus, MDC will always give a lower cost to the tree S C when S B is the species tree. (Similarly, if S C is the Table 1. UMDC for 4-taxon unrooted gene trees. species tree, UMDC will also give a lower cost to S C than to S B , regardless of the data.) Thus, MDC is incapable of returning a balanced tree for this scenario. This means that UMDC is inconsistent on four taxa. We also see that if the candidate species tree is S 0 = (((a, b), d), c) (i.e., swapping taxa c and d in the species tree), then the deep coalescence costs are also 0, 1, and 1 for gene trees U 1 , U 2 , and U 3 , respectively. Thus, regardless of the unrooted gene trees observed, UMDC will give equal scores to S C and S 0 , so that there is no way to choose one versus the other except for an arbitrary (or random) tie-break.
These results suggest that UMDC should not be used on unrooted four-taxon gene tree topologies. However, this is not unreasonable because it has been shown that under the MSC, four-taxon rooted species trees are not identifiable from unrooted gene trees. Thus, identifying the rooted species tree from four-taxon unrooted gene tree topologies would also not be possible using maximum likelihood, for example. However, rooted species trees are identifiable from unrooted five-taxon gene trees, which we examine next.

Trees with five leaves
The 15 binary, unrooted trees with five leaves have only one possible shape, whereas rooted trees on five leaves have three shapes, which we call caterpillar, pseudocaterpillar [30], and balanced. We indicate the i th distinct unrooted gene tree topology, i = 1, . . ., 15, as T i . Although there are 15 possible gene tree topologies, there are 105 rooted species trees possible. An exhaustive approach to UMDC is to compute the UMDC score for all 105 candidate species trees and choose the species tree with the lowest score as the inferred tree.
There are three possible shapes to the rooted species tree when leaf-labels are ignored. The rooted species tree shape is called caterpillar, pseudocaterpillar, or balanced. We use S C , S P and S B , respectively to denote representative trees from each shape: Let D ijk (λ) denote the difference in expected UMDC scores for candidate trees S j and S k when the true species tree is S i . Here i, j, k 2 {C, P, B} to denote caterpillar, pseudocaterpillar, and balanced topologies. Thus, Generally, if D ijk (λ) > 0, then candidate tree S j has higher expected UMDC score than S k when the species tree is S i with branch lengths λ. In particular, if j = i, then D iik > 0 means that UMDC will tend to rank the incorrect candidate tree S k as better than the true tree S i .
The differences in expected values can be obtained from Table 2. To save space, we omit notating the dependence on λ. For example We note that the coefficients of P(U i |S C ) in D CBP are all positive, indicating that the balanced tree has a higher cost than the pseudocaterpillar when the species tree is S C . This holds regardless of the choice of branch lengths λ. Similarly, we see that for any branch lengths λ, The theoretical expected values show that, for example, if the species tree is a caterpillar, then at least one pseudocaterpillar has lower expected deep coalescence cost than the matching caterpillar species tree, and consequently UMDC is not consistent for recovering the true species tree. Similarly, if the species tree is balanced, then at least one pseudocaterpillar tree always has lower expected coalescence cost the true species tree. In both cases, UMDC will be misleading, tending to return an incorrect species tree as more loci are examined. We note that these relationships hold regardless of the branch lengths of the species tree. Remarkably, these inequalities hold not only asymptotically as the UMDC score approaches its expected value, but even for finite numbers of loci (but using non-strict inequalities). For example, let the number of 5-taxon trees in a sample be (n 1 , n 2 , . . ., n 15 ) where the subscript indexes the unrooted topologies from Table 2. If the species tree is S C , then the UMDC score for S 0 C minus Table 2. MDC for 5-taxa unrooted gene trees. ((a, b), d), (c, e))

PLOS ONE
that for S 0 P is S 0 C À S 0 P ¼ n 7 þ n 8 þ n 9 þ n 10 þ n 11 þ n 14 þ n 15 � 0 Since this is always greater than or equal to 0, the matching tree can never have better deep coalescence cost than S 0 P (if none of these 7 topologies are observed, the UMDC scores will be tied for these two candidate species trees). If the species tree is S B , then the situation is even worse, with S 0 B À S 0 P ¼ n 2 þ n 3 þ n 5 þ n 6 þ 2n 7 þ 2n 8 þ n 9 þ 2n 10 þ 2n 11 þ n 12 þ n 13 þ 2n 14 þ 2n 15 � 0: This shows that MDC is misleading for five taxa if the species tree does not have pseudocaterpillar topology. These examples are sufficient to show that UMDC is inconsistent for trees with five taxa. However, to better understand the behavior of UMDC, it is helpful to also understand the MDC costs for other true species tree and candidate species tree combinations (S1 and S2 Tables in S1 Appendix). An interesting question here is whether UMDC will tend to perform well within a particular unlabeled shape. For example, if it is known (or believed) that the species tree has a particular shape (for example, a caterpillar), will UMDC pick the correct species tree if it restricted to the correct unlabeled shape? This situation can arise in particular if alternative rootings lead to two candidate trees that have the same shape. In this case, we can examine in which cases UMDC might or might not be misleading. A potential use here is that UMDC is used to generate candidate species trees; it could return the best species tree for each tree shape, and more computationally intensive methods could then use these as starting trees.
From S1 and S2 Tables in S1 Appendix, we see that UMDC cannot distinguish two caterpillar trees with the outgroup swapped with the taxon that is an outgroup to all other ingroup taxa, for example species trees (((a, b), c), d), e) and ((((a, b), c), e), d) will have exactly the same UMDC score for any data set. To check if, for example, these are expected to be the best scoring caterpillar trees when the species tree is S C , we can use expected values. For example, to compare the expected score for S 0 1 ¼ ðððða; bÞ; cÞ; dÞ; eÞ with S 0 3 ¼ ðððða; bÞ; dÞ; cÞ; eÞ, we note that the MDC cost is the same for these candidate species trees for unrooted gene trees U 3 , U 6 , U 9 , U 10 , and U 15 . The difference in expected values therefore depends on the other 10 trees. Collecting terms, the difference is Because f(X, Y, Z) is decreasing in Z, and therefore f(X, Y, Z) < f(X, Y, 0) for all Z, a sufficient condition to show that that S C has lower expected score than S 0 3 is that Note that the expression is equivalent to Because −1 + Y < 0 and the term in parenthesis is positive, this shows that the difference in expected UMDC costs is negative; hence S C is expected to be preferred over S 0 3 when S C is the species tree. Similar arguments can be made, although tediously, to show that S C has the lowest expected UMDC score (although tied with S 0 2 ) among all caterpillar candidate species trees. Because of the ties in the UMDC costs for pairs of candidate caterpillar species trees, this requires evaluating 29 such inequalities.

Simulation
To illustrate the theoretical results, we simulated gene trees from the species trees S C , S P , and S B using hybrid-Lambda [31], where all internal branch lengths were 1.0 coalescent units, which allows a moderate amount of incomplete lineage sorting. The species trees were These species trees were also repeated with each branch length multiplied by 0.1 to investigate the effect of shorter branch lengths. For each species tree, independent simulations were run with 50, 100, 200, 400, and 800 loci. For each combination of species tree and number of loci, a set of gene trees was simulated using hybrid-Lambda [31]. Species trees were estimated directly from these known gene trees, and a second set of simulations was done using estimated gene trees. To estimate gene trees, DNA sequences were simulated from the gene trees using seq-gen version 1.3.2x [32] with 500 nucleotides per locus and base frequencies of 0.3, 0.2, 0.2, and 0.3 for nucleotides A, C, G, and T, respectively with a mutation rate of θ = 0.01 under a GTR + Γ + I model with four variable rates and 10% invariable sites. Gene trees were then estimated as unrooted using phyml version 20120412 [33] under the correct model. Each of these settings was repeated 100 times, and the proportion of times various tree topologies were inferred was recorded. The UMDC tree was obtained from phylonet using the command Infer_ST_MDC_UR. In case of a tie between highest scoring species trees, a tree was picked uniformly at random as the species tree estimate.
As predicted by the theory, regardless of the species tree, the pseudocaterpillar shape tends to be the tree inferred, with probability approaching 1.0 as the sample size (number of loci) increases (Fig 2). In cases where a non-pseudocaterpillar was inferred, this was due to randomly picking one of the trees tied for best UMDC score. Ties for the best MDC score were less likely with larger sample sizes, with 70% of cases (out of 500) with 50 loci having three trees tied for best when the species tree was S C , and 1.2% of cases being tied with 800 loci for S C . Results were similar for S B , but there were much fewer ties for best tree for the pseudocaterpillar species tree, S P . For S P , 7% of cases had a tree tied for best with 50 loci, and the best tree was always unique with 100 or more loci.
We investigated the effects of using shorter branch lengths (multiplying each branch length by 0.1) and of estimating gene trees from DNA sequences to investigate the effects of both greater gene tree heterogeneity due to shorter branches and gene tree estimation error. For the caterpillar species tree, there was very little effect of either gene tree estimation error or shortening species tree branches (Fig 2). For all species trees, estimation error had little effect on the inference. When branches were multiplied by 0.1, convergence to the incorrect species tree was more rapid for the balanced tree, while convergence to the correct tree was slightly slower when the species tree was a pseudocaterpillar.

Discussion
In this paper, we used a method of considering expected values of scores to show properties of the unrooted version of MDC. This approach of using expected values has also been used to show inconsistency of the original MDC criterion [22] and matrix representation with parsimony [21]. Although we only showed inconsistency for four-and five-taxon trees, the results apply straightforwardly to larger trees. For example, "caterpillarization" is a technique of making some branches long enough while keeping others short that the distribution of gene trees resembles the the distribution found from a caterpillar species tree, and all the results would apply to these larger trees as well lemmas 3 and 5 in [34]. This means that for larger species trees, there exist branch lengths for which UMDC will be misleading.  (((a,  b), c), (d, e)) was inferred. "BL 1.0" means that internal branches, as well as pendant branches leading to taxa a and b had length 1.0, while remaining pendant edges had lengths needed to make the trees ultrametric. "BL 0.1" had all branch lengths multiplied by 0.1 in the species tree. The terms "true" and "est" refer to whether gene trees were estimated from simulated DNA sequences or the actual simulated gene trees were used. https://doi.org/10.1371/journal.pone.0251107.g002 To explain this idea in more detail, suppose a six-taxon species tree has topology ((((a, b), c), d), (e, f)). If the branch leading from the most recent common ancestor (MRCA) of e and f to the root is long, then lineages sampled from e and f will almost certainly coalesce more recently than the root of the tree. Consequently, almost all gene trees from this species tree will have (e, f) as a cluster, and the species tree will have very similar properties as a species tree ((((a, b), c), d), x) with x replaced by (e, f). Consequently, if this branch is sufficiently long, unrooted gene tree distributions on taxa a-f are concentrated on just the 15 unrooted topologies that occur on five-taxon trees, and we can predict what species tree inference methods will do based on their behavior on 5-taxon trees. Another example is the 6-taxon pseudocaterpillar species tree (((a, b), (c, d), e), f). This tree can be caterpillarized by letting the branch leading from the MRCA of c and d to the MRCA of a, b, c, and d be sufficiently long. Replacing (c, d) with x in this tree, it resembles a five-taxon caterpillar ((((a, b), x), e), f) for which we can expect UMDC to prefer tree (((a, b), (e, f)), x) = ((a, b), ((c, d), (e, f)). This approach is sufficient to show that any tree that can be caterpillarized to a five-taxon caterpillar tree can be misleading in the ways shown in this paper given certain branch lengths. A more detailed proof for this particular six-taxon tree would show that of the 105 6-taxon topologies, for any given � > 0, branch lengths in the species tree can be chosen such that the probability is greater than 1 − � that the probability that the gene tree is one of the 15 trees concentrated on taxa a, b, x, e, and f.
Similarly, many species trees with more than five taxa can mimic the behavior of the 5-taxon balanced tree given certain branch lengths. For example, the species tree (((a, b), c), (d, (e, f))) will mimic the 5-taxon balanced tree in this paper if the branch leading from the MRCA of e and f to the MRCA of d, e, and f is sufficiently long. A general proof of inconsistency for UMDC for trees with 6-8 taxa would examine some special cases like these and show that there are branch lengths which could make a tree mimic the 5-taxon caterpillar or balanced trees. Trees with 9 or more taxa can always be caterpillarized to a 5-taxon caterpillar [21,34] Although the results are negative, MDC and UMDC can still be used to quickly generate starting trees when searching for species trees using other methods. An advantage of MDC and UMDC here is that they can rank trees by score, and therefore return suboptimal trees. Since MDC has a shape bias, tending to make it return more balanced trees [35], it is not surprising that there is a shape bias for UMDC as well. The bias for UMDC is surprisingly extreme however in that UMDC always prefers certain shapes (the pseudocaterpillar for five taxa) for any data. The results of this paper suggest that due to the shape bias of UMDC, a preferable method to generating starting trees might be to return the set of optimal trees within each unlabeled shape. The impact that this could have on species tree inference we leave to future work.