Abstract
For each given pair of (rooted or unrooted) topological trees with the same number of leaves a strict upper bound is shown for the tree partition distance (also called symmetric difference metric and Robinson-Foulds distance)—in case of unrooted trees—and for the cluster distance (also called Robinson-Foulds distance)—in case of rooted trees—of corresponding phylogenetic trees. In particular, it is shown that there exist assignments of labels (e.g., species) to the leaves of both topological tree where each label is assigned to exactly one leaf in each tree such that: i) in the unrooted case, the tree partition distance between the corresponding phylogenetic trees equals the number of internal edges in both trees minus the number of nodes with degree 2 in both trees, ii) in the rooted case, the cluster distance between any two corresponding phylogenetic trees equals the number of internal edges in both trees minus the number of nodes with degree 2 in both trees, and iii) the values in (i) and (ii) are also the maximum values with respect to all possible assignments. The shown strict worst case bounds are needed as normalization factor to compute a normalized version of the respective tree partition metrics.
Citation: Middendorf M, Wieseke N (2018) A strict upper bound for the partition distance and the cluster distance of phylogenetic trees for each fixed pair of topological trees. PLoS ONE 13(9): e0204907. https://doi.org/10.1371/journal.pone.0204907
Editor: Arndt von Haeseler, Max F Perutz Laboratories GmbH, AUSTRIA
Received: July 5, 2018; Accepted: September 10, 2018; Published: September 28, 2018
Copyright: © 2018 Middendorf, Wieseke. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the paper.
Funding: The authors received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
Introduction
Tree distance measures are used in many applications to compare trees. The most widely used difference measure between (unrooted) phylogenetic trees (i.e., trees where the leaves are labelled with species) is the tree partition distance (TPD). The TPD was introduced by Bourque [1] and is also called symmetric difference metric or Robinson-Foulds metric [2] (Note, that the latter name has also been used for another tree metric that was proposed in [3]). The TDP measures the size of the symmetric difference between the two sets of bi-partitions (one set for each tree) of all labels that are obtained when for each inner edge of a tree the following is done: the edge is removed and for each of the two ermerging connected components the set of labels that are assigned to its leaves is one set of the bipartition. Several authors have studied the TPD (e.g., [4–7]).
For comparing the distances between pairs of phylogenetic trees of different sizes normalized versions of tree distance measures are used. The most often used normalization factors are strict worst case bounds. Normalized versions of the TPD have been used in several evolutionary studies (e.g., [8, 9]). To define them let d(T1, T2) be the TPD between two phylogenetic trees T1 and T2 which have the same number of leaves. One normalized version of the TPD (used, e.g., in [9, 10]) is then to divide the value d(T1, T2) by the maximum TDP for phylogenetic trees with the same number of leaves as T1 and T2. It is easy to show that maximum TPD for trees with n leaves is 2n − 6 (e.g., [6]). Instead of using a strict worst case bound as normalization factor another possibility is to use the average TPD over pairs of trees with size n. Specifically, in [11] the normalized TPD of trees T1 and T2 is computed as (drand − d(T1, T2))/drand where drand is the average TPD computed empirically over 1000 pairs of random trees of size n (in [11] the trees were generated with the Yule model according to a general proposal for computing normalized tree distance metrics from [12]). It should be mentioned that sometimes also TPD/2 is called normalized Robinson-Foulds distance (e.g., in [13]). Observe, that in all these definitions of normalized versions of TDP the normalization factor does not depend on the topologies of phylogenetic trees T1 and T2.
In [8] it was argued, however, that the normalization factor should consider the topologies of phylogenetic trees since not all possible phylogenetic trees are biogically relevant. Thus, a new variant of normalized TPD was proposed in [8]. In this version—denoted nTPD—value d(T1, T2) is divided by the worst case TPD between any two phylogenetic trees that have the same topology as T1 and T2 (but a possibly different assignment of labels to the leaves). Let w(T1, T2) denote this worst case value. Clearly, w(T1, T2) ≤ 2n − 6 holds. Unfortunately and to the best of our knowledge, no explicit formula for w(T1, T2) is known and it is not feasible for larger n to check all possible assignments of labels to the leaves of trees with the same topology as T1 and T2. Therefore, the computation of the nTPD in the NELSI [14] R package (function dist.topo.normalised) is (only) approximated. This is done as follows. First the TPD is computed for several pairs of phylogenetic trees that have been obtained from the two given topological trees by randomizing the assignments of labels to the leaves for one of the trees. Then the maximum TPD over all randomized pairs of phylogenetic trees is used as an approximation for the worst case TPD. In [8] the approximated nTPD was used to compare co-phylogenetic systems of different sizes and it was argued that the maximum over randomized 1000 pairs should give a good approximation. In particular, it was shown empirically for pairs of trees with sizes up to n = 142 that the maximum over 1000 randomizations is stable (i.e., fewer randomizations had already given the same maximum value).
In this paper we present an explicit formula to compute w(T1, T2). It should be mentioned that a O(n5) time method is given in [4] to compute for a fully resolved (i.e., each inner node has degree three) phylogenetic tree T with n leaves for all values m ∈ [0: 2n − 6] the number of phylogentic trees that have TPD m to T. By taking the maximum value m for which this number is not zero one obtains the worst case TDP for T (with respect to all other fully resolved trees with n leaves). But this is different from the wost case bound shown in this paper were both topological trees are fixed. Moreover, we also consider the case of trees that are not fully resolved.
In addition to the case of unrooted phylogentic trees we also consider the case of rooted phylogenetic trees. In the unrooted case the corresponding distance measure is called cluster distance (CD) or Robinson-Foulds distance for rooted trees. Recently a Robinson-Foulds metric has been proposed in [15] to compare an unrooted phylogentic tree T1 with a rooted phylogentic tree T2 when both trees have n leaves. The idea of this measure is to root the unrooted tree T1 optimally in the sense that the obtained rooted tree has a minimal CD distance to T2. Similar, as for TPD we show a strict worst case bound for CD and for the Robinson-Foulds between an unrooted phylogentic tree and a rooted phylogentic tree. Our results imply that the corresponding normalized distances can be computed efficiently.
Basic definitions
An (unrooted) tree is a connected graph T = (V, E) with n = |V| nodes and n − 1 = |E| edges. A rooted tree is a tree which has one distiguished node with degree ≥ 2 that is called root. For a tree T = (V, E) a node v ∈ V is connected to a node w ∈ V when {v, w} ∈ E. A leaf is a node in a (rooted or unrooted) tree with degree one, a leaf-edge is an edge that is incident to a leaf and all other edges are internal edges. For an unrooted tree T a node that is not a leaf is an internal node. For a rooted tree T a node that is neither a leaf nor the root is an internal node. A proper (rooted or unrooted) tree is a tree T where each internal node has degree ≥ 3. Let (
) be the set of all unrooted (respectively, rooted) trees with n leaves and m internal edges, n ≥ 3. For a proper unrooted tree
it holds that 0 ≤ m ≤ n − 3 and for a proper rooted tree
it holds that 0 ≤ m ≤ n − 2.
Let L be a set of n labels. In this paper we assume w.l.o.g. L = {1, 2, …, n}. A phylogenetic tree on L is a tree T with n leaves and where each leaf is labelled with exactly one element from L such that for each label l ∈ L there exists a leaf with label l. For a phylogenetic tree the underlying tree T is also called the topological tree, i.e., the topological tree is the phylogenetic tree ignoring the labels of the leaves. A rooted or unrooted phylogenetic tree is proper if its corresponding rooted, respectively, unrooted topological tree is proper. If the context is clear notation T is used for a phylogenetic tree and also for its corresponding topological tree.
The removal of an edge e from an unrooted phylogenetic tree T on L induces a two set partition of L—denoted by π(T, e)—where each set of the partition corresponds to the labels of all nodes of one of the two connected components of T − e. Observe that for each leaf-edge e of T one set of the partition π(T, e) is a singleton that contains the label of the corresponding leaf. A two set partition of {1, 2, …, n} where one set is a singleton is called trivial partition. For each internal edge e of a proper unrooted phylogenetic tree T each set of the partition π(T, e) has at least two elements. Let P(T) be the set of all non trivial two set partitions of an unrooted phylogenetic tree T. For two unrooted phylogenetic trees T1 and T2 with n leaves the tree partition distance (TPD) between T1 and T2—denoted by d(T1, T2)—is the size of the symmetric difference between P(T1) and P(T2), i.e., d(T1, T2) = |P(T1) ∪ P(T2) − (P(T1) ∩ P(T2))|.
For each node v of a rooted phylogenetic tree T let T(v) be the subtree of T with root v and let cl(T, v) be the subset of L that contains all labels of the leaves of T(v). Set cl(T, v) is called the cluster of v. Observe that for each leaf v the cluster cl(T, v) is a singleton that contains the label of v. If v is the root of T then cl(T, v) = L. A cluster that is a singleton or equals L is called a trivial cluster. For each internal node v of a proper rooted phylogenetic tree T the cluster of v has at least two elements. Let Cl(T) be the set of all non trivial clusters of a rooted phylogenetic tree T. For two rooted phylogenetic trees T1 and T2 with n leaves the cluster distance (CD) between T1 and T2—denoted by dr(T1, T2)—is the size of the symmetric difference between Cl(T1) and Cl(T2), i.e., dr(T1, T2) = |Cl(T1) ∪ Cl(T2) − (Cl(T1) ∩ Cl(T2))|.
Let T be an unrooted phylogenetic tree. Let E(T) be the edge set of T. A rooting of T is defined by chosing an edge e = {n, n′} of T on which the root is to be placed, i.e., the edge is removed from T, a new node n″ that is the root is added to T, and n″ is connected to n and to n′. The obtained rooted tree is denoted by Te. According to [15] for an unrooted phylogenetic tree T1 and a rooted phylogenetic tree T2 both with n leaves the unrooted cluster distance (urCD) between T1 and T2—denoted by dur(T1, T2)—is defined as dur(T1, T2) = mine∈E(T1)|Cl(T1,e) ∪ Cl(T2) − (Cl(T1,e) ∩ Cl(T2))|.
Results and discussion
For each two proper unrooted topological trees and
the following upper bound on the TPD of two corresponding phylogentic trees holds: d(T1, T2) ≤ m1 + m2. This is clear because each internal edge can lead to at most one two set partition. If both trees are not necessarily proper and Ti, i ∈ {1, 2} has ki nodes with degree 2 then d(T1, T2) ≤ m1 − k1 + m2 − k2 holds. This result follows from the upper bound for proper trees and the fact that both edges that are incident to a node of degree 2 lead to the same two set partition of L when they are removed. Similarly, for two rooted topological trees
and
where ki is the number of nodes of degree 2 in Ti, i ∈ {1, 2} the following upper bound on the CD of two corresponding phylogentic trees holds: dr(T1, T2) ≤ m1 − k1 + m2 − k2. In the rest of this section we show that these upper bounds are all strict for each two unrooted (respectively rooted) topological trees T1 and T2 in the sense that there exist labelings of the leaves of T1 and T2 such that the TPD (respectively CD) of the corresponding phylogenetic trees equals the upper bound.
Theorem 1. For each two proper unrooted topological trees and
there exist {1, 2,…, n}-labelings of the leaves of both trees such that d(T1, T2) = m1 + m2 with n ≥ 3, 0 ≤ m1 ≤ n − 3, 0 ≤ m2 ≤ n − 3.
Proof. For the proof we show that there exist labelings for the leaves of T1 and T2 such that for each internal edge e1 of T1 the partition T1 − e1 is not contained in P(T2) and for each internal edge e2 of T2 the partition T2 − e2 is not contained in P(T1). The proof is done by induction on n.
Base case n = 3. Since each internal node has degree three there exists exactly one internal node in each of the trees T1 and T2. Thus, m1 = m2 = 0 holds and the result follows immediatley.
For the inductive step consider two trees T1 and T2 with n ≥ 4 leaves and the following three cases:
- (At least) one of the trees has only one internal node and therefore has no internal edge.
- Both trees have (at least) one internal edge and (at least) one internal node that is connected to at least 3 leaves.
- Both trees have (at least) one internal edge and for (at least) one of the trees each internal node is connected to at most two leaves.
Proof for case (1). W.l.o.g. let T1 be a tree that has only one internal node. Then, for any labelings of the leaves P(T1) = ∅ and it is clear that for each internal edge e of T2 the partition π(T2, e) is not in P(T1) and the theorem holds.
Proof for case (2). Let u (x) be an internal node in T1 (respectively T2) that is connected to at least 3 leaves. From each tree remove one of the leaves connected to u, respectively x. This does not change the number of internal edges and the resulting trees and
are both proper and have n − 1 leaves and m1 (respectively m2) internal edges. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for
and
such that
. Extend these labelings to {1, 2, …, n}-labelings for T1 and T2 by assigning the label n to both removed leaves. Clearly, then d(T1, T2) = m1 + m2 because the number of internal edges has not changed and the bipartions of T1 (T2) are obtained from the bipartions of
(respectively
) by adding n to one of the sets of each bipartition. Hence, bipartitions of
and
that are different are also different after adding element n.
It remains to prove case (3). W.l.o.g. assume that in T1 each internal node is connected to at most two leaves. Since T1 is a proper tree there exists an internal node u of T1 that is neighbour to exactly two leaves u1 and u2. Since n ≥ 4 node u has a neighbour v that is an internal node. Thus, {u, v} is an internal edge. Consider the tree that is obtained from T1 by removing nodes u1, u and edges {u1, u}, {u, v} and by connecting node u2 to v. Then
.
First assume that T2 has a node x that is connected to at least three leaves x1, x2, and x3. Construct by removing leaf x1. Then,
holds. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for
and
such that
. Extend these labelings to {1, 2, …, n}-labelings for T1 and T2 by assigning leaves u1 and x1 the label n. Clearly, no bipartition of P(T2) is equal to a bipartition π(T1, e) where e is an internal edge with e ≠ {u, v}. Also, π(T1, {u, v}) is not in P(T2) because one set of π(T1, {u, v}) equals {n, i} for an i ∈ [1, n − 1] and all sets of bipartions in P(T2) that include n have at least three elements. Thus, d(T1, T2) = m1 + m2 holds.
It remains to consider the case that all nodes in T2 are connected to at most two leaves. Then there must exist an internal node x in T2 that is connected to two leaves x1 and x2. Clearly, x is connected to an internal edge {x, y}. Similar as for T1, create a phylogenetic tree by removing nodes x1, x and edges {x1, x}, {x, y} from T2 and by connecting x2 to y. By construction,
has n − 1 nodes and m2 − 1 internal edges. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for
and
such that
. Now consider four subcases.
Case a: Both nodes v and y have at least 2 neighbouring leaves in respectively
. Then, we can assume w.l.o.g. that leaf u2 has label i in
and leaf x2 has label j in
with j ≠ i (Because, otherwise, the label of u2 can be exchanged with the label of another leaf that is connected to v). Now, extend the labelings to {1, 2, …, n}-labelings for T1 and T2 by assigning both leaves u1 and x1 label n. Clearly, partition π(T1, {u, v}) is not in P(T2) and partition π(T2, {x, y}) is not in P(T1). Hence, it is not hard to see that d(T1, T2) = m1 + m2 holds.
Case b: Node v has at least two neighbouring leaves (one leaf is u2 and let v1 be the other leaf) in and node y has only the neighbouring leaf (i.e., leaf x2) in
. Assume that j is the label of x2. Then at least one u2 and v1 has a label i ≠ j. Assume first, that node u2 has label i and extend the labelings of
and
to {1, 2, …, n}-labelings for T1 and T2 by assigning label n to leaves u1 and x1. Then the partition π(T1, {u, v}) is not in P(T2) and the partition π(T2, {x, y}) is not in P(T1). It follows easily that d(T1, T2) = m1 + m2 holds. It remains to consider the case that node v1 has label i. Then exchange the labels of u2 and v1 in
. Clearly, for this labeling
holds. Now, proceed as before to show the result.
Case c: Node v has only one neighbouring leaf in and node y has at least 2 neighbouring leaves in
. This case is symmetric to Case (b) and the proof is analogously.
Case d: Nodes v and y have only one neighbouring leaf in respectively
. For the chosen {1, 2, …, n − 1}-labelings for
and
with
let i and j be the labels of u2 respectively x2. Note, that i = j is possible. If i ≠ j extend the labelings to {1, 2, …, n}-labelings for T1 and T2 by assigning label n to leaves u1 and x1. Then, partition π(T1, {u, v}) is not in P(T2) and partition π(T2, {x, y}) is not in P(T1). Now, it is not hard to show that d(T1, T2) = m1 + m2 holds. It remains to consider the case i = j. Let k ∈ {1, 2, …, n − 1} be a label with k ≠ i. First, extend the labelings of
and
to {1, 2, …, n}-labelings for T1 and T2 by assigning label n to leaves u1 and x1. Then change the labeling for T1 by exchanging the labels of the leaves with labels k and n. Now, partition π(T1, {u, v}) has one set {k, i} and is therefore not in P(T2). Similarly, partition π(T2, {x, y}) has one set {n, i} and is therefore not in P(T1). To show that no other partion of P(T1) can be in P(T2) and vice versa, assume the contrary, i.e., assume P(T1) ∩ P(T2) ≠ ∅. Let π(T1, e1) = π(T2, e2) be a partition in P(T1) ∩ P(T2). By construction i and k must be in the same set of the partition because the corresponding leaves are connected to the same internal node in T1. Similarly, by construction i and n must be in the same set of the partition because the corresponding leaves are connected to the same internal node in T2. Hence e1 ≠ {u, v} and e2 ≠ {x, y}. Altogether it follows that
which contradicts the inductive hypothesis that
.
A special case of the theorem is when both trees T1 and T2 are binary trees, i.e., T1, . In this case there exist {1, 2, …, n}-labelings of the leaves such that d(T1, T2) = 2n − 6 for n ≥ 3.
For unrooted trees T1 and T2 that are not necessarily proper, i.e., where it is possible that internal nodes have degree 2, Theorem 1 implies the following corollary.
Corollary 1. For each two unrooted trees , n ≥ 3, i ∈ {1, 2} where ki is the number of internal nodes with degree 2 in Ti there exist {1, 2, …, n}-labelings of the leaves of both trees such that for the corresponding phylogenetic trees d(T1, T2) = m1 − k1 + m2 − k2.
Proof. To see that the corollary holds consider the case that in one of the trees Ti there exists a path of maximal length with internal nodes n1, n2, …, nj that all have degree 2. Then, there exist nodes n0 and nk+1 not in the path such that n0 is connected to n1 and nk is connected nk+1. For each two edges e, e′ that are incident to (at least) one of the nodes in the path π(T, e) = π(T, e′) holds. Hence, if the path n1, n2, …, nj is removed from Ti and exchanged by a single edge (i.e., n0 is connected to nk+1) for the resulting tree T′ the equality P(T′) = P(T) holds. Iteratively, apply this procedure until all k1 + k2 internal nodes with degree 2 have been removed in both trees and apply Theorem 1 to the resulting trees.
For rooted topological trees we show in the following a theorem that is analogous to Theorem 1 for unrooted trees and gives a bound on the cluster distance of two corresponding phylogentic trees.
Theorem 2. For each two proper rooted topological trees and
there exist {1, 2, …, n}-labelings of the leaves of both trees such that dr(T1, T2) = m1 + m2 with n ≥ 3, 0 ≤ m1 ≤ n − 2, 0 ≤ m2 ≤ n − 2.
Proof. The proof is similar to the proof of Theorem 1. We show that there exist labelings for the leaves of T1 and T2 such that for each internal node v1 of T1 the cluster T1(v1) is not contained in Cl(T2) and for each internal node v2 of T2 the cluster T2(v2) is not contained in P(T1). The proof is done by induction on n.
Base case n = 3. In this case each tree has either no internal node (then all three leaves are connected to the root) or it has one internal node that is connected to two leaves and the other leaf is connected to the root. In the first case a tree has only trivial clusters and in the second case it has exactly one non-trivial cluster that contains the labels of the two leaves that are connected to the internal node. If the second case holds for both trees then the two leaves that are connected to the internal node can get labels 1 and 2 for T1 respectively labels 1 and 3 for T2 and the theorem holds. Otherwise, the theorem holds for all {1, 2, 3}-labelings of T1 and T2.
For the inductive step consider two trees T1 and T2 with n ≥ 4 leaves and the following three cases:
- One of the trees has no internal node.
- Both trees have (at least) one internal node that is connected to at least 3 leaves.
- Both trees have (at least) one internal node and for (at least) one of the trees each internal node is connected to at most two leaves.
Proof for case (1). W.l.o.g. let T1 be a tree that has no internal node and therefore Cl(T1) contains only trivial clusters. Then, for each labeling of the leaves of T2 and any internal node v of T2 a non-trivial cl(T2, v) is not in Cl(T1) and the theorem holds.
Proof for case (2). Let u (x) be an internal node in T1 (respectively T2) that is connected to at least 3 leaves. From each tree remove one of the leaves connected to u, respectively x. The resulting trees and
are both proper and have n − 1 leaves and m1 (respectively m2) internal nodes. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for
and
such that
. Extend these labelings to {1, 2, …, n}-labelings for T1 and T2 by assigning the label n to both removed leaves. Since for each two internal nodes v of
and y of
the clusters
and
are different the clusters cl(T1, v) and cl(T2, y) are also different. Hence, dr(T1, T2) = m1 + m2
It remains to prove case (3). W.l.o.g. assume that in T1 each internal node is connected to at most two leaves. Since T1 is a proper tree there exists an internal node u of T1 that is neighbour to exactly two leaves u1 and u2 and has exactly one other neighbour v (which is an internal node or the root). Consider the tree that is obtained from T1 by removing nodes u1, u and edges {u1, u}, {u, v} and by connecting node u2 to v. Then T1 is proper and in
.
First assume that T2 has a node x that is connected to at least three leaves x1, x2, and x3. Construct by removing leaf x1. Then,
is proper and in
. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for
and
such that
. Extend these labelings to {1, 2, …, n}-labelings for T1 and T2 by assigning leaves u1 and x1 the label n. Clearly, no cluster of Cl(T2) is equal to a cluster cl(T1, w) where w is an internal node with w ≠ u. Also, cl(T1, u) = {n, i} for an i ∈ [1, n − 1] is not in Cl(T2) because all cluster in Cl(T2) that include n have at least three elements. Thus, d(T1, T2) = m1 + m2 holds.
It remains to consider the case that all internal nodes in T2 are connected to at most two leaves. Then there must exist an internal node x in T2 that is connected to two leaves x1 and x2 and to exactly one other node y that is an internal node or the root. Similar as for T1, create a phylogenetic tree by removing nodes x1, x and edges {x1, x}, {x, y} from T2 and by connecting x2 to y. By construction,
is proper and in
. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for
and
such that dr(T1, T2) = m1 − 1 + m2 − 1. Now consider four subcases.
Case a: Both nodes v and y have at least 2 neighbouring leaves in respectively
. Then, we can assume w.l.o.g. that leaf u2 has label i in
and leaf x2 has label j in
with j ≠ i (Because, otherwise, the label of u2 can be exchanged with the label of another leaf that is connected to v). Now, extend the labelings to {1, 2, …, n}-labelings for T1 and T2 by assigning both leaves u1 and x1 label n. Clearly, cluster cl(T1, u) is not in Cl(T2) and cluster cl(T2, x)) is not in Cl(T1). Hence, dr(T1, T2) = m1 + m2 easily follows.
Case b: Node v has at least two neighbouring leaves (one leaf is u2 and let v1 be the other leaf) in and node y has only one neighbouring leaf (i.e., leaf x2) in
. Assume that j is the label of x2. Then at least one u2 and v1 has a label i ≠ j. Assume first, that node u2 has label i and extend the labelings of
and
to {1, 2, …, n}-labelings for T1 and T2 by assigning label n to leaves u1 and x1. Then the cluster cl(T1, u) is not in Cl(T2) and the cluster cl(T2, x) is not in Cl(T1). It follows easily that dr(T1, T2) = m1 + m2 holds. It remains to consider the case that node v1 has label i. Then exchange the labels of u2 and v1 in
. Clearly, for this labeling
holds. Now, proceed as before to show the result.
Case c: Node v has only one neighbouring leaf in and node y has at least 2 neighbouring leaves in
. This case is symmetric to Case (b) and the proof is analogously.
Case d: Nodes v and y have only one neighbouring leaf in respectively
. For the chosen {1, 2, …, n − 1}-labelings for
and
with
let i and j be the labels of u2 respectively x2. If i ≠ j extend the labelings to {1, 2, …, n}-labelings for T1 and T2 by assigning label n to leaves u1 and x1. Then, cluster cl(T1, u) is not in Cl(T2) and cluster cl(T2, x) is not in Cl(T1). Now, it is not hard to show that dr(T1, T2) = m1 + m2 holds.
It remains to consider the case i = j. Let k ∈ {1, 2, …, n − 1} be a label with k ≠ i. First, extend the labelings of and
to {1, 2, …, n}-labelings for T1 and T2 by assigning label n to leaves u1 and x1. Then change the labeling for T1 by exchanging the labels of the leaves with labels k and n and for
by exchanging the label of the leave with label k by n. Now, cluster cl(T1, u) = {k, i} is not in Cl(T2). Similarly, cluster cl(T2, x) = {n, i} is not in Cl(T1). By the construction it holds for each cluster in Cl(T1) that it contains either i and k or none of them. Similarly, it holds for each cluster in Cl(T2) that it contains either i and n or none of them. Hence, for each internal node w ≠ u in T1 it holds: either i ∉ cl(T1, w) and
or i, k ∈ cl(T1, w) and
. Similarly, for each internal node z ≠ x in T2 it holds: either i ∉ cl(T2, z) and
or i, n ∈ cl(T2, z) and
.
Now, it remains to show that for each cl ∈ Cl(T1), cl ≠ {k, i} implies cl ∉ Cl(T2) and, vice versa, for each cl ∈ Cl(T2), cl ≠ {n, i} implies cl ∉ Cl(T1). To show the first statement, let cl ∈ Cl(T1), cl ≠ {k, i}. There exist four cases:
- Case i) i, n ∉ cl. Then k ∉ cl and
. Therefore,
. By the construction it follows that cl ∉ Cl(T2).
- Case ii) i ∉ cl, n ∈ cl. Then cl ∉ Cl(T2) because every cluster in Cl(T2) contains either i and n or none of them.
- Case iii) i ∈ cl, n ∉ cl. Similar as in case (ii) it follows that cl ∉ Cl(T2).
- Case iv) i, n ∈ cl. Then k ∈ cl and
. Therefore,
. By the construction it follows that cl ∉ Cl(T2).
The second statement, i.e., cl ∈ Cl(T2), cl ≠ {n, i} implies cl ∉ Cl(T1), can be shown by symmetric arguments. Thus, the theorem holds.
Similarly, as for unrooted trees the following corollary can be shown for rooted trees T1 and T2 that are not necessarily proper, i.e., where it is possible that internal nodes have degree 2. Theorem 2 implies the following corollary.
Corollary 2. For each two rooted trees , n ≥ 3, i ∈ {1, 2} where ki is the number of internal nodes with degree 2 in Ti there exist {1, 2, …, n}-labelings of the leaves of both trees such that for the corresponding phylogenetic trees dr(T1, T2) = m1 − k1 + m2 − k2.
Proof. To see that the corollary holds consider the case that in one of the trees Ti exists a path n1, n2, …, nk of maximal length such that all nodes nj of the path have degree 2. Let nk be the node of the path that is farthest away from the root and let nk+1 be a node that is connected with nk and is not on th path. Clearly such a node nk+1 must exist and nk+1 is a leaf or an internal node with degree ≥ 3. Let n0 be a node that is connected to n1 and is not on the path. Clearly such a node n0 must exist and n0 is the root or an internal node with degree ≥ 3. For each two node nj, 1 ≤ j ≤ k holds cl(T, nj) = cl(T, nk+1). Hence, if the path n1, n2, …, nk is removed from Ti and exchanged by a single edge that connects n0 and nk+1 for the resulting tree the equality
holds. Iteratively, apply this procedure until all k1 + k2 internal nodes with degree 2 have been removed from both trees and apply Theorem 1 to the resulting trees.
From Corollary 2 we obtain the following corollary on the worst case of the unrooted CD distance between an unrooted phylogenetic tree and a roooted phylogenetic tree both with n leaves.
Corollary 3. For each unrooted tree and rooted tree
, n ≥ 3 where ki is the number of internal nodes with degree 2 in Ti, i ∈ 1, 2 there exist {1, 2, …, n}-labelings of the leaves of both trees such that for the corresponding phylogenetic trees dur(T1, T2) = m1 − k1 + m2 − k2.
Proof. For an edge e of T1 consider the rooted tree T1,e. By definition of T1,e it holds that and T1,e has k1 internal node that have degree 2. To see this, recall that the root is not an inner node. Now, the corollary follows immediately from Corollary 2.
Note, that the proof of Corollary 3 has shown that for any edge e of T1 is holds that there exists {1, 2, …, n}-labelings of the leaves of trees T1,e and T2 such that dr(T1,e, T2) = m1 − k1 + m2 − k2.
Conclusion
It was shown that for two topological trees T1 and T2 with n leaves, mi internal edges in tree Ti, and ki nodes of degree 2 in Ti, i ∈ 1, 2 there exists assignments of labels {1, 2, …, n} to the leaves of each tree such that the tree partition distance (TPD; also called Robinson-Foulds distance for unrooted trees) between the corresponding unrooted phylogenetic trees is m1 − k1 + m2 − k2. In addition, this number is an upper bound, i.e., there does not exist assignments of labels {1, 2, …, n} to the leaves such that the TPD between both trees is larger than m1 − k1 + m2 − k2. Moreover, it was shown that analogous results hold for the cluster distance (CD; also called Robinson-Foulds distance for rooted trees) of two rooted trees and for the unrooted cluster distance (urCD) of an unrooted tree and a rooted tree. Our results can be used to compute a normalized version of the corresponding distance measures.
References
- 1.
Bourque M. Arbres de Steiner et reseaux dont certains sommets sont a localisation variable [dissertation]. Montreal: Université de Montreal; 1978.
- 2. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math. Biosci. 1981; 53:131–147.
- 3.
Robinson DF, Foulds LR. Comparison of weighted labeled trees. In: Horadam AF, Wallis WD, editors. Combinatorial Mathematics VI, Lecture Notes in Mathematics, vol 748. Berlin, Heidelberg: Springer; 1979; 748:119–126.
- 4. Bryant D, Steel M. Computing the Distribution of a Tree Metric. IEEE/ACM Trans Comput Biol Bioinform, 2009; 6(3):420–426. pmid:19644170
- 5. Penny D, Hendy MD. The Use of Tree Comparison Metrics. Syst. Zool. 1985; 34(1):75–82.
- 6. Hendy MD, Little CHC, Penny D. Comparing Trees with Pendant Vertices Labelled. SIAM Journal on Applied Mathematics. 1984; 44(5):1054–1065.
- 7. Steel MA. Distribution of the Symmetric Difference Metric on Phylogenetic Trees. SIAM J. Discrete Math. 1988; 1(4):541–551.
- 8. Geoghegan JL, Duchêne S, Holmes EC. Comparative analysis estimates the relative frequencies of co-divergence and cross-species transmission within viral families. PLoS Pathogens. 2017; 13(2):e1006215. pmid:28178344
- 9. Kupczok A, Schmidt HA, von Haeseler A. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms for Molecular Biology. 2010; 5(37):17pp
- 10. Steel MA, Penny D. Distributions of Tree Comparison Metrics—Some New Results. Syst. Biol. 1993; 42(2):126–141.
- 11. Guillerme T, Cooper N. Effects of missing data on topological inference using a Total Evidence approach. Molecular Phylogenetics and Evolution. 2016; 94:146–158. pmid:26335040
- 12. Bogdanowicz D, Giaro K, Wrobel B. TreeCmp: Comparison of trees in polynomial time. Evolutionary Bioinformatics, 2012; 8:475–487.
- 13.
Weyenberg G, Yoshida R. Phylogenetic Tree Distances. In: Kliman RM, editor. The Encyclopedia of Evolutionary Biology, Oxford: Academic Press; 2016; 3:285–290.
- 14. Ho SYW, Duchêne S, Duchêne D. Simulating and detecting autocorrelation of molecular evolutionary rates among lineages. Molecular Ecology Resources. 2015; 15(4):688–996. pmid:25155426
- 15.
Górecki P, Eulenstein O. A Robinson-Foulds Measure to Compare Unrooted Trees with Rooted Trees. In: Bleris L, Mandoiu I, Schwartz R, Wang J, editors. Proc. 8th International Symposium on Bioinformatics Research and Applications (ISBRA 2012). Berlin: Springer. 2012; LNCS 7292:115-126.