A strict upper bound for the partition distance and the cluster distance of phylogenetic trees for each fixed pair of topological trees

Martin Middendorf; Nicolas Wieseke

doi:10.1371/journal.pone.0204907

Abstract

For each given pair of (rooted or unrooted) topological trees with the same number of leaves a strict upper bound is shown for the tree partition distance (also called symmetric difference metric and Robinson-Foulds distance)—in case of unrooted trees—and for the cluster distance (also called Robinson-Foulds distance)—in case of rooted trees—of corresponding phylogenetic trees. In particular, it is shown that there exist assignments of labels (e.g., species) to the leaves of both topological tree where each label is assigned to exactly one leaf in each tree such that: i) in the unrooted case, the tree partition distance between the corresponding phylogenetic trees equals the number of internal edges in both trees minus the number of nodes with degree 2 in both trees, ii) in the rooted case, the cluster distance between any two corresponding phylogenetic trees equals the number of internal edges in both trees minus the number of nodes with degree 2 in both trees, and iii) the values in (i) and (ii) are also the maximum values with respect to all possible assignments. The shown strict worst case bounds are needed as normalization factor to compute a normalized version of the respective tree partition metrics.

Citation: Middendorf M, Wieseke N (2018) A strict upper bound for the partition distance and the cluster distance of phylogenetic trees for each fixed pair of topological trees. PLoS ONE 13(9): e0204907. https://doi.org/10.1371/journal.pone.0204907

Editor: Arndt von Haeseler, Max F Perutz Laboratories GmbH, AUSTRIA

Received: July 5, 2018; Accepted: September 10, 2018; Published: September 28, 2018

Copyright: © 2018 Middendorf, Wieseke. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data are within the paper.

Funding: The authors received no specific funding for this work.

Competing interests: The authors have declared that no competing interests exist.

Introduction

Tree distance measures are used in many applications to compare trees. The most widely used difference measure between (unrooted) phylogenetic trees (i.e., trees where the leaves are labelled with species) is the tree partition distance (TPD). The TPD was introduced by Bourque [1] and is also called symmetric difference metric or Robinson-Foulds metric [2] (Note, that the latter name has also been used for another tree metric that was proposed in [3]). The TDP measures the size of the symmetric difference between the two sets of bi-partitions (one set for each tree) of all labels that are obtained when for each inner edge of a tree the following is done: the edge is removed and for each of the two ermerging connected components the set of labels that are assigned to its leaves is one set of the bipartition. Several authors have studied the TPD (e.g., [4–7]).

For comparing the distances between pairs of phylogenetic trees of different sizes normalized versions of tree distance measures are used. The most often used normalization factors are strict worst case bounds. Normalized versions of the TPD have been used in several evolutionary studies (e.g., [8, 9]). To define them let d(T₁, T₂) be the TPD between two phylogenetic trees T₁ and T₂ which have the same number of leaves. One normalized version of the TPD (used, e.g., in [9, 10]) is then to divide the value d(T₁, T₂) by the maximum TDP for phylogenetic trees with the same number of leaves as T₁ and T₂. It is easy to show that maximum TPD for trees with n leaves is 2n − 6 (e.g., [6]). Instead of using a strict worst case bound as normalization factor another possibility is to use the average TPD over pairs of trees with size n. Specifically, in [11] the normalized TPD of trees T₁ and T₂ is computed as (d_rand − d(T₁, T₂))/d_rand where d_rand is the average TPD computed empirically over 1000 pairs of random trees of size n (in [11] the trees were generated with the Yule model according to a general proposal for computing normalized tree distance metrics from [12]). It should be mentioned that sometimes also TPD/2 is called normalized Robinson-Foulds distance (e.g., in [13]). Observe, that in all these definitions of normalized versions of TDP the normalization factor does not depend on the topologies of phylogenetic trees T₁ and T₂.

In [8] it was argued, however, that the normalization factor should consider the topologies of phylogenetic trees since not all possible phylogenetic trees are biogically relevant. Thus, a new variant of normalized TPD was proposed in [8]. In this version—denoted nTPD—value d(T₁, T₂) is divided by the worst case TPD between any two phylogenetic trees that have the same topology as T₁ and T₂ (but a possibly different assignment of labels to the leaves). Let w(T₁, T₂) denote this worst case value. Clearly, w(T₁, T₂) ≤ 2n − 6 holds. Unfortunately and to the best of our knowledge, no explicit formula for w(T₁, T₂) is known and it is not feasible for larger n to check all possible assignments of labels to the leaves of trees with the same topology as T₁ and T₂. Therefore, the computation of the nTPD in the NELSI [14] R package (function dist.topo.normalised) is (only) approximated. This is done as follows. First the TPD is computed for several pairs of phylogenetic trees that have been obtained from the two given topological trees by randomizing the assignments of labels to the leaves for one of the trees. Then the maximum TPD over all randomized pairs of phylogenetic trees is used as an approximation for the worst case TPD. In [8] the approximated nTPD was used to compare co-phylogenetic systems of different sizes and it was argued that the maximum over randomized 1000 pairs should give a good approximation. In particular, it was shown empirically for pairs of trees with sizes up to n = 142 that the maximum over 1000 randomizations is stable (i.e., fewer randomizations had already given the same maximum value).

In this paper we present an explicit formula to compute w(T₁, T₂). It should be mentioned that a O(n⁵) time method is given in [4] to compute for a fully resolved (i.e., each inner node has degree three) phylogenetic tree T with n leaves for all values m ∈ [0: 2n − 6] the number of phylogentic trees that have TPD m to T. By taking the maximum value m for which this number is not zero one obtains the worst case TDP for T (with respect to all other fully resolved trees with n leaves). But this is different from the wost case bound shown in this paper were both topological trees are fixed. Moreover, we also consider the case of trees that are not fully resolved.

In addition to the case of unrooted phylogentic trees we also consider the case of rooted phylogenetic trees. In the unrooted case the corresponding distance measure is called cluster distance (CD) or Robinson-Foulds distance for rooted trees. Recently a Robinson-Foulds metric has been proposed in [15] to compare an unrooted phylogentic tree T₁ with a rooted phylogentic tree T₂ when both trees have n leaves. The idea of this measure is to root the unrooted tree T₁ optimally in the sense that the obtained rooted tree has a minimal CD distance to T₂. Similar, as for TPD we show a strict worst case bound for CD and for the Robinson-Foulds between an unrooted phylogentic tree and a rooted phylogentic tree. Our results imply that the corresponding normalized distances can be computed efficiently.

Basic definitions

An (unrooted) tree is a connected graph T = (V, E) with n = |V| nodes and n − 1 = |E| edges. A rooted tree is a tree which has one distiguished node with degree ≥ 2 that is called root. For a tree T = (V, E) a node v ∈ V is connected to a node w ∈ V when {v, w} ∈ E. A leaf is a node in a (rooted or unrooted) tree with degree one, a leaf-edge is an edge that is incident to a leaf and all other edges are internal edges. For an unrooted tree T a node that is not a leaf is an internal node. For a rooted tree T a node that is neither a leaf nor the root is an internal node. A proper (rooted or unrooted) tree is a tree T where each internal node has degree ≥ 3. Let () be the set of all unrooted (respectively, rooted) trees with n leaves and m internal edges, n ≥ 3. For a proper unrooted tree it holds that 0 ≤ m ≤ n − 3 and for a proper rooted tree it holds that 0 ≤ m ≤ n − 2.

Let L be a set of n labels. In this paper we assume w.l.o.g. L = {1, 2, …, n}. A phylogenetic tree on L is a tree T with n leaves and where each leaf is labelled with exactly one element from L such that for each label l ∈ L there exists a leaf with label l. For a phylogenetic tree the underlying tree T is also called the topological tree, i.e., the topological tree is the phylogenetic tree ignoring the labels of the leaves. A rooted or unrooted phylogenetic tree is proper if its corresponding rooted, respectively, unrooted topological tree is proper. If the context is clear notation T is used for a phylogenetic tree and also for its corresponding topological tree.

The removal of an edge e from an unrooted phylogenetic tree T on L induces a two set partition of L—denoted by π(T, e)—where each set of the partition corresponds to the labels of all nodes of one of the two connected components of T − e. Observe that for each leaf-edge e of T one set of the partition π(T, e) is a singleton that contains the label of the corresponding leaf. A two set partition of {1, 2, …, n} where one set is a singleton is called trivial partition. For each internal edge e of a proper unrooted phylogenetic tree T each set of the partition π(T, e) has at least two elements. Let P(T) be the set of all non trivial two set partitions of an unrooted phylogenetic tree T. For two unrooted phylogenetic trees T₁ and T₂ with n leaves the tree partition distance (TPD) between T₁ and T₂—denoted by d(T₁, T₂)—is the size of the symmetric difference between P(T₁) and P(T₂), i.e., d(T₁, T₂) = |P(T₁) ∪ P(T₂) − (P(T₁) ∩ P(T₂))|.

For each node v of a rooted phylogenetic tree T let T(v) be the subtree of T with root v and let cl(T, v) be the subset of L that contains all labels of the leaves of T(v). Set cl(T, v) is called the cluster of v. Observe that for each leaf v the cluster cl(T, v) is a singleton that contains the label of v. If v is the root of T then cl(T, v) = L. A cluster that is a singleton or equals L is called a trivial cluster. For each internal node v of a proper rooted phylogenetic tree T the cluster of v has at least two elements. Let Cl(T) be the set of all non trivial clusters of a rooted phylogenetic tree T. For two rooted phylogenetic trees T₁ and T₂ with n leaves the cluster distance (CD) between T₁ and T₂—denoted by d_r(T₁, T₂)—is the size of the symmetric difference between Cl(T₁) and Cl(T₂), i.e., d_r(T₁, T₂) = |Cl(T₁) ∪ Cl(T₂) − (Cl(T₁) ∩ Cl(T₂))|.

Let T be an unrooted phylogenetic tree. Let E(T) be the edge set of T. A rooting of T is defined by chosing an edge e = {n, n′} of T on which the root is to be placed, i.e., the edge is removed from T, a new node n″ that is the root is added to T, and n″ is connected to n and to n′. The obtained rooted tree is denoted by T_e. According to [15] for an unrooted phylogenetic tree T₁ and a rooted phylogenetic tree T₂ both with n leaves the unrooted cluster distance (urCD) between T₁ and T₂—denoted by d_ur(T₁, T₂)—is defined as d_ur(T₁, T₂) = min_e∈E(T₁)|Cl(T_1,e) ∪ Cl(T₂) − (Cl(T_1,e) ∩ Cl(T₂))|.

Results and discussion

For each two proper unrooted topological trees and the following upper bound on the TPD of two corresponding phylogentic trees holds: d(T₁, T₂) ≤ m₁ + m₂. This is clear because each internal edge can lead to at most one two set partition. If both trees are not necessarily proper and T_i, i ∈ {1, 2} has k_i nodes with degree 2 then d(T₁, T₂) ≤ m₁ − k₁ + m₂ − k₂ holds. This result follows from the upper bound for proper trees and the fact that both edges that are incident to a node of degree 2 lead to the same two set partition of L when they are removed. Similarly, for two rooted topological trees and where k_i is the number of nodes of degree 2 in T_i, i ∈ {1, 2} the following upper bound on the CD of two corresponding phylogentic trees holds: d_r(T₁, T₂) ≤ m₁ − k₁ + m₂ − k₂. In the rest of this section we show that these upper bounds are all strict for each two unrooted (respectively rooted) topological trees T₁ and T₂ in the sense that there exist labelings of the leaves of T₁ and T₂ such that the TPD (respectively CD) of the corresponding phylogenetic trees equals the upper bound.

Theorem 1. For each two proper unrooted topological trees and there exist {1, 2,…, n}-labelings of the leaves of both trees such that d(T₁, T₂) = m₁ + m₂ with n ≥ 3, 0 ≤ m₁ ≤ n − 3, 0 ≤ m₂ ≤ n − 3.

Proof. For the proof we show that there exist labelings for the leaves of T₁ and T₂ such that for each internal edge e₁ of T₁ the partition T₁ − e₁ is not contained in P(T₂) and for each internal edge e₂ of T₂ the partition T₂ − e₂ is not contained in P(T₁). The proof is done by induction on n.

Base case n = 3. Since each internal node has degree three there exists exactly one internal node in each of the trees T₁ and T₂. Thus, m₁ = m₂ = 0 holds and the result follows immediatley.

For the inductive step consider two trees T₁ and T₂ with n ≥ 4 leaves and the following three cases:

(At least) one of the trees has only one internal node and therefore has no internal edge.
Both trees have (at least) one internal edge and (at least) one internal node that is connected to at least 3 leaves.
Both trees have (at least) one internal edge and for (at least) one of the trees each internal node is connected to at most two leaves.

Proof for case (1). W.l.o.g. let T₁ be a tree that has only one internal node. Then, for any labelings of the leaves P(T₁) = ∅ and it is clear that for each internal edge e of T₂ the partition π(T₂, e) is not in P(T₁) and the theorem holds.

Proof for case (2). Let u (x) be an internal node in T₁ (respectively T₂) that is connected to at least 3 leaves. From each tree remove one of the leaves connected to u, respectively x. This does not change the number of internal edges and the resulting trees and are both proper and have n − 1 leaves and m₁ (respectively m₂) internal edges. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for and such that . Extend these labelings to {1, 2, …, n}-labelings for T₁ and T₂ by assigning the label n to both removed leaves. Clearly, then d(T₁, T₂) = m₁ + m₂ because the number of internal edges has not changed and the bipartions of T₁ (T₂) are obtained from the bipartions of (respectively ) by adding n to one of the sets of each bipartition. Hence, bipartitions of and that are different are also different after adding element n.

It remains to prove case (3). W.l.o.g. assume that in T₁ each internal node is connected to at most two leaves. Since T₁ is a proper tree there exists an internal node u of T₁ that is neighbour to exactly two leaves u₁ and u₂. Since n ≥ 4 node u has a neighbour v that is an internal node. Thus, {u, v} is an internal edge. Consider the tree that is obtained from T₁ by removing nodes u₁, u and edges {u₁, u}, {u, v} and by connecting node u₂ to v. Then .

First assume that T₂ has a node x that is connected to at least three leaves x₁, x₂, and x₃. Construct by removing leaf x₁. Then, holds. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for and such that . Extend these labelings to {1, 2, …, n}-labelings for T₁ and T₂ by assigning leaves u₁ and x₁ the label n. Clearly, no bipartition of P(T₂) is equal to a bipartition π(T₁, e) where e is an internal edge with e ≠ {u, v}. Also, π(T₁, {u, v}) is not in P(T₂) because one set of π(T₁, {u, v}) equals {n, i} for an i ∈ [1, n − 1] and all sets of bipartions in P(T₂) that include n have at least three elements. Thus, d(T₁, T₂) = m₁ + m₂ holds.

It remains to consider the case that all nodes in T₂ are connected to at most two leaves. Then there must exist an internal node x in T₂ that is connected to two leaves x₁ and x₂. Clearly, x is connected to an internal edge {x, y}. Similar as for T₁, create a phylogenetic tree by removing nodes x₁, x and edges {x₁, x}, {x, y} from T₂ and by connecting x₂ to y. By construction, has n − 1 nodes and m₂ − 1 internal edges. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for and such that . Now consider four subcases.

Case a: Both nodes v and y have at least 2 neighbouring leaves in respectively . Then, we can assume w.l.o.g. that leaf u₂ has label i in and leaf x₂ has label j in with j ≠ i (Because, otherwise, the label of u₂ can be exchanged with the label of another leaf that is connected to v). Now, extend the labelings to {1, 2, …, n}-labelings for T₁ and T₂ by assigning both leaves u₁ and x₁ label n. Clearly, partition π(T₁, {u, v}) is not in P(T₂) and partition π(T₂, {x, y}) is not in P(T₁). Hence, it is not hard to see that d(T₁, T₂) = m₁ + m₂ holds.

Case b: Node v has at least two neighbouring leaves (one leaf is u₂ and let v₁ be the other leaf) in and node y has only the neighbouring leaf (i.e., leaf x₂) in . Assume that j is the label of x₂. Then at least one u₂ and v₁ has a label i ≠ j. Assume first, that node u₂ has label i and extend the labelings of and to {1, 2, …, n}-labelings for T₁ and T₂ by assigning label n to leaves u₁ and x₁. Then the partition π(T₁, {u, v}) is not in P(T₂) and the partition π(T₂, {x, y}) is not in P(T₁). It follows easily that d(T₁, T₂) = m₁ + m₂ holds. It remains to consider the case that node v₁ has label i. Then exchange the labels of u₂ and v₁ in . Clearly, for this labeling holds. Now, proceed as before to show the result.

Case c: Node v has only one neighbouring leaf in and node y has at least 2 neighbouring leaves in . This case is symmetric to Case (b) and the proof is analogously.

Case d: Nodes v and y have only one neighbouring leaf in respectively . For the chosen {1, 2, …, n − 1}-labelings for and with let i and j be the labels of u₂ respectively x₂. Note, that i = j is possible. If i ≠ j extend the labelings to {1, 2, …, n}-labelings for T₁ and T₂ by assigning label n to leaves u₁ and x₁. Then, partition π(T₁, {u, v}) is not in P(T₂) and partition π(T₂, {x, y}) is not in P(T₁). Now, it is not hard to show that d(T₁, T₂) = m₁ + m₂ holds. It remains to consider the case i = j. Let k ∈ {1, 2, …, n − 1} be a label with k ≠ i. First, extend the labelings of and to {1, 2, …, n}-labelings for T₁ and T₂ by assigning label n to leaves u₁ and x₁. Then change the labeling for T₁ by exchanging the labels of the leaves with labels k and n. Now, partition π(T₁, {u, v}) has one set {k, i} and is therefore not in P(T₂). Similarly, partition π(T₂, {x, y}) has one set {n, i} and is therefore not in P(T₁). To show that no other partion of P(T₁) can be in P(T₂) and vice versa, assume the contrary, i.e., assume P(T₁) ∩ P(T₂) ≠ ∅. Let π(T₁, e₁) = π(T₂, e₂) be a partition in P(T₁) ∩ P(T₂). By construction i and k must be in the same set of the partition because the corresponding leaves are connected to the same internal node in T₁. Similarly, by construction i and n must be in the same set of the partition because the corresponding leaves are connected to the same internal node in T₂. Hence e₁ ≠ {u, v} and e₂ ≠ {x, y}. Altogether it follows that which contradicts the inductive hypothesis that .

A special case of the theorem is when both trees T₁ and T₂ are binary trees, i.e., T₁, . In this case there exist {1, 2, …, n}-labelings of the leaves such that d(T₁, T₂) = 2n − 6 for n ≥ 3.

For unrooted trees T₁ and T₂ that are not necessarily proper, i.e., where it is possible that internal nodes have degree 2, Theorem 1 implies the following corollary.

Corollary 1. For each two unrooted trees , n ≥ 3, i ∈ {1, 2} where k_i is the number of internal nodes with degree 2 in T_i there exist {1, 2, …, n}-labelings of the leaves of both trees such that for the corresponding phylogenetic trees d(T₁, T₂) = m₁ − k₁ + m₂ − k₂.

Proof. To see that the corollary holds consider the case that in one of the trees T_i there exists a path of maximal length with internal nodes n₁, n₂, …, n_j that all have degree 2. Then, there exist nodes n₀ and n_k+1 not in the path such that n₀ is connected to n₁ and n_k is connected n_k+1. For each two edges e, e′ that are incident to (at least) one of the nodes in the path π(T, e) = π(T, e′) holds. Hence, if the path n₁, n₂, …, n_j is removed from T_i and exchanged by a single edge (i.e., n₀ is connected to n_k+1) for the resulting tree T′ the equality P(T′) = P(T) holds. Iteratively, apply this procedure until all k₁ + k₂ internal nodes with degree 2 have been removed in both trees and apply Theorem 1 to the resulting trees.

For rooted topological trees we show in the following a theorem that is analogous to Theorem 1 for unrooted trees and gives a bound on the cluster distance of two corresponding phylogentic trees.

Theorem 2. For each two proper rooted topological trees and there exist {1, 2, …, n}-labelings of the leaves of both trees such that d_r(T₁, T₂) = m₁ + m₂ with n ≥ 3, 0 ≤ m₁ ≤ n − 2, 0 ≤ m₂ ≤ n − 2.

Proof. The proof is similar to the proof of Theorem 1. We show that there exist labelings for the leaves of T₁ and T₂ such that for each internal node v₁ of T₁ the cluster T₁(v₁) is not contained in Cl(T₂) and for each internal node v₂ of T₂ the cluster T₂(v₂) is not contained in P(T₁). The proof is done by induction on n.

Base case n = 3. In this case each tree has either no internal node (then all three leaves are connected to the root) or it has one internal node that is connected to two leaves and the other leaf is connected to the root. In the first case a tree has only trivial clusters and in the second case it has exactly one non-trivial cluster that contains the labels of the two leaves that are connected to the internal node. If the second case holds for both trees then the two leaves that are connected to the internal node can get labels 1 and 2 for T₁ respectively labels 1 and 3 for T₂ and the theorem holds. Otherwise, the theorem holds for all {1, 2, 3}-labelings of T₁ and T₂.

For the inductive step consider two trees T₁ and T₂ with n ≥ 4 leaves and the following three cases:

One of the trees has no internal node.
Both trees have (at least) one internal node that is connected to at least 3 leaves.
Both trees have (at least) one internal node and for (at least) one of the trees each internal node is connected to at most two leaves.

Proof for case (1). W.l.o.g. let T₁ be a tree that has no internal node and therefore Cl(T₁) contains only trivial clusters. Then, for each labeling of the leaves of T₂ and any internal node v of T₂ a non-trivial cl(T₂, v) is not in Cl(T₁) and the theorem holds.

Proof for case (2). Let u (x) be an internal node in T₁ (respectively T₂) that is connected to at least 3 leaves. From each tree remove one of the leaves connected to u, respectively x. The resulting trees and are both proper and have n − 1 leaves and m₁ (respectively m₂) internal nodes. By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for and such that . Extend these labelings to {1, 2, …, n}-labelings for T₁ and T₂ by assigning the label n to both removed leaves. Since for each two internal nodes v of and y of the clusters and are different the clusters cl(T₁, v) and cl(T₂, y) are also different. Hence, d_r(T₁, T₂) = m₁ + m₂

It remains to prove case (3). W.l.o.g. assume that in T₁ each internal node is connected to at most two leaves. Since T₁ is a proper tree there exists an internal node u of T₁ that is neighbour to exactly two leaves u₁ and u₂ and has exactly one other neighbour v (which is an internal node or the root). Consider the tree that is obtained from T₁ by removing nodes u₁, u and edges {u₁, u}, {u, v} and by connecting node u₂ to v. Then T₁ is proper and in .

First assume that T₂ has a node x that is connected to at least three leaves x₁, x₂, and x₃. Construct by removing leaf x₁. Then, is proper and in . By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for and such that . Extend these labelings to {1, 2, …, n}-labelings for T₁ and T₂ by assigning leaves u₁ and x₁ the label n. Clearly, no cluster of Cl(T₂) is equal to a cluster cl(T₁, w) where w is an internal node with w ≠ u. Also, cl(T₁, u) = {n, i} for an i ∈ [1, n − 1] is not in Cl(T₂) because all cluster in Cl(T₂) that include n have at least three elements. Thus, d(T₁, T₂) = m₁ + m₂ holds.

It remains to consider the case that all internal nodes in T₂ are connected to at most two leaves. Then there must exist an internal node x in T₂ that is connected to two leaves x₁ and x₂ and to exactly one other node y that is an internal node or the root. Similar as for T₁, create a phylogenetic tree by removing nodes x₁, x and edges {x₁, x}, {x, y} from T₂ and by connecting x₂ to y. By construction, is proper and in . By the induction hypothesis there exist {1, 2, …, n − 1}-labelings for and such that d_r(T₁, T₂) = m₁ − 1 + m₂ − 1. Now consider four subcases.

Case a: Both nodes v and y have at least 2 neighbouring leaves in respectively . Then, we can assume w.l.o.g. that leaf u₂ has label i in and leaf x₂ has label j in with j ≠ i (Because, otherwise, the label of u₂ can be exchanged with the label of another leaf that is connected to v). Now, extend the labelings to {1, 2, …, n}-labelings for T₁ and T₂ by assigning both leaves u₁ and x₁ label n. Clearly, cluster cl(T₁, u) is not in Cl(T₂) and cluster cl(T₂, x)) is not in Cl(T₁). Hence, d_r(T₁, T₂) = m₁ + m₂ easily follows.

Case b: Node v has at least two neighbouring leaves (one leaf is u₂ and let v₁ be the other leaf) in and node y has only one neighbouring leaf (i.e., leaf x₂) in . Assume that j is the label of x₂. Then at least one u₂ and v₁ has a label i ≠ j. Assume first, that node u₂ has label i and extend the labelings of and to {1, 2, …, n}-labelings for T₁ and T₂ by assigning label n to leaves u₁ and x₁. Then the cluster cl(T₁, u) is not in Cl(T₂) and the cluster cl(T₂, x) is not in Cl(T₁). It follows easily that d_r(T₁, T₂) = m₁ + m₂ holds. It remains to consider the case that node v₁ has label i. Then exchange the labels of u₂ and v₁ in . Clearly, for this labeling holds. Now, proceed as before to show the result.

Case c: Node v has only one neighbouring leaf in and node y has at least 2 neighbouring leaves in . This case is symmetric to Case (b) and the proof is analogously.

Case d: Nodes v and y have only one neighbouring leaf in respectively . For the chosen {1, 2, …, n − 1}-labelings for and with let i and j be the labels of u₂ respectively x₂. If i ≠ j extend the labelings to {1, 2, …, n}-labelings for T₁ and T₂ by assigning label n to leaves u₁ and x₁. Then, cluster cl(T₁, u) is not in Cl(T₂) and cluster cl(T₂, x) is not in Cl(T₁). Now, it is not hard to show that d_r(T₁, T₂) = m₁ + m₂ holds.

It remains to consider the case i = j. Let k ∈ {1, 2, …, n − 1} be a label with k ≠ i. First, extend the labelings of and to {1, 2, …, n}-labelings for T₁ and T₂ by assigning label n to leaves u₁ and x₁. Then change the labeling for T₁ by exchanging the labels of the leaves with labels k and n and for by exchanging the label of the leave with label k by n. Now, cluster cl(T₁, u) = {k, i} is not in Cl(T₂). Similarly, cluster cl(T₂, x) = {n, i} is not in Cl(T₁). By the construction it holds for each cluster in Cl(T₁) that it contains either i and k or none of them. Similarly, it holds for each cluster in Cl(T₂) that it contains either i and n or none of them. Hence, for each internal node w ≠ u in T₁ it holds: either i ∉ cl(T₁, w) and or i, k ∈ cl(T₁, w) and . Similarly, for each internal node z ≠ x in T₂ it holds: either i ∉ cl(T₂, z) and or i, n ∈ cl(T₂, z) and .

Now, it remains to show that for each cl ∈ Cl(T₁), cl ≠ {k, i} implies cl ∉ Cl(T₂) and, vice versa, for each cl ∈ Cl(T₂), cl ≠ {n, i} implies cl ∉ Cl(T₁). To show the first statement, let cl ∈ Cl(T₁), cl ≠ {k, i}. There exist four cases:

Case i) i, n ∉ cl. Then k ∉ cl and . Therefore, . By the construction it follows that cl ∉ Cl(T₂).
Case ii) i ∉ cl, n ∈ cl. Then cl ∉ Cl(T₂) because every cluster in Cl(T₂) contains either i and n or none of them.
Case iii) i ∈ cl, n ∉ cl. Similar as in case (ii) it follows that cl ∉ Cl(T₂).
Case iv) i, n ∈ cl. Then k ∈ cl and . Therefore, . By the construction it follows that cl ∉ Cl(T₂).

The second statement, i.e., cl ∈ Cl(T₂), cl ≠ {n, i} implies cl ∉ Cl(T₁), can be shown by symmetric arguments. Thus, the theorem holds.

Similarly, as for unrooted trees the following corollary can be shown for rooted trees T₁ and T₂ that are not necessarily proper, i.e., where it is possible that internal nodes have degree 2. Theorem 2 implies the following corollary.

Corollary 2. For each two rooted trees , n ≥ 3, i ∈ {1, 2} where k_i is the number of internal nodes with degree 2 in T_i there exist {1, 2, …, n}-labelings of the leaves of both trees such that for the corresponding phylogenetic trees d_r(T₁, T₂) = m₁ − k₁ + m₂ − k₂.

Proof. To see that the corollary holds consider the case that in one of the trees T_i exists a path n₁, n₂, …, n_k of maximal length such that all nodes n_j of the path have degree 2. Let n_k be the node of the path that is farthest away from the root and let n_k+1 be a node that is connected with n_k and is not on th path. Clearly such a node n_k+1 must exist and n_k+1 is a leaf or an internal node with degree ≥ 3. Let n₀ be a node that is connected to n₁ and is not on the path. Clearly such a node n₀ must exist and n₀ is the root or an internal node with degree ≥ 3. For each two node n_j, 1 ≤ j ≤ k holds cl(T, n_j) = cl(T, n_k+1). Hence, if the path n₁, n₂, …, n_k is removed from T_i and exchanged by a single edge that connects n₀ and n_k+1 for the resulting tree the equality holds. Iteratively, apply this procedure until all k₁ + k₂ internal nodes with degree 2 have been removed from both trees and apply Theorem 1 to the resulting trees.

From Corollary 2 we obtain the following corollary on the worst case of the unrooted CD distance between an unrooted phylogenetic tree and a roooted phylogenetic tree both with n leaves.

Corollary 3. For each unrooted tree and rooted tree , n ≥ 3 where k_i is the number of internal nodes with degree 2 in T_i, i ∈ 1, 2 there exist {1, 2, …, n}-labelings of the leaves of both trees such that for the corresponding phylogenetic trees d_ur(T₁, T₂) = m₁ − k₁ + m₂ − k₂.

Proof. For an edge e of T₁ consider the rooted tree T_1,e. By definition of T_1,e it holds that and T_1,e has k₁ internal node that have degree 2. To see this, recall that the root is not an inner node. Now, the corollary follows immediately from Corollary 2.

Note, that the proof of Corollary 3 has shown that for any edge e of T₁ is holds that there exists {1, 2, …, n}-labelings of the leaves of trees T_1,e and T₂ such that d_r(T_1,e, T₂) = m₁ − k₁ + m₂ − k₂.

Conclusion

It was shown that for two topological trees T₁ and T₂ with n leaves, m_i internal edges in tree T_i, and k_i nodes of degree 2 in T_i, i ∈ 1, 2 there exists assignments of labels {1, 2, …, n} to the leaves of each tree such that the tree partition distance (TPD; also called Robinson-Foulds distance for unrooted trees) between the corresponding unrooted phylogenetic trees is m₁ − k₁ + m₂ − k₂. In addition, this number is an upper bound, i.e., there does not exist assignments of labels {1, 2, …, n} to the leaves such that the TPD between both trees is larger than m₁ − k₁ + m₂ − k₂. Moreover, it was shown that analogous results hold for the cluster distance (CD; also called Robinson-Foulds distance for rooted trees) of two rooted trees and for the unrooted cluster distance (urCD) of an unrooted tree and a rooted tree. Our results can be used to compute a normalized version of the corresponding distance measures.

References

1. Bourque M. Arbres de Steiner et reseaux dont certains sommets sont a localisation variable [dissertation]. Montreal: Université de Montreal; 1978.
2. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math. Biosci. 1981; 53:131–147.
- View Article
- Google Scholar
3. Robinson DF, Foulds LR. Comparison of weighted labeled trees. In: Horadam AF, Wallis WD, editors. Combinatorial Mathematics VI, Lecture Notes in Mathematics, vol 748. Berlin, Heidelberg: Springer; 1979; 748:119–126.
4. Bryant D, Steel M. Computing the Distribution of a Tree Metric. IEEE/ACM Trans Comput Biol Bioinform, 2009; 6(3):420–426. pmid:19644170
5. Penny D, Hendy MD. The Use of Tree Comparison Metrics. Syst. Zool. 1985; 34(1):75–82.
- View Article
- Google Scholar
6. Hendy MD, Little CHC, Penny D. Comparing Trees with Pendant Vertices Labelled. SIAM Journal on Applied Mathematics. 1984; 44(5):1054–1065.
- View Article
- Google Scholar
7. Steel MA. Distribution of the Symmetric Difference Metric on Phylogenetic Trees. SIAM J. Discrete Math. 1988; 1(4):541–551.
- View Article
- Google Scholar
8. Geoghegan JL, Duchêne S, Holmes EC. Comparative analysis estimates the relative frequencies of co-divergence and cross-species transmission within viral families. PLoS Pathogens. 2017; 13(2):e1006215. pmid:28178344
9. Kupczok A, Schmidt HA, von Haeseler A. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms for Molecular Biology. 2010; 5(37):17pp
- View Article
- Google Scholar
10. Steel MA, Penny D. Distributions of Tree Comparison Metrics—Some New Results. Syst. Biol. 1993; 42(2):126–141.
- View Article
- Google Scholar
11. Guillerme T, Cooper N. Effects of missing data on topological inference using a Total Evidence approach. Molecular Phylogenetics and Evolution. 2016; 94:146–158. pmid:26335040
12. Bogdanowicz D, Giaro K, Wrobel B. TreeCmp: Comparison of trees in polynomial time. Evolutionary Bioinformatics, 2012; 8:475–487.
- View Article
- Google Scholar
13. Weyenberg G, Yoshida R. Phylogenetic Tree Distances. In: Kliman RM, editor. The Encyclopedia of Evolutionary Biology, Oxford: Academic Press; 2016; 3:285–290.
14. Ho SYW, Duchêne S, Duchêne D. Simulating and detecting autocorrelation of molecular evolutionary rates among lineages. Molecular Ecology Resources. 2015; 15(4):688–996. pmid:25155426
15. Górecki P, Eulenstein O. A Robinson-Foulds Measure to Compare Unrooted Trees with Rooted Trees. In: Bleris L, Mandoiu I, Schwartz R, Wang J, editors. Proc. 8th International Symposium on Bioinformatics Research and Applications (ISBRA 2012). Berlin: Springer. 2012; LNCS 7292:115-126.

[ref1] 1. Bourque M. Arbres de Steiner et reseaux dont certains sommets sont a localisation variable [dissertation]. Montreal: Université de Montreal; 1978.

[ref2] 2. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math. Biosci. 1981; 53:131–147.
View Article
Google Scholar

[3] View Article

[4] Google Scholar

[ref3] 3. Robinson DF, Foulds LR. Comparison of weighted labeled trees. In: Horadam AF, Wallis WD, editors. Combinatorial Mathematics VI, Lecture Notes in Mathematics, vol 748. Berlin, Heidelberg: Springer; 1979; 748:119–126.

[ref4] 4. Bryant D, Steel M. Computing the Distribution of a Tree Metric. IEEE/ACM Trans Comput Biol Bioinform, 2009; 6(3):420–426. pmid:19644170
View Article
PubMed/NCBI
Google Scholar

[7] View Article

[8] PubMed/NCBI

[9] Google Scholar

[ref5] 5. Penny D, Hendy MD. The Use of Tree Comparison Metrics. Syst. Zool. 1985; 34(1):75–82.
View Article
Google Scholar

[11] View Article

[12] Google Scholar

[ref6] 6. Hendy MD, Little CHC, Penny D. Comparing Trees with Pendant Vertices Labelled. SIAM Journal on Applied Mathematics. 1984; 44(5):1054–1065.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref7] 7. Steel MA. Distribution of the Symmetric Difference Metric on Phylogenetic Trees. SIAM J. Discrete Math. 1988; 1(4):541–551.
View Article
Google Scholar

[17] View Article

[18] Google Scholar

[ref8] 8. Geoghegan JL, Duchêne S, Holmes EC. Comparative analysis estimates the relative frequencies of co-divergence and cross-species transmission within viral families. PLoS Pathogens. 2017; 13(2):e1006215. pmid:28178344
View Article
PubMed/NCBI
Google Scholar

[20] View Article

[21] PubMed/NCBI

[22] Google Scholar

[ref9] 9. Kupczok A, Schmidt HA, von Haeseler A. Accuracy of phylogeny reconstruction methods combining overlapping gene data sets. Algorithms for Molecular Biology. 2010; 5(37):17pp
View Article
Google Scholar

[24] View Article

[25] Google Scholar

[ref10] 10. Steel MA, Penny D. Distributions of Tree Comparison Metrics—Some New Results. Syst. Biol. 1993; 42(2):126–141.
View Article
Google Scholar

[27] View Article

[28] Google Scholar

[ref11] 11. Guillerme T, Cooper N. Effects of missing data on topological inference using a Total Evidence approach. Molecular Phylogenetics and Evolution. 2016; 94:146–158. pmid:26335040
View Article
PubMed/NCBI
Google Scholar

[30] View Article

[31] PubMed/NCBI

[32] Google Scholar

[ref12] 12. Bogdanowicz D, Giaro K, Wrobel B. TreeCmp: Comparison of trees in polynomial time. Evolutionary Bioinformatics, 2012; 8:475–487.
View Article
Google Scholar

[34] View Article

[35] Google Scholar

[ref13] 13. Weyenberg G, Yoshida R. Phylogenetic Tree Distances. In: Kliman RM, editor. The Encyclopedia of Evolutionary Biology, Oxford: Academic Press; 2016; 3:285–290.

[ref14] 14. Ho SYW, Duchêne S, Duchêne D. Simulating and detecting autocorrelation of molecular evolutionary rates among lineages. Molecular Ecology Resources. 2015; 15(4):688–996. pmid:25155426
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref15] 15. Górecki P, Eulenstein O. A Robinson-Foulds Measure to Compare Unrooted Trees with Rooted Trees. In: Bleris L, Mandoiu I, Schwartz R, Wang J, editors. Proc. 8th International Symposium on Bioinformatics Research and Applications (ISBRA 2012). Berlin: Springer. 2012; LNCS 7292:115-126.