Figures
Abstract
The cophenetic distance is a well-established metric in biology used to compare pairs of trees represented in a vector format. This distance was introduced by Cardona and his co-authors, building on the foundational work of Sokal and Rohlf, which dates back over 60 years. It is widely recognized for its versatility since it can analyze trees with edge weights using various vector norms. However, when comparing large-scale trees, the quadratic runtime of the current best-known (i.e., naïve) algorithm for computing the cophenetic distance can become prohibitive. Recently, a new algorithmic framework with near-linear time complexity has been developed to calculate the distances of a generalized class of cophenetic distances, which are derived from the work of Sokal and Rohlf. This improvement not only allows the cophenetic distance to be utilized in large-scale studies but also enhances the versatility of these studies by incorporating generalized variants of the cophenetic distance. However, the framework is limited to applying only the L1 and L2 vector norms, which significantly restricts the versatility of generalized cophenetic distances in large-scale applications. To address this limitation, we present a near-linear time algorithmic framework for computing the generalized cophenetic distances across all Lp vector norms. In our scalability study, we showcase the practical performance of our unrestricted algorithmic framework. Furthermore, we investigate the applicability of the generalized cophenetic distances by analyzing the distributions of key components of these distances under various vector norms.
Author summary
Biological research often relies on large-scale comparisons of evolutionary trees to extract valuable insights across different subfields of biology. To effectively compare trees on a large scale, sensitive metrics are needed to assess differences in topology and branch lengths alongside efficient computational algorithms. In this study, we focus on the classic cophenetic distance, a metric that compares pairs of trees represented in vector format. The cophenetic distance can analyze trees with edge weights using various vector norms, making it highly versatile. Recently, a fast algorithm was developed to calculate generalized cophenetic distances, including the cophenetic distance itself, which has enabled large-scale studies and expanded the types of cophenetic distances applicable for tree pair comparisons. However, this algorithm is limited to L1 and L2 norms, which greatly reduces the versatility of the cophenetic distance. To overcome this limitation, we introduce a fast algorithm that computes the generalized cophenetic distance for all vector norms, making it suitable for large-scale studies. We also conduct a scalability study to demonstrate the effectiveness of our algorithm in practice, and we analyze the distributions of key representatives of cophenetic distances across various vector norms.
Citation: Górecki P, Markin A, Vijendran S, Eulenstein O (2025) Computing generalized cophenetic distances under all Lp norms: A near-linear time algorithmic framework. PLoS Comput Biol 21(6): e1013069. https://doi.org/10.1371/journal.pcbi.1013069
Editor: Joëlle Barido-Sottani, Ecole Normale Superieure, FRANCE
Received: February 5, 2025; Accepted: April 17, 2025; Published: June 10, 2025
Copyright: © 2025 Górecki et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: The source code and data used to produce the results and analyses presented in this manuscript are available from the GitHub repository “near-linear-cophenetic-distance”: https://github.com/sriram98v/near-linear-Cophenetic-distance.
Funding: The support to PG was provided by National Science Centre grant #2019/33/B/ST6/00737 (https://www.ncn.gov.pl/en). This project was funded in part by the United States Department of Agriculture (USDA), Agricultural Research Service (ARS, https://ars.usda.gov) project numbers 3022-32000-018-017-S, 5030-32000-231-095-S and with federal funds from the USDA Agricultural Research Service (5030-32000-231-095-S, and 3022-32000-018-017-S)to OE. The funding sources had no role in study design, data collection, and interpretation, or the decision to submit the work for publication. Mention of trade names or commercial products in this article is solely to provide specific information and does not imply recommendation or endorsement by the USDA. USDA is an equal opportunity provider and employer.
Competing interests: The authors have declared that no competing interests exist.
Introduction
The cophenetic distance, introduced by Cardona et al. [1], originates from the pioneering work of Sokal and Rohlf over 60 years ago [2] and has gained substantial recognition in biology due to its reputation for reliability in analysis. This distance is a vector-based metric that is more versatile than many other commonly used tree metrics [3–5], as it can be applied to tree pairs with edge weights and analyzed using various vector norms. This capability allows for a more in-depth analysis, particularly when comparing similar tree topologies. Consequently, the cophenetic distance has broad applicability across various fields, such as phylogenetics [6–8], genomics [9], ecology [10–12], epidemiology [13,14], and conservation biology [15].
The cophenetic distance between a pair of rooted binary trees is defined based on their representation as cophenetic vectors. A cophenetic vector for a rooted tree assigns a value to each pair of taxa within the tree, representing the depth of their least common ancestor in that particular tree. Cophenetic vectors encode their corresponding trees equivalently [1], allowing the distance between a pair of trees to be measured as a Lp norm of the vector difference of the corresponding trees. Vector norms are beneficial not only for biological data analysis [16] but also applicable in various fields, such as statistical analyses [17] and machine learning [18]. An example of various vector norms is depicted in Fig 1.
The vectors have values computed for the least common ancestor (, see the bottom row) of the corresponding taxon pairs. For example,
is v, and the second position in cophenetic vectors for T reflects the contribution of v, that is, the depth of v is 1.
With the advent of genome-scale data, the established reputation of the cophenetic distance and its analysis using various norms has led to increased interest among researchers in utilizing this metric for large-scale biological analyses. Such analyses include phylogenetics [19], median tree estimation [20,21], comparison of evolutionary process models [22], balancing of trees and networks [6,7], studies on theoretical properties of cophenetic norm [23], and representative sampling [10]. Calculating cophenetic distances using the naïve algorithm, which has a quadratic time complexity, is slow and impractical for large datasets. Fortunately, an efficient algorithmic framework now enables near-linear time computations for a generalized class of cophenetic distances for the L1 and L2 norms [24]. This advancement makes it feasible to conduct large-scale analyses. However, despite these advancements, algorithms for calculating cophenetic distances in sub-quadratic time for other Lp norms are still unknown.
In this paper, we present a near-linear time algorithmic framework for calculating the Lp norms for the generalized class of cophenetic distances. We demonstrate the performance of our algorithmic framework through a scalability study. Furthermore, we analyze the distributions of key representatives from the generalized class of cophenetic distances across various Lp norms. An implementation of the algorithmic framework can be found on GitHub [25].
Related work. The cophenetic distance is frequently used in phylogenetics to measure the similarity between two trees; both trees are encoded as vectors and are then compared in the corresponding vector space using different norms, such as the L1, L2, or norm. Another popular metric that utilizes vector encodings is the path-difference distance, and both metrics can be naïvely computed in O(pn2) time for a pair of trees with n taxa under the Lp norm. This quadratic runtime is highly restrictive in the realm of large-scale phylogenetic analysis. [26,27] addressed the quadratic barrier for the path-difference metric under the Lp norm by proposing near-linear time algorithms. For the cophenetic distances, [24] proposed a novel algorithmic framework to efficiently compute the distance between a pair of trees under the L1 and L2 norms in
time and
time, respectively [24]. Furthermore, [24] demonstrated that the framework can be applied to a broad class of cophenetic metrics constructed using path-monotonic mappings, which are monotonic on the paths of the input trees. For instance, the original cophenetic distance is derived from the depth of a node, which is a path-monotonic mapping. This condition enables the development of hybrid approaches, resulting in new cophenetic costs rather than metrics, by allowing distinct path-monotonic mappings for each input tree. Such an approach may be particularly suitable when the input trees originate from different sources.
Contribution. We introduce a near-linear time algorithmic framework for computing the Lp norms of a broad class of cophenetic distances. Our framework achieves the following time complexities: for even values of p,
for odd values of p, and
under the
norm.
Our approach begins by partitioning the input trees into four approximately equal-sized subtrees. Based on the placement of taxon pairs within these subtrees, we identify several categories of taxon pair locations. For each category, we develop a specific method to compute their contributions to the cophenetic distance. The first category requires four recursive calls on controlled (smaller) subtrees extracted from the input trees. Other categories can be computed in linear time. For the final category, the computation depends on the parity of p, resulting in time complexities of and O(pn) for odd and even p, respectively. By integrating these methods, we develop a unified divide-and-conquer algorithm for computing cophenetic distances under Lp norms with finite p. Furthermore, we extend this approach to efficiently compute the
norm of the cophenetic distance.
We demonstrate the efficiency of our algorithmic frameworks through experimental evaluations. We analyze the runtime of the proposed algorithm for tree pairs with varying numbers of taxa in a scalability study under multiple Lp norms. Our findings indicate that our implementation of the divide-and-conquer strategy significantly outperforms the runtime of the best-known quadratic algorithm for pairs of trees with more than 400 taxa across all tested norms .
Lastly, we investigated various types of cophenetic distances under different Lp norms, treating them as distributions of pairwise distances. The trees for each distribution were randomly generated using the uniform model and the Yule model [28]. Our analyses reveal that the class of cophenetic distances provides a high degree of diversity under larger Lp norms, which can benefit numerous applications in comparative phylogenetics.
Methods
This section presents algorithms for computing the Lp norms of cophenetic distances. We begin with the necessary definitions and outline the method for identifying a median vertex in a rooted tree. Next, we describe how input trees are partitioned into four approximately equal subtrees using median vertices. Based on the positions of taxon pairs within these subtrees, we classify them into several categories. For each category, we propose distinct algorithms to calculate their contributions to the cophenetic distance. Finally, we introduce a unified algorithm for computing cophenetic distances for all finite norms and the norm. We also analyze and present the time and space complexities of the proposed algorithms.
Definitions
We introduce needed notation and terminology partially following [24]. Let be a rooted binary tree, and let v and w be vertices in T. The root of T is denoted by
. The least common ancestor of v and w in T is denoted by
. We use
to denote that w lies on the path between v and the root of T. Note that
means
and
. A vertex v is called strictly internal in T, if v is neither a leaf nor the root. For any non-root vertex v,
and
denote the parent and sibling of v, respectively. The set of leaves in T is denoted by LT, and the number of leaves by
. Similarly,
denotes the set of all leaves reachable from v, and
represents the size of
. A weighted tree T is a rooted binary tree with an edge weight function
.
We write that a function is path-monotonic if, for every v and w such that
, either
(descending) or for every v and w such that
,
(ascending). In this article, we distinguish the following three contribution functions, defined for unweighted trees, where for a vertex v from T its contribution is:
- the depth of a v, i.e., the number of edges on the path from v to the root of T [1],
- the height of the subtree rooted at v, defined as the maximum number of edges in any path from v to a leaf in the subtree,
- the number of leaves in the subtree of T rooted at v.
Note that the first contribution function is descending, whereas the other two are ascending.
A similar definition applies to weighted depth and height in weighted trees, but instead of counting the number of edges, we sum the weights of edges. See examples in Fig 1. Note that our definition allows for the contribution of the root. Therefore, a tree may also include an additional rooting edge (with an associated weight), whose bottom vertex is the root, if necessary.
Let T and be trees with the same set of leaves
, where the ordering of leaves is fixed. Let
and
(both either ascending or descending) be two path-monotonic contribution functions.
The cophenetic vector of T is defined as , and similarly for
. The Lp-cophenetic distance with respect to the contribution functions
and
is defined for
as:
where is the Lp norm. That is, dp is the Lp norm of the difference between the two cophenetic vectors induced by
and
.
Formally, in our case, if p is finite,
otherwise
It should be clear that dp is a metric [1] as long as both and
are based on the same type of contribution functions (e.g., depth).
Median vertex in a rooted tree
A vertex t of a rooted tree T divides the tree into two parts: the subtree of T rooted at t, denoted by Tt and referred to as the lower tree with respect to t, and the tree Tt, called the upper tree, which is obtained by replacing Tt with a leaf. The vertex t, as introduced in the next lemma, is called a median vertex and can be computed in O(n) time.
Lemma 1 (The existence of a median vertex; [24]). For every rooted binary tree T of size there is a vertex t such that
and
.
We will present the proof for Lemma 1, as it was not included in [24].
Proof: Let t be a vertex computed as follows: initialize t as the root of T, and repeatedly update t to its child with the largest subtree size until . We show that the resulting vertex t satisfies the condition.
Let be the sibling of t and p be the parent of t. We have |Tt| = |t|,
, and
. Hence,
. If
then
. Thus,
, which is a contradiction. Thus,
. Finally,
.
A median vertex is usually non-unique, e.g., a rooted tree has two median vertices b and c. Note that the concept of a median vertex is similar to the centroid of a tree, which partitions an unrooted tree into subtrees, each with a size of at most
. However, in our case, the median vertex divides a rooted binary tree into two subtrees, with the size of one subtree being between
and
, as shown in Lemma 1.
Classification of taxon pairs
In the remaining part of the article, we assume that T and are two trees that share the same fixed set of leaves. Let t and
represent fixed median vertices of T and
, respectively. Additionally, let
and
denote the contribution functions of T and
, respectively. Without loss of generality, we further assume that all contribution functions
and
are descending. That is, for any pair of vertices v and w such that
, it holds that
.
Given two trees T and , we fix arbitrarily one median vertex in each tree. Then, the path connecting the median vertex with the root will be called a median path. We denote by A and B the sets of leaves excluding the median vertex in the upper and lower trees of T, respectively. Similarly, we denote
and
for
. Note that
.
We now have four possible classes for a pair of leaves , where x can be equal to y, from T, depending on their location:
- AA — if both leaves are located in the upper tree (i.e.,
),
- BB — if
,
- AB — if
and
,
- BA — if
and
.
When considering the pair in both trees, there are 16 possible types of locations, denoted in the form
, where
and
. We say that the pair
has type
if
and
.
Since some of the types due to symmetry represent the same sets of taxon pairs, we introduce categories for the joint representation of types as indicated in Table 1. There are 4 non-mixed categories that correspond uniquely to the types
,
,
and
. Taking symmetry into account, there are 4 single-mixed categories
, e.g., in S1 the type
is equivalent to the
type, since they involve the same taxon pairs, and 2 double-mixed categories: D1 (which includes
and
) and D2 (which includes
and
).
To compute , it is sufficient to demonstrate how to compute it for one representative type
from each category, where
and
. Specifically, we calculate the partial distances as:
Then, the cophenetic distance is the -th power of the sum of the partial distances computed for each representative of the category. In the next sections, we show algorithms to compute partial distances for each representative type.
The partial distance of non-mixed types
According to Table 1, there are four types of non-mixed taxon pairs: ,
,
, and
. To calculate the partial distance for each non-mixed type, we begin by contracting the trees T and
to the set of leaves defined by the corresponding pair. For instance, for the type
, the trees T and
are contracted to the set
.
We then compute the partial distance dp for these contracted trees recursively. See also lines 14-15 in Algorithm 5.
If f(n) represents the complexity of the algorithm for computing the distance between trees of size n, the total time to calculate these four partial distances is , where
,
,
, and
(note that
.), plus the linear cost of performing the contractions.
The partial distance of double-mixed types
Here, we present algorithms for computing partial distances for the categories D1 and D2, represented by the two double-mixed types and
, respectively.
Double-mixed types
.
The type denotes pairs
, where x belongs to the upper trees and y to the lower trees of both T and
. In this case, the
of
is positioned along the median path in both trees. Furthermore,
is the same as
. Similarly, we have
.
Having this, the partial distance is
which can be computed in O(pn) time.
Double-mixed types
.
For the naïve approach requires
steps. Below, we show an O(pn) time solution. We begin with the following problem.
Problem 1. Given two sequences of numbers: and
. Compute:
.
Algorithm 1 Function (partially adapted from [24]).
1: Function , where g is a median vertex of G
2: Set for every vertex v on the median path of G
# Init counters
3: For every leaf l in X:
4: Initialize an empty sequence (a list)
5: For every vertex v on the median path: append to
, repeated
times
6: Return
Lemma 2. from Algorithm 2 computes
in
time and space.
Proof: Correctness: Note that the swap in the 5-th line ensures that . Let
be the number of elements from
that are smaller or equal to
. In particular we have
. The algorithm of
in the main loop computes
as
, i.e., it is the sum of the l-th powers of all elements from
that are smaller or equal to
. In particular,
. Then, for a fixed j,
Algorithm 2 Partial distances: type .
1: Input: T and with median vertices t and
, resp.
2: Output:
3: Function :
4:
5: If Then swap
and
6: Let for all
and
7: While :
8: If and
9: Then i = i + 1; For
10: Else j = j + 1; For
11: Return
12: Let and
13: Return
Now, the formula in the 11-th line is derived by applying above the following identities and
.
Time and space complexity: The main loop requires steps, while computing the formula in the 11-th line requires O(pm) steps plus precomputing all values of binomial coeficients
for
, which can be done once in O(p) time. In total, the overall time complexity is
, the same applies to space complexity.
Now, and similarly,
is computed by calling
, where
is the sequence of contributions of
’s for x in
and
is the sequence of contributions of
’s for y in
. Such sequences are inferred in O(n) steps by the function
in Algorithm 1.
Lemma 3. Algorithm 2 computes in O(pn) time.
Proof: The proof is similar to the proof of Lemma 3 from [24]. The difference is in the more general computation of the sum from Lemma 2. We leave out simple details.
The partial distance of single-mixed types
There are four single-mixed categories represented by the types ,
,
, and
(see Table 1). Each of these variants can be solved similarly. Thus, we present only the algorithm for computing the partial distance for
.
Assume that the pair of taxa has the type
, meaning
and
. In this case,
is a vertex from the upper part of tree T, while
lies on the median path of
. To apply our algorithm, we consider the tree to be ordered. For non-leaf vertices, the right child of a vertex v is denoted
, and the left child is denoted
.
Let . Our solution is divided depending on the parity of p. We start with the solution to odd norms.
Single-mixed types under odd Lp norms.
We start with the following definitions.
- and
.
In the next two lemmas, we prove several properties of vertex attributes from Algorithm 3.
Algorithm 3 Partial distances: type for odd p’s.
1: Input: T and with median verticest and
, respectively; p is odd
2: Output:
3: For every in the upper tree of T:
4: ;
# the zero vector
5: For : # The preprocessing loop
6: ;
7: If
8: Then ;
9: If Then
Else
10: For every non-root in the upper tree of T in postfix
order: # The main loop
11: ;
12:
13:
14: Return
Lemma 4. If v is strictly internal, then after the preprocessing loop of Algorithm 3, we have
where .
Proof: In line 8 of Algorithm 2, is defined as the vertex such that
, and this inequality is not satisfied by the parent of
; that is,
. Combining these observations, we can conclude that
consists of vertices
for which
as specified in line 8. Consequently, from the assignment in line 8, we have
.
Lemma 5. If v is strictly internal, then after the main loop of Algorithm 3,
,
- and
.
Proof: We present an inductive proof for a fixed and for
, which denotes the
-th element of the vector
. If v is a leaf, then the equality follows directly from line 9 of Algorithm 3 and the definition of
. Now, let us consider the case where v is strictly internal. Note that
.
The proof for is similar. We omit details.
Lemma 6. If v is strictly internal in T, then after the main loop of Algorithm 3 we have
Proof: We can prove (2) by using Lemmas 4 and 5. Let v be strictly internal in the upper tree of T. Then, for a leaf and a leaf y from
(i.e., from
, see line 12), we have
and
. Let R be the right side of (2). Then,
The last equation is obtained by identities from Lemma 5 and binomial expansions. R equals the right side of the assignment from line 13.
Lemma 7. Algorithm 3 computes in
time.
Proof: We show the algorithm’s correctness followed by the stated time complexity.
Correctness: Let I be the set of non-root vertices from the upper tree of T. Then, every pair of leaves of type
uniquely determines
such that
and
. It also follows that
. Let
denote such vertex v. For a given v,
is the set of all pairs
such that
. Hence,
By Lemma 6 the above sum equals the value returned in the last line of Algorithm 3.
Time complexity: The key aspect of the time complexity is found in line 8. We show that can be found by a binary search in
time that seeks the value in an ordered array composed of vertices on the path connecting a given leaf x with the root of T. An infix traversal of T can construct such an array. Then, a vertex is inserted into the array when visited for the first time. When a vertex is visited for the last time, it is removed from the array. Due to the monotonic ordering of paths, the array is always sorted, and its size is limited by n. The time complexity of the remaining loops is O(pn), which gives
time of Algorithm 3.
Single-mixed types under even Lp norms.
Algorithm 4 Partial distances: type for even p.
1: Input: T and with median vertices t and
, resp.
2: Output:
3: For every v in the upper tree of T: .
(the zero vector)
4: For : # Preprocessing
5:
6: For every non-root v in the upper tree of T in postfix
order:
7:
8:
9: Return
The advantage of even Lp norms is given by the relation for p even. This fact allows us to circumvent the extra complexity of Algorithm 3. Algorithm 4 outlines the computation of
when p is even.
Lemma 8. For each non-root vertex v in the upper tree of T after Algorithm 4 we have
,
.
Proof: The relation for follows directly from the bottom-up nature of the algorithm and the relation for
can be observed using the following equation:
Lemma 9. Algorithm 4 computes in O(pn) time for even p.
Proof: The correctness of the algorithm follows from Lemmas 8 and 7. Finally, the time complexity of Algorithm 4 is O(pn) since each line within the loops requires O(p) time and is executed at most 2n times.
Partial distances of the remaining single-mixed types.
As previously mentioned, Algorithm 3 and Algorithm 4 for computing the partial distances of single mixed type can be adapted to solve the other types as follows. For the type
, replace the term “upper” with “lower” and A with B in both algorithms. For the type
, swap the input trees and execute both algorithms. For the type
, swap the input trees and run the algorithm designed for the type
. These modifications are detailed in lines 14 through 16 of Algorithm 5.
Algorithm to compute the cophenetic distance under the Lp norm
The pseudo-code in Algorithm 5 summarizes the complete procedure for computing the cophenetic distance for finite p. The correctness of the algorithm follows from the results presented in the previous section. Below, we analyze the time complexity in two scenarios: when p is constant, and when p is considered as a parameter in the asymptotic analysis.
Algorithm 5 Computing cophenetic distance.
1: Input: T and with the same set of leaves of size n, two
ascending contribution functions and
,
and an integer
2: Output: The Lp-cophenetic distance between T and with
respect to and
3: Compute all binomial coeficients , for all i, by
and a0 = 1
4: Function
5: Compute t and the median vertices of T and
,
resp.
6: Let A and be the set of all leaves in the upper
tree of T and , resp.
7: ;
;
8: For X, in : For
in
9: Let and
10: # Compute partial distances
for type
11: see Eq. (1) # Double-mixed
12: by Alg. 2 # Double-mixed
13: # All single-mixed types
14: using Alg. 3 if p is odd, or Alg. 4
otherwise
15: Similarly to line 14 , , but replace
“upper” with “lower” and A
with B in Alg. 3/4
16: For and
swap the input trees and
repeat lines 14 and 15
17: Return s # Return the sum of partial distances
18: Return # Return the cophenetic distance
First, consider the case where p is constant. Since , it follows from Lemmas 3 and 7 that computing the partial distances of mixed types requires
time when p is odd, and O(n) time when p is even. Consequently, the overall time complexity of Algorithm 5, as derived in [24], is
for odd p, and
for even p.
If p is a parameter, is the worst-case time complexity of the complete algorithm. Then, computation of partial distances of non-mixed types (see lines 8-10 in Algoritm 5) requires
time where
and
for each i (by Lemma 1), while the computation of mixed types requires
time if p is odd or O(pn) time if p is even, (Lemmas 3 and 7). Therefore, for some
we can write that,
if
,
, if n>5 and p is odd, and
, otherwise.
Theorem 10. The time complexity of the algorithm is , when p is odd and
when p is even.
Proof: The proof for odd norms. Assume that p is odd. We show that there are constants and d>0 such that for every n>0 and p>0,
. The proof is by induction on n. For
we have
and the inequality is satisfied. For n>5,
.
Let . Then, for n>5,
. Also,
and
. Finally, for n>5,
.
Even norms. If p is even, it suffices to show that , for some constants d>0 and
. The proof is similar and simpler compared to the odd case and follows the same reasoning. We omit details.
The last theorem shows that, despite the higher complexity of odd norms, the factor p only appears in the asymptotically minor term of . This suggests that the computational effort for cophenetic distances is more influenced by the size of the input trees than by the chosen norm level p. Additionally, this theoretical result highlights that the binary search algorithm - required for computing double-mixed types and representing the main distinction between the two algorithmic variants - introduces
more steps in the worst-case than the algorithm for even norms. See the experimental section for a detailed discussion on scalability and a comparison of the algorithms for odd and even norms.
Cophenetic distances under
norm.
Algorithm 5 can be easily adapted to solve the last remaining case of cophenetic distances. The algorithm for the norm is similar to the case p = 1. The only difference is taking the maximum value instead of adding partial distances as follows. First, fix
. Then in line 7, replace
with
. In lines 10-16 instead of assignments s = s + r, where r is the right-hand expression, write
. Now, the time complexity is
, which follows from Theorem 10 for the odd case p = 1.
Results
In the following sections, we present the scalability study and distribution analysis results. The near-linear time algorithm, i.e., Algorithm 5, for computing the Lp norm cophenetic distance and the naïve (quadratic time) algorithm were implemented in Rust 1.79.0. The implementation of both algorithms, along with instructions on reproducing the simulation results below, is available on GitHub [25]. Note that the distances computed by the naïve and our algorithm are the same.
Scalability analysis
We investigate the scalability of our algorithm by comparing its runtime to that of the best-known quadratic-time algorithm for computing cophenetic distance.
Datasets. We generated three datasets consisting of pairs of random binary trees with n leaves using the classic Yule model, where for each n we generated q pairs. The first dataset was generated using and q = 1000, the second with
and q = 1000, and the last one with
and q = 100. Algorithms were executed on each pair of trees for various norms Lp with p values ranging from 1 to 100. The comparative runtime analysis for each dataset is illustrated in the diagrams of Figs 2, 3, and 4.
The diagrams show the average runtime with standard deviation bands over 1000 runs for each tree size () on pairs of random trees, evaluated for Lp-norms (
). Crossing points indicate the tree sizes where the near-linear algorithm outperforms the naïve one.
The diagrams show the average runtime with standard deviation bands over 1,000 runs for each tree size () on pairs of random trees, evaluated for Lp-norms (
).
The diagrams show the average runtime with standard deviation error over 100 runs for each tree size () on pairs of random trees, evaluated for Lp-norms (
).
Results. Our algorithm shows a significant improvement in efficiency compared to the quadratic solution for trees with more than 250 taxa and norms below 20, as illustrated in Fig 2. Furthermore, the crossing point at which our algorithm outperforms the quadratic solution increases gradually as p increases, reaching 366 at p = 100. This finding indicates the potential for developing a hybrid algorithm that utilizes the divide-and-conquer method for larger trees while reverting to the quadratic time algorithm for trees with fewer than 230–370 leaves during recursion.
Despite the differences in asymptotic worst-case time complexity between odd and even norms, these differences are not evident in the average runtime diagrams presented in Figs 3 and 4. Many of these differences can be attributed to the binary search step in Algorithm 3, as explained in Lemma 7, where the runtime is relatively straightforward to estimate. Notably, the computation of odd norms accounted for only 0.01% of the total average runtime in the binary search steps. In practical terms, this indicates that the time complexity of Algorithm 3 is closer to O(pn) rather than the more conservative estimate of provided in Lemma 7. Consequently, the overall runtime complexity for calculating the cophenetic distance can be approximated as
for any given norm. Since the trees were generated randomly, we propose that
is also a valid estimator of the average time complexity of our algorithm. However, as shown in Fig 4, there are no significant differences in runtime across various norms for a fixed tree size n. This indicates that the runtime is primarily affected by the overhead associated with maintaining supporting data structures, rather than by the computation of values, which involves loops over the range from 0 to p. We conjecture that for extremely large values of p, the computational cost of these loops will become increasingly noticeable in the runtime; however, such scenarios were not tested in this study.
Cophenetic distance distributions
We investigate the distributions of three key representatives from the generalized class of cophenetic distances. These representatives are (i) the original cophenetic distance metric defined by depths, (ii) the metric defined by the heights of the subtrees, and (iii) the metric defined by the number of taxa in the subtrees. For ease of reference, we refer to these metrics as the depth, subtree-height, and subtree-size cophenetic metrics, respectively.
To the best of our knowledge, the specific distributions for any selected metrics remain unpublished. Therefore, we present here the sampled distributions of cophenetic distances based on two standard models of phylogenetic tree sampling: the uniform model and the Yule model [28,29]. The sampled distributions for the classical depth-based cophenetic distance were previously discussed in [1] for both the L1 norm and the L2 norm under the uniform model.
Data. We generated a dataset containing 106 pairs of trees, each with 100 leaves. Each tree was independently generated under the uniform model. Similarly, we generated another dataset based on the Yule model. It is important to note that the generated trees do not include edge lengths; therefore, both depths and subtree heights are measured in terms of the number of edges.
Results. The sampled distributions are illustrated in Figs 5 and 6, with the corresponding mean and standard deviation statistics depicted in Fig 7.
Pairs of trees of size 100 were sampled according to the uniform model (left columns) and the Yule model (right columns). The frequencies are grouped into 200 bins. Note that the y-axis scale and range are the same across all diagrams. To enhance visibility, low frequencies (e.g., at low distances close to 0) are omitted and the width of the diagrams is appropriately adjusted.
(Cont. from Fig 5). Sampled distribution of three cophenetic distances is shown for Lp norms with p = 20, 50, 100, and (rows). Similar to Fig 5, the frequencies for finite p are grouped into 200 bins. Since all three contribution functions return integers, the
norms are also integers. As a result, the frequency values for
are presented without binning.
The distribution of subtree-height and depth cophenetic metrics exhibit similar shapes under the uniform model for smaller values of p as illustrated in Fig 5. A similar trend is observed for the depth under the Yule model. Furthermore, these distributions are positively skewed, which is consistent with the findings for the L1 and L2 norms of the depth cophenetic metric reported in [1]. On the contrary, the subtree-height distributions exhibit positive skewness. It is also worth noting that our study involved a significantly larger number of pairs for sampling. As a result, the diagrams in Fig 5 appear smoother compared to the sampled distributions of depth metric from [1].
Since cophenetic vectors are finite, it is straightforward to prove that under any contribution function. Consequently, Lp cophenetic distributions converge to
distributions as p tends to infinity. Furthermore, in our case, all three contribution functions return integer values, which implies that the
distributions have an integer domain, as illustrated in Fig 6. This property is evident in diagrams, where the distributions become increasingly irregular as p grows and more and more similar to the corresponding discrete distributions under
. This effect is particularly noticeable for the depth and subtree-height metrics. By contrast, the distributions for the subtree-size metric remain relatively smooth. This smoothness arises from the fact that, under the
norm, almost all frequencies under uniform and Yule models are concentrated at 98 (see low standard deviation in Fig 7). This value is derived from the pair of leaf labels that form a cherry in one tree but are separated by the root in the other tree, yielding subtree sizes of 2 and 100, respectively, and a corresponding distance of 98.
The sampled distributions for the subtree-size cophenetic metric are notably distinct from the analysis mentioned above. The histograms for this metric are negatively-skewed, and the mean value under the Yule model is, in this case, larger than the mean value under the uniform model.
In comparing the depth and subtree-height cophenetic metrics, we find that the mean values under the Yule model are significantly lower than those under the uniform model. A similar bias towards the Yule model was previously noted in the sampled distributions for the path-difference distance [30,31].
Additionally, the mean and standard deviation of the depth metric under the L1 and L2 norms align with the exact values reported in [32].
Results and discussion
We introduced a novel algorithmic framework for computing the Lp norm cophenetic distance, achieving a time complexity of for odd values of p,
for even values of p, and
for
norm. This represents a substantial advancement compared to the previously best-known naïve algorithm, which requires
time.
Additionally, our scalability studies suggest that the estimated runtime of our algorithm approaches time under all Lp norms with finite p, contrasting with the larger upper asymptotic bound observed for odd values of p. These advancements greatly improve the usability of the cophenetic distance for large-scale phylogenetic studies and the median-tree inference of species trees from gene trees using this metric.
Distribution analyses of these three key representative metrics from the cophenetic class further enhance this work, offering practitioners valuable guidance in selecting appropriate metrics for their specific needs.
The framework demonstrates broad practical applicability by generalizing to all metrics that rely on the path-monotonic property, here referred to as the class of generalized cophenetic distances. This generalization can be achieved by either designing contribution functions that satisfy basic monotonicity properties or by forming linear, positively weighted combinations of existing contribution functions.
As a result, the class of generalized cophenetic distances includes the original cophenetic distance based on depth, as well as other metrics, particularly the subtree height and subtree size cophenetic distances. Additionally, this class incorporates the more recently proposed metric from [19], which combines both weighted and unweighted contribution functions. Consequently, this metric can be computed in near-linear time using our algorithm under any Lp norm.
Furthermore, the framework can be applied to mixed scenarios where different types of cophenetic vectors are used for the trees — for example, depth in one tree and height in another. Although these mixed scenarios may not fully satisfy metric properties, they can still be useful for comparing trees from different origins by allowing asymmetry. For instance, this approach may be more suitable when comparing a gene tree, which represents the evolutionary history of a gene, to a species tree that illustrates the evolutionary history of the species from which the genes were sampled.
References
- 1. Cardona G, Mir A, Rosselló F, Rotger L, Sánchez D. Cophenetic metrics for phylogenetic trees, after Sokal and Rohlf. BMC Bioinformatics. 2013;14:3. https://doi.org/10.1186/1471-2105-14-3 pmid:23323711
- 2. Sokal RR, Rohlf FJ. The comparison of dendrograms by objective methods. Taxon. 1962;11(2):33–40.
- 3. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53(1–2):131–47.
- 4. Sand A, Holt MK, Johansen J, Fagerberg R, Brodal GS, Pedersen CNS, et al. Algorithms for computing the triplet and quartet distances for binary and general trees. Biology (Basel). 2013;2(4):1189–209. https://doi.org/10.3390/biology2041189 pmid:24833220
- 5. Li M, Tromp J, Zhang L. On the nearest neighbour interchange distance between evolutionary trees. J Theor Biol. 1996;182(4):463–7. https://doi.org/10.1006/jtbi.1996.0188 pmid:8944893
- 6. Knüver L, Fischer M, Hellmuth M, Wicke K. The weighted total cophenetic index: a novel balance index for phylogenetic networks. Discrete Appl Math. 2024;359:89–142.
- 7.
Fischer M, Herbst L, Kersting SJ, Kühn AL, Wicke K. Total cophenetic index. Springer. 2023. p. 81–7.
- 8. Elliott TL, Davies TJ. Jointly modeling niche width and phylogenetic distance to explain species co‐occurrence. Ecosphere. 2017;8(8).
- 9. Zogopoulos VL, Saxami G, Malatras A, Papadopoulos K, Tsotra I, Iconomidou VA, et al. Approaches in gene coexpression analysis in eukaryotes. Biology (Basel). 2022;11(7):1019. https://doi.org/10.3390/biology11071019 pmid:36101400
- 10. Kuramae EE, Robert V, Echavarri-Erasun C, Boekhout T. Cophenetic correlation analysis as a strategy to select phylogenetically informative proteins: an example from the fungal kingdom. BMC Evol Biol. 2007;7:134. https://doi.org/10.1186/1471-2148-7-134 pmid:17688684
- 11. Haji D, Vailionis J, Stukel M, Gordon E, Lemmon EM, Lemmon AR, et al. Lack of host phylogenetic structure in the gut bacterial communities of New Zealand cicadas and their interspecific hybrids. Sci Rep. 2022;12(1):20559. https://doi.org/10.1038/s41598-022-24723-3 pmid:36446872
- 12. Yang I, Woltemate S, Piazuelo MB, Bravo LE, Yepez MC, Romero-Gallo J, et al. Different gastric microbiota compositions in two human populations with high and low gastric cancer risk in Colombia. Sci Rep. 2016;6:18594. https://doi.org/10.1038/srep18594 pmid:26729566
- 13. Kusejko K, Kadelka C, Marzel A, Battegay M, Bernasconi E, Calmy A, et al. Inferring the age difference in HIV transmission pairs by applying phylogenetic methods on the HIV transmission network of the Swiss HIV Cohort Study. Virus Evol. 2018;4(2):vey024. https://doi.org/10.1093/ve/vey024 pmid:30250751
- 14. Shaw LP, Wang AD, Dylus D, Meier M, Pogacnik G, Dessimoz C, et al. The phylogenetic range of bacterial and viral pathogens of vertebrates. Mol Ecol. 2020;29(17):3361–79. https://doi.org/10.1111/mec.15463 pmid:32390272
- 15. Weisbecker V, Beck RMD, Guillerme T, Harrington AR, Lange-Hodgson L, Lee MSY, et al. Multiple modes of inference reveal less phylogenetic signal in marsupial basicranial shape compared with the rest of the cranium. Philos Trans R Soc Lond B Biol Sci. 2023;378(1880):20220085. https://doi.org/10.1098/rstb.2022.0085 pmid:37183893
- 16.
Rong Z, Cai J, Qiu J, Xu P, Garmire LX, Lian Q, et al. L2 normalization and geodesic distance for enhanced information preservation in visualizing. High-dimensional single-cell sequencing data. In: ACM-BCB 24. New York, NY, USA: ACM; 2024.
- 17. Wooldridge JM. Applications of generalized method of moments estimation. J Econ Perspect. 2001;15(4):87–100.
- 18.
Lange M, Zühlke D, Holz O, Villmann T, Mittweida SG. Applications of norms and their smooth approximations for gradient based learning vector quantization. In: ESANN; 2014. p. 271–6.
- 19. Kendall M, Colijn C. Mapping phylogenetic trees to reveal distinct patterns of evolution. Mol Biol Evol. 2016;33(10):2735–43. https://doi.org/10.1093/molbev/msw124 pmid:27343287
- 20.
Markin A, Eulenstein O. Cophenetic median trees under the manhattan distance. In: ACM-BCB 17; 2017. p. 194–202.
- 21. Markin A, Eulenstein O. Cophenetic median trees. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(5):1459–70. https://doi.org/10.1109/TCBB.2018.2870173 pmid:30222583
- 22.
Sánchez-Charles D, Muntés-Mulero V, Carmona J, Solé M. Process model comparison based on cophenetic distance. Springer; 2016. p. 141–58.
- 23.
Munch E, Stefanou A. The l∞-cophenetic metric for phylogenetic trees as an interleaving distance. Springer; 2019. p. 109–27.
- 24.
G´orecki P, Markin A, Eulenstein O. Cophenetic distances: a near-linear time algorithmic framework. In: COCOON 2018. 2018. p. 168–79.
- 25.
Vijendran S. Cophenetic distance in near-linear time. 2025. https://github.com/sriram98v/near-linear-cophenetic-distance
- 26. Wang B-F, Li C-Y. Fast algorithms for computing path-difference distances. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(2):569–82. https://doi.org/10.1109/TCBB.2018.2790957 pmid:29993953
- 27. Bryant D, Scornavacca C. An $$O(n \log n)$$ time algorithm for computing the path-length distance between trees. Algorithmica. 2019;81(9):3692–706.
- 28. Yule GU. A mathematical theory of evolution, based on the conclusions of Dr. JC Willis, FRS. Philos Trans Roya Soc Lond Ser B. 1925;213:21–87.
- 29. Harding EF. The probabilities of rooted tree-shapes generated by random bifurcation. Adv Appl Probab. 1971;3(1):44–77.
- 30. Markin A, Eulenstein O. Computing manhattan path-difference median trees: a practical local search approach. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(4):1063–76. https://doi.org/10.1109/TCBB.2017.2718507 pmid:28650824
- 31. Markin A, Eulenstein O. Efficient local search for Euclidean path-difference median trees. IEEE/ACM Trans Comput Biol Bioinform. 2019;16(4):1374–85. https://doi.org/10.1109/TCBB.2017.2763137 pmid:29035224
- 32. Cardona G, Mir A, Rosselló F, Rotger L. The expected value of the squared cophenetic metric under the Yule and the uniform models. Math Biosci. 2018;295:73–85. https://doi.org/10.1016/j.mbs.2017.11.007 pmid:29155134