TreeCluster: Clustering biological sequences using phylogenetic trees

Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.

B (u l ) + w l + B (u r ) + w r ≥ B(u l ) + w l + B(u r ) + w r > α where the first inequality follows from the inductive hypothesis and the final inequality 763 shows that we will have to cut a branch in any alternative setting. Finally, we need to 764 show that an alternative solution with A (u) = A(u) but B (u) < B(u) is not possible. 765 The inequality requires that either B (u l ) < B(u l ) or B (u r ) < B(u r ). First, consider 766 the B (u l ) < B(u l ) case, which is possible only if A (u l ) = A(u l ) + 1. Note that 767 A (u) = A(u) requires A (u r ) = A(u r ) (and thus B (u r ) = B(u r )) and that 768 B (u l ) + w l + B(u r ) + w r < α, which is possible. Under this condition, we find: 769 B (u) = max(B (u l ) + w l , B(u r ) + w r ) ≥ B(u r ) + w r = B(u) If instead B (u r ) < B(u r ), similar conditions can be written, resulting in 770 B (u) = max(B(u l ) + w l , B (u r ) + w r ) ≥ B(u l ) + w l ≥ B(u r ) + w r = B(u) (5) Thus, A(u) and B(u) are optimal when B(u l ) + w l + B(u r ) + w r > α.    Proof. The proof uses induction. The base case for the induction is the simple rooted 788 tree with root u and two leaves u l and u r . If w l + w r > α, the algorithm cuts the longer 789 branch, whereas if w l + w r ≤ α, no branch is cut. In both cases, the theorem holds.

790
The inductive hypothesis is that, for a node u, the algorithm has computed A(u l ),

791
A(u r ), B(u l ), and B(u r ) optimally. We need to prove that a solution other than the 792 one computed by our algorithm i) cannot have a lower number of clusters, call it A (u), 793 and ii) when A (u) = A(u), cannot have a lower distance to the farthest connected leaf, 794 call it B (u).

795
When B(u l ) + w l + B(u r ) + w r ≤ α, we have A(u) = A(u l ) + A(u r ) − 1, which is the 796 minimum possible by the inductive hypothesis along with the fact that the number of 797 clusters cannot decrease by more than one on node u. Also, B(u) is optimal by 798 construction.

799
When B(u l ) + w l + B(u r ) + w r > α, without loss of generality, assume that B(u l ) + w l ≥ B(u r ) + w r , and thus, the algorithm cuts the (u, u l ) branch, resulting in A(u) = A(u l ) + A(u r ) and B(u) = B(u r ) + w r . Note that A (u) < A(u) is only possible if A (u l ) = A(u l ) and A (u r ) = A(u r ) and we do not cut any branch at u in the alternative clustering. However, this scenario is not possible because where the first inequality follows from the inductive hypothesis and the final inequality 800 shows that we will have to cut a branch in any alternative setting. Finally, we need to 801 show that an alternative solution with If, instead, B (u r ) < B(u r ), similar conditions can be written, resulting in August 8, 2019 22/28 Thus, A(u) and B(u) are optimal when B(u l ) + w l + B(u r ) + w r > α. (o, o r ) (w.l.o.g). Using this mapping, the optimal clustering (i.e., the optimal cut-set) on 815 T can be translated to an alternative Max-diameter min-cut partitioning on T o .

816
However, by Theorem 1, A(o) is optimal and cannot be improved by any alternative 817 partitioning. Since any admissible clustering on T o is also admissible on T , Algorithm 1 818 minimizes q.   840 and b, for each node j, we use the term support of j, denoted by s(j), to refer to the 841 unique node on all the three paths a b, a j, and b j. We refer to a group of 842 leaves that share a mutual support with respect to a and b as a bubble (e.g. triangles in 843 Fig SA). Among all bubbles branching out of a b, let the one with the closest 844 support to a be A . We name the leaf closest to a on A as a (Fig SA). 845 We start with the observation that if d(a, b) ≤ α holds, the algorithm will never cut 846 any edge on a b. For every internal node u on a b, let v and w be the adjacent 847 nodes on a u and u b, respectively. Also, let p a be the closest leaf to u whose 848 support s(p a ) is on a u, and let p b be the closest leaf to u whose support s(p b ) is on 849 u b. Now, note that d(p a , u) + d(u, p b ) ≤ d(a, u) + d(u, b) ≤ α holds, so regardless of 850 the rooting, (v, u) and (u, w) are never cut by Algorithm 2.

851
If a chain H exists, due to the previous observation, there are no cuts on c i c i+1 852 for every 0 ≤ i ≤ m. Consequently, a and b are connected through a path and are thus 853 in the same cluster.

854
Assume Algorithm 2 places a and b on the same cluster, i.e., it does not cut any edge 855 on a b. We present a procedure to generate a chain H as described in Definition 5. 856 But we first need some definitions. We define p 0 = a and p m = b. For 1 ≤ i ≤ m , let p i 857 denote the closest leaf to p i−1 whose support s(p i ) is on p i−1 b and s(p i ) = s(p i−1 ); 858 i.e., p i is in the bubble to the right of the bubble of p i−1 . Conversely, for 1 ≤ i ≤ m , let 859 π i denote the closest leaf to p i whose support is on a s(p i−1 ); i.e., is in a bubble to 860 the left of p i . Also define π 1 = a. We can also show that every π i ∈ {p 0 . . . p i−1 }. If a π i 861 is not equal to one of {p 0 . . . p i−1 }, then, s(π i ) has to be on s(p j−1 ) s(p j ) for some j. 862 However, we would have d(p j−1 , π i ) ≤ d(p j−1 , p j ), which contradicts the definition of p i . 863 Now we construct the chain. The fact that Algorithm 2 retains (a, s(a )) indicates 864 that min(d(a, a ), d(a, p 1 )) = d(a, p 1 ) ≤ α; therefore, we add a → p 1 to an auxiliary 865 graph H . Now, consider Algorithm 2 when it processes the node s(p i−1 ) for 1 < i. The 866 fact that the first edge on path s(p i−1 ) s(p i ) (shown in red in Fig SA) is not cut Depending on which is true, we 868 add a link from π i−1 → p i or p i−1 → p i to H . We repeat this process for all i until we 869 reach i = m , where we add an edge to p m = b.

874
All optimal solutions 875 Lemma A. Let {e 1 , e 2 , · · · , e m } be the set of edges in an unrooted tree T . Consider 876 the following algorithm: root T at e j and run Algorithm 1, and let S j denote the set of 877 edges cut by the algorithm in this run. Any optimal clustering for T has to draw its 878 cut-set from Σ = ∪ m j=1 S j .

879
Proof. The proof is by contradiction. Assume there is an optimal cut-set S that 880 contains an edge e i such that e i / ∈ Σ. Consider the rooting of T at e i . Denote the root 881 of this tree as v, the immediate left and right branches of v as e l and e r , and the left 882 and right child nodes of v as v l and v r . Note that the concatenation of e l and e r 883 corresponds to e i in T ; thus, e l / ∈ S j and e r / ∈ S j . When e i is removed from T , two new 884 trees form, called T l (the one containing the node v l ) and T r (the one containing the 885 node v r ). If p cuts in S are in T r , and if q cuts in S are in T l , then |S | = p + q + 1.

886
The number of cuts in S and S j are equal, and e l and e r are not cut, which implies 887 that either the tree rooted by v l or v r has an alternative clustering with one less cut. By 888 the design of Algorithm 1, if this was the case, the algorithm would have chosen the 889 alternative cut. Before performing maximum likelihood ancestral state reconstruction, we inferred 896 GTR parameters from the input tree using RAxML v8.2.12:

897
" h i v f r e q i 4 " : 0 ,

947
" h i v f r e q a 3 " : 0 ,

948
" h i v f r e q a 4 " : 0 ,

949
" h i v f r e q d " : 0 ,