Minimum variance rooting of phylogenetic trees and implications for species tree reconstruction

Phylogenetic trees inferred using commonly-used models of sequence evolution are unrooted, but the root position matters both for interpretation and downstream applications. This issue has been long recognized; however, whether the potential for discordance between the species tree and gene trees impacts methods of rooting a phylogenetic tree has not been extensively studied. In this paper, we introduce a new method of rooting a tree based on its branch length distribution; our method, which minimizes the variance of root to tip distances, is inspired by the traditional midpoint rerooting and is justified when deviations from the strict molecular clock are random. Like midpoint rerooting, the method can be implemented in a linear time algorithm. In extensive simulations that consider discordance between gene trees and the species tree, we show that the new method is more accurate than midpoint rerooting, but its relative accuracy compared to using outgroups to root gene trees depends on the size of the dataset and levels of deviations from the strict clock. We show high levels of error for all methods of rooting estimated gene trees due to factors that include effects of gene tree discordance, deviations from the clock, and gene tree estimation error. Our simulations, however, did not reveal significant differences between two equivalent methods for species tree estimation that use rooted and unrooted input, namely, STAR and NJst. Nevertheless, our results point to limitations of existing scalable rooting methods.

Proposition 6. Let p be a point on an edge (u, v) of tree T with distance d(p, u) = x. If we let p vary along edge (u, v) and consider var(p) as a function of variable x with parameters u and v, then: in which α = 2ST (u) − 4(SI(v) + |v|e v ) n and β = 1 − 2|v| n (S2)

Extra notations
For two points p and p , potentially on different edges, we let path(p, p ) denote the directed path from p to p . For two nodes p and u, we define Cld p (u) as the clade under u if the tree T is rerooted at p. For ease of notation we use |p u| to denote the size of Cld p (u). For a point p on tree T and another point p on either the same edge or an edge connected to p (if p is a node), we let − → pp denote a direction of p. It is easy to see that any point on a tree has at least two directions, and any node that is not the root has at least three directions. We call − → pp a dominant direction of p if and only if Proof of ST relation. Recall that ST (v) is the sum of distances of all leaves from the node v (i.e. ST (p) = i∈Cld(p) (d i (p)). We need to prove that We have Let p ≡ v, we get Eq. S4.
Proof of Proposition 6. Recall that ST (p) = i∈L d i (p).
The first term of the RHS of S6 can be expanded as follow: where the last line is simply derived from the definition: Recall β = (1 − 2|v| n ); the second term can be expanded as follow: Substitute S7 and S8 to S6, we obtain: Thus, we get Eq. S1

Useful Lemmas
Below are useful lemmas that will be used later in the proofs.

3/15
Lemma 1. Any point on a tree either is a balance point or has at least one dominant direction.
Lemma 2. If a point p 0 is not a local MV of tree T , there exists at least one point p on T such that var(p ) < var(p 0 ). We start by some definitions and derivations that are used in proofs of both Proposition 1 and Lemma 2. Consider a point p 0 on tree T and any arbitrary point p on the same edge as p 0 or on an edge adjacent to p 0 if p 0 is a node. Note that when p 0 is in the middle of a edge, p can be a point above or below it on the same edge, but when p 0 is a node, p can be a point on any of the three (or more) edges adjacent to p. We divide the leaf set L of T into two disjoint groups: the leaves inside Cld p0 (p) (group 1), and the remaining leaves (group 2). Let x = d(p 0 , p), n be the size of T , and k be the size of group 1; the size of group 2 is therefore n − k. Let d 1 , d 2 , ..., d k be the distances of the leaves in group 1 to p 0 , d k+1 , d k+2 , ..., d n be the distances of the leaves in group 2 to p 0 , d 1 , d 2 , ..., d k be the distances of the leaves in group 1 to p, and d k+1 , d k+2 , ..., d n be the distances of the leaves in group 2 to p. Also let µ and µ be the averages of the leaf distances to p 0 and p. Then: Proof of Proposition 1. We consider both directions. a. Suppose p 0 is a local MV of T then by Eq. S15 Thus, p 0 is also a balance point, which completes one direction of Proposition 1.
b. Suppose p 0 is a balance point of T ; then, which means, p 0 is a local MV. This completes the proof for Proposition 1.
Proof of Lemma 2. Suppose p 0 is not a local MV. By Lemma 1, there is a point p 1 on the same edge or an adjacent edge to p 0 such that − − → p 0 p 1 is a dominant direction of p 0 . Letting y = d(p 0 , p 1 ), replacing p with p 1 in Eq. S15, we get: where the inequality follows from the fact that − − → p 0 p 1 is a dominant direction (see Eq. S3). Because the derivative at p 0 approaching from p 1 is negative, there exist a point p in a small local neighborhood of p 0 towards p 1 such that var(p ) < var(p 0 ).

Proofs of Proposition 2 -5 and Lemma 3
Proof of Lemma 3. For the the edge (u, v) (where u = p(v)), let , and similarly, . We have: Proof of Proposition 2. Consider a tree T rooted at r T . If r T is a local MV, then the proof is complete. If r T is not a local MV, by Lemma 1 and Lemma 3, there exists an  Proof of Proposition 4. On tree T , let p be the global MV and x = d(p, r), w denote the child of r that is on the same side as p, and d i be the shorthand for d i (r) (i.e. the distance from r to leaf i of tree T ). We prove that x ≤ (1 − )e w , and therefore, p ∈ e(r 0 , w). Note that T 0 and T have the same topology but are different in branch lengths. In this proof we use e v to denote the length of the edge (p(v), v) of T 0 .
Follow the lemma condition By Proposition 1 and 3, p is a balance point. Therefore, From Eq. S20 and S21, we have Recall that under our model, T 0 is an ultrametric tree, so that for each leaf i, v∈path(i,r) e v = h. Also, T was obtained by multiplying each edge of T 0 by a random variable with support Hence, there exists a child w of r such that the global MV belongs to edge (r, w).
Proof of Proposition 5. Let D i be the random variable corresponding to the distribution of d i (r) and P be a random variable giving the position of the global MV root. Then, By the global balance property of P , we can compute and thus, Random number generator seed 9644 Root to crown ratios and Divergence from the strict clock are shown with variables α and R/C. These parameters change for each model condition and are available in Table B.    The tests were performed on the subset of D1 where outgroup exists. For true gene trees, the true root is known. For estimated gene trees, the Ideal is the rooting position that minimizes triplet error to the true gene trees. p-values are shown for the significance of differences between the error of the two methods specified in each row, and for the differences in error among the three levels of clock divergence parameter, respectively.   Species trees are estimated on estimated gene trees. RF distance is shown for NJst and STAR with all three methods of rooting.