Network science inspires novel tree shape statistics

The shape of phylogenetic trees can be used to gain evolutionary insights. A tree’s shape specifies the connectivity of a tree, while its branch lengths reflect either the time or genetic distance between branching events; well-known measures of tree shape include the Colless and Sackin imbalance, which describe the asymmetry of a tree. In other contexts, network science has become an important paradigm for describing structural features of networks and using them to understand complex systems, ranging from protein interactions to social systems. Network science is thus a potential source of many novel ways to characterize tree shape, as trees are also networks. Here, we tailor tools from network science, including diameter, average path length, and betweenness, closeness, and eigenvector centrality, to summarize phylogenetic tree shapes. We thereby propose tree shape summaries that are complementary to both asymmetry and the frequencies of small configurations. These new statistics can be computed in linear time and scale well to describe the shapes of large trees. We apply these statistics, alongside some conventional tree statistics, to phylogenetic trees from three very different viruses (HIV, dengue fever and measles), from the same virus in different epidemiological scenarios (influenza A and HIV) and from simulation models known to produce trees with different shapes. Using mutual information and supervised learning algorithms, we find that the statistics adapted from network science perform as well as or better than conventional statistics. We describe their distributions and prove some basic results about their extreme values in a tree. We conclude that network science-based tree shape summaries are a promising addition to the toolkit of tree shape features. All our shape summaries, as well as functions to select the most discriminating ones for two sets of trees, are freely available as an R package at http://github.com/Leonardini/treeCentrality.

measure is an edit distance between trees with the same labels (i.e., relating the same taxa).

9
The resulting distance matrix for a set of tree topologies can then be interpreted as a measure 10 space for supervised or unsupervised classifiers [1]. However, the requirement for shared labels 11 in the Robinson-Foulds distance prevents its application to the present problem of comparing 12 phylogenies from different viruses.

13
To overcome this limitation, we previously adapted a kernel method from natural language 14 processing [2] to provide a similarity measure that operates on both the topology and branch 15 lengths of trees [3]. Every tree is comprised of a large number of subset trees that can each 16 act as a feature. A subset tree is a contiguous set of branches rooted at an internal node of its 17 parent tree, which does not necessarily include all descendants of that internal node. In other 18 words, the subset tree does not have to extend out to the tips of the parent tree.

19
A subset tree is completely defined by its branching order if we rotate branches of the tree 20 (so-called "ladderization") so that all branching events occur preferentially to one side. For a 21 tree of even modest size, the number of all possible subset trees is extremely large; the space 22 of all possible subset trees is even more immense. Clearly, it is not feasible to exhaustively 23 enumerate the appearance of every possible subset tree for a given observed tree.

24
The kernel trick is a well-established technique in machine learning that provides an efficient 25 way to compare trees with respect to their subset trees by limiting the comparison to those 26 features that appear in one or both trees [3]. Calculating the inner products over this restricted 27 set of features for every pair of trees yields a distance matrix which defines a projection of the 28 trees into a high dimensional space with convenient properties for machine learning. 29 We applied this kernel method to the data sets examined above with the following parameter Furthermore, applying the same method to trees from the three different samples of influenza 43 A virus outbreaks was less promising ( Figure S4).

44
While it is possible that adjusting the kernel tuning parameters (λ and σ) could yield better 45 results, we would risk over-fitting this method to these data. In contrast, Colless and Sackin 46 imbalance separate global flu from the other two groups, while closeness, diameter and mean path separate the five-year flu trees from the global and USA groups. A combination of these 48 summary statistics separate the three groups, as shown in Figure S3 below. 49 We observed a similar situation in the case of the other three datasets, in which the first  Consider a phylogenetic tree on n tips as a graph. Its degree assortativity is the Pearson 58 correlation between the degrees of the sources (heads) and targets (tails) of its edges. We direct 59 each edge away from the root, towards the tips. We show below that for n ≥ 4, only two values 60 of the assortativity coefficient can occur (only one such value can obviously occur for n ≤ 3). 61 We make use of the following formula for the Pearson correlation between x ∈ R N and y ∈ R N : Based on this formula, we note that only two options can occur; the first option is when the root's children are both internal nodes, and the second option is when one of its children is a tip and the other, an internal node (which occurs when, for instance, there is an outgroup in the data). In the first case, we use  Figure S3: PCA biplots illustrating the separation among influenza A virus phylogenies sampled at global (green), five-year (red) and regional (United States, blue) scopes by using tree shape statistics calculation then yields where A 1 denotes the directed degree assortativity, calculated by substituting x and y directly 66 into the formula above, while A 1 denotes the undirected degree assortativity, calculated by We note that lim n→∞ A 1 (n) = 0 = lim n→∞ A 2 (n) and lim n→∞ A 1 (n) = −1/3 = lim n→∞ A 2 (n).  Figure S4: Kernel PCA plots illustrating the overall lack of separation among influenza A virus phylogenies sampled at global (green), five-year (red) and regional (United States, blue) scopes by using the tree kernel method. The diameter for a phylogenetic tree on n tips nodes ranges from ≈ 2 log 2 (n) for the maximally  Figure S10 shows the distribution of all the diameters, together with one of the trees 78 achieving the minimum and maximum values of the diameter, respectively.

79
The Wiener index for a phylogenetic tree on n tips ranges from ≈ 4n 2 log 2 (n) to ≈ 2 3 n 3 ,  The distribution of maximum betweenness centrality values for n = 23 tips ranges from 505 to 89 645, with a mean of 587.592. Figure S13 shows the distribution of all the maximum betweenness 90 centrality values, together with one of the trees achieving the minimum and maximum values, 91 respectively.

92
The distribution of maximum closeness centrality values for n = 23 tips ranges from 143 93 to 285, with a mean of 189.920. Figure S14 shows the distribution of all the maximum close-  Figure S15 shows the distribution of all the maximum eigen-   Figure S7: Kernel PCA plot illustrating the separation among phylogenies simulated from a biased model (red), a Yule model (purple) and two birth-death models (green and blue) by using the kernel method.

124
and at least one of these inequalities is strict. We call a node which satisfies this condition  Figure S8: The same kernel PCA plot as above, components 2 and 3.
u itself). If we denote by F (x) the farness of a node x, then This recurrence is in fact the basis for a linear-time algorithm for computing F (x) for all 138 nodes x in T 0 , starting from the root. Recall that w(uv) > 0 by assumption. Therefore, 139 in order for F (u) to be the smallest farness, we must have since N = 2n − 1 is odd, and in fact, the first two inequalities must be strict. It follows 141 that the subtree T v of T rooted at any child v of u must contain fewer than half of all 142 the nodes; therefore, u must satisfy the definition of a centroid. We note that two closely   Figure S9: PCA biplots illustrating the separation among phylogenies simulated from a biased model (red), a Yule model (purple) and two birth-death models (green and blue) by using tree shape statistics.
For the uniqueness (in phylogenetic trees), suppose that this there is another node in T 0 , 152 say w = u, that also has the defining property of a centroid. Consider the connected 153 component C u of T 0 after w is removed that contains u, and the connected component C w 154 of T 0 after u is removed that contains w. Since u and w are centroids, C u and C w each 155 contain fewer than half the nodes, so their union does not contain some node v of T 0 . But then v ∈ C w .

158
Lastly, we note that for any internal node v, there is some tip w such that the path from 159 v to w strictly increases in farness; indeed, it is sufficient to always pick the child of the 160 current node whose subtree contains fewer than half of all the nodes, by equation (1).

161
Therefore, the maximum farness is attained at a tip, not at an internal node.

162
To connect the concepts of centroid and tricenter, we note that a tricenter, if one exists, 163 is also a centroid, by definition; in particular, a tricenter, if it exists, is always unique; 164 however, a phylogenetic tree always has a centroid, while it may not have a tricenter.  We give an example of a pair of trees on n = 3 tips, with different branch lengths, whose distance 173 Laplacian matrices are co-spectral. We obtained them by exploring all possible integer branch 174 lengths, in order of total branch length (from smallest to largest), and used Maple [8] to check, 175 for each resulting weighted topology, the existence of a different set of branch lengths that would 176 result in a distance Laplacian matrix having the exact same characteristic polynomial, and hence 177 the exact same spectrum. The first tree is therefore the tree with integer branch lengths having 178 the smallest possible total length and admitting a co-spectral tree, the second tree, with respect 179 to the distance Laplacian matrix.

180
The first tree has branch lengths a = 2 from the root to the first tip, b = 2 from the 181 root to the internal node, and c = d = 1 from the internal node to the remaining two   Figure S16: The 6 phylogenetic trees on n = 6 tips, with their betweenness, closeness, and eigenvector centrality between