Supplemental Materials: A convenient correspondence between k -mer-based metagenomic distances and phylogenetically-informed β -diversity measures

j=1 Mj . The subscript i in Mi is indicating the starting position of a k-mer and given two subscripts i and j, we can find out if two kmers overlap and in which pattern they overlap. The idea is that suppose we do k-mer analysis and have two starting positions of k-mers i and j, we know that if |i− j| ≥ k, then these two k-mers are ”disjoint”. And if |i− j| = h, 0 ≤ h ≤ k − 1, these two k-mers share k − h sites. Then the MM can be written as


Blocked diagonal structure of E(MM T ) in balanced binary tree
Given a root sequence expression of length ℓ, there are ℓ − k + 1 k-mers for 1 ≤ k ≤ ℓ. M can be decomposed directly with respect to those ℓ − k + 1 k-mers, where M i is the counting matrix of size p by 4 k when we only look at the nucleotides in positions from i to i + k − 1 for each leaf sequence provided that i + k − 1 ≤ ℓ.
With this, we have The subscript i in M i is indicating the starting position of a k-mer and given two subscripts i and j, we can find out if two kmers overlap and in which pattern they overlap. The idea is that suppose we do k-mer analysis and have two starting positions of k-mers i and j, we know that if |i − j| ≥ k, then these two k-mers are "disjoint". And if |i − j| = h, 0 ≤ h ≤ k − 1, these two k-mers share k − h sites. Then the MM T can be written as Correspondingly, the expectation E(MM T ) can be expressed as Each M i is filled with indicator random variables m i,(n,ga) = 1(S (i,i+k−1) n = g a ), where S (i,i+k−1) n is the random sequence expression from position i to i + k − 1 for leaf node n and g a is some k-mer gene expression. To visualize, (1,g1) m i, (1,g2) · · · m i, (1,g For each starting index i and leaf node u, the uth diagonal entry of For a pair of different leaf nodes u and v and suppose their closest common ancestral node is c, we have Given any k-mer expression g a2 of S (i,i+k−1) c , it is not hard to realize that when we sum over all possible k-mer expressions g a1 , we have where p 0 is the probability of no mutation based on the branch length from non-leaf node c to u and v in JC69 model [1].

Structure of E(
Fix an starting index i and for any other starting position j such that |i − j| ≥ k, the uth diagonal entry of For a pair of different leaf nodes u and v, and w.l.o.g. assume starting index i is associated with u and j is associated with v. We have ) .
Since u and v are both leaf nodes and the balanced binary tree is ultrametric, we have E( Without loss of generality, fix an index i and j with j −i = h. Again, we examine the diagonals and off-diagonals We can see that, conditional on S (i,j+k−1) 0 , the diagonals will only be a function of the distance from the root to the tip u obtained from the phylogenetic tree. In the case when the tree is an ultrametric one, those distances will be the same and therefore all the diagonals are the same.
Consider a pair of two different leaf nodes u and v whose closest common ancestral node is c, the off-diagonal , and .
where g a2 is the generic expression for any (k + h)-mer and superscripts denote positions. With the complete binary tree and conditional on the information of the corresponding ancestral subsequence, each term will be only a function of the distance between tip u and v so that for each pair of i and j, is symmetric. Now consider a triplet of different leaf nodes u, v 1 and v 2 and denote the most recent common ancestral node of u and v 1 being c 1 and c 2 for u and v 2 . Two quantities that we want to examine are , and ) .
Recall that with the ultrametric balanced binary phylogenetic tree, given a pair of tips u and v and their most recent common ancestral node c, we have where d(·, ·) is the distance function on the tree and r is the root of the tree.
For each g a2 and g a1 , P is a function of d(r, c 1 ) as well as a function of . From this perspective, we conclude that , that is when the tips v 1 and v 2 do not share the same most recent common ancestral node with respect to u. Therefore, by examing the distance between tips on the complete binary tree, we have that E( is in the blocked diagonal form. Then, we established the fact that with JC69 model and balanced binary phylogeny, E(MM T ) is in the blocked diagonal form.

Eigenstructure of E(MM T ) in balanced binary tree
Assume the balanced binary tree has depth d (root r is in depth 1), then the form of the blocked diagonal matrix D, in which the entries are denoted as µ's can be expressed as follows, in which D 1 is also a blocked diagonal matrix in form of and J n is an all-ones square matrix of size n.

Eigenvalues of D
D is a non-negative matrix and each row of D sums up to the same number. By Perron-Frobenius theorem, we immediately have the leading eigenvalue λ 1 is the row sum. By the fact that determinant of D is det( and D 1 is also in the same matrix form (with a smaller size) and µ d−1 J 2 d−2 is a constant matrix, we conclude that both D 1 − µ d−1 J 2 d−2 and D 1 + µ d−1 J 2 d−2 are also in the same block form as D 1 . Therefore, the determinant of D can be found in a recursive manner. The characteristic polynomial p d (λ)of D is through which the all the eigenvalues and their multiplicities can be identified.

Eigenspace of D
Based on the block structure of D ∈ R 2 j ×2 j , where 2 j is the total number of leaves in the balanced binary tree, we show here how to write D in terms of a linear combination of D ij . D ij is a 2 j × 2 j matrix with 2 i × 2 i blocks of 1's on the diagonals.
Let c ijk denote a vector of length 2 j with a block of 2 i+1 non-zero elements, with the first 2 i of the non-zero elements equal to 1 and the second block of 2 i non-zero elements equal to -1. k can be considered as the index for such vectors and for fixed values of i and j, we have 2 j−i−1 possibilities for k.
Notice that if we have c i2jk is an eigenvector of D i1j if i 2 ≥ i 1 . Furthermore, the c ijk 's form an orthogonal basis for R 2 j . If i < j − 1, we can write and if i = j, we have This means that for any i < j, we have Finally, if the number of leaves on our tree is 2 j , notice that we can write E(MM T ) as We can get the eigenvalues in two pieces. The eigenvalue corresponding to c jj0 will be This tells us that the eigenvalue corresponding to the eigenvector c ijk is

Eigenstructure of E(MM T ) for trees with 128 tips
To better demonstrate our findings of the eigenvectors in the binary tree and comb tree, we repeat the analysis with trees with more leaves.
6 Fig A: The second, third and fourth-largest eigenvectors of the balanced binary tree.
The features defined by eigenvectors for a balanced binary tree with 128 tips are still in the same pattern as what we find in the paper: features with progressively smaller eigenvalues measure the difference in abundance between successively smaller clades and their sisters. It is clear that the first-largest eigenvectors (features) behave the same as the features in Figure 6 in the paper and they have the same interpretations. With more leaves on the comb tree, it is more apparent that features in the comb tree are combinations of features that are contrasts between close sister clades and contrasts between most distinct taxa. With small k between 1 and 3, the third-largest eigenvectors can be considered as both contrast between the taxa in the left-hand side and contrast between two close sister clades in the right-hand side. When k is between 4 to 7, the pattern of being a contrast between one part of the tree and another is more evident.

Eigenstructure of Q in MPQ distances
The relationship between EKS distances and MPQ distances can be established by investigating the eigenstructure of the inner product matrix Q r used in the MPQ distances. The eigenstructures of Q are similar to those in E(MM T ) with the balanced binary tree. In other settings, although the eigenvectors don't look exactly the same, those eigenvectors are either trying to build contrast between close-related clades or contrast between more distinct taxa or achieve both. r = 0 r = .50 r = .90 r = .95 r = 1.00 k = 1 k = 2 k = 3 k = 4 k = 5 k = 8 k = 10 k = 30 k = 50 k = 75