Scalable Methods for Post-Processing, Visualizing, and Analyzing Phylogenetic Placements

The exponential decrease in molecular sequencing cost generates unprecedented amounts of data. Hence, scalable methods to analyze these data are required. Phylogenetic (or Evolutionary) Placement methods identify the evolutionary provenance of anonymous sequences with respect to a given reference phylogeny. This increasingly popular method is deployed for scrutinizing metagenomic samples from environments such as water, soil, or the human gut. Here, we present novel and, more importantly, highly scalable methods for analyzing phylogenetic placements of metagenomic samples. More specifically, we introduce methods for visualizing differences between samples and their correlation with associated meta-data on the reference phylogeny, as well as for clustering similar samples using a variant of the k-means method. To demonstrate the scalability and utility of our methods, as well as to provide exemplary interpretations of our methods, we applied them to 3 publicly available datasets comprising 9782 samples with a total of approximately 168 million sequences. The results indicate that new biological insights can be attained via our methods.


Introduction
The availability of high-throughput DNA sequencing technologies has revolutionized Thus, for each branch of the tree, the process yields a so-called placement of the QS, 67 that is, an optimized position on the branch, along with a likelihood score for the whole 68 tree. The likelihood scores for a QS are then transformed into probabilities, which 69 quantify the uncertainty of placing the sequence on the respective branch [32,33]. Those 70 probabilities are called likelihood weight ratios (LWRs). The accumulated LWR sum 71 over all branches for a single QS is 1.0. Figure 1 shows an example depicting the 72 placements of one QS, including the respective LWRs. Each branch of the reference tree is tested as a potential insertion position, called a "placement" (blue dots). Note that placements have a specific position on their branch, due to the branch length optimization process. A probability of how likely it is that the sequence belongs to a specific branch is computed (numbers next to dots), which is called the likelihood weight ratio (LWR). The bold number (0.7) denotes the most probable placement of the sequence.
This process is repeated for every QS. Note that the placement process is conducted 74 independently for each QS. That is, for each QS, the algorithm starts calculating 75 placements from scratch on the original RT. 76 In summary, the result of a phylogenetic placement analysis is a mapping of the QSs 77 in a sample to positions on the branches of the RT. Each such position, along with the 78 corresponding LWR, is called a placement of the QS. 79 When placing multiple samples, for instance, from different locations, typically, the 80 same RT is used, in order to allow for comparisons of the phylogenetic composition of 81 these samples. In this context, it sis important to consider how to properly normalize 82 the samples. Normalization is required as the sample size (often also called library size), 83 that is, the number of sequences per sample, can vary by several orders of magnitude, 84 due to efficiency variations in the sequencing process or biases introduced by the 85 amplification process. Selecting an appropriate normalization strategy constitutes a 86 common problem in many metagenomic studies. The appropriateness depends on data 87 characteristics [34], but also on the biological question asked. For example, estimating 88 indices such as the species richness are often implemented via rarefaction and 89 rarefaction curves [35], which however ignores a potentially large amount of the 90 available valid data [36]. Furthermore, the specific type of input sequence data has to 91 be taken into account for normalization: Biases induced by the amplification process 92 can potentially be avoided if, instead of amplicons, data based on shotgun sequencing 93 are used, such as mi tags [37]. Moreover, the sequences can be clustered prior to 94 phylogenetic placement analysis, for instance, by constructing operational taxonomic 95 units (OTUs) [38][39][40][41]. Analyses using OTUs focus on species diversity instead of simple 96 abundances. OTU clustering substantially reduces the number of sequences, and hence 97 greatly decreases the computational cost for placement analyses. Lastly, one may 98 completely ignore the abundances (which are called "multiplicities" of placements) of 99 the placed sequences, reads, or OTUs, and only be interested in their presence/absence 100 3/36 when comparing samples. 101 Which of the above analysis strategies is deployed, depends on the specific design of 102 the study and the research question at hand. The common challenge is that the number 103 of sequences per sample differs, which affects most post-analysis methods. Before 104 introducing our methods, we therefore explain how the necessary normalizations of 105 sample sizes can be performed in the following. We also describe general techniques for 106 interpreting and working with phylogenetic placement data. Some of these techniques 107 have been used before as building blocks for methods like Edge PCA and Squash 108 Clustering [23,42]. 109 Edge Masses 110 Methods that compare samples directly based on their sequences, such as the UniFrac 111 distance [43,44], can benefit from rarefaction [34]. However, in the context of 112 phylogenetic placement, rarefaction is not necessary. Thus, more valid data can be kept. 113 To this end, it is convenient to think of the reference tree as a graph (when exploiting 114 graph properties of the tree, we refer to the branches of the tree as edges). Then, the 115 per-branch LWRs for a single QS can be interpreted as mass points distributed over the 116 edges of the RT, including their respective placement positions on the branches, 117 cf. Figure 1. This implies that each QS has a total accumulated mass of 1.0 on the RT. 118 We call this the mass interpretation of the placed QSs, and henceforth use mass and 119 LWR interchangeably. The mass of an edge refers to the sum of the LWRs on that edge 120 for all QSs of a sample, as shown in Figure 2(a). The total mass of a sample is then the 121 sum over all edge masses, which is identical to the number of QSs in the sample. 122 The key idea is to use the distribution of placement mass points over the edges of 123 the RT to characterize a sample. This allows for normalizing samples of different size by 124 scaling the total sample mass to unit mass 1.0. In other words, absolute abundances are 125 converted into relative abundances. This way, rare species, which might have been 126 removed by rarefaction, can be kept, as they only contribute a negligible mass to the 127 branches into which they have been placed. This approach is analogous to using 128 proportional values for methods based on OTU count tables, that is, scaling each 129 sample/column of the table by its sum of OTU counts [34]. Most of the methods 130 presented here use normalized samples, that is, they use relative abundances. As 131 relative abundances are compositional data, certain caveats occur [45,46], which we 132 discuss where appropriate. 133 When working with large numbers of QSs, the mass interpretation allows to further 134 simplify and reduce the data: The masses on each edge of the tree can be quantized into 135 b discrete bins, that is, each edge is divided into b intervals (or bins) of the 136 corresponding branch length. All mass points on that edge are then accumulated into 137 their respective nearest bin. The parameter b controls the resolution and accuracy of 138 this approximation. In the extreme case b := 1, all masses on an edge are grouped into 139 one single bin. This branch binning process drastically reduces the number of mass 140 points that need to be stored and analyzed in several methods we present, while only 141 inducing a negligible decrease in accuracy. As shown in Supplementary Table 1, branch 142 binning can yield a speedup of up to 75% for post-analysis run-times.
otherwise propagates the effects of differing sample sizes all across the tree. An example 163 of the imbalance calculation is shown in Figure 2(a). The edge imbalance relates the 164 masses on the two sides of an edge to each other. This implicitly captures the RT 165 topology and reveals information about its clades. Furthermore, this transformation can 166 also reveal differences in the placement mass distribution of nearby branches of the tree. 167 This is in contrast to the KR distance, which yields low values for masses that are close 168 to each other on the tree. An example that illustrates the different use cases for edge 169 mass and edge imbalance metrics is shown in Figure 3.

170
The edge masses and edge imbalances per sample can be summarized by two 171 matrices, which we use for all further downstream edge-and clade-related analyses, 172 respectively. In these matrices, each row corresponds to a sample, and each column to 173 an edge of the RT. Note that these matrices can either store absolute or relative 174 abundances, depending on whether the placement mass was normalized.

175
Furthermore, many studies provide meta-data for their samples, for instance, the pH 176 value or temperature of the samples' environment. Such meta-data features can also be 177 summarized in a per-sample matrix, where each column corresponds to one feature. The 178 three matrices are shown in Figure 2(b). Quantitative meta-data features are the most 179 suitable for our purposes, as they can be used to detect correlations with the placement 180 mass distributions of samples. For example, Edge principal components analysis (Edge 181 PCA) [23] is a method that utilizes the imbalance matrix to detect and visualize edges 182 with a high heterogeneity of mass difference between samples. Edge PCA further allows 183 to annotate its plots with meta-data variables, for instance, by coloring, thus 184 establishing a connection between differences in samples and differences in their 185 meta-data [14]. In the following, we propose several new techniques to analyze 186 placement data and their associated meta-data.

188
In this section, we introduce novel methods for analyzing and visualizing the 189 phylogenetic placement data of a set of environmental samples. Each such sample 190 represents a geographical location, a body site, a point in time, etc. In the following, we 191 represent a sample by the placement locations of its metagenomic QSs, including the 192 respective per-branch LWRs. Furthermore, for a specific analysis, we assume the 193 standard use case, that is, all placements were computed on the same fixed reference 194 tree (RT) and reference alignment. We initially describe the methods. We then assess 195 their application to real world data and their computational efficiency in Section scaling, as shown in [11].

209
These simple visualizations directly depict the placement masses on the tree. When 210 visualizing the accumulated masses of multiple samples at once, it is important to chose 211 the appropriate normalization strategy for the task at hand. For example, if samples 212 represent different locations, one might prefer to use normalized masses, as comparing 213 relative abundances is common for this type of data. On the other hand, if samples 214 from the same location are combined (e.g., from different points in time, or different size 215 fractions), it might be preferable to use absolute abundances instead, so that the total 216 number of sequences per sample can be visualized.

217
The visualizations provide an overview of the species abundances over the tree. They 218 6/36 can be regarded as a more detailed version of classic abundance pie charts. When 219 placing OTUs, or ignoring sequence abundances, the resulting visualizations depict 220 species diversity. Moreover, these visualizations can be used to assess the quality of the 221 RT. For example, placements into inner branches of the RT may indicate that 222 appropriate reference sequences (i) have not been included or (ii) are simply not yet 223 available.

224
Here, we introduce visualization methods that highlight (i) regions of the tree with a 225 high variance in their placement distribution (called Edge Dispersion), and (ii) regions 226 with a high correlation to meta-data features (called Edge Correlation). can be mapped back to the tree, and visualized, for instance, via color coding. This 232 allows to examine which edges exhibit a high heterogeneity of placement masses across 233 samples, and indicates which edges discriminate samples. As edge mass values can span 234 many orders of magnitude, it might be necessary to scale the variance logarithmically. 235 Often, one is more interested in the branches with high placement mass. In these cases, 236 using the standard deviation or variance is appropriate, as they also indicate the mean 237 mass per edge. On the other hand, by calculating the per-edge Index of Dispersion [51], 238 that is, the variance-mean-ratio σ 2 /µ, differences on edges with little mass also become 239 visible. As Edge Dispersion relates placement masses from different samples to each 240 other, the choice of the normalization strategy is important. When using normalized 241 masses, the magnitude of dispersion values needs to be cautiously interpreted [46]. The 242 Edge Dispersion can also be calculated for edge imbalances. As edge imbalances are 243 usually normalized to [−1.0, 1.0], their dispersion can be visualized directly without any 244 further normalization steps. An example for an Edge Dispersion visualization is shown 245 in Figure 3(a), and discussed in Section Visualization.

247
In addition to the per-edge masses, the Edge Correlation further takes a specific 248 meta-data feature into account, that is, a column of the meta-data matrix. The Edge 249 Correlation is calculated as the correlation between each edge column and the feature 250 column, for example by using the Pearson Correlation Coefficient or Spearman's Rank 251 Correlation Coefficient [51]. This yields a per-edge correlation of the placement masses 252 or imbalances with the meta-data feature, and can again be visualized via color coding 253 of the edges. It is inexpensive to calculate and hence scales well to large datasets. As shows an example of this method. In contrast to Edge PCA [23] that can use meta-data 258 features to annotate samples in its scatter plots, our Edge Correlation method directly 259 represents the influence of a feature on the branches or clades of the tree. It can thus, 260 for example, help to identify and visualize dependencies between species abundances 261 and environmental factors such as temperature or nutrient levels. Again, the choice of 262 normalization strategy is important to draw meaningful conclusions. However, the 263 correlation is not calculated between samples or sequence abundances. Hence, even 264 when using normalized samples, the pitfalls regarding correlations of compositional 265 data [46] do not apply here. 266

7/36
Clustering 267 Given a set of metagenomic samples, one key question is how much they differ from 268 each other. A common distance metric between microbial communities is the (weighted) 269 UniFrac distance [43,44]. It uses the fraction of unique and shared branch lengths 270 between phylogenetic trees to determine their difference. UniFrac has been generalized 271 and adapted to phylogenetic placements in form of the phylogenetic 272 Kantorovich-Rubinstein (KR) distance [23,42]. In other contexts, the KR distance is 273 also called Wasserstein distance, Mallows distance, or Earth Mover's distance [52][53][54][55].

274
The KR distance between two metagenomic samples is a metric that describes by at 275 least how much the normalized mass distribution of one sample has to be moved across 276 the RT to obtain the distribution of the other sample. The distance is symmetrical, and 277 becomes larger the more mass needs to be moved, and the larger the respective 278 displacement (moving distance) is. As the two samples being compared need to have 279 equal masses, the KR distance operates on normalized samples; that is, it compares 280 relative abundances.

281
Given such a distance measure between samples, a fundamental task consists in 282 clustering samples that are similar to each other. Standard linkage-based clustering 283 methods like UPGMA [56][57][58] are solely based on the distances between samples, that 284 is, they calculate the distances of clusters as a function of pairwise sample distances.

285
Squash Clustering [14,23] is a method that also takes into account the intrinsic 286 structure of phylogenetic placement data. It uses the KR distance to perform 287 agglomerative hierarchical clustering of samples. Instead of using pairwise sample 288 distances, however, it merges (squashes) clusters of samples by calculating their average 289 per-edge placement mass. Thus, in each step, it operates on the same type of data, that 290 is, on mass distributions on the RT. This results in a hierarchical clustering tree that 291 has meaningful, and hence interpretable, branch lengths.

292
Phylogenetic k-means 293 The number of tips in the resulting clustering tree obtained through Squash Clustering 294 is equal to the number n of samples that are being clustered. Thus, for datasets with 295 more than a few hundred samples, the clustering result becomes hard to inspect and 296 interpret visually. We propose a variant of k-means clustering [59] to address this 297 problem, which we call Phylogenetic k-means. It uses a similar approach as Squash 298 Clustering, but yields a predefined number of k clusters. It is hence able to work with 299 arbitrarily large datasets. Note that we are clustering samples here, instead of 300 sequences [60]. We discuss choosing a reasonable value for k later.

301
The underlying idea is to assign each of the n samples to one of k cluster centroids, 302 where each centroid represents the average mass distribution of all samples assigned to 303 it. Note that all samples and centroids are of the same data type, namely, they are mass 304 distributions on a fixed RT. It is thus possible to calculate distances between samples 305 and centroids, and to calculate their average mass distributions, as described earlier.

306
Our implementation follows Lloyd's algorithm [61], as shown in Algorithm 1. update Centroids as mass averages of their Samples

5: return Assignments and Centroids
By default, we use the k-means++ initialization algorithm [62] to obtain a set of k 308 initial centroids. It works by subsequent random selection of samples to be used as 309 initial centroids, until k centroids have been selected. In each step, the probability of 310 selecting a sample is proportional to its squared distance to the nearest already selected 311 sample. An alternative initialization is to select samples as initial clusters entirely at 312 random. This is however more likely to yield sub-optimal clusterings [63]. 313 Then, each sample is assigned to its nearest centroid, using the KR distance. Lastly, 314 the centroids are updated to represent the average mass distribution of all samples that 315 are currently assigned to them. This iterative process alternates between improving the 316 assignments and the centroids. Thus, the main difference to normal k-means is the use 317 of phylogenetic information: Instead of euclidean distances on vectors, we use the KR 318 distance, and instead of averaging vectors to obtain centroids, we use the average mass 319 distribution.

320
The process is repeated until it converges, that is, the cluster assignments do not 321 change any more, or until a maximum number of iterations have been executed. The 322 second stopping criterion is added to avoid the super-polynomial worst case running 323 time of k-means, which however almost never occurs in practice [64,65].

324
The result of the algorithm is an assignment of each sample to one of the k clusters. 325 As the algorithm relies on the KR distance, it clusters samples with similar relative 326 abundances. The cluster centroids can be visualized as trees with a mass distribution, 327 analogous to how Squash Clustering visualizes inner nodes of the clustering tree. That 328 is, each centroid can be represented as the average mass distribution of the samples that 329 were assigned to it. This allows to inspect the centroids and thus to interpret how the 330 samples were clustered. Examples of this are shown in Supplementary Figure 6.

331
The key question is how to select an appropriate k that reflects the number of 332 "natural" clusters in the data. There exist various suggestions in the literature [66][67][68][69][70][71]; 333 we assessed the Elbow method [66] as explained in Supplementary Figure 8, which is a 334 straight forward method that yielded reasonable results for our test datasets.

335
Additionally, for a quantitative evaluation of the clusterings, we used the k that arose 336 from the number of distinct labels based on the available meta-data for the data. For  In each assignment step of the algorithm, distances from all samples to all centroids are 341 calculated, which has a time complexity of O(n · k). In order to accelerate this step, we 342 can apply branch binning as introduced in Section Edge Masses. For the BV dataset, 343 we found that even using just 2 bins per edge does not alter the cluster assignments.

344
Branch binning reduces the number of mass points that have to be accessed in memory 345 during KR distance calculations; however, the costs for tree traversals remain. Thus, we 346 observed a maximal speedup of 75% when using one bin per branch, see Supplementary 347 Table 1 for details.

348
Furthermore, during the execution of the algorithm, empty clusters can occur, for 349 example, if k is greater than the number of natural clusters in the data. Although this 350 problem did not occur in our tests, we implemented the following solution: First, find 351 the cluster with the highest variance. Then, choose the sample of that cluster that is 352 furthest from its centroid, and assign it to the empty cluster instead. This process is 353 repeated if multiple empty clusters occur at once. 354

9/36
Imbalance k-means 355 We further propose Imbalance k-means, which is a variant of k-means that makes use of 356 the edge imbalance transformation, and thus also takes the clades of the tree into 357 account. In order to quantify the difference in imbalances between two samples, we use 358 the euclidean distance between their imbalance vectors (that is, rows of the imbalance 359 matrix). This is a suitable distance measure, as the imbalances implicitly capture the 360 tree topology as well as the placement mass distributions. As a consequence, the 361 expensive tree traversals required for Phylogenetic k-means are not necessary here. The 362 algorithm takes the edge imbalance matrix of normalized samples as input, as shown in 363 Figure 2(b), and performs a standard euclidean k-means clustering following Lloyd's 364 algorithm.

365
This variant of k-means tends to find clusters that are consistent with the results of 366 Edge PCA, as both use the same input data as well as the same distance measure.

367
Furthermore, as the method does not need to calculate KR distances, and thus does not 368 involve tree traversals, it is several orders of magnitude faster than the Phylogenetic 369 k-means. For example, on the HMP dataset, it runs in mere seconds, instead of several 370 hours needed for Phylogenetic k-means; see Section Performance for details.

372
We used three real world datasets to evaluate our methods:

373
• Bacterial Vaginosis (BV) [14]. This small dataset was already analyzed with 374 phylogenetic placement in the original publication. We used it as an example of 375 an established study to compare our results to. It has 220 samples with a total of 376 15 060 unique sequences.

377
• Tara Oceans (TO) [7,24,25]. This world-wide sequencing effort of the open oceans 378 provides a rich set of meta-data, such as geographic location, temperature, and 379 salinity. Unfortunately, the sample analysis for creating the official data repository 380 is still ongoing. We thus were only able to use 370 samples with 27 697 007 unique 381 sequences.

382
• Human Microbiome Project (HMP) [12,13]. This large data repository intends to 383 characterize the human microbiota. It contains 9192 samples from different body 384 sites with a total of 63 221 538 unique sequences. There is additional meta-data 385 such as age and medical history, which is available upon special request. We only 386 used the publicly available meta-data.  These datasets represent a wide range of environments, number of samples, and 394 sequence lengths. We use them to evaluate our methods and to exemplify which method 395 is applicable to what kind of data. To this end, the sequences of the datasets were 396 placed on appropriate phylogenetic RTs as explained in the supplement, in order to 397 obtain phylogenetic placements that our methods can be applied to. In the following, 398 we present the respective results, and also compare our methods to other methods 399 where applicable. As the amount and type of available meta-data differs for each 400 10/36 dataset, we could not apply all methods to all datasets. Lastly, we also report the 401 run-time performance of our methods on these data.

402
Visualization 403 BV Dataset 404 We re-analyzed the BV dataset by inferring a tree from the original reference sequence 405 set and conducting phylogenetic placement of the 220 samples. The characteristics of 406 this dataset were already explored in [14] and [23]. We use it here to give exemplary 407 interpretations of our Edge Dispersion and Edge Correlation methods, and to evaluate 408 them in comparison to existing methods. Vaginosis, ranging from 0 (healthy) to 10 (severe illness). The connection between the 413 Nugent score and the abundance of placements on particular edges was already explored 414 in [23], but only visualized indirectly (i.e., not on the RT itself). For example, Figure 6 415 of the original study plots the first two Edge PCA components colorized by the Nugent 416 score. We recalculated this figure for comparison in Supplementary Figure 5(i). In 417 contrast, our Edge Correlation measure directly reveals the connection between Nugent 418 score and placements on the reference tree: The clade on the left hand side of the tree, 419 to which the red and orange branches lead to, are Lactobacillus iners and Lactobacillus 420 crispatus, respectively, which were identified in [14] to be associated with a healthy 421 vaginal microbiome. Thus, their presence in a sample is anti-correlated with the Nugent 422 score, which is lower for healthy subjects. The branches leading to this clade are hence 423 colored in red. On the other hand, there are several other clades that exhibit a positive 424 correlation with the Nugent score, that is, were green and blue paths lead to in the 425 Figure, again a finding already reported in [14].

11/36
Both trees in Figure 3 highlight the same parts of the tree: The dark branches with 427 high deviation in (a) represent clades attached to either highly correlated (blue) or 428 anti-correlated (red) paths (b). This indicates that edges that have a high dispersion 429 also vary between samples of different Nugent score. 430 We further compared our methods to the visualization of Edge PCA components on 431 the reference tree. To this end, we recalculated  Figure 3(b), but does not distinguish further between its 438 sub-clades. This is because a high Nugent score is associated with a high abundance of 439 placements in either of the two relevant Lactobacillus clades.  Supplementary Figures 1 and 2. We also conducted Edge 442 Correlation using Amsel's criteria [73] and the vaginal pH value (data not shown), both 443 of which were used in [14] as additional indicators of Bacterial Vaginosis. We again 444 found similar correlations compared to the Nugent score. 446 We analyzed the TO dataset to provide further exemplary use cases for our 447 visualization methods. To this end, we used the unconstrained Eukaryota RT with 2059 448 taxa as provided by our Automatic Reference Tree method [26]. The meta-data features 449 of this dataset that best lend themselves to our methods are the sensor values for 450 chlorophyll, nitrate, and oxygen concentration, as well as the salinity and temperature 451 of the water samples. Other available meta-data features such as longitude and latitude 452 are available, but would require more involved methods. This is because geographical 453 coordinates yield pairwise distances between samples, whose integration into our  We selected the diatoms and the animals as two exemplary clades for closer 458 examination of the results. In particular, the diatoms show a high correlation with the 459 nitrate concentration, as well as an anti-correlation with salinity, which represent 460 well-known relationships [74,75]. See Supplementary Figure 4 for details. These findings 461 indicate that the method is able to identify known relationships. It will therefore also 462 be useful to investigate or discover insights of novel relationships between sequence 463 abundances and environmental parameters. implemented as a command in the guppy program [17]. For the BV dataset with 220 470 samples, guppy required 9 min and used 2.2 GB of memory, while our implementation 471 only required 33 s on a single core, using less than 600 MB of main memory. For the 472 HMP dataset, as it is only single-threaded, guppy took 11 days and 75.1 GB memory, 473 while our implementation needed 7.5 min on 16 cores and used 43.5 GB memory. 474

12/36
Clustering 475 We now evaluate our Phylogenetic k-means clustering (which uses edge masses and KR 476 distances) and Imbalance k-means clustering (which uses edge imbalances and euclidean 477 distances) methods in terms of their clustering accuracy. We used the BV as an 478 example of a small dataset to which methods such as Squash Clustering [23] are still 479 applicable, and the HMP dataset to showcase that our methods scale to datasets that 480 are too large for existing methods.

482
We again use the re-analyzed BV dataset to test whether our methods work as expected, 483 by comparing them to the existing analysis of the data in [14] and [23]. To this end, we 484 ran both Phylogenetic k-means and Imbalance k-means on the BV dataset. We chose 485 k := 3, inspired by the findings of [14]. They distinguish between subjects affected by Clustering, Edge PCA, and two alternative dimensionality reduction methods, colorized 495 by the cluster assignments PKM of Phylogenetic k-means (in red, green, and blue) and 496 IKM of the Imbalance k-means (in purple, orange, and gray), respectively. We use two 497 different color sets for the two methods, in order to make them distinguishable at first 498 glance. Note that the mapping of colors to clusters is arbitrary and depends on the 499 random initialization of the algorithm.

500
As can be seen in Figure 4 between each other, indicating the smaller KR distance between them, which is a result 504 of the dominance of Lactobacillus in healthy subjects. The same clusters are found by 505 Phylogenetic k-means: As it uses the KR distance, it assigns all healthy subjects with 506 short cluster tree branches to one cluster (shown in red). The green and blue clusters 507 are mostly the subjects affected by the disease.

508
The distinguishing features between the green and the blue cluster are not apparent 509 in the Squash cluster tree. This can however be seen in Figure 4(c), which shows a 510 Multidimensional scaling (MDS) plot of the pairwise KR distances between the samples. 511 MDS [51,77,78] is a dimensionality reduction method that can be used for visualizing 512 levels of similarity between data points. Given a pairwise distance matrix, it finds an 513 embedding into lower dimensions (in this case, 2 dimensions) that preserves higher 514 dimensional distances as well as possible. Here, the red cluster forms a dense region, 515 which is in agreement with its short branch lengths in the cluster tree. At the same 516 time, the green and blue cluster are separated in the MDS plot, but form a coherent 517 region of low density, indicating that k := 3 might be too large with Phylogenetic 518 k-means on this dataset. That is, the actual clustering just distinguishes healthy from 519 sick patients (Supplementary Figure 8).

520
A similar visualization of the pairwise KR distances is shown in Figure 4(d). It is a 521 recalculation of Figure 4 in the preprint [76], which did not appear in the final 522 published version [23]. The figure shows a standard Principal Components 523 Analysis (PCA) [51,78] applied to the distance matrix by interpreting it as a data We applied our variants of the k-means clustering method to the BV dataset in order to compare them to existing methods. See [14] for details of the dataset and its interpretation. We chose k := 3, as this best fits the features of the dataset. For each sample, we obtained two cluster assignments: First, by using Phylogenetic k-means, we obtained the cluster assignment PKM. Second, by using Imbalance k-means, we obtained assignment IKM. In each subfigure, the 220 samples are represented by colored circles: red, green, and blue show the cluster assignments PKM, while purple, orange, and gray show the cluster assignments IKM. (a) Hierarchical cluster tree of the samples, using Squash Clustering. The tree is a recalculation of Figure 1(A) of [14]. Each leaf represents a sample; branch lengths are KR distances. We added color coding for the samples, using PKM. The lower half of red samples are mostly healthy subjects, while the green and blue upper half are patients affected by Bacterial Vaginosis. (b) The same tree, but annotated by IKM. The tree is flipped horizontally for ease of comparison. The healthy subjects are split into two sub-classes, discriminated by the dominating species in their vaginal microbiome: orange and purple represent samples were Lactobacillus iners and Lactobacillus crispatus dominate the microbiome, respectively. The patients mostly affected by BV are clustered in gray. (c) Multidimensional scaling using the pairwise KR distance matrix of the samples, and colored by PKM. (d) Principal component analysis applied to the distance matrix by interpreting it as a data matrix. This is a recalculation of Figure 4 of [76], but colored by PKM. (e) Edge PCA applied to the samples, which is a recalculation of Figure 3 of [76], but colored by IKM.
matrix, and was previously used to motivate Edge PCA. However, although it is 525 mathematically sound, the direct application of PCA to a distance matrix lacks a 526 simple interpretation. Again, the red cluster is clearly separated from the rest, but this 527 time, the distinction between the green and the blue cluster is not as apparent.

528
In Figure 4(b), we compare Squash Clustering to Imbalance k-means. Here, the 529 14/36 distinction between the two Lactobacillus clades can be seen by the purple and orange 530 cluster assignments. The cluster tree also separates those clusters into clades. The 531 separate small group of orange samples above the purple clade is an artifact of the tree 532 ladderization. The diseased subjects are all assigned to the gray cluster, represented by 533 the upper half of the cluster tree. It is apparent that both methods separate the same 534 samples from each other.

535
Lastly, Figure 4(e) compares Imbalance k-means to Edge PCA. The plot is a 536 recalculation of Figure 3 of [76], which also appeared in Figure 6 in [23] and Figure 3 537 in [14], but colored using our cluster assignments. Because both methods work on edge 538 imbalances, they group the data in the same way, that is, they clearly separate the two 539 healthy groups and the diseased one from each other. Edge PCA forms a plot with 540 three corners, which are colored by the three Imbalance k-means cluster assignments.

547
The HMP dataset is used here as an example to show that our method scales to large 548 datasets. To this end, we used the unconstrained Bacteria RT with 1914 taxa as 549 provided by our Automatic Reference Tree method [26]. The tree represents a broad 550 taxonomic range of Bacteria, that is, the sequences were not explicitly selected for the 551 HMP dataset, in order to test the robustness of our clustering methods. We then placed 552 the 9192 samples of the HMP dataset with a total of 118 701 818 sequences on that tree, 553 and calculated Phylogenetic and Imbalance k-means on the samples. The freely  Ideally, all samples from one body site would be assigned to the same cluster, hence 561 forming a diagonal on the plot. However, as there are several nearby body sites, which 562 share a large fraction of their microbiome [12], we do not expect a perfect clustering.

563
Furthermore, we used a broad reference tree that might not be able to resolve details in 564 some clades. Nonetheless, the clustering is reasonable, which indicates a robustness The complexity of Phylogenetic k-means is in O(k · i · n · e), with k clusters, i iterations, 571 and n samples, and e being the number of tree edges, which corresponds to the number 572 of dimensions in standard euclidean k-means. As the centroids are randomly initialized, 573 the number of iterations can vary; in our tests, it was mostly below 100. For the BV   Fig 5. k-means cluster assignments of the HMP dataset with k := 18. Here, we show the cluster assignments as yielded by Phylogenetic k-means (a) and Imbalance k-means (b) of the HMP dataset. We used k := 18, which is the number of body site labels in the dataset, in order to compare the clusterings to this "ground truth". Each row represents a body site; each column one of the 18 clusters found by the algorithm.
The color values indicate how many samples of a body site were assigned to each cluster. Similar body sites are clearly grouped together in coherent blocks, indicated by darker colors. For example, the stool samples were split into two clusters (topmost row), while the three vaginal sites were all put into one cluster (rightmost column). However, the algorithm cannot always distinguish between nearby sites, as can be seen from the fuzziness of the clusters of oral samples. This might be caused by our broad reference tree, and could potentially be resolved by using a tree more specialized for the data/region (not tested). Lastly, the figure also lists how the body site labels were aggregated into regions as used in Supplementary Figure 7.
Although the plots of the two k-means variants generally exhibit similar characteristics, there are some differences. For example, the samples from the body surface (ear, nose, arm) form two relatively dense clusters (columns) in (a), whereas those sites are spread across four of five clusters in (b). On the other hand, the mouth samples are more densely clustered in (b).

16/36
In contrast to this, Imbalance k-means does not need to conduct any expensive tree 579 traversals, and instead operates on compact vectors, using euclidean distances. It is 580 hence several orders of magnitude faster than Phylogenetic k-means. For example, using 581 again k := 18 for the HMP dataset, the algorithm executed 75 iterations in 2 s. It is 582 thus also applicable to extremely large datasets. 583 Furthermore, as the KR distance is used in Phylogenetic k-means as well as other 584 methods such as Squash Clustering, our implementation is highly optimized and 585 outperforms the existing implementation in guppy [17] by orders of magnitude (see 586 below for details). The KR distance between two samples has a linear computational 587 complexity in both the number of QSs and the tree size. As a test case, we computed 588 the pairwise distance matrix between sets of samples. Calculating this matrix is 589 quadratic in the number of samples, and is thus expensive for large datasets. For 590 example, in order to calculate the matrix for the BV dataset with 220 samples, guppy 591 can only use a single core and required 86 min. Our KR distance implementation in 592 genesis is faster and also supports multiple cores. It only needed 90 s on a single core; 593 almost half of this time is used for reading input files. When using 32 cores, the matrix 594 calculation itself only took 8 s. This allows to process larger datasets: The distance 595 matrix of the HMP dataset with 9192 samples placed on a tree with 3824 branches for 596 instance took less than 10 h to calculate using 16 cores in genesis. In contrast, guppy 597 needed 43 days for this dataset. Lastly, branch binning can be used to achieve 598 additional speedups, as shown in Supplementary Table 1.  Furthermore, we presented adapted variants of the k-means method, which exploit the 610 structure of phylogenetic placement data to identify clusters of environmental samples. 611 The method builds upon ideas such as Squash Clustering and can be applied to 612 substantially larger datasets, as it uses a pre-defined number of clusters. For future 613 exploration, other forms of cluster analyses could be extended to work on phylogenetic 614 placement data, for example, soft k-means clustering [79,80] or density-based 615 methods [81]. The main challenge when adopting such methods consists in making them 616 phylogeny-aware, that is, to use mass distributions on trees instead of the typical R n 617 vectors.

618
The presented methods take either the edge masses or the edge imbalances as input, 619 and can hence analyze different aspects of the placements. While edge masses reveal 620 information about single branches, edge imbalances take entire reference tree clades into 621 account. Depending on the task at hand, either of them might be preferable, although 622 they generally exhibit similar properties. We emphasize again the importance of 623 appropriately normalizing the sample sizes as required. That is, depending on the type 624 of sequence data, using either absolute or relative abundances is critical to allow for 625 meaningful interpretation of the results. 626 We tested our novel methods on three real-world datasets and gave exemplary 627 interpretations of the results. We further showed that these results are consistent with 628 17/36 existing methods as well as empirical biological knowledge. Hence, our methods will also 629 be useful to unravel new, unexplored relationships in metagenomic data.

630
The methods are computationally inexpensive, and are thus, as we have 631 demonstrated, applicable to large datasets. They are implemented in our tool gappa, 632 which is freely available under GPLv3 at http://github.com/lczech/gappa. 633 Furthermore, scripts, data and other tools used for the tests and figures presented here 634 are available at http://github.com/lczech/placement-methods-paper. The analyses and figures presented here were conducted on distinct reference alignments 645 and trees. Firstly, for the BV dataset, we used the original set of reference sequences, 646 and re-inferred a tree on them. Secondly, for the TO and HMP datasets, we used our 647 Automatic Reference Tree (ART) method [26] to construct sets of suitable reference 648 sequences from the Silva database [82,83]. We used the 90% threshold consensus 649 sequences; see [26] for details.

650
For all analyses, we used the following software setup: Unconstrained maximum 651 likelihood trees were inferred using RAxML v8.2.8 [49]. For aligning reads against 652 reference alignments and reference trees, we used a custom MPI wrapper for PaPaRa 653 2.0 [27,28], which is available at [84]. We then applied the chunkify procedure as 654 explained in [26] to split the sequences into chunks of unique sequences prior to 655 conducting the phylogenetic placement, in order to minimize processing time.

656
Phylogenetic placement was conducted using EPA-ng [19], which is a faster and more 657 scalable phylogenetic placement implementation than RAxML-EPA [18] and 658 pplacer [17]. Lastly, given the per-chunk placement files produced by EPA-ng, we 659 executed the unchunkify procedure of [26] to obtain per-sample placement files. These 660 subsequently served as the input data for the methods presented here.

662
We used the Bacterial Vaginosis dataset [14] in order to compare our novel methods to 663 existing ones such as Edge PCA and Squash Clustering [23,42]. The dataset contains 664 metabarcoding sequences of the vaginal microbiome of 220 women, and was kindly 665 provided by Sujatha Srinivasan. This small dataset with a total of 426 612 query 666 sequences, thereof 15 060 unique, was already analyzed with phylogenetic placement 667 methods in the original publication. We re-inferred the reference tree of the original 668 publication using the original alignment, which contains 797 reference sequences 669 specifically selected to represent the vaginal microbiome. As the query sequences were 670 already prepared, no further preprocessing was applied prior to phylogenetic placement. 671 The available per-sample quantitative meta-data for this dataset comprises the Nugent 672 score [72], the value of Amsel's criteria [73], and the vaginal pH value. We used all three 673 meta-data types in our analyses.

677
At the time of download, there were 370 samples available with a total of 49 023 231 678 sequences. As the available data are raw fastq files, we followed [85] to generate 679 cleaned per-sample fasta files. For this, we used our tool PEAR [86] to merge the 680 paired-end reads; cutadapt [87] for trimming tags as well as forward and reverse 681 primers; and vsearch [41] for filtering erroneous sequences and generating per-sample 682 fasta files. We filtered out sequences below 95 bps and above 150 bps, to remove 683 potentially erroneous sequences. No further preprocessing (such as chimera detection) 684 was applied. This resulted in a total of 48 036 019 sequences, thereof 27 697 007 unique. 685 The sequences were then used for phylogenetic placement as explained above. We 686 placed the sequences on the unconstrained Eukaryota reference tree obtained via our 687 ART method [26], which comprises 2059 taxa, thereof 1795 eukaryotic sequences. The 688 remaining 264 taxa are Archaea and Bacteria, and were included as a broad outgroup. 689

19/36
The TO dataset has a rich variety of per-sample meta-data features; in the context of 690 this paper, we mainly focus on quantitative features such as temperature, salinity, as 691 well as oxygen, nitrate and chlorophyll content of the water. Furthermore, each sample 692 has meta-data features indicating the date, longitude and latitude, depth, etc. of the 693 sampling location. This data might be interesting for further correlation analyses based 694 on geographical information. We did not use them here, as for example longitude and 695 latitude would require a more involved method that also accounts for, e.g., ocean 696 currents. Furthermore, geographical coordinates yield pairwise distances between 697 samples, which are not readily usable with our correlation analysis. Lastly, in order to 698 use features such as the date, that is, in order to analyze samples over time, the same 699 sampling locations would need to be visited at different times during the year, which is 700 not the case for the Tara Oceans expedition.

702
We used the Human Microbiome Project (HMP) dataset [12,13] for testing the 703 scalability of our methods. In particular, we used the "HM16STR" data of the initial 704 phase "HMP1", which are available from http://www.hmpdacc.org/hmp/. The dataset 705 consists of trimmed 16S rRNA sequences of the V1V3, V3V5, and V6V9 regions. The data 706 are further divided into a "by_sample" set and a "healthy" set, which we merged in 707 order to obtain one large dataset, with a total of 9811 samples. We then removed 10 708 samples that were larger than 70 MB as well as 605 samples that had fewer than 1500 709 sequences, because we considered them as defective or unreliable outliers. Finally, we The methods described here are implemented in our tool gappa, which is freely 723 available under GPLv3 at http://github.com/lczech/gappa. gappa internally uses 724 our C++ library genesis, which offers functionality for working with phylogenies and 725 phylogenetic placement data, and also contains methods to work with taxonomies, 726 sequences and many other data types. genesis is also freely available under GPLv3 at 727 http://github.com/lczech/genesis. • phylogenetic-kmeans and imbalance-kmeans: Performs k-means clustering of a 737 set of jplace files according to our methods.

739
These are the gappa commands that are relevant for this paper. The tool also offers 740 additional commands that are useful for phylogenetic placement data, such as 741 visualization or filtering. At the time of writing this manuscript, gappa is under active 742 development, with more functions planned in the near future. Lastly, we provide 743 prototype implementations, scripts, data, and other tools used for the tests and figures 744 in this paper at http://github.com/lczech/placement-methods-paper. 745 21/36 S1 Table. Effect of Branch Binning on the KR Distance of the HMP Dataset. Here we show the effect of per-branch placement binning on the run-time and on the resulting relative error when calculating the pairwise KR distance matrix between samples, by example of the Human Microbiome Project (HMP) [12,13] dataset. Because of the size of the dataset (9192 samples) and reference tree (1914 taxa), we executed this evaluation in parallel on 16 cores. The first row shows the baseline performance, that is, without binning. When using fewer bins per branch, the run-time decreases, at the cost of slightly increasing the average relative error. Still, even when compressing the placement masses into only one bin per branch (that is, just using per-branch masses), the average relative error of the KR distances is around 1%, which is acceptable for most applications. However, considering that the run-time savings are not substantially better for a low number of bins, we recommend using a relatively large number of bins, e.g., 32 or more. This is because run-times of KR distance calculations also depend on other effects such as the necessary repeated tree traversals. We also conducted these tests on the BV dataset (data not shown), were the relative error is even smaller.  We re-analyzed the BV dataset to show variants of our Edge Dispersion method. All subfigures highlight the same branches and clades as found by other methods such as Edge PCA. The method is useful as a first exploratory tool to detect placement heterogeneity across samples. In contrast to Edge Correlation, it can however not explain the reasons of heterogeneity. Subfigure (a) shows the standard deviation of the absolute edge masses, without any further processing. It is striking that one outlier, marked with an arrow, is dominating, thus hiding the values on less variable edges. This outlier occurs at the species Prevotella bivia in one of the 220 samples, where 2781 out of 2782 sequences in the sample have placement mass on that branch. Upon close examination, this outlier can also be seen in Figure 1D of [14], but is less apparent there. Subfigure (b) is identical to Figure 3(a) of the main text and shows the standard deviation again, but this time using logarithmic scaling, thus revealing more details on the edges with lower placement mass variance. Furthermore, when comparing it to Supplementary Figure 2(c), we see that the same clades that exhibit a high correlation or anti-correlation with meta-data there are also highlighted here. Subfigure (c) shows the Index of Dispersion of the edge masses, that is, the variance normalized by the mean. Hence, edges with a higher number of placements are also allowed to have a higher variance. We again use a logarithmic scale because of the outlier. The figure reveals more details on the edges with lower variance, highlighted in medium green colors. Subfigure (d) shows the standard deviation of edge imbalances. Because we used imbalances of unit mass samples, the values are already normalized. The path to the Lactobacillus clade is again clearly visible, indicating that the placement mass in this clade has a high variance across samples. Note that imbalances can be negative; thus, the Index of Dispersion is not applicable to them.  Figure 3(b) of the main text. All subfigures show red edges or red paths at the Lactobacillus clade. This indicates that presence of placements in this clade is anti-correlated with the Nugent score, which is consistent with the findings of [14] and [23]. In other words, presence of Lactobacillus correlates with a healthy vaginal microbiome. On the other hand, blue and green edges, which indicate positive correlation, are indicative of edges that correlate to Bacterial Vaginosis. The extent of correlation is larger for Spearman's Coefficient, indicating that the correlation is monotonic, but not strictly linear. Diatoms are mainly photosynthetic, and thus depend on nitrates as key nutrients, which is clearly visible by the high correlation of the clade in (a). Furthermore, the diatoms exhibit positive correlation with the chlorophyll concentration (c), which again is indicative of their photosynthetic behavior. On the other hand, they show a high anti-correlation with the salt content (b). Salinity is a strong environmental factor which heavily affects community structures and species abundances [74], particularly diatoms [75]. The correlations of the animal clade are less pronounced. They exhibit a negative correlation with nitrate (a), as well as an increase in absolute abundance with higher temperatures (d). While these findings are not surprising, they show that the method is able to find meaningful relationships in the data. Here, we show and compare the dimensionality reduction methods MDS, PCA, and Edge PCA (one per row). MDS and PCA were calculated on the pairwise KR distance matrix of the BV dataset, Edge PCA was calculated using the placements on the re-inferred RT of the original publication [14]. The plots are colored by the cluster assignments as found by our k-means variants (first two columns), and by the Nugent score of the samples (last column). The Nugent score is included to allow comparison of the health status of patients with the clustering results. , respectively. This figure reveals additional details about how the k-means method works, that is, which samples are assigned to the same cluster. For example, the purple cluster found by Imbalance k-means forms a dense cluster of close-by samples on the left in (b) and (e), which is in accordance with the short branch lengths of this cluster as shown in Figure 4(b) of the main text.

Red Cluster
Phylogenetic k-means

Green Cluster Blue Cluster
Purple Cluster Orange Cluster Gray Cluster Imbalance k-means S6 Figure. Example of k-means cluster centroids visualization. Here we show the cluster centroids as found by our k-means variants using the BV dataset, visualized on the reference tree via color coding. The cluster assignments are the same as in Figure 4 of the main text; the first row show the three clusters found by Phylogenetic k-means, the second row the clusters found by Imbalance k-means. Each tree represents one centroid around which the samples were clustered, that is, it shows the combined masses of the samples that were assigned to that cluster. The edges are colored relative to each other, using a linear scaling of light blue (no mass), purple (half of the maximal mass) and black (maximal mass). As explained in the main text, the samples can be split into three groups: The diseased subjects, which have placement mass in various parts of the tree, as well as two groups of healthy subjects, with placement mass in one of two Lactobacillus clades (marked with black arcs on the left of the trees). This grouping is also clearly visible in these trees. The red cluster for example represents all healthy subjects, and thus most of its mass is located in the two Lactobacillus clades. The purple and orange clusters on the other hand show a difference in placement mass between those clades. Furthermore, the placement mass of the gray cluster is mostly a combination of the masses of the green and blue cluster, all of which represent diseased subjects. These observations are in accordance with previous findings as explained in the main text.  Figure 4 for the cluster assignment where k is set to the original number of labels; there, we also list how the labels were aggregated. Each row represents a body site; each column one of the 8 clusters. The color values indicate how many samples of a body site were assigned to each cluster. Some of the body sites can be clearly separated, while particularly the samples from the oral region are distributed over different clusters. This might be due to homogeneity of the oral samples. show the cluster variance, that is, the average squared distance of the samples to their assigned cluster centroids, for different values of k. The first row are clusterings of the BV dataset, the second row of the HMP dataset. They were clustered using Phylogenetic k-means (first column), and Imbalance k-means (second column), respectively. Accordingly, (a) and (c) use the KR distance, while (b) and (d) use the euclidean distance to measure the variance. These plots can be used for the Elbow method in order to find the appropriate number of clusters in a dataset [66]. Low values of k induce a high variance, because many samples exhibit a large distance from their assigned centroid. On the other hand, at a given point, higher values of k only yield a marginal gain by further splitting clusters. Thus, if the data has a natural number of clusters, the corresponding k produces an angle in the plot, called the "elbow". For example, (a) and (b) exhibit the elbow at k := 2 and 3, respectively, which are marked with orange circles. These values are consistent with previous findings, for instance, Figure 4: There, Phylogenetic k-means splits the samples into a distinct red cluster and the nearby green and blue clusters, while Imbalance k-means yields three separate clusters in purple, orange, and gray. For the HMP dataset, the elbow is less pronounced. We suspect that this is due to the broad reference tree not being able to adequately resolve fine-grained differences between samples, see Supplementary Section 1 for details. Likely candidates for k are 4 − 6 for (c) and around 7 for (d). These values are consistent with the number of coherent "blocks" of clusters, which can be observed in Figure 5. Clearer results for this dataset might be obtained with other methods for finding "good" values for k, although we did not test them here.