Scalable methods for analyzing and visualizing phylogenetic placement of metagenomic samples

doi:10.1371/journal.pone.0217050

Fig 1.

Phylogenetic placement of a single query sequence.

Each branch of the reference tree is tested as a potential insertion position, called a “placement” (blue dots). Note that placements have a specific position on their branch, due to the branch length optimization process. A probability of how likely it is that the sequence belongs to a specific branch is computed (numbers next to dots), which is called the likelihood weight ratio (LWR). The bold number (0.7) denotes the most probable placement of the sequence.

More »

Expand

Fig 2.

Operations on placement masses.

(a) The edge mass or mass of an edge is the sum of likelihood weight ratios (LWRs) on that edge for all query sequences (QSs) in a sample. Here, three placements from three QSs on the branch are summarized. (b) In order to reduce time and memory of the computations, masses can be binned by summarizing them across QSs in intervals along the edges. (c) The masses on corresponding edges of the RT of two or more samples can be squashed to represent the average mass distribution of the samples. For simplicity, we here use equal weights, and show edge masses instead of individual LWRs. (d) The edge imbalance of an edge is the sum of masses on all edges on the root side of the edge (A+B, with the root in subtree A denoted as a gray dot) minus the sum of the masses on the edges on the non-root side (C+D), while ignoring the mass on the edge itself.

More »

Expand

Fig 3.

Edge masses and imbalances.

(a) Reference tree where each edge is annotated with the normalized mass (first value, blue) and imbalance (second value, red) of the placements in a sample. The depicted tree is unrooted, hence, its top-level trifurcation (gray dot) is used as “root” node. (b) The masses and imbalances for the edges of a sample constitute the rows of the first two matrices. The third matrix contains the available meta-data features for each sample. These matrices are used to calculate, for instance, the Edge PCA or correlation coefficients.

More »

Expand

Fig 4.

Examples of Edge Dispersion and Edge Correlation.

We applied our novel visualization methods to the BV dataset to compare them to the existing examinations of the data. (a) Edge Dispersion, measured as the standard deviation of the edge masses across samples, logarithmically scaled. (b) Edge Correlation, in form of Spearman’s Rank Correlation Coefficient between the edge imbalances and the Nugent score, which is a clinical standard for the diagnosis of Bacterial Vaginosis. Tip edges are gray, because they do not have a meaningful imbalance. This example also shows the characteristics of edge masses and edge imbalances: The former highlights individual edges, the latter paths to clades.

More »

Expand

Fig 5.

Squashing of edge masses.

Two trees are merged (squashed) by calculating the weighted average of the respective mass distributions on their branches. By squashing, a cluster of (similar) samples can be summarized and visualized. For simplicity, we here show the masses per edge and visualize them as branch widths. In practice however, each placement location of each query sequence is considered individually. The figure is based on the similar Figure 3/2 of Matsen et al. [29]; see there for more details on squashing.

More »

Expand

Fig 6.

Example computation of the balances between two subtrees.

The figure shows how the balance is computed for the two subtrees induced by the dashed edge of the tree, for one sample. Numbers next to edges are the accumulated placement mass of the sequences in the sample. We call the left hand side of the tree R, and the right hand side S, as seen from the dashed edge. For simplicity, we do not use weighting here; that is, we assume p = (1, …, 1). First, the geometric means for both subtrees are calculated, then, their balance. The balance is positive, indicating that subtree R contains more placement mass on (geometric) average.

More »

Expand

Fig 7.

Input data and first two iterations of Placement-Factorization.

The figure resembles Figure 2 of [31]. It shows the adaptation of concepts from Phylofactorization to phylogenetic placement data. (a) The input data is a set of samples with placement masses on each edge of the tree. The tree is colorized by the total mass across all samples, that is, by the row sums of the heat map. The heat map then shows the detailed mass per edge (rows) and per sample (columns). Note that the heat map also contains rows for each inner edge of the tree, as phylogenetic placement also considers these edges. We show an example of this visualization for empirical data in S1 Fig. (b) In the first iteration, the objective function for all inner edges is evaluated. Here, e₁ is the winning edge that maximizes the objective function, which separates (A, B, C, D) from the rest of the tree. (c) In the second iteration, only the contrasts within the two subtrees are calculated, but not across the winning edges of previous iterations (here, e₁). That is, the winning edge e₂ maximizes the objective function that contrasts (F, G) with (E, H, I), but does not consider the edges in the subtree (A, B, C, D). Note that in our adaption, edges that lead to a tree tip are not considered as potential factors.

More »

Expand

Fig 8.

Comparison of k-means clustering to Squash Clustering and Edge PCA.

We applied our variants of the k-means clustering method to the Bacterial Vaginosis (BV) dataset in order to compare them to existing methods. See [18] for details of the dataset and its interpretation. We chose k ≔ 3, as this best fits the features of the dataset. For each sample, we obtained two cluster assignments: First, by using Phylogenetic k-means, we obtained a cluster assignment, which we here abbreviate as PKM. Second, by using Imbalance k-means, we obtained an assignment here abbreviated as IKM. In each subfigure, the 220 samples are represented by colored circles: red, green, and blue show the cluster assignments PKM, while purple, orange, and gray show the cluster assignments IKM. (a) Hierarchical cluster tree of the samples, using Squash Clustering. The tree is a recalculation of Figure 1(A) of Srinivasan et al. [18]. Each leaf represents a sample; branch lengths are Kantorovich-Rubinstein (KR) distances between the samples. We added color coding for the samples, using PKM. The lower half of red samples are mostly healthy subjects, while the green and blue upper half are patients affected by Bacterial Vaginosis. (b) The same tree, but annotated by IKM. The tree is flipped horizontally for ease of comparison. The healthy subjects are split into two sub-classes, discriminated by the dominating species in their vaginal microbiome: orange and purple represent samples were Lactobacillus iners and Lactobacillus crispatus dominate the microbiome, respectively. The patients mostly affected by BV are clustered in gray. (c) Multidimensional scaling using the pairwise KR distance matrix of the samples, and colored by PKM. (d) Principal component analysis (PCA) applied to the distance matrix by interpreting it as a data matrix. This is a recalculation of Figure 4 of [101], but colored by PKM. (e) Edge PCA applied to the samples, which is a recalculation of Figure 3 of Matsen et al. [101], but colored by IKM.

More »

Expand

Fig 9.

k-means cluster assignments of the HMP dataset with k ≔ 18.

Here, we show the cluster assignments as yielded by Phylogenetic k-means (a) and Imbalance k-means (b) of the Human Microbiome Project (HMP) dataset. We used k ≔ 18, which is the number of body site labels in the dataset, in order to compare the clusterings to this “ground truth”. Each row represents a body site; each column one of the 18 clusters found by the algorithm. The color values indicate how many samples of a body site were assigned to each cluster. Similar body sites are clearly grouped together in coherent blocks, indicated by darker colors. For example, the stool samples were split into two clusters (topmost row), while the three vaginal sites were all put into one cluster (rightmost column). However, the algorithm cannot always distinguish between nearby sites, as can be seen from the fuzziness of the clusters of oral samples. This might be caused by our broad reference tree, and could potentially be resolved by using a tree more specialized for the data/region (not tested). Lastly, the figure also lists how the body site labels were aggregated into regions as used in S9 Fig. Although the plots of the two k-means variants generally exhibit similar characteristics, there are some differences. For example, the samples from the body surface (ear, nose, arm) form two relatively dense clusters (columns) in (a), whereas those sites are spread across four of five clusters in (b). On the other hand, the mouth samples are more densely clustered in (b).

More »

Expand

Fig 10.

Objective values of Placement-Factorization of the BV dataset.

Here, we show the values of the objective function for each inner edge, for the first two factors found by Placement-Factorization (with taxon weights) of the BV dataset. The winning edge of each iteration is marked by a black arrow. This novel visualization of phylogenetic factors helps to understand why a particular edge was chosen in an iteration: Here, the objective function of the first iteration in (a) yields high values for the path towards the Lactobacillus clade, consistent with previous findings. However, due to small differences, the winning edge of the first iteration is chosen to be relatively basal in the tree, meaning that a large clade is factored out. This obfuscates the fact that this factor is mostly concerning the Lactobacillus clade, and not so much the remaining taxa in that clade. This visualization thus aids interpretation of the found clades, and allows to identify the parts of a factored clade that are most relevant to the factor. In the second iteration in (b), the tree clearly shows the distinction between the two relevant clades of Lactobacillus again, consistent with previous findings. See also S15 Fig for the version of this visualization without taxon weights.

More »

Expand

Fig 11.

Ordination of Placement-Factorization of the HMP dataset.

In this visualization of phylogenetic factors, we show the balances of the winning edge at different factors for all samples. Subfigure (a) shows the first 10 factors found by Placement-Factorization with taxon weighting on the oral/fecal subset of the HMP dataset. We call this a balance swarm plot. It can be understood as multi-dimensional scatter plot, where each dimension is shown separately: Each column corresponds to a factor (PF1–PF10), with the vertical axis being the balances, and horizontal space within each column used to spread samples at nearby positions, revealing their distribution density. The balances were scaled to the [−1.1, 1.0] interval for better comparability across factors, while keeping the centering at 0. Subfigure (b) shows the first factor of Placement-Factorization with taxon weighting on the full HMP dataset. The violin plots in (b) extend on the idea of balance swarm plots by separating different groups of samples, based on their body site. This allows to clearly see the distribution of balances at the factor for all groups of samples. The exhaustive versions of these plots, with and without taxon weighting, and for more factors, are shown in the context of the typical two- and three-dimensional scatter plots in S18, S19, and S20 Figs. See there for more details.

More »

Expand