Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison

doi:10.1371/journal.pone.0056859

Figure 1.

A graphical representation of the operation of edge principal components analysis (edge PCA).

The phylogenetic distribution of reads for a given sample determines its position in the principal components projection. For the first axis, reads that fall below edges with positive coefficients on that axis' tree (marked in orange on the tree) move the corresponding sample point to the right, while reads that land on edges with negative coefficients (marked in green on the tree) move the corresponding sample point to the left. The second axis is labeled with a subtree of the first tree (the position of which is marked with a star on the first principal component tree): reads below edges with positive coefficients move sample points up, while reads below edges with negative coefficients move sample points down. The principal components shown here are the actual principal components for the example shown in Figures 4, 5, and 6.

More »

Expand

Figure 2.

A visual depiction of the squash clustering algorithm.

When two clusters are merged, their mass distributions are combined according to a weighted average. The edges of the reference tree in this figure are thickened in proportion to the mass distribution (for simplicity, just a subtree of the reference tree is shown here). In this example, the lower mass distribution is an equal-proportion average of the upper two mass distributions. Similarities between mass distributions, such as the similarity seen between the two clusters for the G. vaginalis clade shown here, are what cause clusters to be merged. Such similarities between internal nodes can be visualized for the squash clustering algorithm; the software implementation produces such a visualization for every internal node of the clustering tree. Note that in this figure only the number of reads placed on each edge is shown, although each placement has an associated location on each edge when performing computation.

More »

Expand

Figure 3.

How the edge PCA algorithm works.

(a) For every edge of the tree, the difference is taken between the number of reads on the non-root side the number of reads on the root side (root marked with a star). (b) The results of this are put into a matrix corresponding to the sample number (row) and the edge number (column). (c) The standard PCA algorithm is then applied, resulting in a collection of eigenvectors (the principal components) and eigenvalues. (d) These eigenvectors are indexed by the edges of the tree, and hence they can be mapped back onto the tree.

More »

Expand

Figure 4.

The first principal component for the combined vaginal data, representing about 56 percent of the variance.

The reference tree is colored by principal component sign (positive colored orange, negative colored green) and thickened proportional to magnitude. The edges across which maximal between-sample heterogeneity is found are those leading to the Lactobacillus clade and those leading to the Sneathia and Prevotella clade. This axis corresponds to taxa that are important in the diagnosis of bacterial vaginosis, as Sneathia and Prevotella are associated with bacterial vaginosis, while Lactobacillus is associated with a healthy microbiome.

More »

Expand

Figure 5.

The second principal component for the combined vaginal data, representing about 24 percent of the variance.

Low-weight regions of the tree are excluded from the figure. The edges across which maximal between-sample heterogeneity is found are those between two different Lactobacillus clades: L. iners and L. crispatus. Thus, the second important “axis” appears to correspond to the relative levels of these two species.

More »

Expand

Figure 6.

Edge principal components analysis (edge PCA) applied to the combined Forney and Fredricks data set and plotted separately.

The axes for the edge principal components plot are described in Figures 4 (-axis) and 5 (-axis). The Nugent score is a diagnostic score for bacterial vaginosis, with high score indicating bacterial vaginosis.

More »

Expand

Figure 7.

Principal coordinates analysis applied to the Fredricks vaginal data set.

More »

Expand

Figure 8.

The results of (a) squash clustering and (b) UPGMA as applied to the vaginal data.

The labels are not shown and they do not appear in the same order on the two trees. For a comparison of labeled trees, see Supplementary Figure S1.

More »

Expand

Figure 9.

The results of the cluster accuracy simulation experiment using the rooted Robinson-Foulds (RF) metric.

This graphic shows very similar levels of topological accuracy for squash clustering and UPGMA, as well as high similarity between the topology returned by the two methods. The figure is divided into panels by the level of reconstructability parameter as described in the text (a larger implies easier reconstruction). The -axis is the value of for the distance as described in (1). The -axis is the rooted Robinson-Foulds distance: for the “squash” and “UPGMA” lines it is the distance between the reconstructed tree and the original tree using these two algorithms (lower is more accurate), while the “between” line shows the distance between the result for the two clustering algorithms (lower is more similar). Note that the maximum rooted RF distance between two trees with six taxa is four.

More »

Expand