Figure 1.
Metagenomic deconvolution successfully predicted the length of each gene in the various species found in simple synthetic metagenomic samples (see text).
Actual (black squares) and predicted (blue circles) gene lengths for a given gene in each species (A) and for all the genes in one species (B). The specific gene and the specific species shown here were those with the median variation in abundance across samples. (C) The predicted gene length as a function of the actual length for all genes in all species. Different colors are used to indicate the number of copies in a species. The dashed line represents a perfect prediction. Note that predicted gene lengths can be negative, as predictions were made in this case using least-squares regression. Gene lengths can be restricted to positive values using alternative regression methods (see Supporting Text S1).
Figure 2.
Prediction accuracy is correlated with variation.
Average error in prediction accuracy for each gene orthology group (red squares) as a function of the variation (standard deviation divided by the mean) across samples, R = −0.48, p<4.3×10−7 (A), and across species, R = −0.53, p<2.0×10−8 (B). Best fit lines are illustrated. Error is calculated as the relative error in the length prediction for each gene orthology group.
Figure 3.
Predicting the length of each KO in each species using deconvolution and the effect of annotation errors.
Predicted KO lengths vs. actual KO lengths, using BLAST-based annotation.
Figure 4.
Reconstructing the genomic content of reference genomes from simulated mixed metagenomic samples using metagenomic deconvolution.
(A) ROC curves (AUC = 0.93) for predicting KO presence and absence across all species as a function of the threshold used to predict the presence of a KO. ROC curve for a naïve convolved prediction (AUC = 0.76) is illustrated for comparison. (B) Predicted genomic content of each species. KOs are partitioned into bins based on the set of genomes in which they are present (e.g., genes present only in the first species, genes present only in the second species, genes present in the first and second species but not in the third, etc.; see Venn diagram). The height of each bar represents the proportion of KOs in each bin and the color represents the presence of these KO in each species. The black strip inside each bar represents the fraction of KOs from this bin predicted to be present in each species.
Figure 5.
Reconstructing the genomic content of genera from HMP tongue dorsum samples.
(A) The average similarity in KO content between each reconstructed genus and sequenced genomes from the various genera. Similarity is measured by the Jaccard similarity coefficient, over the set of the 500 KO with the highest variation across samples. Genera are ordered by their mean abundance in the set of samples under study. Entries highlighted with a black border represent the highest similarity in each row. (B) The average similarity between sequenced genomes from the various genera. Similarity was measured as in panel A.
Figure 6.
Predicting genus-specific KOs in genera from the HMP tongue dorsum samples.
To restrict our analysis to well-sampled genera, only genera for which at least 10 reference genomes are available and for which at least 5 genus-specific KOs were obtained are considered. The presence (or absence) of genus-specific KOs across the set of sequenced species from each genus is illustrated by the presence (or absence) of a black dot. Gray dots indicate that the KO was present in only a subset of the sequenced strains of that species. KOs predicted to be present by metagenomic deconvolution are shown using colored dots. Results are shown for several regression methods, including least squares (LS, 89% accuracy, 82% recall), non-negative least squares (NNLS, 90% accuracy, 82% recall), and lasso (90% accuracy, 92% recall). See also Supporting Text S1.