CoSMIC: A hybrid approach for large-scale, high-resolution microbial profiling of novel niches

doi:10.1371/journal.pone.0340349

Table 1.

LNA primers.

A. Three primer pairs (pairs 1-3) and their location along the E. coli 16S rRNA gene. Primers are available via the Qiagen catalog number. The specifics of modified bases are a Qiagen trade secret, yet only a handful of bases are modified. B. The number of amplified full-length 16S rRNA genes by each pair and by their intersection. About 360K are amplified by all three pairs, 570K by two pairs, and 250K by just a single pair.

More »

Expand

Table 2.

Primers used in the Swift AMPLICON 16S+ITS PANEL.

Primers were applied to amplify the variable regions along the 16S rRNA gene as described in the kit’s protocol.

More »

Expand

Table 3.

Swift primer pairs applied in SMURF-based reconstruction.

Multiple amplicon types are generated in a multiplex amplification of the Swift nine primer pairs. SMURF analysis considered nine forward-reverse pairs of adjacent regions. For example, two amplicon types starting with V3_f were considered, i.e., those that end with V4_r or V5_r.

More »

Expand

Fig 1.

Microbial profiling of Aster poliothamnus samples depends on the applied amplicon.

Leaf, soil, and root samples of A. poliothamnus were profiled by the Qiagen kit that provides six amplicons along the 16S rRNA gene, and their reads were subsequently analyzed using the CLC Genomics platform. a. The six amplicons of the Qiagen kit, each amplifying two or three adjacent variable regions. Taken together, these amplicons cover all variable regions. b. The number of OTUs detected for soil, leaf, and root samples of A. poliothamnus for each of the six amplicons. c. Distribution of Jaccard distances among pairs of amplicons. In each pair, e.g., V1V2 vs. V3V4, a Jaccard distance is calculated between the lists of taxa detected by CLC Genomics in each region. d-f. Decay graph of the number of unique taxa identified by the amplicons analyzed. Amplicons are ordered by the number of unique taxa a region contributes to the former set of regions. D-Leaf, E-Soil, and F-Root.

More »

Expand

Fig 2.

Evaluating the performance of sets of primer pairs across niches.

Twenty-four samples corresponding to the leaf, soil, and root of eight plants were profiled by SMURF, using all 63 combinations of the six amplicons. a. Ambiguity, i.e., the number of 16S rRNA gene sequences per group as detected by SMURF, as a function of the number of amplicons. Each dot corresponds to the average ambiguity in one of the 63 distinct cases. The blue line and gray background correspond to the average and standard deviation of the ambiguity across combinations of a specific number of regions. b. We consider SMURF’s solution based on all six regions the gold standard and display a fraction of species identified by subsets of the six regions. Shown is the fraction of species whose taxonomy matches the six-region solution. c. Pearson correlation between profiling based on all six regions and each set of regions. d. A comparison of amplicon sets for soil samples. Each dot corresponds to a specific region combination averaged across 24 samples, where the marker size scales with the ambiguity and axes correspond to Pearson correlation (horizontal) and mutual species (vertical). The marker’s color indicates the number of amplicons used for SMURF analysis. Hence, a single marker corresponds to six regions (yellow), and six dark green dots correspond to the different groups of five regions. The marker “x” corresponds to a specific set of 3 regions comprising V1V2, V2V3, and V4V5. e. The same for root samples.

More »

Expand

Fig 3.

Enriching the database with novel full-length 16S rRNA sequences.

We amplified the full-length 16S rRNA gene using pre-designed LNA primers and sequenced the resulting amplicons by PacBio long-read sequencing. Four samples were processed, each containing a pool of samples from a different habitat, i.e., plants from the Arava desert in Israel, characterized by a climate type BWh (defined as hot desert climate with annual rainfall <100mm), according to Köppen climate classification; plants from the lowland region in Israel, characterized by a climate type Csa (defined as a Mediterranean climate with hot, dry summers and mild, wet winters); and pools of samples from the marine sponge S. officinalis collected in the Mediterranean Sea along the coast of either France or Israel. a. All non-redundant PacBio sequences (Plants BWh - 14852, Plants Csa - 14589, Sponges-France - 6507, Sponges-Israel - 5468) were aligned to SILVA using the CD-HIT algorithm [20]. The percentage of sequences that did not appear in SILVA is shown as a function of the identity threshold. The vertical dashed line corresponds to SILVA’s standard threshold for calling novel full-length 16S rRNA sequences (SILVA NR99). Colors indicate different biomes sampled. b. Histograms of read counts per 16S rRNA sequence for the plants’ pool (left panels) and the sponges’ pool (right panels). The upper and lower panels show histograms of SILVA and novel sequences, respectively. An identity threshold of 99% was used to call novel sequences with respect to SILVA. c. Individual samples from the abovementioned pools were sequenced via short reads and analyzed by SMURF using the augmented database to determine their microbial composition (Sponges-France – 6 samples, Sponges-Israel – 3 samples, Plants (BWh) – 10 samples, Plants (Csa) – 10 samples). The total relative frequency assigned by SMURF to novel 16S rRNA sequences per sample (dots) and across types (columns) is shown. d. The fraction of reads aligned to either SILVA or augmented SILVA for each sample depicted in C. Each panel corresponds to a different sample type. To evaluate the statistical significance, a paired t-test was performed for each sample type; ^* P<0.05, ^*** P<0.001, ^**** P<0.0001.

More »

Expand

Fig 4.

The suggested CoSMIC flowchart.

Gray - Perform standard DNA extraction from environmental samples. Orange - Establish a niche-relevant full-length 16S rRNA gene database by LNA-based amplification and long-read sequencing. Orange-purple - Test multiple primer sets to find an optimal primer pair combination. Purple - Perform high throughput profiling in a large-scale project. Samples are amplified using the optimal primer combination and analyzed by SMURF based on the augmented database.

More »

Expand

Fig 5.

Experimental evaluation of standard methods vs. CoSMIC.

Ten samples were sequenced - leaf, soil, and root samples of T. hirsuta, S. lycopersicum, and T. aestivum, and a mock mixture of 12 known species (see Methods). Each sample underwent standard shotgun metagenomic sequencing, from which we detected its full-length 16S rRNA sequences to be considered ground truth. In addition, as part of CoSMIC, we pooled extracted DNA from all ten samples, amplified by LNA primers, and sequenced via synthetic long reads (Loop Genomics) to create full-length 16S rRNA sequences that were added to the SILVA database. Standard amplicon sequencing was performed by a commercially available kit by Swift, which uses primers spanning the 16S rRNA gene to amplify several variable regions (see Methods). Analysis was then performed by CoSMIC using nine primer pairs and by a standard QIIME2 pipeline. The latter used either nine primer pairs (All) or the V3V4 region only (V3V4) and was performed using OTUs or by ASV calling. Hence, four QIIME2-based results are shown – OTU-All, OTU-V3V4, ASV-All, and ASV-V3V4. a. Alignment identity, i.e., the average alignment score between detected 16S rRNA sequences and ground truth 16S rRNA genes (gt16S) detected by shotgun metagenomics. Boxplots present the distribution of alignment identities for each tissue and analysis method. Leaf, soil, and root panels correspond to an average of over three samples (one from each plant). A Mann-Whitney U test was performed for all comparisons, which resulted in very low p-values, with the highest being 0.0064. b. The total frequency of correctly detected gt16S by each method as a function identity threshold. The dotted horizontal line in the left panel (mock mixture) marks the results if we improve the ground truth sequences by MarkerMag. Insets present the total frequency achieved per sample at a cutoff of 99%; the dotted horizontal line is their average. The y-axis, shown for the leftmost inset, is relevant for all insets. c. Percentage of erroneous 16S rRNA sequences, i.e., those identified by an analysis method yet not detected by shotgun metagenomics. The percentage is a function of the identity threshold, where alignments below the threshold are considered erroneous. The dotted line marks the MarkerMag-based improvement for the mock community using CoSMIC for up to a 99% identity threshold. d. Histogram of the number of taxa matching an identification by each analysis type. e. Radar plot combining all samples over several criteria (normalized on a 0-1 scale, where large values correspond to better performance): Id, mean alignment identity for each analysis method as described in A; Cov-80, mean total ground truth frequency at 80% identity threshold as described in B; Cov-99, the same for 99% identity threshold; FP, One minus the mean false positive rate as described in C; Amb, mean ambiguity as described in d.

More »

Expand