The authors have declared that no competing interests exist.
Conceived and designed the experiments: SV. Analyzed the data: SV EP MA FC. Contributed reagents/materials/analysis tools: PSC MT. Wrote the paper: SV. Provided discussion-evaluation of report: EP MA FC PSC MT. Provided evaluation of original idea: MT.
The novel multi-million read generating sequencing technologies are very promising for resolving the immense soil 16S rRNA gene bacterial diversity. Yet they have a limited maximum sequence length screening ability, restricting studies in screening DNA stretches of single 16S rRNA gene hypervariable (V) regions. The aim of the present study was to assess the effects of properties of four consecutive V regions (V3-6) on commonly applied analytical methodologies in bacterial ecology studies. Using an
Use of the 16S rRNA gene as a bacterial evolution marker was a breakthrough for microbial ecology studies in the late 1980s
The methodologies from the 90's, along with the new generation of high throughput screening of the 16S rRNA gene revealed that the bacterial diversity existing in just a few grams of soil was far more immense than previously believed
The aim of the present study was to assess the use of Illumina sequencing for massive parallel screening of bacterial 16S rRNA gene diversity in soil environments based on the information potential of such short reads (single V region). 16S rRNA gene stretch for RDP database soil derived sequences was explored for conservation, and potential primer designing sites were proposed. Afterwards, four consecutive 16S rRNA gene hypervariable (V) regions were analyzed; namely V3, V4, V5 and V6. These sequences were examined by means of properties related to contemporary Illumina technology limitations. The performed tests included: (i) screening the suitability of V regions according to sequencing technology read length screening abilities; (ii) assessment of conservation of sequence stretches flanking the examined V regions; (iii) estimation of pairwise sequence distances as a means for evaluating how representative the trimmed V region is of the full-length 16S rRNA gene sequence; and (iv) taxonomy information loss of trimmed sequences as compared to their full length versions. Finally, a virtual experiment based on sequences and outcomes of previously performed studies was used to identify expected differences between V regions according to 16S rRNA gene sequence frequencies.
42,109 full or nearly full length 16S rRNA gene sequences derived from currently cultured and uncultured soil bacteria were used for performing the following analyses. Sequence conservation was examined using the Shannon entropy values (
Hypervariable regions indicated as designated by Baker
A) Nucleic acid base composition of the 16S rRNA gene consensus sequence of the 41,109 RDP database soil derived sequences for 90% conservation cutoff value. Red background positions include hypervariable stretches as reported in reference
Fragment lengths including the examined hypervariable regions for all screened (41,109) sequences. Sequence fragments were plotted according to length ascending order.
Effects of sequence length and V region variability patterns on obtained sequence distances were assessed by comparing distances of trimmed V region sequences with their full length variants (
All tests were significant (P<001). Test correlation index (r) values and linear models (presented with solid lines) used to describe overall trends are provided above and below each plot. Local relationships between corresponding sequence distances of the FL and other datasets are expressed with the non-parametric LOWESS (locally weighted regression and smoothing scatterplots) regression analysis plotting (dot-dashed lines), while the ideal y = x correlation is also plotted (dashed lines).
Classification depth testing indicated that all V region datasets showed a similar under-representation of existing sequences throughout all taxa per taxonomical level, with V6 performing worst of all (
Values are expressed as percentage of the taxonomical annotations obtained for the FL sequence variants. Taxa were characterized into groups according to the existing sequence numbers as indicated in
Group | Taxon | FL | V3 | V4 | V5 | V6 |
|
|
15786 | 15875 | 15835 | 15678 | 15760 |
|
7388 | 7209 | 7431 | 7619 | 7144 | |
|
7373 | 7380 | 7292 | 7275 | 7353 | |
|
4607 | 4421 | 4350 | 4418 | 3781 | |
|
1861 | 1870 | 1856 | 1836 | 1894 | |
|
|
1521 | 2318 | 2303 | 2307 | 3383 |
|
|
808 | 752 | 706 | 641 | 725 |
|
759 | 759 | 730 | 749 | 688 | |
|
690 | 493 | 560 | 391 | 477 | |
|
446 | 226 | 304 | 392 | 124 | |
|
319 | 318 | 309 | 319 | 280 | |
|
162 | 146 | 154 | 153 | 169 | |
|
114 | 96 | 95 | 106 | 101 | |
|
|
97 | 91 | 89 | 98 | 58 |
|
59 | 40 | 19 | 33 | 41 | |
|
45 | 36 | 19 | 22 | 9 | |
|
12 | 13 | 7 | 13 | 12 | |
|
11 | 9 | 9 | 6 | 8 | |
|
9 | 11 | 8 | 7 | 7 | |
|
7 | 9 | 4 | 12 | 12 | |
|
7 | 3 | 3 | 7 | 6 | |
|
7 | 7 | 6 | 6 | 6 | |
|
5 | 5 | 5 | 7 | 6 | |
|
4 | 4 | 4 | 4 | 4 | |
|
4 | 3 | 3 | 3 | 3 | |
|
4 | 4 | 2 | 2 | 4 | |
|
2 | 8 | 2 | 2 | 3 | |
|
2 | 2 | 2 | 2 | 2 | |
|
0 | 1 | 0 | 0 | 0 | |
|
0 | 0 | 0 | 0 | 2 | |
Taxa were categorized according to included sequence numbers or annotation to known taxa in RDP database and are denoted with different letters as shown here: | ||||||
|
|
|||||
|
|
|||||
|
|
|||||
|
|
Published soil bacterial 16S rRNA gene diversity datasets were downloaded and used as templates for generating corresponding virtual samples. The latter were used for assessing differences between V region trimmed fragments and their full length sequences, while used in taxonomy, OTU and phylotype screening approaches.
Dataset topologies based on sample distances showed an overall better approximation of the FL dataset by the longer stretch V region datasets, V3 and V4 (
A) PCA results of matrix generated by sample distances based on classified sequence relative abundance (left) and presence absence (right) for the V regions and FL datasets. B) Similarly to A for OTU relative abundance (left) and presence absence (right). C) PCA results for matrices generated using the weighted (left - phylotype relative abundance based) and unweighted (right - phylotype occurrence based) Unifrac analysis result distances between samples for the V regions and FL datasets.
16S rRNA gene diversity screening using technologies like Illumina that produce multimillion sequence reads is a very appealing method for elucidating ecology concepts in complex environments such as soils. However, as indicated in the present study, there are several issues related to contemporary technology abilities and properties of screened environments that should be considered.
Sequence conservation is an important factor for determining the potential of screening depth of various taxa using an existing library. Our results (
Interconnected to the previous discussion point is the operational fragment length for an Illumina technology application. Current Illumina technology screening abilities according to the latest available (v4–v5) chemistries are maximized when using the Genome Analyzer IIx (GAIIx) and exploiting the paired-end reading ability (obtaining reads from both sequence fragment ends). It has been demonstrated that relatively good read quality results can be obtained for read-lengths of 125 nucleotides for each of two reads per fragment (with the second read showing lower qualities at the error prone read ends)
RDP database soil derived sequences were further analyzed for assessing representation of the tested full-length sequences concerning obtained distances and taxonomy annotations during sequence comparisons, when sequence parts belonging in the tested V-regions are used. Correlation tests of generated distances of sequences belonging to the same strains for the full length sequences and their V region variants, showed an overall superior performance for the V4 region dataset, followed by V5 for both the Pearson correlation values and the dispersal of points around the applied linear model. However, when examining more carefully V region datasets, for distances of 0–13% according to FL dataset distances there appears to be a distance overestimation for V3 and an underestimation for V5 and V6. This indicates that more per base variability is accumulated in the V3 region than in the other V regions and the corresponding section in the FL sequence. Higher resolution of signature sequences can therefore be obtained at the referred OTU definitions.
Taxonomy classification of the V region and FL datasets indicated that there is some information loss along with sequence size reduction, particularly for the V6 dataset (
The performance of the simulated analysis provided an approximation of the effect that sequence relative abundance and richness in environmental soil samples would have on diversity assessment. Overall it was shown that datasets of V-regions encompassing longer sequence stretches (V3 and V4) generated sample distances more similar to the ones produced by the FL dataset compared to V5 and V6. Such differences between the V3, V4 and the V5 dataset were not indicated in the database screening analyses performed in the first part of this study. That is possibly because of the composition of the tested soil microbiomes, having increased relative abundance of sequences showing performance differences when the trimmed V5 or V6 regions are compared with their full-length sequence variants.
Combination of Illumina sequencing technology with screening partial 16S rRNA gene sequence reads in environmental samples can be a powerful tool for microbial ecology studies. However, this combination has some limitations as a result of the sequence screening length. V3 region selection as the screened 16S rRNA gene stretch did not perform as well as when the non redundant soil derived sequence dataset was screened, but it had a superior performance when sequence frequencies came close to those found in soil environments. V4 had a high overall performance, but compared to the rest it had a reduced conservation of flanking sequence sites of the V region. This lack of conservation may be restricting concerning diversity screening depths. V5 had a desireable diversity screening depth and an overall good performance for the non redundant dataset, but the information extracted from this region showed differences with the full-length 16S rRNA gene sequence variants in the non-redundant dataset. Thus showing the effect of the composition of the tested bacterial communities to the outcome of the V5 selection approach. V6 was outperformed in all tests apart from the one of flanking sequence conservation.
Collectively, these results suggest that partial 16S rRNA gene sequence reads corresponding to single V regions have flaws compared to their FL variants in soil bacterial community studies. Nevertheless, some appear to capture the FL sequence information in a great degree. V3 properties can match the demands of many of total soil bacterial community screening studies. V5 on the other hand, is a relatively well performing representative of the shorter V regions. The shorter V regions can provide the opportunity of assessment of the sequencing quality of the reads used, since longer read parts of the sequenced amplicon strands overlap during assembly (and therefore agreement of base calling quality of the overlapped parts is examined), which is performed as part of the reconstruction of the screened V region sequence.
Incorporation of database exploration during initial experimental setup stages is strongly suggested for strategy improvement towards experimental goals. This especially holds true during primer designing phase, which is crucial concerning the quality of the produced data. Careful selection of template sequences for the primer designing process can improve primer-set collections for highly diverse environments like soil. Potentials for further methodology improvements and can be found in approaches like the use of more than a single V region screening or even the usage of multiple housekeeping genes
An overview of the approach is provided in
Datasets description: 42109 full or nearly full-length (≥1200 bp) soil-derived, ribosomal database project (RDP) database
Analysis of 16S rRNA gene conservation and V region lengths: Assessment of alignment based soil bacterial 16S rRNA gene sequences positional variability, was carried out by estimating the Shannon entropy (
Corresponding sequence distances and taxonomy comparisons between V region and FL datasets: Properties related to two major microbial diversity assessment approaches (OTU and taxonomy based) were examined in comparison to the respective properties of the full-length sequence variants. OTU and taxonomy based analyses were carried out using the Mothur software. Using the average linkage algorithm, distances between aligned sequences having the same identifiers were calculated and concomitantly compared for all V region datasets against the full-length sequences. Due to computational power limitations a subset of ∼10,000 sequences per dataset (ones derived from agricultural and grassland soils) was used generating ∼100,000,000 pairwise distances. Comparisons for 1,000,000 randomly selected distances per dataset corresponding to the same strain of origin, were used for performing Pearson correlation tests between each V region dataset and the full-read length variant. Taxonomy information differences throughout all datasets and the full-length sequence annotations were assessed using the naïve Bayesian classifier for 50% confidence resulting from bootstrap analysis
Datasets description: Nine datasets in total, derived from soil bacterial 16S rRNA gene diversity screening results of previous studies using pyrosequencing, were used as templates for these analysis series. Major criteria for their selection were the range of sequence numbers per sample (26,000 to 54,000) and the read qualities. Studies with a corresponding dataset or sequence accession numbers used were: Roesch
Data analysis: V region performance was assessed by means of Classification, operational taxonomic unit (OTU) and phylogenetic results for each of the V region dataset comparisons with the FL dataset. For the classification based analysis (performed with the parameters described above) sequences of each dataset were classified and sample distances were calculated using the Bray-Curtis transformation for relative abundance matrices and the Jaccard transformation for presence absence matrices. The obtained pairwise distances were used as loadings for performing PCA analysis in corresponding sample distances between generated datasets. Using the same methods, sample distances generated by an OTU approach for OTU definition of 3% sequence distances were used for OTU assessment differences between generated datasets (V-regions and FL) for relative abundance and presence absence matrices. The phylogeny based analysis included calculations of dataset distances based on obtained sample distances per dataset as calculated by encompassed sequence evolutionary relationships. The initial step was to perform a relaxed neighbor joining algorithm performed by the Clearcut application
(TIF)