Application of the stochastic labeling methods with random-sequence barcodes for simultaneous quantification and sequencing of environmental 16S rRNA genes

Next-generation sequencing (NGS) is a powerful tool for analyzing environmental DNA and provides the comprehensive molecular view of microbial communities. For obtaining the copy number of particular sequences in the NGS library, however, additional quantitative analysis as quantitative PCR (qPCR) or digital PCR (dPCR) is required. Furthermore, number of sequences in a sequence library does not always reflect the original copy number of a target gene because of biases caused by PCR amplification, making it difficult to convert the proportion of particular sequences in the NGS library to the copy number using the mass of input DNA. To address this issue, we applied stochastic labeling approach with random-tag sequences and developed a NGS-based quantification protocol, which enables simultaneous sequencing and quantification of the targeted DNA. This quantitative sequencing (qSeq) is initiated from single-primer extension (SPE) using a primer with random tag adjacent to the 5’ end of target-specific sequence. During SPE, each DNA molecule is stochastically labeled with the random tag. Subsequently, first-round PCR is conducted, specifically targeting the SPE product, followed by second-round PCR to index for NGS. The number of random tags is only determined during the SPE step and is therefore not affected by the two rounds of PCR that may introduce amplification biases. In the case of 16S rRNA genes, after NGS sequencing and taxonomic classification, the absolute number of target phylotypes 16S rRNA gene can be estimated by Poisson statistics by counting random tags incorporated at the end of sequence. To test the feasibility of this approach, the 16S rRNA gene of Sulfolobus tokodaii was subjected to qSeq, which resulted in accurate quantification of 5.0 × 103 to 5.0 × 104 copies of the 16S rRNA gene. Furthermore, qSeq was applied to mock microbial communities and environmental samples, and the results were comparable to those obtained using digital PCR and relative abundance based on a standard sequence library. We demonstrated that the qSeq protocol proposed here is advantageous for providing less-biased absolute copy numbers of each target DNA with NGS sequencing at one time. By this new experiment scheme in microbial ecology, microbial community compositions can be explored in more quantitative manner, thus expanding our knowledge of microbial ecosystems in natural environments.


Introduction 49
Quantifying and characterizing the taxonomic composition and diversity of 50 microbial communities in natural environments are primary foundations in microbial ecology. 51 Quantitative PCR (qPCR) using DNA-binding fluorescent dyes [1] or sequence-specific 52 probes (e.g., Taqman [2]) is a powerful and sensitive tool [3] for the quantification of a target 53 gene, which has been widely used in environmental microbiology (e.g., 16S rRNA genes) and 54 other biological research fields. However, these quantification methods use external standards 55 and sometimes result in inaccurate values due to differences in the efficiency of PCR with 56 "clean" standard DNA and "dirty" environmental DNA, which may also contain 57 PCR-inhibiting substances [4][5][6]. The efficiency of PCR can also be affected by the GC 58 content, secondary structure of the targeted sequence, bases adjacent to the 3' end of the 59 primers, and other factors [3,[7][8][9][10][11][12][13][14][15]. Those potential factors introducing biases always have 60 some risks to produce accurate and hence reliable quantification results for the study of 61 environmental microbial communities. Digital PCR (dPCR) is an approach that would 62 circumvent the above-mentioned issues, because it is less affected by the PCR efficiency and 63 provides the absolute copy number of DNAs without external standards [16,17]. However, 64 the both qPCR and dPCR quantification assay must be optimized for each target gene (or 65 taxa), necessitating the design of specific primers and standardized PCR conditions on a 66 taxon-by-taxon basis. Because the optimal condition (i.e. concentration of template DNA and 67 annealing temperature) is different among different primers specific for a taxa. In general, 68 such experimental processes are cumbersome and not likely amenable to high-throughput 69 analyses. 70 NGS of PCR-amplified 16S rRNA genes has been used to study microbial 71 community structures in a variety of environments, including the ocean [18,19], soils [20,21], 72 and the human body [22,23]. NGS enables the reading of tens of millions of sequences per 73 run, permitting the analysis of even "rare biosphere" members of a microbial community that 74 cannot be detected by conventional sequencing methods (e.g., Sanger method) [24,25]. This 75 advantage enables researchers to capture more comprehensive pictures of the naturally 76 occurring microbial communities. For quantification of particular sequences in the NGS 77

Method overview
107 Stochastic labeling of DNA molecules with a diverse array of random-sequence 108 tags during the procedure of tag-sequencing was employed to quantify 16S rRNA genes in 109 environmental DNA samples, allowing simultaneous sequencing and quantification. The 110 quantitave sequencing (qSeq) procedure is summarized in Fig 1. The first step involves 111 preparation of a primer for SPE that is composed of a sequence for indexing and sequencing, 112 a random barcode tag (RBT: random octamer [N8] in this study), and a sequence 113 complementary to that of the target genomic DNA in the 5' to 3' direction. This is followed 114 by enzymatic digestion of excess SPE primers to avoid incorporation of RBT during 115 subsequent PCR cycles. The SPE product is specifically amplified by PCR using primers 116 targeting the SPE primer sequence and genomic DNA. Subsequently, indexing and adapter 117 sequences are added to the PCR amplicon for Illumina sequencing. The resulting products 118 contain the RBT sequences at the beginning of reads that are incorporated during the SPE step. 119 Finally, the absolute number of target DNAs in the template environmental DNA sample is 120 estimated from the number of RBT obtained. This method has several advantages. First, counting the number of labels after 126 sequence classification enables simultaneous quantification of many different sequences; e.g., 127 at the level of bacterial or archaeal 16S rRNA genes or even finer levels of classification. 128 Second, the quantification is independent of external DNA standards, as are used in qPCR, 129 and because the number of random-barcode sequences is controlled by incorporation during 130 the SPE reaction, it is not affected by the efficiency of subsequent PCR steps. Hence, 131 quantification results are subject to less bias resulting from differences in PCR efficiency for 132 different target DNAs and/or sampling sites. 133 The incorporation of the random-tag sequence into the SPE product is a stochastic 134 process similar to dPCR [35,36]; i.e., the number of containers in dPCR is the same as the 135 number of random labels in the present method. The number of DNA molecules used as 136 templates for SPE and subsequent PCR can be estimated based on a Poisson distribution. 137 Thus, the number of target DNA molecules (N) is approximately given by equation 1, where 138 C represents the the number of unique RBT (e.g., in this study, C is 4 8 = 65,536, because 139 random octamers [N8] is used for the RBT) and H represents the unique random tags 140 incorporated into the SPE product, respectively: 141 The expected increment (P) in number of observed unique RBT per every increment 143 of a sequence at a given number (n) is determined using equation 2: 144 The observed number of unique RBT (S) is given by equation 3, because P n is an 146 arithmetic sequence where r = (H−1)/H: 147 With in crease in n, S n converges with H, in other word, by increasing the number 148 of sequence reads, the number of unique RBT observed increases and gets close to the the 149 number of incorporated unique random tags. Therefore, equation 4 is used to estimate the 150 number of target DNA molecules when n is large enough relative to H: to 5.0 × 10 5 copies were used to determine the dynamic range of the method. Precise 157 quantification was achieved across the entire range of DNA copy numbers examined. Please 158 note, however, that the accuracy at copy numbers over 1.0 × 10 5 was relatively low (Fig 2), 159 which may result in significant underestimation. This is most likely due to insufficient 160 labeling because the number of labels is lower than the number of template DNA molecules. 161 This relatively narrow dynamic range comparing to qPCR is a drawback of the current 162 protocol, however the upper quantification limit could be increased simply by increasing the 163 length of the random barcode; e.g., by adding 1 base to N8 would provide 4 9 = 2.62 × 10 5 164 barcode sequences. However, more than 10 6 reads per sample are required to recover 10 5 165 barcode sequences incorporated into the DNA. We anticipate that further developments in 166 sequencing technologies will enable much higher throughput and resolve the trade-off issue, 167 although applying right amount of template DNA below the upper limit would be 168 straightforward. More accurate quantification (78-109% of expected) was observed in the 169 DNA copy number ranging from 5.0 × 10 3 to 5.0 × 10 4 . In this study, therefore, we utilized 5 170 × 10 3 to 5 × 10 4 copies of the target DNA for each qSeq analysis to maximize the 171 quantification accuracy. We did not test fewer copy number than 5.0 × 10 3 in this study 172 because 5.0 × 10 3 is roughly equal to ~5 pg of DNA and using that few DNA in NGS might 173 not provide meaningful data upon 16S rRNA gene survey for natural microbial habitats. 174 Slight underestimations observed with lower copy numbers could be caused, in part, by 175 reduced reaction efficiency, especially the efficiency of SPE priming. To remedy this, we 176 tested longer reaction times, higher primer concentrations, and slower ramp rates. Changing 177 these parameters did not significantly improve the results, however. In future studies, 178 improvement might be achieved using DNA analogues as locked nucleic acids, which have 179 higher affinity for DNA in the SPE primer [37,38]. Errors in PCR and/or sequencing represent another issue of concern, because the 186 qSeq method counts any such error as a barcode sequence, potentially resulting in 187 overestimation of target DNA molecules. To reduce the effect of these errors, the number of 188 sequence reads used for the analysis should be kept to a minimum. In this study, we used 189 10-fold number of sequences for quantification compared with the expected copy number and 190 did not observed such an overestimation in the dynamic range test.

Methanocaldococcus jannaschii, Halomonas elongata, S. tokodaii, Bacillus subtilis, 195
Streptomyces avermitiis, and Paracoccus denitrificans was sequenced by the standard 196 Illumina sequencing and qSeq. The total copy number was 2.0 × 10 4 , which was within the 197 dynamic range as described above. The number of RBT was counted after sorting of the qSeq 198 results by phylogeny, and the 16S rRNA gene copy number was estimated using the equation 199 as described above. As a result, qSeq could provide copy number of each microbial species 200 and the quantified copy numbers were consistent with the expected value (Table 1). Based on 201 the estimated copy number, relative abundances of each species were calculated ( Fig 3A). In 202 the standard sequencing analysis, the proportion of each species was calculated based on the 203 relative ratio in the resulting sequence library. Standard sequencing library analysis 204 underestimated the relative abundance of the 16S rRNA gene of M. jannaschii, whereas the H. 205 elongata and S. tokodaii 16S rRNA were overestimated (Fig 3A). The GC content of these 206 16S rRNA genes, which is well known to affect PCR efficiency, is 31, 61, and 33% for M. 207 jannaschii, H. elongata and S. tokodaii, respectively. Quantification bias cannot be predicted 208 or explained based on GC content alone, and it is likely that other factors may be involved, 209 such as priming efficiency, secondary structure, and/or bases adjacent to primers [7-10]. In 210 contrast, the qSeq returned more accurate and reproducible data in good agreement with the 211 expected ratios of the 16S rRNA gene copy numbers of each species examined. 212 213 214 In addition, two mock communities consisting of genomic DNA of two microbial 219 species were tested. One was the mixture of the genomic DNA of M. jannaschii and H. 220 elongata with 16S rRNA gene ratios ranging from 1:9 to 9:1 and the other was the mixture of 221 S. avermitilis and B. subtilis. Both mock communities were subjected to the standard Illumina 222 sequencing and the qSeq. The ratio of sequence reads in the sequence library was calculated 223 and compared to the results obtained by the qSeq as described above. We found that 16S 224 rRNA gene of M. jannaschii was always underestimated in the sequence library (t-test, p < 225 0.05 for all 3 plots), whereas the qSeq approach provided accurate ratios over the entire range 226 tested ( Fig 3B). This result was consistent with the results obtained from the 6-species mock 227 community, indicating relative abundance in normal sequencing library can be biased as 228 previously reported [10][11][12][13]15]. We also tested the mock community of S. avermitilis and B. 229 subtilis of which relative abundances in the 6-species mock community were similar ( Fig 3A). 230 As expected, both the ratios of S. avermitilis to B. subtilis obtained by sequence library and 231 qSeq are consistent with expected ratio (Fig 3C). The qSeq approach was tested to some environmental samples, such as beach sand , 244 paddy field soil, biofilm, and hot spring water sample. Bacterial and archaeal 16S rRNA 245 genes were quantified separately for the reference using a microfluidic dPCR technique with 246 domain-specific primers. The microfluidic dPCR enables determination of the absolute target 247 DNA sequence copy number unaffected by biases introduced by differences in PCR 248 efficiency [16]. Compared with the dPCR data, qSeq provided slightly lower copy numbers 249 for bacterial and archaeal 16S rRNA genes in the paddy filed soil and biofilm sample (Fig  250   4A). In contrast, for the beach sand and hot spring water sample, qSeq provided higher copy 251 numbers for bacterial and archaeal 16S rRNA genes. Although the results were not directly 252 comparable due to differences in primer coverage, all the results for both techniques were 253 within the same order of magnitude, indicating that qSeq enables accurate and precise 254 quantification of those environmental DNA examined, at least as comparable to microfluidic 255 dPCR. Moreover, the bacteria to archaea copy number ratios, which were determined using 256 dPCR and qSeq as well as from the standard Illumina sequence library, were generally 257 consistent: 0.60, 0.60, and 0.49, respectively, for biofilm; 0.43, 0.25, and 0.18, respectively, 258 for hot spring; 0.024, 0.033, and 0.030, respectively, for paddy field soil; and 0.039, 0.040, 259 and 0.060, respectively, for beach sand (Fig 4B). The ratios obtained from dPCR and qSeq 260 were consistent each other in all the samples except for hot spring water, indicating, qSeq 261 could quantify 16S rRNA without bias regarding amplification efficiency. In the hot spring 262 sample, dPCR resulted in significantly higher ratio than the other two methods probably due 263 to difference in coverage of the primer sets used for amplification. Overall, the result obtained 264 from qSeq was shown to be as accurate as currently used methods for the analysis of 265 environmental samples. 266 The qSeq protocol allows us to compare quantity of 16S rRNA gene in different 267 environments. Table 2 shows an example for quantification in which the copy numbers of 16S 268 rRNA gene of selected 5 classes were estimated by counting random sequence tags after 269 classification of the sequence reads. Methanomicrobia, one of the classes containing 270 methanogenic Archaea mainly from paddy and lakes, is detected in the paddy files soil to be 271 (1.43±0.14) × 10 7 copies/g-sample, whereas it is absent in the beach sand. As for 272 Proteobacteria, the paddy field contains 5 times more β-Proteobacteria than the beach sand 273 while ɣ-Proteobacteria populations were in the same order of magnitude. These examples 274 demonstrates that qSeq has a great advantage that copy numbers can be directly quantified 275 after sequencing without additional quantitative approaches as qPCR or dPCR. for biofilm and hot spring sample, the ratio was calculated from single sample. 283

288
In summary, in analyses of a mock microbial community and actual environmental 289 samples, we demonstrate that the qSeq protocol optimized for environmental DNA enables 290 simultaneous sequencing and gene quantification. The absolute copy number of S. tokodaii 291 16S rRNA gene was accurately quantified by qSeq in the template copy range of 10 3 to 5 × 292 10 4 . The mock microbial community tests showed that qSeq enables accurate determinations 293 of the ratios of microbial community members and that the technique would be less affected 294 by possible biases due to differences in PCR efficiency. The quantification accuracy and 295 precision of bacterial and archaeal 16S rRNA genes in environmental samples assessed by 296 qSeq were comparable to microfluidic dPCR. The qSeq technique described here enables 297 parallel sequencing and quantification of many different target DNAs at one time, and we 298 anticipate that it will be of great value in future studies in environmental microbiology. 299 300

302
Beach sand and soil samples were collected from the coast and from a paddy field 303 near the Kochi Institute for Core Sample Research, JAMSTEC, in Nankoku, Kochi, Japan, 304 using a 50 ml tip-cut syringe for the each site. Biofilm sample and hot spring water sample 305 were collected at a stream in Garandake which is an active volcano located in Oita, Japan. 306 The collected hot spring water was filtrated by a polycarbonate filter at the sampling site. 307 All the sample including the filter were stored at −80°C prior to analysis. DNA was extracted 308 using a PowerMax Soil DNA Isolation Kit (MOBIO). The extracted DNA was further 309 concentrated by isopropanol precipitation and finally resuspended in TE buffer. To compare relative quantification results for library sequences using 328 random-labeling quantification, standard Illumina sequencing was performed as described 329 elsewhere [39,40], using the universal primers N8-515F and N8-806R targeting the V4 330 region of 16S rRNA genes (Table 1). After purification of the PCR product of the expected 331 size, indexing PCR was performed using a Nextera XT indexing kit (Illumina), followed by 332 sequencing using a MiSeq platform with MiSeq Reagent Kit v2 for 500 cycles or v3 for 600 333 cycles (Illumina). 334

Quantification by qSeq 335
For qSeq, SPE was performed using MightyAmp DNA polymerase ver. 2 (TaKaRa 336 Bio) with 0.3 µM N8-515F primer in a total reaction volume of 20 µl. The amount of 337 template DNA was 300 pg for environmental samples, and 2.0 × 10 4 copies of 16S rRNA 338 gene were used for the analysis of a mock microbial community sample. For dynamic range 339 testing, 5.0 × 10 3 to 5.0 × 10 5 copies of 16S rRNA gene from the genomic DNA of S. tokodaii 340 were used. The SPE conditions consisted of initial denaturation at 98°C for 2 min, cooling to 341 68°C at 0.3°C/s, and elongation at 68°C for 10 min. Subsequently, excess primers were 342 digested by addition of 4 of Exonuclease I (5 U/µl, TaKaRa Bio) and incubation at 37°C for 343 120 min, followed by inactivation of the enzyme by incubation at 80°C for 30 min. A 5-µl 344 volume of the SPE product was used for the first-round PCR; each 20-µl reaction contained 345 0.3 µM N8-U806R and F2-primers (Table 3), 1× MightyAmp buffer ver. 2, and 0.4 µl of 346 MightyAmp DNA polymerase. Thermal cycling consisted of initial denaturation at 98°C for 2 347 min, followed by 40 cycles of 98°C for 10 s, 55°C for 15 s, and 68°C for 30 s. Finally, 15 µl 348 of each PCR product was subjected to agarose gel electrophoresis, and the band of the 349 expected size was excised and purified using a Nucleospin column (TaKaRa Bio). The 350 purified PCR products were finally subjected to indexing PCR and sequencing by Miseq, as 351 described above in the section of Standard Illumina DNA sequencing. 352 353 The absolute copy numbers of bacterial and archaeal 16S rRNA genes in 357 environmental samples were determined by microfluidic dPCR using a BioMark Real-time 358 system and 12.765 Digital Array (Fludigm) as described previously [16]. The domain-specific 359 primer pairs B27F-B357R and A806F-A958R were used for bacterial and archaeal 16S rRNA 360 genes, respectively (Table 3). 361

362
All sequence data obtained in the study were processed using the Mothur software   Quantified copy number (Log10 copies)