Advantages of Single-Molecule Real-Time Sequencing in High-GC Content Genomes

Next-generation sequencing has become the most widely used sequencing technology in genomics research, but it has inherent drawbacks when dealing with high-GC content genomes. Recently, single-molecule real-time sequencing technology (SMRT) was introduced as a third-generation sequencing strategy to compensate for this drawback. Here, we report that the unbiased and longer read length of SMRT sequencing markedly improved genome assembly with high GC content via gap filling and repeat resolution.


Introduction
Technical advances in DNA sequencing are key to the current capacity to complete organismal genomes, especially microbial genomes, rapidly and at low cost using next-generation sequencing (NGS) technologies such as Illumina Genome Analyser, SOLiD and Roche 454 platforms.However, the low cost and highthroughput of NGS are still insufficient for complete sequencing of genomes with high GC contents because this technology relies on a template amplification phase prior to sequencing, leading to biased coverage when using a high-GC genome as the template [1,2,3].The Single-Molecule Real-Time (SMRT) sequencing technology recently developed by Pacific Biosciences (PacBio RS) avoids the amplification step and provides sequence data for individual template molecules, minimising the risk of introducing substitutions and/or low bias during amplification [4,5].Therefore, this method is expected to compensate for the major drawback of next-generation sequencing of high-GC content genomes.
PacBio RS is generally applied to two types of sequencing, i.e.Continuous Long Reads (CLRs) and Circular Consensus Sequencing (CCS) reads.CLR involves single-pass SMRT reads and ,10 kb in length with only 82.1%-84.4% base accuracy [4].CCS reads are consensus sequences obtained from multiple passes on a single sequence with relatively short read lengths (,2 kb) and a low error rate [6].Despite the low accuracy of CLRs, the longer read length and low bias have major advantages with regard to resolving complex repeats and filling the gaps in de novo assembly.Therefore, tools for correcting lowquality reads generated by PacBio RS have been developed, including LSC, p-errormodule of SMRT analysis (http://www.pacificbiosciences.com)and pacbioToCA [7,8].Using pacbio-ToCA, CLR sequences obtained by mapping high-quality shortread sequences were corrected with high-quality reads and achieved .99.9% accuracy.The statistics of assembly were markedly improved in de novo assembly with error-corrected reads.Also, CCS reads have been reported to improve yield and mean read length in comparison to Illumina short reads in error correction and in genome assembly with moderate GC content (http://www.pacificbiosciences.com).
Here, we evaluated the utility of the Pacific Biosciences RS platform for the sequencing of the high-GC content genome of Streptomyces sp., an endosymbiotic bacterium isolated from the Antarctic lichen Cladonia borealis with an estimated G+C content of 70.89%, to assess the advantages of unbiased single-molecule sequencing.

Genome sequencing
The endosymbiotic bacterium Streptomyces sp.PAMC 26508 was isolated from the Antarctic lichen Cladonia borealis.Genomic DNA for Streptomyces sp.PAMC 26508 was prepared according to Nikodinovic et al. [9].For PacBio RS sequencing, two types of libraries were made with 1.5-kb and 8-kb sheared genomic DNA, and prepared using the standard PacBio RS sample preparation methods with C1 chemistry specific to each insert size.The 8-kb sample was sequenced on 1 SMRT cell with a 1690 min collection protocol, and the 1.5-kb sample was sequenced on 8 SMRT cells with a 2645 min collection protocol.A 300-bp paired end library for Illumina Hiseq 2000 and 7-kb paired end library for GS-FLX titanium were prepared, and sequencing was performed according to the manufacturers' instructions.All sequencing processes were performed using the services of DNA Link, Inc.

Contig ordering of the assembly PBcR SR(506)+CCS +454, finishing of the genome, and circular map
To determine the order of contigs in the assembly PBcR SR(506)+CCS +454, we designed primers for the flanking region of ribosomal DNA at the end of each contig and performed PCR using primer combinations.In addition, we used the sequences of the resulting PCR products to close all the gaps in the assembly PBcR SR(506)+CCS +454.A circular map of contigs between assemblies and coverage plot of assembly with PBcR SR(506)+CCS was visualised using Circos [15].Coverage value across the contigs was calculated using the command genome-CoverageBed of BEDTools [16].

Data access
The raw data are available via NCBI.Accession numbers are SRA062237 for Short Read Archive, CP003990 for Chromosome, and CP003991 for Plasmid.

Results
We combined three sequencing platforms: PacBio RS, GS-FLX titanium and Illumina Hiseq 2000 (Table 1).First, CLRs were corrected with high-accuracy sequences of Illumina or CCS reads with the pacBio-corrected Read (PBcR) algorithm [7].Using 506 Illumina Short Reads (SRs) to correct 256 CLRs generated 146corrected reads with the PBcR algorithm (Fig. 1a and Table 1).However, when we corrected CLRs with 1006 and 2006 SRs, additional SRs did not increase the mean read length or total bases.We examined whether unbiased CCS reads with improved sequencing accuracy could increase the throughput in error correction with high GC content and found that the addition of 266CCS reads to 506SRs in error correction increased throughput with 16genome coverage and the average read length to 1.56 kb.In the PBcR algorithm, high-quality reads were aligned to CLRs, and the aligned regions were corrected with high quality.Then, CLRs were split into multiple fragments at unaligned regions.CCS reads improved the results through filling the regions that could not be aligned based on SRs (Fig. 1b, Fig. 2 and Table 2).After error correction, both CLRs corrected using 506SRs (PBcR SR(506) ) and CLRs corrected using 506SRs and 266CCS reads (PBcR SR(506)+CCS ) showed high quality (.99.9%) (Fig. 1c).We also estimated the true accuracies of PBcR SR(506) and PBcR SR(506)+CCS by mapping to the contigs of the assembly SRs(1006)+454 with BLAST, and the estimate accuracy was 99.97% and 99.95%, respectively (Table S2).
We compared the results of assemblies using SRs and PBcR with Celera Assembler [10]; 86454 reads, paired end library with an insert length of 7 kb, were used to produce longer and more accurate scaffolds in all assemblies (Table 3 and Fig. 3a).The SRs(2006)+454 assembly showed that use of more than 1006SRs could not increase the N50 contig size and filling the gap as much as the increment obtained in SRs, and also that the numbers of scaffolds and contigs were not increased.However, with assembly using only PBcR SR(506) , the contig number was reduced by half (54 contigs) and the N50 contig size was increased to 410 kb compared with the assembly of SRs(1006)+454.In addition, using error-corrected PBcR with a combination of 506 SRs and 266 CCS reads, the assembled results using PBcR SR(506)+CCS reducing the contig nember to 6, 5 contigs comprising chromosome and 1 contig comprising a plasmid and increased the N50 contig size to 1.43 Mb compared with the assembly of SRs(1006)+454.Most ends of contigs comprising chromosome sequences corresponded to the region of ribosomal RNA operons (.5.7 kb long); so PBcR SR(506)+CCS (mean 1.56 kb) was shown to be too short for spanning this repeat region in this assembly.We validated assemblies by aligning the contigs assembled using SRs to those assembled with PBcR SR(506) and PBcR SR(506)+CCS with the MUMmer sequence alignment tool (Fig. 4a and Fig. 4b).Although some regions of contigs showed different orders, the overall contig sequence identity of assemblies was estimated to be 99.99%.We also validated the regions showing disagreement in the alignment by PCR and Sanger sequencing (Fig. 4c, Fig. 4d, Fig. 4e and Figure S1), and demonstrate that PBcR with longer read length was more efficient for resolving interspersed and tandem repeats (Fig. 3b, Fig. 4e and Fig. 5).Assembly Likelihood Evaluation (ALE) framework, one of the recently pulished assembly likehood tools evaluating the accuracy of an assembly in a reference-independent manner, also showed that both PBcR and CCS increase the accuracy of an assembly (Figure S2).
Finally, we further investigated whether PBcR has an important role in gap filling in the assembly of a genome with high GC content.For example, in one of subset of the contigs, contig 93 of the assembly PBcR SR(506)+CCS +454 was split into 4 contigs in the assembly PBcR SR(506) +454 and 12 contigs in assembly SRs(1006)+454 (Fig. 3c).The graph of alignment coverage in the assembly SRs(1006)+454 showed that high-coverage Illumina reads could not fill the gaps between contigs, it appears that the high-GC content within the gaps led to bias in illumine sequence [1], but PBcR could fill these gaps with sufficient coverage (gaps 3, 6 and 11).These results indicated that unbiased SMRT sequencing may be sufficient to fill the gaps generated by nextgeneration sequencing technology in the assembly of a genome with high GC content.

Discussion
Recently, most genome sequencing projects are carried out using automated applications of the NGS sequencing techniques, these newly developed methodologies may enable even more rapid bacterial genome sequencing.A genome project progresses through phases of data acquisition, assembly of the sequence reads, and then annotation and exploitation of the assembled data.During the data analysis, the de novo genome assemblies are potentially two incompleteness.First, Individual reads from NGS platform can have errors because they require amplification of source DNA before sequencing, leading to amplification artifacts and biased coverage of the genome, also, they have shown frequently incorrect read in homopolymer and/or very shot repeat regions.Second, they produce relatively short reads (median lengths of 100 bp for Illumina and about 700 bp for 454), it make assembly and related analyses difficult leading to transcript variants, although more computational power and several assembler has been developed.Recently, the Pacific Biosciences technology, which is based on single-molecule realtime (SMRT) DNA sequencing and the lack of amplification in the library construction step, provides a fundamentally new data type that provides the potential to overcome these limitations by providing significantly longer reads (now averaging .1 kb).Especially, Next-generation sequencing produces more gaps in the assembly of a genome with high GC content than in a genome with moderate GC content.Many could not readily be amplified by PCR, even if the regions of gaps were amplified, and could not easily be sequenced.We showed that PacBio RS SMRT sequencing is advantageous to resolve this problem in genome assembly with unbiased high-throughput sequencing and longer read length.
The genome of Streptomyces sp.PAMC 26508 has a 7,526,197 base pair linear chromosome with 70.89% GC content, and it contained 1 plasmid with 104,048 base pair.PacBio read data (PBcR and CCS) can fill the 88 gaps of high-GC repeat region with sufficient coverage, and also it has shown efficiently resolve interspersed and short tandem repeats, which it cannot overcome with high coverage NGS data.
In summary, the results of assembly with PBcR showed higher advantage to those of a genome with high GC content and repetition geneome structure.Moreover, CCS reads were important in improving assembly using CLRs in the errorcorrection process and assembly.The numbers along the track indicate kilobase coordinates along the contig.The highlighted region H01 indicates the region of mis-assembled contig by repeat (Fig. 3b) and the highlighted region H02 indicates the representative region showing the differences in assemblies (Fig. 3c).(b) Red arrow indicates interspersed repeat sequences of the integrase gene.Contigs assembled from SRs(1006) with short read length were mis-assembled and split into three contigs by two integrase genes with identical sequences (600 bp long), but both PBcR SR(506) and PBcR SR(506)+CCS could resolve repeats due to their ability to span repeats.(c) The box indicates two types of gap: the black box indicates the gaps generated by assembly with both SRs(1006) and PBcRs reads, and the yellow box indicates the gaps generated by assembly with only SRs(1006) reads.Black line is GC content, and green, blue and red lines are each coverage, respectively.Each coverage and the average GC content for 25 base window of the flanking 1-kb region of gaps in assemblies.Gaps generated by assembly using short reads were filled with sufficient coverage of PBcRs, and PBcR SR(506)+CCS was able to span more gaps than PBcR SR(506) .The local GC content of gaps is relatively higher than contigs.doi:10.1371/journal.pone.0068824.g003

Figure 1 .
Figure 1.Statistics of error-corrected reads.(a) The length distribution of CLRs and PBcRs.Error correction of CLRs with Illumina short reads (506, 1006 and 2006 coverage) showed similar length distributions.Larger numbers of Illumina short reads did not improve the results of error correction in the mean length of reads and throughput, but CCS reads increased both in mean length and throughput.(b) CCS increased the throughput of error correction by joining the break positions with no short-read coverage.(c) Base qualities of CLRs and PBcRs, where the x-axis correspnds to base position and the y-axis to the average Phred quality score.doi:10.1371/journal.pone.0068824.g001

Figure 2 .
Figure 2. Results of error correction using 506 SR and 166 CCS reads.HAWKEYE indicated how to correct the errors of CLR with SRs (blue) and CCS reads (red).The numbers indicate the regions aligned with only CCS reads.CCS reads improved the throughput of error correction by spanning the unaligned region by SRs.doi:10.1371/journal.pone.0068824.g002

Figure 3 .
Figure 3. Streptomyces sp.PAMC 26508 assembly.(a) The outermost track (pink) represents the complete genome sequence of Streptomyces sp.PAMC 26508, the middle track (red) represents assembly with PBcR SR(506)+CCS , the inner track (blue) represents assembly with PBcR SR(506) and the next track (green) represents assembly with SR.The innermost track (red line) indicates the read coverage of assembled contigs with PBcR SR(506)+CCS .The numbers along the track indicate kilobase coordinates along the contig.The highlighted region H01 indicates the region of mis-assembled contig by repeat (Fig.3b) and the highlighted region H02 indicates the representative region showing the differences in assemblies (Fig.3c).(b) Red arrow indicates interspersed repeat sequences of the integrase gene.Contigs assembled from SRs(1006) with short read length were mis-assembled and split into three contigs by two integrase genes with identical sequences (600 bp long), but both PBcR SR(506) and PBcR SR(506)+CCS could resolve repeats due to their ability to span repeats.(c) The box indicates two types of gap: the black box indicates the gaps generated by assembly with both SRs(1006) and PBcRs reads, and the yellow box indicates the gaps generated by assembly with only SRs(1006) reads.Black line is GC content, and green, blue and red lines are each coverage, respectively.Each coverage and the average GC content for 25 base window of the flanking 1-kb region of gaps in assemblies.Gaps generated by assembly using short reads were filled with sufficient coverage of PBcRs, and PBcR SR(506)+CCS was able to span more gaps than PBcR SR(506) .The local GC content of gaps is relatively higher than contigs.doi:10.1371/journal.pone.0068824.g003