Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Single molecule sequencing of the M13 virus genome without amplification

  • Luyang Zhao,

    Roles Conceptualization, Data curation, Writing – original draft, Writing – review & editing

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Liwei Deng,

    Roles Investigation, Methodology

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Gailing Li,

    Roles Investigation

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Huan Jin,

    Roles Formal analysis, Software

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Jinsen Cai,

    Roles Investigation

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Huan Shang,

    Roles Investigation

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Yan Li,

    Roles Investigation

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Haomin Wu,

    Roles Investigation

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Weibin Xu,

    Roles Formal analysis, Software

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Lidong Zeng,

    Roles Formal analysis, Software

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Renli Zhang,

    Roles Validation

    Affiliation Reproductive Medical Center of Guangdong General Hospital & Guangdong Academy of Medical Sciences, Guangzhou, China

  • Huan Zhao,

    Roles Validation

    Affiliation Shenzhen Armed Police Hospital Reproductive Center, Luohu District, Shenzhen, China

  • Ping Wu,

    Roles Methodology

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Zhiliang Zhou,

    Roles Methodology

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Jiao Zheng,

    Roles Resources

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Pierre Ezanno,

    Roles Resources

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Andrew X. Yang,

    Roles Writing – review & editing

    Affiliation Department of Biology, South University of Science and Technology of China, Shenzhen, Guangdong, China

  • Qin Yan,

    Roles Project administration

    Affiliation Direct Genomics Co., Ltd., Shenzhen, Guangdong, China

  • Michael W. Deem,

    Roles Writing – review & editing

    Affiliation Departments of Bioengineering and Physics & Astronomy, Rice University, Houston, TX, United States of America

  •  [ ... ],
  • Jiankui He

    Roles Conceptualization, Data curation, Funding acquisition, Project administration, Writing – original draft, Writing – review & editing

    hejk@sustc.edu.cn

    Affiliation Department of Biology, South University of Science and Technology of China, Shenzhen, Guangdong, China

  • [ view all ]
  • [ view less ]

Abstract

Next generation sequencing (NGS) has revolutionized life sciences research. However, GC bias and costly, time-intensive library preparation make NGS an ill fit for increasing sequencing demands in the clinic. A new class of third-generation sequencing platforms has arrived to meet this need, capable of directly measuring DNA and RNA sequences at the single-molecule level without amplification. Here, we use the new GenoCare single-molecule sequencing platform from Direct Genomics to sequence the genome of the M13 virus. Our platform detects single-molecule fluorescence by total internal reflection microscopy, with sequencing-by-synthesis chemistry. We sequenced the genome of M13 to a depth of 316x, with 100% coverage. We determined a consensus sequence accuracy of 100%. In contrast to GC bias inherent to NGS results, we demonstrated that our single-molecule sequencing method yields minimal GC bias.

Introduction

The sequencing of the human genome [1, 2] and the ensuing development of next-generation sequencing technologies (NGS) has revolutionized the life sciences and brought new approaches to applications as diverse as pathogen detection [3, 4], forensics [5, 6], and clinical diagnosis [7,8]. The advent of precision medicine [9] promises profound advances in the clinic, leveraging sequencing results for diagnosis of cancer [10, 11] and inherited disease [12, 13]. Despite the advantages of NGS platforms, the costly, time-intensive process of NGS sample library preparation and the use of polymerase chain reaction (PCR) amplification limit the efficiency and practicality of NGS in the clinic.

The preparation of DNA libraries in NGS generally requires a preliminary step based on PCR amplification. This process introduces bias and can result in incorrect interpretation of raw data [14, 15]. The popular Illumina sequencing platform produces data sets with uneven coverage and serious defects in GC-poor or GC-rich regions. Low coverage regions could be interpreted as sequencing errors by most current assemblers [16], and high coverage regions could be interpreted as repetitive sequences [17, 18], introducing hard-to-correct errors in NGS results. Much effort has gone into improving protocols for NGS library preparation to reduce or fully suppress GC bias [19, 20]. Single-molecule (SM) sequencing circumvents these library preparation issues by avoiding PCR amplification altogether.

First proposed in 1989 [21], SM sequencing is now seen as the successor to NGS [22] in the progression of sequencing technology development. Different SM sequencing technologies have rapidly developed over the past decade, with progress on read length, sequencing time, and data throughput. Three technologies are now well known, each with their unique characteristics: (i) the first true single molecule sequencing (tSMS) combined with sequencing-by-synthesis (SBS) [23] technology from Helicos Biosciences [24, 25]; (ii) single molecule real time (SMRT) sequencing technology from Pacific Biosciences producing super long read length (longer than 10k bases [26, 27]), but relatively low throughput; and (iii) Oxford Nanopore Technologies, producing long read length (6k bases [26]) but limited accuracy and low throughput. The GenoCare platform improves on principles from the Helicos Biosciences platform.

A combination of minimal, amplication-free sample preparation and efficient massively parallel short reads processing are ideal for the demands of sequencing-based clinical diagnosis. Advantages of GenoCare SM sequencing include (i) a simple and time-saving sample preparation consisting of DNA shearing followed by poly-A tailing and 3' end blocking steps, (ii) absence of PCR amplification and its associated base substitution errors, and (iii) potential for RNA SM sequencing for investigation of transcriptomic aspects of gene expression.

Our approach is devised to provide simple operation and high-throughput, unbiased data. Recently, we demonstrated a direct targeted sequencing of cancer related gene mutations at the SM level [28]. In this study, we describe the performance of our new GenoCare platform for SM sequencing without preliminary PCR amplification.

Results

Sequencing process

Our SBS scheme is shown in Fig 1. Sample preparation is simple, fast, and amplification-free. M13 genomic DNA was sheared into fragments of ~200bp, poly-A tailed with tail length of 50-100nt, and blocked by ddATP-Cy3. Sequencing surfaces were chemically modified and covalently bound with poly(T) oligonucleotides, allowing for hybridization with target DNA. Once annealed, residual dATP were filled with natural nucleotides, and locked with one reversible terminator.

thumbnail
Fig 1. Sample preparation and sequencing process for single molecule sequencing of biological samples.

https://doi.org/10.1371/journal.pone.0188181.g001

The single molecule SBS process has been described previously [28]. Each cycle includes terminator incorporation, imaging, fluorophore cleavage, and residual bond capping. The GenoCare platform adopts total internal fluorescence microscopy (TIRF) to observe single molecules. Integration time of 200 ms guaranteed a good signal-to-noise ratio and reduced the photobleaching of dyes. Just 0.5% of one flow-cell channel was needed to resequence the M13 virus genome. We sequenced 80 cycles (20 quads of CTAG), and analyzed the images to perform base-calling (S1 File and S1 Scheme). Sequence data was uploaded to NCBI Sequence Read Archive (SRA) with accession number SRR6168186. Sample preparation took 3 hours and instrument run time was 9 hours.

Genome coverage

104,802 reads were uniquely aligned to the reference genome, accounting for 25.4% of the total reads. Reads matching the following criteria were discarded: 1) reads shorter than 13 bases after alignment, 2) reads including a sequence exactly matching the terminator addition order, indicating non-specific adsorption, and 3) reads mapped to multiple locations on the reference genome. Among mapped reads, the dominant error was deletion (1.65%), followed by insertion (0.78%) and substitution (0.69%) (Table 1). We calculated the error rates in homopolymer regions and non-homopolymer regions. Homopolymer was defined as 3 or more identical bases in a row. The results show that, in homopolymer regions, substitution error rate is 1.23%, followed by insertion 1.04% and deletion 0.86%. In non-homopolymer regions, error rates are like substitution 0.60%, insertion 0.71% and deletion 1.84%. Relatively low deletion rate in homopolymer regions indicates satisfactory blocking efficiency. Considering that a deletion followed by an insertion can also be called as a substitution, we looked at the total error rate (3.13% vs 3.15%), which demonstrates the lack of homopolymer issue using our method [29].

Most reads (53,100) aligned perfectly to the reference with no errors, and aligned reads had at most 3 errors, as specified by our alignment algorithm (S1 Fig). The average coverage depth for each base was 316x, and the minimum coverage was 18x (Fig 2A). The variation in coverage depth is due to several reasons: 1) Non-random fragmentation by DNase. 2) Non-unique mapped reads were filtered which may cause lower coverage depth. Abnormal GC content also contributes to low coverage. 3) In M13 genome, there are some areas that contain sequences similar to the base addition order, which may artificially increase the coverage because of non-specific adsorption. The coverage depth profile can be seen in Fig 2B. The Integrative Genomics Viewer (IGV) gives a clear picture of mapping against the known M13 genome reference (S2 Fig).

thumbnail
Fig 2. Coverage depth distribution.

(A) Coverage depth for each base on M13 reference. The average coverage depth is 316x±96x. (B) Coverage rate as a function of coverage depth.

https://doi.org/10.1371/journal.pone.0188181.g002

Read length

Read length for this M13 sequencing run is shown in Fig 3. After conducting 80 base incorporation cycles and filtering, the average read length was 22 bases (Table 1). Before filtering, a peak was observed in the read length distribution at 25 bases.

thumbnail
Fig 3. Read length distribution after length and repeat filters (blue bars) and after alignment (red bars).

https://doi.org/10.1371/journal.pone.0188181.g003

GC bias

No obvious GC bias was observed in the coverage depth of 100 base windows over a GC content range of 22–69% (Fig 4A). The distribution of base frequency in the reference as function of the GC content shows an almost identical shape to the depth distribution calculated from the sequencing result (Fig 4B). The R2 (goodness of fit) of those two curves is 0.9946, indicating minimal coverage bias observed in this experiment.

thumbnail
Fig 4. GC content.

(A) Average depth distribution of all 100-base windows as a function of GC content. From GC content 22% to 69%, the average depth of each window in the genome fluctuates in a small range. (B) GC patterns of the reference genome and aligned reads.

https://doi.org/10.1371/journal.pone.0188181.g004

Discussion

Alignment profile

The cloning vector M13mp18 was sequenced on this new GenoCare platform. Similar deletion rates in homopolymer (1.87%) and non-homopolymer (1.46%) regions demonstrate reasonable blocking efficiency by the terminators. For the 7.2 kb M13 genome, the average read length of 22 bases was adequate for alignment. Before alignment, the length distribution showed a peak at 25 bases. Filtering high-error and non-uniquely mapped reads lowered throughput and average read length. The reported average read length can also be attributed to the relatively small number of cycles run (80); thus there is potential for longer read length on the GenoCare platform as cycle number is increased in future experiments. As predicted from the absence of PCR amplification in our platform, we observed minimal GC bias in this experiment, demonstrating a key advantage of SM sequencing over NGS.

Clinical applications

In this study, we demonstrated the new GenoCare platform’s SM sequencing capabilities. Overall sequencing took 12 hours including sample preparation, instrument run time and data analysis—a major improvement over NGS standards. This reduction in sequencing time is of great importance in the clinic, where timely results and diagnosis are critical. In sequencing the M13 genome, GenoCare used only 0.5% of one flow cell channel. Thus, GenoCare is capable of vastly increased throughput and has potential for whole human genome sequencing. Because our platform uses poly-T oligonucleotides to hybridize with poly-A tailed DNA, there is potential for GenoCare to handle naturally poly-A tailed RNA, and address needs for new technologies in transcriptomics. GenoCare is an automated desktop sequencer for dedicated use in the clinic, eclipsing NGS technologies with the potential to deliver faster and cheaper sequencing results with limited GC bias.

Materials and methods

Sample preparation

M13 genomic DNA preparation process was illustrated in Fig 1.

M13.

M13mp18 cloning vector was purchased from NEB, Beijing, China, and used as received. The sequence of the M13mp18 cloning vector is derived from the M13 phage [30] and contains 7249 bp. In this study, we used this cloning vector as DNA raw material to re-sequence, analyze, and compare with the reference sequence.

Oligonucleotide primers.

5’ amine functionalized Poly-T oligonucleotides were purchased from Sangon and used as received.

DNA fragmentation.

The M13mp18 cloning vector (from NEB, ref. N4018S) was used as raw DNA material to be sequenced by our platform. This cloning vector was first randomly fragmented into dsDNA fragments of about 200 bp using NEBNext® dsDNA Fragmentase® (from NEB, ref M0348S). Then, DNA fragments were purified using Agencourt AMPure XP beads (from Beckman, ref. A63881). The concentration of DNA was assessed by UV absorption using a Nanodrop 2000 device.

Poly-A tailing and blocking.

Multiple incorporations of 50–100 dATP at the 3' end of ssDNA fragments from the cloning vector resulted in a poly-A tail. This reaction completed within 20 minutes. In a second step, poly-A tailed 3' ends were blocked by incorporating the Cyanine 3 dideoxy ATP (Cy3-ddATP from PERKINELMER, ref. NEL586001EA). The blocking reaction completed within 30 minutes using the enzyme Terminal Transferase (from NEB, ref. M0315) such that the incorporation of reversible terminators at the 3' end of the template strands was prevented.

Surfaces and template capture

Surface chemistry.

Sequencing surfaces were prepared on 110×74 mm epoxy-coated glass coverslips (SCHOTT, Jena, Germany). Poly-T oligonucleotides were covalently bond to surface.

Flow cells.

The above functionalized glass coverslip was assembled with a 1.0 mm thick glass slide by a pressure sensitive adhesive to form a flow cell. The flow cell has 16 channels, determined by the adhesive shape. For the M13 sequencing in this experiment, ~0.5% of one channel was imaged.

Template capture (hybridization).

The surface of the flow-cell was chemically modified by anchoring poly-T ssDNA strands at their 5' end, in order to capture poly-A tailed strands from the library once they were injected inside the flow-cell at 55°C. Then non-hybridized templates were washed away by 150 mM HEPES, 1X SSC and 0.1% SDS, followed by 150 mM HEPES and 150 mM NaCl.

Sequencing reactions

The GenoCare platform.

All the sequencing reactions were implemented on the GenoCare platform The GenoCare is an automated single molecule sequencer with three major components: fluorescence imaging system, microfluidic system, and the stage to control the movement of sample. The imaging system is based on total internal reflection fluorescence (TIRF) microscopy [28]. GenoCare is designed for clinical applications and it outperforms the previous platform developed by Harris et al [29] in terms of read lengths, coverage depth, error rate and sequencing time.

Fill and lock.

Because the hybridization of poly-T primer with poly-A tailed template may not be perfect, a step to fill the remaining dATP on the template with dTTP before the real sequencing process starts is necessary. After hybridization, the temperature of the flow-cell was lowered to 37°C. The unpaired adenine nucleotides of poly-A tailed template strand were paired by multiple incorporations of natural thymine nucleotides at the 3' end of primer strands. A mixture of dATP, dCTP, and dGTP reversible terminators were added to block further incorporation so that the template was locked in place and ready for sequencing.

Nucleotide addition.

Reversible terminators were adopted in the sequencing-by-synthesis approach. They are modified nucleotides, which are composed of nucleotide triphosphates, a fluorophore (Atto647N), disulfide linker, and an inhibitor group. The design of the inhibitor effectively blocks the incorporation of next nucleotide before cleavage of previous reversible terminator’s disulfide bond.

The DNA extension was carried out at 37°C in Tris buffer containing polymerase, one of the four nucleotides and other salts. The components of this system are available with the use instructions from Direct Genomics.

Supporting information

S1 File. Description of imaging analysis process.

https://doi.org/10.1371/journal.pone.0188181.s001

(DOCX)

S1 Fig. Error distribution for all unique mapped reads.

Most of those reads have zero or one error.

https://doi.org/10.1371/journal.pone.0188181.s003

(TIF)

S2 Fig. Part of an IGV view of mapping.

The sequence at the bottom is the reference sequence. Capital letters show the mismatch sites, black horizontal lines indicate deletion errors, and purple vertical lines denote insertions.

https://doi.org/10.1371/journal.pone.0188181.s004

(TIF)

Acknowledgments

We thank Dr. J. William Efcavitch for helping design experiments and interpret data.

References

  1. 1. Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, et al. Initial sequencing and analysis of the human genome. Nature. 2001 Feb;409(6822):860–921. pmid:11237011
  2. 2. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, et al. The sequence of the human genome. Science. 2001 Feb;291(5507):1304–1351. pmid:11181995
  3. 3. Miller RR, Montoya V, Gardy JL, Patrick DM, Tang P. Metagenomics for pathogen detection in public health. Genome Med. 2013 Sep;5(9):81–94. pmid:24050114
  4. 4. Thorburn F, Bennett S, Modha S, Murdoch D, Gunson R, Murcia PR. The use of next generation sequencing in the diagnosis and typing of respiratory infections. J. Clin. Virol. 2015 Aug;69:96–100. pmid:26209388
  5. 5. Aly SM, Sabri DM. Next generation sequencing (NGS): a golden tool in forensic toolkit. Arch. Med. Sadowej Kryminol. 2015;65(4):260–271. pmid:27543959
  6. 6. Børsting C, Morling N. Next generation sequencing and its applications in forensic genetics. Forensic. Sci. Int. Genet. 2015 Sep;18:78–89. pmid:25704953
  7. 7. Voelkerding KV, Dames SA, Durtschi JD. Next-generation sequencing: from basic research to diagnostics. Clin. Chem. 2009;55(4):641–658. pmid:19246620
  8. 8. Zhang J, Chiodini R, Badr A, Zhang G. The impact of next-generation sequencing on genomics. J. Genet Genomics. 2011 Mar;38(3):95–109. pmid:21477781
  9. 9. Roden DM, Tyndale RF. Genomic medicine, precision medicine, personalized medicine: what's in a name?. Clin. Pharmacol. Ther. 2013 Aug;94(2):169–172. pmid:23872826
  10. 10. Dong L, Wang W, Li A, Kansa R, Chen Y, Chen H, et al. Clinical Next Generation Sequencing for Precision Medicine in Cancer. Curr. Genomics. 2015 Aug;16(4):253–263. pmid:27006629
  11. 11. Xue Y, Wilcox WR. Changing paradigm of cancer therapy: precision medicine by next-generation sequencing. Cancer Biol. Med. 2016 Mar;13(1):12–18. pmid:27144059
  12. 12. Zhang W, Cui H, Wong LJ. Application of next generation sequencing to molecular diagnosis of inherited diseases. Top Curr. Chem. 2014;336:19–45. pmid:22576358
  13. 13. Daoud H, Luco SM, Li R, Bareke E, Beaulieu C, Jarinova O,et al. Next-generation sequencing for diagnosis of rare diseases in the neonatal intensive care unit. CMAJ. 2016 Aug;188(11):254–260.
  14. 14. van Dijk EL, Jaszczyszyn Y, Thermes C. Library preparation methods for next-generation sequencing: tone down the bias. Exp. Cell. Res. 2014 Mar;322(1):12–20. pmid:24440557
  15. 15. Chen YC, Liu T, Yu CH, Chiang TY, Hwang CC. Effects of GC bias in next-generation-sequencing data on de novo genome assembly. PLoS ONE. 2013 Apr;8(4): 62856–62875.
  16. 16. Chitsaz H, Yee-greenbaum JL, Tesler G, Lombardo MJ, Dupont CL, Badger JH, et al. Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat. Biotechnol. 2011 Sep;29(10):915–921. pmid:21926975
  17. 17. Dohm JC, Lottaz C, Borodina T, Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008 Sep;36(16): 105–114.
  18. 18. Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011;12(2): 18–29.
  19. 19. Oyola SO, Otto TD, Gu Y, Maslen G, Manske M, Campino S, et al. Optimizing Illumina next-generation sequencing library preparation for extremely AT-biased genomes. BMC Genomics. 2012 Jan;13:1–12. pmid:22214261
  20. 20. Kozarewa I, Ning Z, Quail MA, Sanders MJ, Berriman M, Turner DJ. Amplification-free Illumina sequencing-library preparation facilitates improved mapping and assembly of (G+C)-biased genomes. Nat. Methods. 2009 Apr;6(4):291–295. pmid:19287394
  21. 21. Jett JH, Keller RA, Martin JC, Marrone BL, Moyzis RK, Ratliff RL et al. High-speed DNA sequencing: an approach based upon fluorescence detection of single molecules. J. Biomol. Struct. Dyn. 1989 Oct;7(2):301–309. pmid:2557861
  22. 22. Sam LT, Lipson D, Raz T, et al. A comparison of single molecule and amplification based sequencing of cancer transcriptomes. PLoS ONE. 2011 Mar;6(3):17305–17316.
  23. 23. Braslavsky I, Hebert B, Kartalov E, Quake SR. Sequence information can be obtained from single DNA molecules. Proc. Natl. Acad. Sci. USA. 2003 Apr;100(7):3960–3964. pmid:12651960
  24. 24. Thompson JF, Steinmann KE. Single molecule sequencing with a HeliScope genetic analysis system. Curr. Protoc. Mol. Biol. 2010 Oct; Chapter 7: Unit7.10.
  25. 25. Milos P. Helicos BioSciences. Pharmacogenomics. 2008 Apr;9(4):477–480. pmid:18384261
  26. 26. Reuter JA, Spacek DV, Snyder MP. High-throughput sequencing technologies. Mol. Cell. 2015 May;58(4):586–597. pmid:26000844
  27. 27. Buermans HP, Den dunnen JT. Next generation sequencing technology: Advances and applications. Biochim. Biophys. Acta. 2014 Oct;1842(10):1932–1941. pmid:24995601
  28. 28. Gao Y, Deng L, Yan Q, Gao YQ, Wu Z,1 Cai J, et al. Single molecule targeted sequencing for cancer gene mutation detection. Sci. Rep. 2016 May;6:26110–26120. pmid:27193446
  29. 29. Harris TD, Buzby PR, Babcock H, Beer E, Bowers J, Braslavsky I, et al. Single-molecule DNA sequencing of a viral genome. Science 2008; Apr 320(5872):106–109. pmid:18388294
  30. 30. Yanisch-perron C, Vieira J, Messing J. Improved M13 phage cloning vectors and host strains: nucleotide sequences of the M13mp18 and pUC19 vectors. Gene. 1985;33(1):103–119. pmid:2985470