Conceived and designed the experiments: KDL JH DD FvN SB PC JV KC. Performed the experiments: KDL MB SDK. Analyzed the data: KDL JDS SL FP BDW JH LCKDL JDS SL FP BDW JH WvC. Wrote the paper: KDL JH JV KC. Bioinformatics support: KDL JDS SL FP BDW JH WvC. Statistical analyses: JH LC.
The authors have declared that no competing interests exist.
Despite improvements in terms of sequence quality and price per basepair, Sanger sequencing remains restricted to screening of individual disease genes. The development of massively parallel sequencing (MPS) technologies heralded an era in which molecular diagnostics for multigenic disorders becomes reality. Here, we outline different PCR amplification based strategies for the screening of a multitude of genes in a patient cohort. We performed a thorough evaluation in terms of set-up, coverage and sequencing variants on the data of 10 GS-FLX experiments (over 200 patients). Crucially, we determined the actual coverage that is required for reliable diagnostic results using MPS, and provide a tool to calculate the number of patients that can be screened in a single run. Finally, we provide an overview of factors contributing to false negative or false positive mutation calls and suggest ways to maximize sensitivity and specificity, both important in a routine setting. By describing practical strategies for screening of multigenic disorders in a multitude of samples and providing answers to questions about minimum required coverage, the number of patients that can be screened in a single run and the factors that may affect sensitivity and specificity we hope to facilitate the implementation of MPS technology in molecular diagnostics.
A multitude of laboratory technologies for the detection of DNA mutations have been developed over the last decades. In current diagnostic settings, most frequently a combination of a mutation scanning technique, followed by Sanger sequencing of the abnormal DNA fragments is used. Well known examples of widely used methods to identify the aberrant fragments are single strand conformation polymorphism (SSCP), conformation sensitive gel electrophoresis (CSGE), high performance liquid chromatography (HPLC) and more recently high resolution melting curve analysis (HRMCA)
In order for MPS to take over the role of Sanger sequencing and to evolve into the method of choice for next generation molecular diagnostics (NGMD), a number of hurdles need to be taken and questions be answered. The goal of this paper is to remove a number of these obstructions by describing strategies which enable mutation analysis through MPS, by presenting tools for determination of the required coverage and the number of patients who can be screened in a single run, and by listing possible sources of false negative or false positive mutation calls along with possible solutions. The guidelines and tools provided in this study were formulated or calculated based on pyrosequencing data obtained on the GS-FLX instrument (454-Roche), but may provide better insights into applications with other MPS chemistries as well.
The data presented in this article are derived from 10 GS-FLX sequencing runs (using both Standard and Titanium chemistries) on samples prepared with different approaches. In total over 200 patient samples were evaluated in these 10 experiments. To pool different patients in a single experiment multiplex identifier (MID)-tags were attached on all patients' samples. Different approaches were evaluated to attach these tags:
Approach 1: the samples investigated for recessive congenital deafness (15 genes:
Approach 2: for hereditary breast cancer (2 genes:
During the first PCR, gene-specific amplicons are generated, using primers modified at their 5′ end with a universal M13 linker sequence. In the first experiments (2 out of 10 experiments), we equimolarly pooled singleplex reactions. In further experiments the first amplification step was replaced by a multiplex PCR in which several amplicons of the same patient are combined (we typically aimed for 10-plex PCR reactions) to reduce the workload and consumable cost. After 1/1000 dilution of the PCR products, a second round of PCR is performed. In the second PCR, primers containing the common A or B sequence, a patient specific barcode sequence (MID) and a universal linker sequence (M13) were used to amplify the initial PCR products, thereby extending them with the sequences that are required to initiate sequencing and to distinguish reads from the different patients. Primer sequences, reaction conditions and constitution of the multiplex reactions are described by De Leeneer et al.
PCRs prior to pooling were performed in the presence of a saturating dye (LCgreen+, Idaho Technology Inc) on a real-time PCR instrument (CFX384, Bio-RAD). PCRs were normalized and equimolarly pooled in relation to the RFU data (endpoint fluorescence). This pool was purified on a High Pure PCR Cleanup Micro kit (Roche).
During optimization of the multiplex reactions FAM labeled MID primers were used to evaluate equimolarity between amplicons within one reaction and fluorescent peaks were separated on an ABI3730 capillary system.
Emulsion PCR and sequencing reactions on the GS-FLX (454-Roche) were performed according to the manufacturer's instructions. On average 380,000 (range: 290,000–520,000) reads were obtained in a standard GS-FLX run and 1,000,000 when the Titanium chemistry was used (range: 800,000–1,200,000). In each experiment, a minimum of 90% of all reads mapped to the reference sequence. FASTA files were analyzed with the in house developed variant interpretation pipeline (VIP) software (version 1.3)
Distribution plots and log-normal curve fitting were performed using the GraphPad PRISM 5 software. Statistical analysis of the potential bias introduced during emulsion PCR and pyrosequencing was performed using the R package. The mean of both relative coverages (obtained after sequencing on the GS-FLX) and relative fluorescent signals (obtained on capillary electrophoresis on an ABI3730) was used to center both data sets for each multiplex prior to principal component analysis to remove the effect of the different multiplex sizes.
With Sanger sequencing a two-fold (forward and reverse) coverage is considered to be sufficient for molecular diagnostics, provided that sequences are of high quality. At this moment there is no clear consensus on the required minimum coverage (MC) to reliably detect heterozygous variations using MPS technologies. Current guidelines typically suggest a 20-fold coverage
sequencing error filter level | ||||||||
Power (Q) | 5% | 10% | 15% | 20% | 25% | 30% | 35% | |
90.00% | (10) | 4 | 4 | 4 | 7 | 7 | 12 | 24 |
95.00% | (13) | 5 | 5 | 8 | 8 | 11 | 18 | 30 |
99.00% | (20) | 7 | 7 | 11 | 17 | 19 | 35 | 54 |
99.50% | (23) | 8 | 12 | 12 | 18 | 26 | 42 | 71 |
99.90% | (30) | 10 | 14 | 18 | 27 | 38 | 61 | 110 |
99.95% | (33) | 11 | 15 | 19 | 28 | 42 | 68 | 117 |
99.99% | (40) | 14 | 18 | 28 | 34 | 54 | 83 | 148 |
99.995% | (43) | 15 | 19 | 30 | 42 | 61 | 92 | 165 |
100.00% | (50) | 17 | 25 | 36 | 51 | 70 | 109 | 194 |
Not surprisingly, MC values increase as the required power to detect heterozygous variants increases. There is also a strong dependency on the sequencing error filter level: if only variants present in 30% of the reads are considered as true variants, a 61-fold MC is required, while a coverage depth of only 27 is needed if the filter threshold is lowered to 20% (both for or P = 99.90%, corresponding to a Phred score of 30, required for standard molecular diagnostics).
When plotting obtained variant frequencies vs. coverage of unfiltered data, the largest deviations from the binomial distribution are observed at the lower allele frequencies. Because the majority of such data points are sequencing errors, especially related to homopolymers (see below), dispersion can best be evaluated at frequencies above 50%. Allele specific amplification biases during sample preparation or emulsion PCR are the most likely cause of any remaining dispersion. A stepwise analysis starting from unfiltered variant data in one experiment (9721 variants) to determine the dispersion is shown in Supporting information
Determination of the required minimum coverage is not sufficient to calculate the number of SAC that can be analyzed with a given number of reads because the coverage may differ between SAC. In an ideal experiment, all SAC have exactly the same coverage, matching the theoretically determined required MC. In practice, some SAC will display a lower coverage than others. Since these require at least the MC as well, other SAC will have a higher coverage than absolutely required wasting sequencing capacity. The correction factor to convert the minimum coverage into the required average coverage can be derived from an evaluation of the distribution of the coverage.
A) Distribution plot of the coverage observed in a pilot study representative for NGMD screening (full line) with 3300 sample amplicon combinations (SAC), derived from sequencing 30 patients for
Supplemental
We assumed a more narrow spread in coverage would be obtained by sequencing an equimolar pool of fragments or amplicons. To test the assumption that the emulsion PCR does not introduce a substantial bias we compared the relative peak intensities (determined by fragment analysis on ABI3730xl) of 9 different fluorescently labeled multiplex PCRs (6 to 11-plexes), amplified on 5 different samples (total of 360 SACs) with the corresponding relative coverage after sequencing. Overall there seems to be good 1∶1 relationship between the relative fluorescence and the relative coverage, indicating that a certain increase in relative fluorescence on average induces an equal increase in relative coverage (
Nine different fluorescently labeled multiplex PCRs (6 to 11-plexes), amplified on 5 different samples, were analyzed on a capillary sequencer to determine relative amplicon abundances prior to emulsion PCR and sequencing on a GS-FLX. Relative fluorescent signals were compared to their corresponding coverage values. The top panel shows the relative coverage in function of the relative fluorescence for the 360 SACs. The ellipse represents the 95% confidence region according to the multivariate normal distribution. The continuous line is the first principal component (PC) which indicates the direction of the largest variance in the sample: 92% of the variance of the sample can be explained by the first PC. The first PC lays very close to the first bisectrice (dashed line). Hence, there is a good 1∶1 relationship between the relative fluorescence and the relative coverage, indicating that a certain increase in relative fluorescence on average induces an equal increase in relative coverage. The table at the bottom summarizes results across all 9 multiplex PCRs (360 SACs). It shows that the first PC explains a large proportion of the variance of each multiplex (84%–98%): the majority of variation in coverage results from variations in input amounts (as determined by fragment analysis on a capillary sequencer).
Equimolarity can be achieved by optimizing amplification conditions or by normalizing PCR product concentrations. Although normalization can potentially increase sequencing efficiency, one may lose on overall processing efficiency due to the required effort to normalize the SAC. With good primer design tools one should be able to get similar DNA quantities (as measured by end point fluorescence in a qPCR reaction with saturating DNA binding dye) for the 90% best assays. For such screenings, the majority of amplicons do not require any normalization and a significant portion of all remaining amplicons can be made equimolar by a simple normalization.
This graph represents the distribution of the relative end point fluorescence intensities (RFU, relative to the maximum fluorescence), across 627 different qPCR reactions on a single sample. About 90% of reactions have RFU values of at least 0.5. This implies that if equal volumes of all PCR reactions are pooled, the concentration of 90% of amplicons will vary less than 2-fold. This fraction of amplicons can be increased to 96% by using a double volume for the PCRs in the 0.5–0.25 RFU range, and to 97% by using a quadruple volume for the PCRs in the 0.25–0.125 RFU range. The concentration of the remaining 3% of PCR reactions is too low to be efficiently used.
Sequence quality was determined using the GS-FLX basecaller. Quality scores per base were averaged across all reads within a single run (∼700,000 reads of 1 GS-FLX Titanium experiment for
a) Average quality score in function of the position within the reads for a representative dataset (full Titanium run with amplicons for breast cancer and for familial aorta aneurysmata screenings). Across the first 400 bp there is an average quality of 35.3 corresponding to a predicted error rate of 0.029%. b) Comparison of the observed homopolymer length in a series of sequencing runs to the expected length based on the reference sequence. Results are plotted as the fraction of reads having correct homopolymer length estimation (n), an underestimation of the homopolymer length (n−1, n−2, n−3) or an overestimation (n+1, n+2, n+3). The vast majority of reads for homopolymers of up to 6 repeats has correct length estimation, less than 2% are overcalls and less than 10% are undercalls. For homopolymers of 7 repeats, three quarters of the reads are correctly called and over 20% of the reads are interpreted to be missing one repeat. Only by filtering for low allele frequencies can these repeats be analyzed. At 8 repeats only about half of the reads are correctly called, at even larger homopolymer lengths only a minority of reads have a correct basecalling.
Pyrosequencing reactions are characterized by a low false call rate for substitutions, but also by a higher error rate for insertions and deletions – especially in homopolymeric regions
As massively parallel sequencing has the ability to become the standard for next generation molecular diagnostics, more insight is urgently needed in the limitations of the technology and tools are required to standardize the quality of the diagnostic tests offered in various laboratories. In this study, we thoroughly evaluated data obtained with 10 GS-FLX experiments allowing us to shed light on a number of important issues and provide workarounds.
Current massively parallel sequencers offer a throughput per run that is insufficient for complete genome sequencing at affordable cost in a diagnostic setting, but mostly supersedes the requirements for targeted resequencing of single DNA samples. Strategies for next generation molecular diagnostics will therefore have to deal with both the selection of regions of interest and with sample multiplexing. Regions can be selected by either hybridization based enrichment or PCR amplification. Enrichment by capturing DNA fragments on oligonucleotides – on array (e.g. NimbleGen, Febit) or in solution (e.g. Agilent, Illumina) – has the advantage that many regions can be targeted in parallel (target multiplexing). While this allows enrichment of a high number of regions of interest (up to an entire human exome), it is well known to introduce large variations in coverage
Sample multiplexing can be achieved by physically separating samples in the sequencing reaction or by tagging the amplicons with different sample specific sequences during library preparation. Physical separation on current MPS instruments offers limited flexibility in the number of samples to be multiplexed (up to 16 in GS-FLX) and may reduce the available sequencing capacity by blocking parts of the available sequencing space. Therefore, a sample tagging approach is preferred. For applications where different samples are analyzed for different genes, no special multiplexing modifications need to be done when sequences can be easily attributed to the different samples based on correct alignment to the gene of interest. Four major amplification based approaches for NGMD are currently used worldwide: 1) PCR with fusion primers (GS-FLX), 2) PCR followed by adapter ligation (GS-FLX), 3) two consecutive rounds of PCR (GS-FLX), and 4) shearing of concatenated PCR products followed by adapter ligation (various MPS platforms). It must be noted that other approaches or variations on the methods described may be used as well. In this study, we evaluated approach 2 and 3. The main advantages of approach 2 are its simplicity and ease of set-up. The drawback is the large number of individual PCR reactions that need to be performed. Hence, we concluded that this approach is best suited if a screening only needs to be performed a few times or when results are quickly required and one cannot afford optimization. As soon as a few hundred samples need to be screened, approach 3 may be the preferred alternative. By multiplexing PCR reactions in approach 3, one can reduce the workload and consumable cost for sample preparation. Although optimization of multiplex PCR may be challenging, there is a good return in increased efficiency (in terms of cost and workload to prepare samples) for tests that will be run many times – as is the case in diagnostic sequencing. Further optimization may be achieved if the first and second round of PCR can be combined into a single PCR containing the two types of primers (inner target specific and outer sample specific primers).
Because of fundamental differences between the traditional and the so called next-generation sequencing methods, people are uncertain on how to deal with coverage and how to interpret variants, errors and quality scores. Despite the availability of some guidelines on required coverage provided by sequencing instrument suppliers, there was no theoretical framework to actually calculate the required minimum coverage. We here provide such a framework and implement it into a spreadsheet template that can be used to determine the required coverage and the number of patients that can be screened in a single run.
A number of sources of false positives and false negatives are identical for both Sanger and massively parallel sequencing and hence independent on the fold coverage. However, because MPS is based on the sequencing of single, clonally amplified molecules and uses a completely different sequencing chemistry, new types of error sources must be taken into account. Knowing the possible sources of error, one may optimize sample preparation and sequencing protocols, and take measures to adjust the data analysis pipeline for these new types of errors.
Based on the strategies and methods described in this paper we successfully developed and validated the screening of the complete coding region of the
NGMD calculator.
(PDF)
Allele frequency analysis.
(XLSX)