Assessment of Whole Genome Amplification for Sequence Capture and Massively Parallel Sequencing

Exome sequence capture and massively parallel sequencing can be combined to achieve inexpensive and rapid global analyses of the functional sections of the genome. The difficulties of working with relatively small quantities of genetic material, as may be necessary when sharing tumor biopsies between collaborators for instance, can be overcome using whole genome amplification. However, the potential drawbacks of using a whole genome amplification technology based on random primers in combination with sequence capture followed by massively parallel sequencing have not yet been examined in detail, especially in the context of mutation discovery in tumor material. In this work, we compare mutations detected in sequence data for unamplified DNA, whole genome amplified DNA, and RNA originating from the same tumor tissue samples from 16 patients diagnosed with non-small cell lung cancer. The results obtained provide a comprehensive overview of the merits of these techniques for mutation analysis. We evaluated the identified genetic variants, and found that most (74%) of them were observed in both the amplified and the unamplified sequence data. Eighty-nine percent of the variations found by WGA were shared with unamplified DNA. We demonstrate a strategy for avoiding allelic bias by including RNA-sequencing information.


Introduction
The introduction of massively parallel DNA sequencing has massively increased the amount of genetic information that can be generated from tissue and cell samples [1]. Genome-wide analyses of genetic structure are particularly valuable in cancer research, where they can provide important information on the origins of the disease and the optimal course of treatment. However, the quantity of tissue available for study is often limited. Therefore, to facilitate detailed analyses of tumor heterogeneity, there is a need for highly sensitive methods that can efficiently amplify the genomes of cancer cells from small samples and for sequencing the functional parts of the genome. This is not only true for cancer research, metagenomic studies of environmental viruses and microbial communities also deal with low-copy number and heterogeneous DNA composition where the biases of amplification techniques also are of importance [2,3,4,5]. Whole genome amplification (WGA) [6,7,8] and target enrichment [9,10,11] are valuable techniques that are finding increasingly common usage in established cancer research pipelines [12,13,14]. Numerous reagents and commercial sequence capture kits have been developed for these purposes, and comparative reviews indicate that most of them are very effective for targeted exome capture [15,16,17]. A recent study used WGA in conjunction with exome sequence capture to analyze genomic variation in kidney cancer cells at the single nucleotide level [18,19]. The results obtained demonstrated that the combination of these two methods provides a powerful tool for identifying new disease-causing mutations even when working with very small quantities of input genetic material. Multiple displacement amplification is suitable for mutation analysis because it has both high resolution and genome coverage and also high accuracy at the nucleotide level, making it superior to degenerate oligonucleotide primed amplification for identifying novel causative mutations. However, there is a need to fully investigate the limitations and scope of this combined approach. Still there areas that have not been explored and the use of an amplification method before target enrichment might induce more false positives, introduce a bias due to low copy number as starting material and therefore comparison of exome sequencing of unamplified and whole genome amplified material are warranted.
A number of approaches can be used to evaluate the bias introduced when using these two technologies together and to validate identified genetic variations. These include performing analyses on unamplified material (although the availability of a sufficiently large sample may prove limiting here), PCR cloning and Sanger sequencing of genome regions, performing replicate runs (possibly using alternate reagents), and the use of comple-mentary RNA sequencing, among others. The latter two methods are probably most suitable for validation on a global scale, and RNA sequencing has the added advantage that it can be used to confirm the expression of mutated alleles. However, it should be noted that Sanger sequencing of amplicons remains the gold standard in mutation analysis. This paper describes an investigation into the performance of WGA using the phi29 polymerase followed by exome sequence capture and massively parallel sequencing of lung cancer tumor material. To assess the biases of this approach, we sequenced unamplified material and also performed RNA sequencing. Based on our findings, we propose a strategy for identifying biologically relevant variations.

Samples
Samples from sixteen patients diagnosed with non-small cell lung cancer (NSCLC) were obtained from the Institut de Gustave Roussy and Institut Mutualiste Montsouris (IMM), Paris. All participants gave their written informed consent to participate in the study. Both tumor tissue samples and healthy samples were obtained from each patient and extracted to isolate their genomic DNA and RNA. Microscope analysis indicated that the tumor cell content of the tumor samples was .70%. Genomic DNA was isolated from peritumoral tissue using the QIAamp DNA Mini Kit (Qiagen, Hilden, Germany), which produced eluates with DNA concentrations ranging from 2 to 41.6 ng/ul. RNA samples were extracted according to the manufacturer's instructions using the Trizol Reagent from Invitrogen. The obtained RNA concentrations ranged from 102.2 to 2060.4 ng/ul. The ethical guidelines of the CHEMORES consortium (www.chemores.org) were observed during this work, which was conducted with the aim of generating a systems biology database for studying resistance to chemotherapy. The study was approved by the local ethics committee, the Institutional Ethics Committee of the IMM RECHERCHE, Institut Mutualiste Montsouris, Paris, France.

Whole Genome Amplification (WGA)
DNA amplification was performed using the Illustra GenomiPhi V2 DNA Amplification kit (GE Healthcare, Waukesha, Wisconsin) with random hexameric primers, according to the manufacturer's instructions [14], [15]. The samples were subjected to gel electrophoresis to confirm that the WGA reaction has been successful (data not shown).

Target Enrichment of WGA Material
Following WGA of the DNA from the tumor cells and the healthy samples, sequence capture was performed following the protocol provided with the solution-based Roche Nimblegen EZ kit (Roche-NimbleGen, Madison, WI, USA). Concentrations were measured using Qubit, after which libraries were prepared for Illumina sequencing. The samples were multiplexed and clustered on a cBot cluster generation system, using a paired-end read cluster generation kit according to the manufacturer's instructions. The samples were spiked with 1% PhiX Library for quality control purposes and sequenced on an Illumina HiSeq2000 instrument using paired end 26100 bp technology (Illumina, San Diego, CA, USA). Base conversion was achieved using Illumina's OLB v1.9 software.

Target Enrichment of Unamplified Material
For comparative purposes, sequence capture of unamplified tumor DNA material was achieved according to the protocol provided with the SureSelect Human All Exon V4 kit (Agilent, Santa Clara, CA, USA). Library preparation and sequencing were performed using the HiSeq2000 instrument as described above.

Alignment and Variant Calling of Genomic DNA
The reads obtained by sequencing the DNA samples were quality checked using FastQC, and aligned against the reference genome HG19 using Mosaik-aligner version 2.1 (http://code. google.com/p/mosaik-aligner/). The coverage and total number of reads were evaluated for each sample. Variant calling was performed using GATK version 1.5.2.1 and was based on positions having .10-fold coverage with a phred scaled quality score above 30 in order to reduce the number of false positives. The total numbers of heterozygous and homozygous variants as well as the ratio of the two were calculated for each sample. The ratio of transversions to transitions was calculated for each sample using custom scripts.

Comparison of WGA and Unamplified Sequencing of Tumor DNA
To evaluate the reliability of WGA as a sample preparation method for exome capture, we compared single nucleotide variant (SNV) calls obtained using WGA to those obtained using unamplified sequencing. First, the genetic variants were filtered to include only exonic regions (CCDS) covered by both the Nimblegen Sequence Capture and Agilent SureSelect. The numbers of genetic variants found by WGA, unamplified sequencing, or both were then extracted using custom scripts and compared using Venn diagrams for each patient. The coverage of genetic variants found uniquely by WGA and unamplified sequencing was also investigated, along with the coverage of shared genetic variants. In addition, we examined the ratio of variant reads to total reads for sequencing following WGA and unamplified sequencing.
To evaluate the effect of coverage filtering on the observed discrepancies between genetic variants identified using WGA and without amplification, the alignment files for the unamplified sequences (BAM files without coverage filtering) were inspected to check for the presence of variants identified only by WGA and vice versa.

Identification of Tumor Specific Mutations
Tumor specific mutations were identified by removing variants found by sequencing after WGA of the patient's normal tissue DNA from the variants found by sequencing of WGA tumor tissue DNA. The total number of exonic mutations for each patient was calculated. To evaluate the performance of sequencing following WGA in identifying mutations, the mutations found in each patient were compared to genetic variants found by sequencing unamplified tumor tissue.
The ratio of tumor specific mutations to the total number of genetic variants was also investigated for variants found exclusively by sequencing following WGA as well as for those identified by unamplified sequencing alone and those identified by both methods (shared variants).

RNA Sequencing, Alignment and Variant Calling
Eleven of the 16 tumor tissue samples provided RNA of sufficient quality and quantity for analysis. These samples were sequenced using the RNA TruSeq kit (Illumina) and a Illumina HiSeq2000 instrument, and aligned to the reference genome (HG19) using Tophat (version 1.0.14). Cufflinks (version 0.8.3) was used to compute FPKM values, and variants having .10-fold coverage and quality scores above 30 were called by mpileup (Samtools). Samtools was used since GATK gave inconclusive variant calls in this pipeline.
To further evaluate the performance of WGA in mutation identification, we analyzed the BAM files obtained during RNA sequencing to determine whether they reflected the genotype calls obtained by sequencing WGA DNA.
To evaluate the use of RNA sequencing for variant calling, we investigated the number and fraction of bi-allelically expressed genes for all variants found by sequencing both WGA and unamplified material. To ensure that only bi-allelically expressed genes were considered, genes were excluded from the comparison unless they had at least one heterozygous variant according to RNA sequencing and were in the top 75% of genes with an FPKM value of at least one.

Patients
The patients' characteristics are presented in Table 1. All patient samples were analyzed by sequence capture and massively parallel sequencing either after WGA or using unamplified material. In addition, RNA sequencing was performed on a subset of samples. Exome sequencing following WGA yielded an average of 36 M reads/patient with a sequence quality of at least 20 and 60X on target coverage for tumor samples, and an average of 34 M reads/patient with a sequence quality of at least 20 and 67X on target coverage for normal samples. Exome sequencing of unamplified tumor material generated a total of 39 M reads/ patient with a sequence quality of at least 20 and 74X on target coverage (see Table 2).

Variant Calling after DNA Sequencing
Genetic variants in all samples were called using GATK [20], giving heterozygote:homozygote ratios ranging from 1.7 to 2.3 in samples subjected to WGA, and from 1.7 to 4.9 for unamplified samples (see Table 2). For the unamplified material, four patient samples (127, 146, 225, and 344) yielded anomalously high heterozygote:homozygote ratios (ranging from 4.5-4.9) and were therefore excluded from further comparative analyses. The transition:transversion ratio ranged from 2.0 to 3.4 for all samples. The average total number of variants detected in samples following WGA was 8073 for tumor tissue and 8277 for normal tissue ( Table 2).

Comparison of WGA and Unamplified Sequencing of Tumor DNA
To identify biases that may be introduced by WGA, SNVs identified in the sequencing data obtained using WGA and unamplified material were compared using Venn diagrams (see Figure 1). The numbers of genetic variants called for WGA and unamplified material from tumor samples from two representative   Figures 1A and 1C. Most (74%) of the identified genetic variants were observed in both the amplified and the unamplified sequence data (Table 3). However, in most patients (83%) the number of unique variants identified in the unamplified material was greater than that observed following whole genome amplification. The coverage of genetic variants found uniquely by WGA and without amplification is presented using box plots for patients 140 and 295 in Figures 1B and 1D, along with coverage data for genetic variants observed in both sets of sequencing data (results for samples from the other patients  considered are presented in Figure S1). The coverage of unique variants was consistently lower than that for common variants that could explain some of the discrepancies. Figure 2 provides a representative comparison of the results obtained with WGA and without amplification, showing the distribution of coverage over the exons of the gene SPINK1. Notably, the peaks for the WGA sequence are somewhat broader than those for the unamplified material in line with the overall lower coverage on target as compared to unamplified material for similar number of reads.
To further evaluate the performance of the two methods in variant calling, we also compared the unamplified and WGA sequence data for patients 140 and 295 with respect to the number of variant reads divided by the total number of reads for each variant position (see Figure 3). Correlations for the remaining patients are presented in Figure S2 and Table S2. The ratios obtained using WGA were comparable to those for unamplified sequencing, with R 2 values ranging from 0.79 to 0.88 in all cases except the four excluded samples.
Since on average only 74% of the total number of genetic variants found for each patient were observed in both the WGA and unamplified sequence data, we further examined the effect of coverage filtering on the discrepancies between genetic variants identified in WGA and unamplified material. This was done by calculating the fraction of variants that were only identified by WGA but were also present in the alignment files (BAM files without coverage filtering) for the unamplified sequences, and vice  Table 3. Summary data for SNVs in positions identified in both WGA and unamplified sequence data, and using only one of the two methods. versa (Table 4). On average, 79% of the positions at which genetic variation was only observed in the sequence data for the unamplified material were identified in the aligned sequence BAM file for the WGA material with a coverage value of less than 10. For positions at which genetic variation was only observed in the WGA sequence data, only 36% were subsequently identified in the aligned sequence BAM files for the unamplified material with coverage values of less than 10, indicating a fraction of false positive variants in the WGA material.

Identification of Tumor Specific Mutations
A key goal when sequencing tumor samples is to identify tumor specific mutations. We therefore compared the tumor-specific variants detected in the WGA sequence data to those for the unamplified material. On average, the WGA sequence data for the tumor samples contained 1942 tumor specific mutations (i.e. positions with coverage .10 for which no variation was detected in the WGA sequence data for the healthy tissue samples). To study these mutations more in detail, we compared the WGA results to those obtained for unamplified tumor sample DNA and  were able to confirm the presence of just over half of them (50.2%; see Table 4).

Validation of Identified Genetic Variants using RNA Sequencing
To determine whether it might be possible to use RNAsequencing to validate mutation data, we calculated the fraction of tumor specific mutations (on average 1942) for which a corresponding position could be identified in the RNA sequence alignment files. On average, 71% of the mutations identified based on analysis of WGA sequence data were confirmed in this way (see the final column of Table 4).
To evaluate the use of RNA sequencing for variant calling, we used the number of bi-allelically expressed genes (i.e. genes with at least one heterozygous variant) and the list of the entire set of variants. We looked at the fraction of these genes for which analyses of the WGA and unamplified sequence data had also resulted in the identification of at least one genetic variant. In this comparison, we focused on transcripts with FPKM values above 2.8. To rule out mono-allelic expression, we chose transcripts for which two alleles were expressed (as indicated by the minor allele accounting for at least 20% of all transcripts) and compared the RNA sequence data to DNA sequence data for genes that exhibited variation when using either or both of the two tested sample preparation methods. Bi-allelic expression was observed for 37.6 percent of the expressed genes (representing the top 75% of all genes with FPKM.1). In total, 39% of the bi-allelic genes found by RNA sequencing also contained at least one genetic variant that was identified by sequencing of WGA and unamplified DNA. (See Table S2).

Discussion
In general the coverage, total number of heterozygotes, homozygotes, transition/transversion rate and total number of reads were comparable for sequences produced by WGA and when using unamplified material, indicating that both methods worked well on most samples. Most (74%) of the SNVs identified in this work were observed in both the WGA sequence data and that for the unamplified material. Conversely, 9% of the identified SNVs were observed only in the WGA sequence data and 18% were observed only in the sequence data for the unamplified material. Of the SNVs found by both methods the read ratios for variant and total reads were also equivalent (R 2 = 0.79-0.88, Table S1), but with some biased for the variant allele by WGA. The SNV that were identified by only one of the two sample preparation methods generally had lower coverage than those identified by both, indicating that these SNVs are located in regions that are more difficult to sequence. This is in accordance with the data presented by Indap et al. that also showed that down sampling bam files from sequencing of WGA samples results in reduction in accurately called variants [21]. When the search was expanded by removing the coverage filter, it was determined that 72% of the genetic variants found uniquely by sequencing unamplified material were present in the WGA alignment file. However a similar search expansion for the BAM file of the unamplified material identified only 36% of the genetic variants found in the filtered WGA data. This suggests that WGA might introduce some false positive genetic variation. Jiang et al. showed biased for WGA when comparing constitutional genetic variants called after WGA and shot-gun re-sequencing in a single sample, with an higher error rate for false negatives than false positives [22]. This difference compared to our results might be due to the presence of low frequency mutations in our material which might introduce a bias for the WGA.
All the WGA samples had satisfactory quality control results, including phred based quality scores, mapped reads values, and number of called variants, etc., and also had acceptable ratios of heterozygous/homozygous SNVs. However, four samples of the unamplified material did not satisfy this last criterion. Interestingly, of the four samples with higher heterozygous/homozygous ratios, sample 127 had a tumor content of 65%, which may have interfered with the relative abundance of heterozygotes and homozygotes. Samples 146 and 344 represented more morphologically aggressive cancers, with tumors at a later stage of development than was the case for the other samples. As a result, the sampled material may have been surrounded by tumor tissue with different levels of tumor progression. Since the unamplified method requires larger samples than are used with WGA, it is possible that they may have incorporated material with different levels of tumor development and would therefore exhibit a higher level of genetic variation than would be observed for the WGA samples. This highlights the importance of considering factors other than read depth and standard quality measures alone, even for samples that seem to satisfy conventional criteria. A more detailed analysis of these samples focusing on the read ratios and SNV coverage for the WGA and unamplified sequence data revealed an even more pronounced skew (see Figure S1 and S2). Also note that the difference in coverage and variants might also be due to the use of two different exome kits, however, we have tried to minimize the effects of this by only comparing the regions covered by both enrichment methodologies.
In cancer research, there is great interest in identifying tumor specific mutations [23,24]. This was achieved by comparing the variants found in the WGA sequence data for tumor samples and healthy tissue, and excluding all variants present in the latter. On average, 1942 tumor-specific variants were identified in this way for each tumor sample, although only 46% of these could be confirmed by sequencing the unamplified tumor sample DNA. This should be compared to the observation that 89% of all SNVs (i.e. both tumor-specific and non-tumor specific SNVs, average 11432 shared variants) identified by sequencing tumor samples following WGA could be confirmed by sequencing unamplified material. This suggests that WGA introduces some false positive calls that are identified as mutations when the patient's constitutional genetic variation is eliminated. Interestingly, 10% of the genetic variants found using both the unamplified and WGA protocols were mutations, i.e. variations identified by exome sequencing for both WGA and unamplified tumor DNA that were not present in the sequence of DNA from healthy samples. Of the SNVs found uniquely by WGA, 54% were identified as mutations. This value is comparable to the average number of confirmed mutations for this sample preparation method i.e. the fraction of WGA mutations (WGA pos and normal neg) found also by exome sequencing of unamplified tumor is almost the same as the fraction of mutations found only in the WGA exomes. Here we compared the coding genetic mutations found by standardized bioinformatic pipelines in WGA samples versus non-WGA samples. We have taken into account the coverage of the different capturing methods, and identified variants with and without a coverage filter. This has its limitation in that low coverage areas might be biased by inaccurate genotyping in these regions. However, in the coding region most bioinformatic pipelines do find mutations by comparing sequencing of both tumor and normal tissue. There are bioinformatic tools to deal with these issues like VarScan 2 [25] and pibase [26], but this might have the limitation of only studying high quality regions.
RNA-sequencing (RNA-seq) could potentially be useful for validating mutations identified in WGA sequence data. On average, we found that 71% of such mutations can be covered by RNA sequencing if the adequate sequencing depth can be achieved. However, the drawbacks of variant calling based on RNA sequence data are that it depends on bi-allelic expression and that there is a lack of variant calling tools that are designed for variant calling based on RNA-seq data. To eliminate the influence of bi-allelic expression on the variant calls, we determined the number of genes that were expressed bi-allelically (i.e. for which we found heterozygous SNVs) based on the RNA-seq data and also determined how many of these genes contained variants based on the WGA and unamplified tumor DNA sequence data. On average, 39% of the genes called with heterozygous variants in RNA-seq could also be verified by inspection of the WGA and unamplified tumor sequence data. This indicates that a large fraction of the variant calls obtained from RNA sequence data cannot be validated [27]. Although it is possible that the advent of rapid and affordable RNA sequencing might eliminate these discrepancies to some extent [28]. A further advantage of using RNA-seq data is that only genetic variants in highly expressed genes are identified.
Overall, the results presented herein indicate that the use of WGA in conjunction with sequence capture can identify a large fraction of the genetic variants found without using amplification when doing exome sequencing. However, a comparatively high fraction of the mutations identified using WGA alone cannot be confirmed using other methods. The main concern regarding the use of WGA in conjunction with sequence capture for mutation analysis is of course the introduction of false positives, which in this study was found to some extent. However, the use of WGA for identifying SNVs and constitutional genetic variants seems to be reliable, although caution should be taken when using WGA to identify mutations. Figure S1 Box-plots of the coverage of genetic variants found uniquely by WGA and without amplification as well as the coverage of shared genetic variants. The two leftmost boxes represent shared variant calls with coverage in those positions for WGA and without amplification, respectively. The two rightmost boxes represent coverage over unique positions for each method. (DOCX) Figure S2 The regression between the numbers of variant reads divided by total reads for SNV identified by sequencing of WGA and unamplified DNA per position.