Exome Sequencing Reveals Comprehensive Genomic Alterations across Eight Cancer Cell Lines

It is well established that genomic alterations play an essential role in oncogenesis, disease progression, and response of tumors to therapeutic intervention. The advances of next-generation sequencing technologies (NGS) provide unprecedented capabilities to scan genomes for changes such as mutations, deletions, and alterations of chromosomal copy number. However, the cost of full-genome sequencing still prevents the routine application of NGS in many areas. Capturing and sequencing the coding exons of genes (the “exome”) can be a cost-effective approach for identifying changes that result in alteration of protein sequences. We applied an exome-sequencing technology (Roche Nimblegen capture paired with 454 sequencing) to identify sequence variation and mutations in eight commonly used cancer cell lines from a variety of tissue origins (A2780, A549, Colo205, GTL16, NCI-H661, MDA-MB468, PC3, and RD). We showed that this technology can accurately identify sequence variation, providing ∼95% concordance with Affymetrix SNP Array 6.0 performed on the same cell lines. Furthermore, we detected 19 of the 21 mutations reported in Sanger COSMIC database for these cell lines. We identified an average of 2,779 potential novel sequence variations/mutations per cell line, of which 1,904 were non-synonymous. Many non-synonymous changes were identified in kinases and known cancer-related genes. In addition we confirmed that the read-depth of exome sequence data can be used to estimate high-level gene amplifications and identify homologous deletions. In summary, we demonstrate that exome sequencing can be a reliable and cost-effective way for identifying alterations in cancer genomes, and we have generated a comprehensive catalogue of genomic alterations in coding regions of eight cancer cell lines. These findings could provide important insights into cancer pathways and mechanisms of resistance to anti-cancer therapies.


Introduction
All cancer cells have somatic mutations in their genomes, such as single nucleotide mutations, insertions, deletions, and copy-number gain or loss. Genomic lesions in cancer cells disrupt normal functions and pathways such as proliferation and apoptosis, and are essential for tumor genesis, growth, and metastasis. In addition, each tumor carries a unique combination of mutations in its genome, leading to heterogeneity in cancer prognosis and responses to therapeutic intervention. Our limited understanding of the more common mutations has already affected therapeutic regimens. For example, treatment with small molecule inhibitors of the epidermal growth factor receptor (EGFR) has been shown to primarily benefit lung cancer patients that carry certain somatic mutations in their EGFR gene [1,2]. Similarly, certain antibody therapies directed against EGFR only show efficacy in the subset of colorectal cancer patients with a wild-type KRAS gene [3,4]. Deep systematic characterization of somatic mutations in cancer genomes promises to be a powerful tool for both understanding cancer pathways and developing targeted therapeutics.
Over the last two decades, focused studies on candidate genes have led to the identification of mutations occurring with high frequency in crucial cancer pathway genes such TP53, KRAS, and PTEN [5]. In recent years, the coding regions of breast, lung, colon, and brain tumor genomes have been analyzed using capillary-based sequencing technologies. These efforts have led to the identification of causative mutations in previously unsuspected genes such as IDH1, highlighting the power and importance of unbiased, genomic-scale mutation discovery [6,7,8]. However, large-scale capillary-based sequencing technologies are time consuming and expensive, and thus not feasible for wider use.
Next-generation sequencing (NGS) technologies have increased the throughput and decreased the cost of DNA sequencing by several orders of magnitude. A number of studies have applied NGS technologies to sequence cancer genomes, as summarized in recent reviews [9,10]. However, sequencing the whole genome is still cost-prohibitive for many potentially valuable applications.
One alternative to whole genome methods is exome sequencing, which captures and sequences only coding exons in the genome. Exome sequencing methods can deliver sequencing information for much of the functionally relevant genome at increased coverage and reduced cost. Recent studies have successfully applied exome sequencing to identify causal mutations of Mendelian diseases [11,12]. Large cancer genome initiatives such as The Cancer Genome Atlas project also include exome sequencing as part of their strategy to characterize cancer genomes [13].
Protein kinases are the most ubiquitous family of signaling molecules in human cells and play essential roles in regulating most cellular functions [14]. Since the protein kinase family is one of the most frequently mutated gene families in cancers [5], it has been subjected to several focused genomic sequencing studies. Bardelli et al. conducted the first systematic screen of mutations in the receptor tyrosine kinase subfamily of protein kinases, in colorectal cancer samples [15]. Since then, studies in primary tissues and cell lines have identified many mutations in protein kinases across multiple tumor types [16,17,18]. The interest in mutations of kinases has continued with recent genome-wide mutation discovery studies [13,19,20].
Cell line models of human cancer have played a critical role in our understanding of cancer disease pathways, identification and validation of cancer target genes, and our ability to screen potential anticancer drugs. These cell lines carry genomic mutations inherited from their source tumor cells, although additional mutations can be acquired during the course of cell line development and passage. In general, comparisons between cell lines reveal substantial heterogeneity in genomic mutations and reflect cancer pathways similar to those found in primary tumors. For example, comparison of a panel of breast cancer cell lines with a collection of primary breast samples showed that gene expression and copy number profiles in cell lines mirror those found the primary tumors [21]. Similarly, genomic mutations reported in the COSMIC database for cell lines have a similar spectrum to those in primary tumors [22]. As additional largescale tumor genome sequencing results become available, there is a growing need for corresponding cell models to determine how novel variants affect protein function. Comprehensive characterization of genomic alterations in cancer cell lines will advance our understanding of cancer biology, and could also provide a basis for choosing relevant cell line models to study a particular aspect of cancer disease biology, or to screen for antagonists of certain cancer pathways.
To evaluate NGS technologies and to characterize genomic mutations in cancer cell lines, we have analyzed data from the Roche Nimblegen exome capturing array and Roche 454 NGS technologies, applied to eight commonly used cell lines representing several major cancer types. We demonstrate that exome sequencing can be a reliable and cost effective way for identifying genomic alterations in cancer genome, and generated a comprehensive catalogue of genomic alterations in coding regions of eight cancer cell lines.

Exome capture and sequencing results
Exome capture and 454 sequencing technologies were applied to DNA samples from eight cancer cell lines (A2780, A549, COLO205, GTL16, NCI-H661, MDA-MB468, PC3, and RD, as described in Methods. The results of initial data processing are summarized in Table 1. For each cell line, about 1.9 million sequencing reads (688 million bases; 98.5% of total sequencing reads) could be successfully mapped to the human genome NCBI36/hg18 reference assembly (http://www.ncbi.nlm.nih.gov). The average read length across all cell lines is 364 bases, consistent with the long read length reported for the 454 sequencing technology. On average, 89.5% of the circa 180,000 exons on the Nimblegen 2.1 M human exome array (target regions) were covered with at least one sequencing read, and the average sequencing read depth for all cell lines is 7.3 in target regions. The exome capture and sequencing results are within the normal range of performance specified by the manufacturer and are comparable with published results using the same technology [23].
We detected on average 14,340 sequence variants (differences from the human reference genome) per cell line. The majority of these differences are known polymorphisms in normal human population (i.e. recorded in NCBI dbSNP database, build 130). On average 2,779 variants per cell line are not found in the dbSNP database, and therefore represent novel sequence variations and/or somatic mutations. On average 1,904 of the 2,779 novel variants are non-synonymous, i.e. they alter codon specificity. These variants are more likely to change protein functions and impact cellular phenotypes.

Concordance with genotyping results
As another means to assess the accuracy of exome sequencing, we compared the data with genotyping results across the eight cell lines ( Table 2). The Affymetrix Genome-Wide Human SNP Array 6.0 is  Table 2). It is expected that the accuracy of genotype detection by sequencing will be influenced both by sequencing read depth and by heterozygosity at a given genomic location. We calculated concordance of genotype calls at difference sequencing read depth, and separately for homozygous or heterozygous SNPs. As shown in Figure 1, concordance is high for homozygous SNPs (average 97%) regardless of sequencing read depth. Concordance for heterozygous alleles is lower, but increases with sequence read depth, starting with  31% concordance at a read depth of 3 and reaching . 90% at a read depth of 10 or higher. In theory, sequencing DNA fragments from a region that contains a heterozygous SNP is a process of random sampling. At lower sequencing depth, there is a higher chance of missing one of the two alleles. We calculated the theoretical rate of detecting both alleles by sequencing at different read depths, assuming no error in sequencing ( Figure 1, dashed line). At low read depths, our experimental observations are close to the theoretical rate, indicating that low concordance at low read depths is likely due to the random sampling process rather than poor quality of sequence data.
Comparison of exome sequencing to the COSMIC database of cancer mutations The protein-coding exons and immediate flanking intron sequences of 61 common cancer genes have previously been systematically determined in about 800 cell lines by the Welcome Trust Sanger Institute, using capillary-based sequencing [22]. Of the eight cell lines in this study, all except one (GTL16) have been screened in that project. We compared somatic mutation information from the Sanger COSMIC database with our exome sequencing results for the seven cell lines. As shown in Table 3, exome sequencing re-discovered most of the 21 mutations reported in the COSMIC database, including point mutations and small insertion/deletions. The two missing cases are due to lack of sequence coverage in the locus of interest: the documented STK11 mutation in A549 is not measurable due to lack of STK11 gene coverage in the Nimblegen 2.1 M human exome arrays, and the TP53 gene is covered by the Nimblegen array but lacks sufficient reads in the PC3 line to verify in this study (there are sufficient reads for the TP53 gene in other lines, as in Table 3).
Large homozygous deletions, such as the known deletions of the CDKN2A gene in A549 and SMAD4 in Colo205 cells, cannot be directly observed with exome sequencing. But a deletion of gene regions can be inferred where the read depth is zero for several consecutive exons (see next section for detailed discussion). All five genomic deletions reported in the COSMIC database are identifiable from exome sequencing results (Table 3). For example, in the A549 cell line we observed 14 consecutive regions around CDKN2A gene with a read depth of zero. In the Colo205 cell line, a documented 904-base deletion in the SMAD4 gene manifests as 4 consecutive target regions with a read depth of zero.

Detecting gene amplification and deletion
Deletions or amplifications of chromosomal segments are common alterations in cancer genomes. In principle, the sequencing read depth in a region should be proportional to its copy number. However, the relatively modest read depth of the current study could give undue weight to random variations in read depth. Variability in read depth could also arise from technical aspects of the exome sequencing process. For example, the exome capturing array could vary in efficiencies for different exon regions due to diverse sequence composition. To assess the possibility of estimating copy number information from our exome sequencing data, we compared average sequence read depths with copy-number data estimated from SNP6 platform. As show in Figure 2, there is a positive correlation between sequence read depth and copy-number, with Pearson correlation coefficient of 0.41. The variation in read depth makes it challenging to accurately detect low-level copy-number changes. On the other hand, we find that accurate detection of high-level gene amplifications and homozygous deletions is possible. Homozygous deletion of the SMAD4 gene region has been reported in the MDA-MB468 cell line (Sanger COSMIC database) and is thus illustrative for comparing deletion detection methods. The sequencing read depths of exon regions in SMAD4 gene and surrounding area were determined for MDA-MB468 and plotted according to their chromosomal location ( Figure 3A A read depth of zero could result from technical issues, such as probe design in the Nimblegen 2.1 M array. In fact, we identified 2,513 exon regions that have a read depth of zero for all 8 cell lines (Table S1). However, since the median read depth across all 8 cell lines is greater than zero for all of the 16 exon regions ( Figure 3A), it is unlikely that the observed depth of zero in the MDA-MB468 cell line is due to a systematic failure of exome capture. Random variation in read depth is another reason for lack of sequencing coverage. In the MDA-MB468 cell line, there are 17,161 exon regions with a read depth of zero (from 194,706 total regions, excluding the 2,513 regions mentioned above). It is highly unlikely that 16 consecutive exon regions around SMAD4 gene would have a read depth of zero due to random variation (p = 1.3e-17, calculated from the binomial distribution).
We were also able to re-identify previously documented gene amplification events using the read depth data. For example, amplification of EGFR1 in the MDA-MB468 cell line has been documented by fluorescence in situ hybridization and by quantitative PCR [24]. We observed that the 53 exon regions around the EGFR gene on chromosome 7 have very high read depths in the MDA-MB468 data ( Figure 4A; the exons between 55.58-55.73 Mb have an average read depth of 107). Our copy number analysis of the Affymetrix SNP array 6.0 data also indicated that the EGFR gene region is highly amplified in the MDA-MB468 line ( Figure 4B, genomic region 55.48-55.81 Mb).

Novel non-synonymous variants in protein kinases
Since mutations in protein kinases have important roles in cancer biology, we chose to examine the sequence data for protein kinases and focus on non-synonymous variations, which produce amino acid substitutions that may have functional consequences. As noted above, exome sequencing revealed circa 2,000 novel non-synonymous variants in each of the eight cell lines. After applying a stringent filter (as described in Methods), between 199 to 479 genes have novel non-synonymous variants, depending on the cell-line (Table S2). The Nimblegen 2.1 M capture array used in this study included exons for 440 of the 518 protein kinases in the human genome (Table S3) [25]. In each cell line, an average of 122 non-synonymous variations were detected in kinase genes. After removing likely germline variants (found in dbSNP) and applying a stringent filter described above, each cell line has an average of eight kinases with non-synonymous variations (Table 4). These sequence variations in protein kinases are listed in Table 5. Most of these sequence variations are not reported in the COSMIC database or reported in the literature, but several have independent confirmation. For example, we identified EGFR variant A1048V in the GTL16 gastric cell line. The same variant in EGFR has been reported in the MKN45 gastric cell line [26], which is the parental cell line of GTL16 [27]. A second example is the R796S variant of the insulin receptor gene (INSR) in the RD cell line (Table 5). We had previously identified this variant in the RD cell line using capillary sequencing technology (data not shown).

Discussion
Analysis of data from eight diverse cancer cell lines shows that Roche Nimblegen and 454 exome sequencing technologies can be successfully applied to identify variations in gene-coding regions. From sequencing data with an average of 7.3-fold coverage, variants from the NCBI36 reference genome were identified in about 8% (14,340 regions) of all target regions on the exome capture array. While the majority of these variants could be confirmed in dbSNP database, on average 0.16% (2,779) of total target regions carry a novel variant.
A comparison of SNP genotype calls from exome sequencing with data generated on the Affymetrix Genome-Wide Human SNP Array 6.0 showed that there is high concordance between the two technology platforms. The concordance is 97% for homozygous sites, and ranges from 30% to .90% at heterozygous positions, with accuracy dependent on sequencing read depth. Our analysis of the relationship between read depth and power of detection suggested that a minimum of ten-fold read depth is required for reliably detecting both alleles at heterozygous sites. These results provide guidance in planning future genome sequencing projects.
For the seven examined cell lines that are also present in the COSMIC database, we show that 19 of 21 known mutations can be re-discovered by exome sequencing. Two previously described mutations were missing due to lack of sequence coverage. In one case this was due to incomplete coverage of the human exome in the Nimblegen 2.1 M capture array, indicating a need for improvements in array design.
By successful re-identification of the EGFR amplification and the SMAD4 homozygous deletion in the MDA-MB468 cell line, we demonstrate that copy number alterations can be inferred from the sequencing read depth data. However, because of the stochastic nature of sequencing read depth and likely unevenness in the exome capturing process, in general it is not possible to reliably estimate copy-number information from our data. Applying the technology to more samples would help improve our ability to estimate and correct for systematic biases in the platform, and increasing the depth of sequencing reads would reduce the variance due to random fluctuation in read number.
To bring context to the genomic variation identified in this study, we chose to focus on protein kinases as an illustrative class. In this work, we identified with high confidence at least four novel variant protein kinases in each cell line. Most of the novel sequence variations in protein kinases identified in this study have not previously been reported, and probably reflect the high diversity of genomic alteration in cancer. Our results expand the knowledge of sequence variations in protein kinases and other potential cancer-related genes. These novel variants could be either germline SNPs not yet reported in the dbSNP database, or somatic mutations in these cancerous cells. Several large-scale human genome sequencing projects currently in progress will expand identification of germline SNPs and help to categorize the nature of novel variants found in tumors.
In conclusion, we showed that exome sequencing can be a reliable and cost-effective approach to identify genomic alterations in cancer cell lines, and suggest ways to further improve exome-sequencing technologies for applications in cancer genomics. A comprehensive catalogue of genomic alterations in the coding regions of eight cancer cell lines was generated, which should contribute not only to our knowledge of these models in particular, but also to our understanding of cancer genomics and cancer biology in general.

Array-based Genotyping and Copy-number Analysis
Two aliquots of 250 ng genomic DNA per sample were digested by restriction enzymes NspI and StyI, respectively. The resulted products were ligated to the corresponding adaptors and PCR amplified. The labeled PCR products were hybridized to the Affymetrix Genome-Wide Human SNP Array 6.0 according to the manufacturer's recommendations. The Birdseed algorithm   [28] implemented in Affymetrix Power Tools (APT) Software Package (version 1.10.0) was used for genotype determination. For copy-number analysis, the Cel files were processed using the aroma.affymetrix package [29] for the R-project. Segmentation of normalized raw copy number data was performed with the CBS algorithm [30] implemented in the aroma.affymetrix package.

Bioinformatics analysis
The Human genome NCBI36/hg18 reference assembly ( http://www.ncbi.nlm.nih.gov/ genome/ guide/ human/ release _notes.html#b36) was used as the framework for all analyses. Sequence data processing, mapping to the human genome, and initial calls of variation from the reference sequence were performed by Roche 454 Life Science using GS Reference Mapper software (Roche Inc.). To qualify as a variant from the reference genome sequence, there must be at least two independent reads that 1) show the difference, 2) have at least 5 bases on both sides of the difference, and 3) have few other isolated sequence differences in the read. Variants identified as 'high confidence' were subject to a more stringent filter, requiring at least three independent reads with the variant comprising at least 40% of all independent reads covering the allele genomic position. To identify non-synonymous variants, the impact of each variant on translated protein sequence was assessed by mapping its genomic coordinates back to genes in RefSeq collection [31] release 37, and identifying changes in codon specificity.
We calculated the theoretical rate of detection at heterozygous positions as a function of different read depth as follows: N sequencing reads covering a heterozygous position could be considered as random sampling of the two alleles repeated N times, thus should follow the binomial distribution. Assuming that allele A is reported in the human reference genome and allele B is the variant allele, we require at least two sequencing reads with the B allele for declaring the detection of allele B. The probability of detecting both A and B alleles at a heterozygous position can be calculated as: PAB = 12P12P2. P1 is the probability of finding 0 or 1 read with the A allele in N sequencing reads according to the binomial distribution, which would lead to a genotype call of AA. P2 is the probability of finding N reads with the B allele in N sequencing reads according to the binomial distribution, which will lead to a genotype call of BB.   Table 5. High confidence* non-synonymous variants in protein kinase genes in each of 8 cell-lines.