Conceived and designed the experiments: MJC NH BDO ZC HL BM SFN. Performed the experiments: MJC BDO ZC HL. Analyzed the data: MJC NH BDO AE HL BM SFN. Contributed reagents/materials/analysis tools: MJC NH BDO ZC HL BM SFN. Wrote the paper: MJC NH BDO ZC AE HL BM SFN.
The authors have declared that no competing interests exist.
U87MG is a commonly studied grade IV glioma cell line that has been analyzed in at least 1,700 publications over four decades. In order to comprehensively characterize the genome of this cell line and to serve as a model of broad cancer genome sequencing, we have generated greater than 30× genomic sequence coverage using a novel 50-base mate paired strategy with a 1.4kb mean insert library. A total of 1,014,984,286 mate-end and 120,691,623 single-end two-base encoded reads were generated from five slides. All data were aligned using a custom designed tool called BFAST, allowing optimal color space read alignment and accurate identification of DNA variants. The aligned sequence reads and mate-pair information identified 35 interchromosomal translocation events, 1,315 structural variations (>100 bp), 191,743 small (<21 bp) insertions and deletions (indels), and 2,384,470 single nucleotide variations (SNVs). Among these observations, the known homozygous mutation in
Glioblastoma has a particularly dismal prognosis with median survival time of less than fifteen months. Here, we describe the broad genome sequencing of U87MG, a commonly used and thus well-studied glioblastoma cell line. One of the major features of the U87MG genome is the large number of chromosomal abnormalities, which can be typical of cancer cell lines and primary cancers. The systematic, thorough, and accurate mutational analysis of the U87MG genome comprehensively identifies different classes of genetic mutations including single-nucleotide variations (SNVs), insertions/deletions (indels), and translocations. We found 2,384,470 SNVs, 191,743 small indels, and 1,314 large structural variations. Known gene models were used to predict the effect of these mutations on protein-coding sequence. Mutational analysis revealed 512 genes homozygously mutated, including 154 by SNVs, 178 by small indels, 145 by large microdeletions, and up to 35 by interchromosomal translocations. The major mutational mechanisms in this brain cancer cell line are small indels and large structural variations. The genomic landscape of U87MG is revealed to be much more complex than previously thought based on lower resolution techniques. This mutational analysis serves as a resource for past and future studies on U87MG, informing them with a thorough description of its mutational state.
Grade IV glioma, also called glioblastoma multiforme (GBM), is the most common primary malignant brain tumor with about 16,000 new diagnoses each year in the United States. While the number of cases is relatively small, comprising only 1.35% of primary malignant cancers in the US
To that end, numerous cell line models of GBM have been established and used in vast numbers of studies over the years. It is well recognized that cell line models of human disorders, especially cancers, are an important resource. While these cell lines are the basis of substantial biological insight, experiments are currently performed in the absence of genome-wide mutational status as no cell line that models a human disease has yet had its genome fully sequenced. Here, we have sequenced the genome of U87MG, a long established cell line derived from a human grade IV glioma used in over 1,700 publications
The first draft of the consensus sequence of the human genome was reported in 2001
For cancer sequencing, it is important to assess not only SNVs, but indels, structural variations and translocations, and it is preferable to extract this information from a common assay platform. A major characteristic of the U87MG cell line that differentiates it from the samples used in other whole genome sequencing projects published thus far is its highly aberrant genomic structure. Due to its heavily rearranged state, we thoroughly and accurately assessed each of these major classes of mutations and demonstrated that small indels, large microdeletions and interchromosomal translocations are actually the major categories of mutations that affect known genes in this cancer cell line. These analyses provide a model for other genome sequencing projects outside major genome centers of how to both thoroughly sequence and assess the mutational state of whole genomes.
From ten micrograms of input genomic DNA, we performed two and a half full sequencing runs on the ABI SOLiD Sequencing System, for a total of five full slides of data
1 | |
2.5 (5) | |
2×50 | |
1,014,984,286 (101.5Gb) | |
120,691,623 (6.0Gb) | |
390,064,184 (39.06Gb) | |
266,635,829 (13.33Gb) | |
62,336,824 (3.12Gb) | |
55.51Gb |
We also performed an exon capture approach designed to sequence the exons of 5,253 genes (10.7Mb) annotated in the Wellcome Trust Sanger Institute Catalogue of Somatic Mutations in Cancer (COSMIC) V38
The Blat-like Fast Accurate Search Tool (BFAST)
1 | |
1/8 (1) | |
2×76 | |
10752923 | |
9,948,782 (1.51Gb) | |
8,142,874 (1.2Gb) | |
1,097,000 (83Mb) | |
1.32Gb | |
317,017,503 | |
29.5× |
The overall pattern of base sequence coverage from the shotgun reads changes across the genome, and as expected is highly concordant with the copy number state as determined by Illumina 1M Duo and Affymetrix 6.0 SNP analysis (
Circos
Single nucleotide variants (SNVs) and small insertions and deletions ranging from 1 to 20 bases (indels) were identified from the alignment data using the MAQ consensus model
In total, we identified 2,384,470 SNVs meeting our filtering criteria. Of these, 2,140,848 (89.8%) were identified as exact matches to entries in dbSNP129
SNV Classification | Total | In dbSNP 129 | Not in dbSNP 129 |
2,384,470 | 2,140,848 | 243,622 | |
2,375,812 | 2,133,226 | 242,586 | |
8,658 | 7,622 | 1,036 | |
151 | 132 | 19 | |
62 | 47 | 15 | |
89 | 85 | 4 | |
134 | 93 | 41 | |
82 | 48 | 34 | |
52 | 45 | 7 | |
8,538 | 7,518 | 1,020 | |
4,005 | 3,134 | 871 | |
4,533 | 4,384 | 149 |
For small (<21bp) insertions and deletions, 191,743 events were detected with 116,964 not previously documented in dbSNP 129. The same criteria as used for SNVs was used for determining if an indel was novel and they were further classified as homozygous or heterozygous using the SAMtools variant caller (
Indel Classification | Total | In dbSNP 129 | Not in dbSNP 129 |
191,743 | 74,779 | 116,964 | |
191,359 | 74,643 | 116,716 | |
384 | 136 | 248 | |
84 | 34 | 50 | |
20 | 7 | 13 | |
64 | 27 | 37 | |
91 | 15 | 76 | |
94 | 40 | 54 | |
168 | 45 | 123 | |
193 | 86 | 107 | |
141 | 33 | 108 | |
179 | 80 | 99 | |
26 | 11 | 15 | |
14 | 5 | 9 |
A subset of 38 variants meeting genome-wide filtering criteria, including a 20-base deletion, was tested by PCR and Sanger sequencing with 34 being validated. In summary, 85.2% of SNVs (23/27), and 100% of small insertions (3/3), deletions (4/4), translocations (3/3) and microdeletions (1/1) were validated in this manner (
The size distribution of indels identified in U87MG is generally consistent with previous studies on coding and non-coding indel sizes in non-cancer samples
(A) Distribution of small deletion sizes as a percent of total, comparing amino-acid encoding deletions (blue) with non-coding deletions (red). (B) Distribution of small insertion sizes as a percent of total, comparing amino-acid encoding insertions (blue) with non-coding insertions (red).
A similar trend is seen with insertions in non-coding sequence with the maximum observed insertion size of 17 bases (
In coding regions, there is a bias towards events that are multiples of 3-bases in length that maintain the reading frame despite variant alleles, suggesting that many of these are polymorphisms (
Observed SNV base substitution patterns were consistent with common mutational phenomena in both coding sequences and genome wide. As expected, the predominant nucleotide substitution seen in SNVs is a transition, changing purine for purine (A<->G) or pyrimidine for pyrimidine (C<->T). Previous studies have observed that two out of every three SNPs are transitions as opposed to transversions
Ratio of specific nucleotide substitutions as a percent of total single nucleotide variants, comparing SNVs in coding regions (blue) to SNVs genome-wide (red).
To assess the coverage depth of the U87MG genome sequence, we followed Ley at al.
Notably, a variant allele was observed at every position called heterozygous by SNP chip, while a reference allele was observed at 201,414 (97.94%) positions. In other words, the SNV detection algorithm uniformly miscalled the homozygous variant allele. Filtering for quality causes a bias toward identifying SNVs at sites that have higher coverage. That said, after SNV quality filtering, diploid coverage of the cytogenetically normal portions of the genome was 10.85× for each allele, which is clearly adequate for calling over 90% of the base variant positions on each allele at high accuracy.
Because the positions of the genome included on SNP arrays is not a random sampling of the genome, we also assessed mapping coverage genome-wide. Of all bases in the haploid genome, 78.9% of the whole reference genome was covered by at least one reliably placed read. Of that portion of the genome, 91.9% of all bases were effectively sequenced based on passing variant calling filters (Phred>10, >4× coverage, <60× coverage). Thus, a total of 72.5% of the whole genome was sequenced, including repeats and duplicated regions, which is typical of short sequence shotgun approaches.
10.9Mb of genomic sequence was targeted consisting of the amino acid encoding exons of 5,235 genes and were sequenced to a mean coverage of 30× using the Illumina GAII sequencer. Given the larger variability of coverage from the capture data, only a subset of these bases (8.5Mb) was evaluable to determine the false positive variant detection rate from the complete genomic sequence data. This region contained 1,621 SNPs present in dbSNP129. Within the 8.5Mb of common and well-covered sequence in the genomic sequence data and the capture sequence, there were 1,780 SNVs called from the genomic sequence. The same non-reference allele was concordantly observed at 1,631 positions within the capture data. At 149 positions, the non-reference allele was not observed in the capture data, but the reference allele was detected. However, the mean coverage at these 149 positions was significantly lower than that of the other 1,631 positions (p = 0.0003), suggesting that the non-reference allele was not adequately covered and is under called in the capture data. Moreover, of the 1,621 dbSNPs in the region, the capture adequately covered only 1,515. In these data there was a bias for the pull down data to under observe the non-reference allele (
There were a total of 100 novel SNVs detected in the ABI genomic sequence dataset that were also very well evaluated in the Illumina pull down data with at least 20 high quality Illumina reads, such that the ABI sequence could be well validated. Of these, 2 of the 100 discovered variants in the genomic sequence dataset were not observed in the Illumina pull down sequencing dataset. Thus, of the entire 8.5mb interval there are 2 unconfirmed variants for an estimated false positive error rate of about 3×10−7 for the whole interval. Alternatively viewed, there were 100 novel SNVs, with a 2% error rate in those novel positions. Thus, the de novo false discovery rate may be as high as 2%. Extrapolating to the whole set of 243,622 novel SNVs, we expect up to 4,872 false positives SNVs. These observations are roughly concordant with a sampling of 37 novel SNVs (not in dbSNP) in the whole genome set selected for testing by Sanger sequencing. Of these, 34 out of 37 (92%) were validated.
There are now several publicly available complete genomes sequenced on next generation platforms. We compared the SNVs discovered in U87MG to two of these published genomes: the James D. Watson genome
(A) Venn diagram showing overlap in SNVs among the U87 genome, the Watson genome, and dbSNP 129. (B) Venn diagram showing overlap in SNVs among the U87 genome, the YanHuang genome, and dbSNP 129.
We utilized the predictable insert distance of mate-paired sequence fragments to directly observe structural variations in U87MG. Our target insert size of 1.5kb gave us a normal distribution of paired end insert lengths ranging from 1kb to 2kb with median around 1.25kb and mean around 1.45kb in the actual sequence data (
Structural variations detected by whole genome sequencing in the U87MG genome are plotted in the Circos program. Orange lines linking two chromosomes represent the 35 interchromosomal translocations. Blue lines around the edge of the circle represent microdeletions and intrachromosomal translocations. The outermost histogram represents sequence coverage and demonstrates how the boundaries of changes in coverage typically coincide with a significant structural variation.
Type | # of events | # that span genes (%) | # of affected genes (%) |
599 | 95 (15.9%) | 145 | |
361 | 58 (16.0%) | 91 | |
35 | 32 (91.4%) | 35 | |
319 | 146 (45.8%) | 166 |
The thirty-five interchromosomal events often coincided with positions of copy number change based on the average base coverage (
Two genomic breakpoint events are highlighted between chromosomes 2 and 16. The outer ring represents the chromosomes displaying tick marks every 100 bases. The green plot shows base-coverage for each position. Each orange line represents a single mate-pair as a link between one end of a read and its mate-pair. Between the breakpoints on each chromosome (chr2:56792000–56953300 and chr16:8826200–8826700), base coverage drops to about half of what it is on the other side of the event, from two to one copy. This suggests an interchromosomal translocation between chromosomes 2 and 16 resulting in a loss of the genomic material between the translocation breakpoints.
A subset of 3 translocations were confirmed by amplifying DNA from the breakpoint-spanning region by polymerase chain reaction and sequencing by dideoxy Sanger sequencing (
The SNVs and indels identified in U87MG were assessed for their potential to affect protein-coding sequence. We considered variants predicted to be homozygous and to affect the coding sequence of a gene through a frameshift, early termination, intron splice site, or start/stop codon loss mutation as causing a complete loss of that protein. We chose to focus on homozygous null mutations for two major reasons. First, this is an interesting set of genes that we can predict from the whole genome data are non-functional within this commonly used cell line. Although heterozygous mutations can certainly affect gene products in multiple ways, it is difficult to assess their effect from genomic data alone. Second, by cross-referencing such null mutations with known regions of common mutation in gliomas we can pick out specific candidates that are of interest to the glioma community.
Of the 2,384,470 SNVs and 191,743 small indels in U87MG, a total of 332 genes are predicted to have loss-of-function, homozygous mutations as a consequence of small variants (
We further divided these homozygous mutant genes by variant type. Of genes mutated by SNVs, 146 contained variants present in dbSNP while only 8 were knocked out by variants not in dbSNP. The ratio of known SNPs causing loss-of-function mutations to total known SNPs (146/2,140,848 = 6.82×10−5) was not significantly different from the ratio of novel SNVs causing loss-of-function mutations over total novel SNVs (8/243,622 = 3.28×10−5; p = 0.04). This indicates that many of the possible de novo point mutations may indeed be rare inherited variants made homozygous by chromosomal loss of the normal allele.
In contrast to the trend in SNVs, small indels that homozygously mutated genes were more often novel. There were 79 genes predicted homozygously mutated by indel variants reported in dbSNP while 99 were predicted mutated by novel indels. Despite this trend, however, there was not a significant enrichment of deleterious indels among the novel indels (99/191,743 = 5.16×10−4) compared to the known indels (79/116,964 = 6.75×10−4; p = 0.08) This suggests that the difference in ratios of novel versus documented SNVs (8 vs. 146) and indels (99 vs. 79) is the result of compositional bias in dbSNP129, which contains a far greater number of SNPs compared to indels.
We also assessed the structural variants in U87MG for whether or not they were likely to affect a gene. Two different criteria were used to determine if translocations and microdeletions impacted a coding region, both predicted to produce an aberrant or nonfunctional protein. Using the UCSC known gene database, we identified 35 genes affected by interchromosomal translocations, 145 affected by complete deletions, 91 affected by heterozygous deletions and 166 affected by other intrachromosomal translocations (
Interchromosomal translocation events were significantly enriched for occurring at positions where they would affect genes with 32 out of 35 events (91.4%) occurring within 1kb of a gene (p<0.0001), while only 44.1% of the reference genome is within 1kb of a known gene. In total, intrachromosomal events did not display this enrichment with 145/319 (45.5%) falling within 1kb of a gene (p = 0.67). However, we ran a set of simulations to assess whether microdeletions were enriched to overlap exons because we noted that 585 of our 599 complete microdeletions were less than 10kb in length with a mean size of 1.8kb. We ran 100,000 simulations randomly placing 600 microdeletions of 2kb lengths and determined how many times a microdeletion spanned an exon. In this way, we demonstrated that complete (homozygous) microdeletions under 10kb in size spanned exons slightly more often than by chance with a simulated p-value of .046. Similar assessment of microdeletions greater than 10kb in size did not find evidence of enrichment. These findings suggest that small microdeletions may preferentially occur within genes as opposed to being randomly distributed across the genome, but the signal is not strong from the available data. Genes affected by structural variations are summarized in
The annotation tool DAVID was used to further examine the biological significance of the list of likely knockout mutations (including genes affected by SNVs, indels, microdeletions and translocation events) using the EASE analysis module. After gene ontology (GO) analysis, 18 GO terms were nominally enriched and associated with the mutated gene with a p-value < = 0.01 (
The list of genes was also compared to the list of cancer-associated genes maintained by the Cancer Gene Census project (
We also explored the overlap of genes with mutations in GBMs according to the Cancer Genome Atlas (TCGA) with those we predicted are homozygously loss-of-function mutated in U87MG (
Finally, in order to place the homozygous mutations of U87MG in context relative to GBM mutational patterns as a whole, the Genomic Identification of Significant Targets in Cancer (GISTIC) method
Reported individual human genome sequencing projects using massively parallel shotgun sequencing with alignment to the human reference genome clearly indicate the practicality of individual whole genome sequencing. However, the monetary cost of data generation, data analysis issues, and the time it takes to perform the experiments have remained substantial limitations to general application in many laboratories. Here we demonstrate enormous improvements in the throughput of data generation. Using a mate-pair strategy and only ten micrograms of input genomic DNA, we generated sufficient numbers of short sequence reads in approximately 5 weeks of machine operation with a total reagent cost of under $30,000. We believe this makes U87MG the least expensive published genome sequenced to date signaling that routine generation of whole genomes is feasible in individual laboratories. Further, the two-base encoding strategy employed within the ABI SOLiD system is a powerful approach for comprehensive analysis of genome sequences and, in concert with BFAST alignment software, is able to identify SNVs, indels, structural variants, and translocations.
Of particular interest in whole-genome resequencing studies such as this one is how much raw data must be produced to sequence both alleles using a shotgun strategy. Here, 107.5Gb of raw data was generated. Of this, 55.51Gb was mapped to unique positions in the reference genome. In effect, this results in a mean base coverage of 10.85× per allele within non-repetitive regions of the genome. Repetitive regions are of course undermapped, as their unique locations are more difficult to determine. This level of oversampling is adequate for high stringency variant calling (error rate less than 5×10−6) at 93.71% of heterozygous SNP positions. There may be some biases in library generation resulting in bases that are not successfully covered even if they are relatively unique, but solutions to this may be found in performing multiple sequencing runs with varied library designs, as suggested in other studies
With rapid advances in the generation of massively parallel shotgun short reads, one of the major computational problems faced is the rapid and sensitive alignment of greater than 1 billion paired end reads needed to resequence an individual genome. We demonstrate a practical solution using BFAST, which was able to perform fully gapped local alignment on the two-base encoded data to maximize variant calling in less than 4 days on a 20-node 8-core computer cluster.
Comparing U87MG SNVs with the James Watson
The genomic sequence demonstrates global differences in variant type across the coding and non-coding portions of the human genome. By increasing the sensitivity of indel detection, we revealed that small indels have mutated genes at a higher rate than SNVs. A larger proportion of the indels identified are predicted to cause a protein coding change compared to SNVs (178/191,743 indels vs. 154/2,384,470 SNVs).
In U87MG, there is a relative increase in 4-base indels genome-wide, which has been observed in other normal genomes
The resolution of genome-wide chromosomal rearrangements is substantially improved by the mate-pair strategy, coupled with sensitive and independent alignment of the short 50-base reads (
Resolution of karyotyping and SKY approaches is not high enough to see the complex nature of this translocation event between chr1 and chr16. With high-resolution whole-genome sequencing, the true structure of the translocation is revealed as mutual translocations between a small fragment of chr2 with chr1 and chr16 on either end.
Delving into the functional effects of the mutations in U87MG through gene ontology and cross-referencing the literature, we found a large number of known and predicted cancer mutations present in the cell line. There is always a concern when dealing with a cancer cell line that mutations will be more related to its status as a cell line than to the cancer it was derived from. While this remains a concern, the large number of predicted and known cancer genes present in U87MG suggests other genes mutated in it have relevance to cancer as well. Using GISTIC to find regions with common deletions in glioma samples, we highlight 60 genes that are mutated in U87MG and are located in regions that are commonly deleted in GBMs that are not included within the Cancer Gene Census list as potential candidate mutational targets in GBMs (
Cancer cell lines are commonly used as laboratory resources to study basic molecular and cellular biology. It is clearly preferable to have complete genomic sequence for these valuable resources. U87MG is the most commonly studied brain cancer cell line and is highly cytogenetically aberrant. While this made the sequencing and mutational analysis more challenging, it serves as a model for future cultured cell line genomic sequencing. Through custom analyses, we found that the mutational landscape of the U87MG genome is vastly more complicated than we would have expected based on the variants discovered in previously published genomes. It is our hope that the increased genomic resolution presented here will direct researchers and clinicians in their work with this brain cancer cell line to create more effective experiments and lead to a greater ability to draw meaningful conclusions in the future.
The NCBI reference genome (build 36.1, hg18, March 2006), genome annotations, and dbSNP version 129 were downloaded from the UCSC genome database located at
U87MG cells were ordered from ATCC (HTB-14) and cultured in a standard way. Genomic DNA was isolated from cultured U87MG cells using Qiagen Gentra Puregene reagents. DNA was stored at −20C until library generation.
Long-Mate-Paired Library Construction: The U87MG genomic DNA 2× 50bp long mate-paired library construction was carried out using the reagents and protocol provided by Applied Biosystems (SOLiD 3 System Library Preparation Guide). A similar protocol was reported previously
We used an array pull-down capture strategy established in our lab
We used Blat-like Fast Accurate Search Tool version 0.5.3 (BFAST
We found candidate alignment locations (CALs) for each end independently. We utilized ten indexes to be robust to up to six color errors, equating to a 12% per-read error rate:
1111111111111111111111
111110100111110011111111111
10111111011001100011111000111111
1111111100101111000001100011111011
111111110001111110011111111
11111011010011000011000110011111111
1111111111110011101111111
111011000011111111001111011111
1110110001011010011100101111101111
111111001000110001011100110001100011111
We also set parameters to use only informative keys when looking up reads in each index (BFAST parameter -K 8), and to ignore reads with too many CALs aggregated across all indexes (BFAST parameter -M 384). If reads mapped to greater than 384 locations, then they were categorized as ‘unmapped’. We then performed local alignment for each of the returned CALs, simultaneously decoding the read from color space searching for color errors (encoding errors), base changes, insertions, and deletions
Illumina generated sequence was aligned to the NCBI human reference genome (build 36.1) using BFAST with the following parameters applied. Each end of the fragment library was mapped independently to identify CALs, utilizing ten indexes to be robust to errors and variants in the short (typically 36bp) reads:
1111111111111111111111
1111101110111010100101011011111
1011110101101001011000011010001111111
10111001101001100100111101010001011111
11111011011101111011111111
111111100101001000101111101110111
11110101110010100010101101010111111
111101101011011001100000101101001011101
1111011010001000110101100101100110100111
1111010010110110101110010110111011
We also set parameters to use only informative keys when looking up reads in each index (BFAST parameter -K 8), and to ignore reads with too many CALs aggregated across all indexes (BFAST parameter -M 1280). We then performed a standard local alignment for each CAL. Reads were declared mapped if a single unique best scoring alignment was identified within the genome. Duplicate reads were filtered out in the same manner as for the ABI SOLiD data.
To find SNVs including SNPs and small indels, we assumed the MAQ consensus-calling model
Structural variations were detected using custom algorithms designed to comprehensively search for groups of mate-pair reads with aberrant paired-end insert size distributions that are consistently identifying a unique structural variant in the genome. We utilized the “dtranslocations” utility in the DNAA package (
The structural variations were then separated into interchromosomal and intrachromosomal events. Intrachromosomal events of less than 1Mb are assessed for deletion status by averaging base coverage within the bounds of the event and comparing it to base coverage 200kb outside the event on both sides. Those that have average interior base coverage less than 25% of the average exterior base coverage are classified as “complete” deletions. Those with average interior base coverage between 25% and 75% that of average exterior base coverage are classified as “heterozygous deletions” (deletions of at least one copy of the region, but with at least one copy remaining).
Variant calls from the SAMtools pileup tool were first loaded into a SeqWare QueryEngine database and subsequently filtered to produce BED files. This filtering criteria required that a variant be seen at least 4 times and at most 60 times with an observation occurring on each strand at least once. For SNVs we further enforced the criteria that SNVs should only be called in reads lacking indels and the last 5 bases of the reads were also ignored. This reduced the likelihood that spurious mismappings were used to predict SNVs and eliminated the lowest quality bases from consideration. For small indels (<21bp) we enforced a slightly different filter by requiring that any reads supporting an indel were only allowed to contain one contiguous indel and these reads were not considered if the indel occurred on either the beginning or end of the read. These criteria, like the SNV criteria, were used to reduce the likelihood of using mismapped reads or locally misaligned reads in the variant calling algorithm. The elimination of reads with indels at the beginning or end of the read was intended to remove potential alignment artifacts caused by ambiguous gap introduction due to lack of information at the ends to guide proper alignment. Together, these filtering criteria reduced the likelihood that sequencing errors were identified as SNV or indel variants. We used scripts available in the BFAST toolset and SeqWare Pipeline to filter and annotate the variant calls. Variants passing these filters were further annotated by their overlap with dbSNP version 129. Variants were required to share the same genomic position as a dbSNP entry along with matching the allele present in the database to be considered overlapping. Mapping to dbSNP allowed us to filter out known SNPs from de novo variants.
Filtered SNV and indel variants were then analyzed for their affect within the genome that is annotated with gene models. This analysis used scripts from the SeqWare Pipeline project and gene models downloaded from the UCSC hg18 human genome annotation database. Six different gene model sets from hg18 were considered: UCSC genes (knownGene), RefSeq genes (refGene,
Genes affected by structural variations were assessed in two ways depending on the structural variation type. For interchromosomal translocation events, a gene was considered “affected” when either end of an interchromosomal translocation event fell in a genic region (including the entire coding region plus 1kb up- or down-stream of the gene's coding region). The same criteria were used for all intrachromosomal translocation events. For events that were classified as complete or heterozygous deletions, a gene was considered affected if all or part of a coding exon was deleted.
Homozygous SNVs, small indels, large deletions, and translocation events for variants that included predicted coding sequence changes were tallied. This became a reference list of variants with serious homozygous mutations that likely completely disrupted, or “knocked out”, the normal function or synthesis of the target protein.
For the SNVs and small indels, a “knockout” variant was defined as a homozygous call by the SAMtools variant caller where the variant was predicted by the SeqWare Pipeline scripts to change coding sequence with one or more of the following annotations: “early-termination”, “frameshift”, “intron-splice-site-mutation”, “start-codon-loss”, and/or “stop-codon-loss”. The “early-termination” event represented a stop codon introduced upstream of the annotated stop codon. The “frameshift” represented an indel that resulted in a shifting of the reading frame of the gene resulting in, typically, early termination and non-sense coding sequence. The “intron-splice-site-mutation” referred to a mutation in the two consensus splice site intronic bases flanking exons (GT at the 5′ splice site and AG at the 3′ splice site). Finally, “stop-codon-loss” and “start-codon-loss” simply refer to variants that interrupt the stop or start codons. We chose to not include “coding-nonsynonymous” and “inframe-indel” annotations in this list of knocked out variants because, while potentially serious as these mutations are, they are not guaranteed to result in an unexpressed or non-functional protein. However, homozygous frameshift, early termination, splice site, and stop/start codon loss mutations are very likely to interrupt a gene's expression and translation to functional protein.
As described above, large microdeletions that removed all or part of an exon and interchromosomal translocation events that fell within 1kb of a gene's coding region were also classified as mutated genes.
Once suspect knockout variants were identified, a mapping process was used to translate one or more variants to the gene symbol. This mapping allowed us to condense multiple variants affecting multiple gene models to a more abbreviated list of gene symbols likely to be affected by these knockout mutations. The mapping from variants to gene symbols used variants identified with gene models from the refGene and the knownGene tables in the UCSC hg18 database and mapped these variants to gene symbols using queries against the name field of the knownGene table and the alias field of the kgAlias table. The UCSC table browser was used to accomplish these queries and map the knownGene identifiers to gene symbols via the kgXref table. A similar approach was used for homozygous large-scale microdeletions and translocation events.
The list of knockout genes was uploaded to the Database for Annotation, Visualization, and Integrated Discovery (DAVID, version 2008) to identify enriched Gene Ontology (GO) terms
The overlap between the Cancer Gene Census genes and those identified as knockouts in U87MG were compared. The Cancer Gene Census project is an ongoing effort to catalog genes with mutations that have been implicated in cancer
The overlap between mutations in the Cancer Genome Atlas (TCGA) and those identified as knockouts in U87MG was analyzed. TCGA is an ongoing effort to understand the molecular basis of cancer through large-scale copy number analysis, expression profiling, genome sequencing, and methylation studies among other techniques
The Genomic Identification of Significant Targets in Cancer (GISTIC) method was used to find significant areas of deletion in 293 samples from the TCGA
The distribution of small indel sizes was examined for both deletions and insertions. Indels classified as affecting coding-sequence by the SeqWare Pipeline (see above) were compared to those outside coding regions. Raw counts were collected, recalculated as percents of total, and compared directly.
Similarly, nucleotide substitution frequency was examined for SNVs from U87MG both genome-wide and only in coding regions. Once binned appropriately, the SNV nucleotide substitutions were counted, tallied in a table, and graphed as percents of total.
Variants from the Watson and Yan Huang genome were downloaded from each respective project from the following URLs:
Genomic DNA from U87MG was submitted to the Southern California Genotyping Consortium to be run on the Illumina Human 1M-Duo BeadChip, which consists of 1,199,187 probes scattered across the human genome. The Illumina Beadstudio program was used to analyze the resulting intensity data. Loss of heterozygosity was determined by analyzing B-allele frequency as determined by the Beadstudio program. Normal two-copy regions of the genome are represented by long stretches of probes with B-allele frequencies of 0, 0.5 or 1. Regions of LOH, on the other hand, deviate from this pattern significantly. Copy number was determined by looking at probe intensity.
Primers for validation were designed by targeting regions immediately flanking the event predicted by our whole genome sequence analysis using the Primer3 tool (
Intensities, quality scores, and color space sequence for the genomic sequence of U87 SOLiD were uploaded to the Sequence Read Archive under the accession SRA009912.1/Sequence of U87 Glioblastoma Cell-line. Intensities, quality scores, and nucleotide space sequence for the exon capture U87 Illumina sequence were also uploaded to the Short Read Archive under the same accession. For both datasets, alignment files have been uploaded to the Short Read Archive as additional analysis results.
Variant calls for both datasets are available via a SeqWare QueryEngine web service at
Most software used for this project is open-source and freely available. We created two software projects that were instrumental in the analysis of the U87MG data: BFAST and SeqWare. The color- and nucleotide-space alignment tool BFAST can be downloaded from
Concordance between Solexa capture data and SOLiD whole genome data. The left plot displays the SNP call concordance between each experiment (Solexa capture data in blue, SOLiD whole genome data in red) with the Illumina 1M Beadchip microarray for the 8.5Mb of sequence pulled down in the capture experiment. The right plot displays concordance of the non-reference (mutant) allele calls with the array data for those regions.
(0.43 MB TIF)
Paired end insert size distribution. Empirical paired end insert size distribution for reads where both ends aligned with duplicates removed.
(0.41 MB TIF)
Alignment is robust against genome-wide repeat elements. Circos plot
(0.21 MB TIF)
Commonly deleted regions in GBM according to GISTIC. This deletion plot shows significant regions of deletion in 293 GBM samples from the TCGA. The top of the plot shows the G-score and the bottom shows the q-values. G-score reflects the frequency and amplitude of the deletion. Q-values greater than 0.25 were considered significant. Overlap of genes mutated in U87 via SNVs or Indels and broad regions of deletion are considered to be likely cancer targets. This includes all or part of chromosomes 1, 6, 9, 10, 13, 14, 15, and 22.
(0.43 MB TIF)
PCR and dideoxy sequencing validation. A list of the variants that were validated by PCR and dideoxy sequencing including primers used, varient location, and validation status.
(0.03 MB XLS)
Structural variants in U87MG. All structural variants listed as regions immediately flanking the genomic breakpoint.
(0.18 MB XLS)
Genes knocked out by SNVs/Indels. List of all genes predicted to be knocked out by SNVs and Indels in U87MG.
(0.20 MB XLS)
Genes affected by structural variants. List of all genes predicted to be affected by structural variants in U87MG.
(0.48 MB XLS)
Annotation of mutated genes. Lists of genes predicted to be mutated in U87MG annotated by various cancer-related gene databases.
(0.17 MB XLS)
We would like to acknowledge Bret Harry and Jordan Mendler for computational support and for maintaining our computer cluster and pipeline.