Browse Subject Areas

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Genome Wide Identification of Recessive Cancer Genes by Combinatorial Mutation Analysis

  • Stefano Volinia,

    Affiliations Data Mining for Analysis of Microarrays, Università degli Studi, Ferrara, Italy, Department of Molecular Virology, Immunology and Molecular Genetics, Comprehensive Cancer Center, Ohio State University, Columbus, Ohio, United States of America

  • Nicoletta Mascellani,

    Affiliation Data Mining for Analysis of Microarrays, Università degli Studi, Ferrara, Italy

  • Jlenia Marchesini,

    Affiliation Data Mining for Analysis of Microarrays, Università degli Studi, Ferrara, Italy

  • Angelo Veronese,

    Affiliation Dipartimento di Medicina Sperimentale e Diagnostica, Centro Interdipartimentale di Ricerca sul Cancro, Università, Ferrara, Italy

  • Elizabeth Ormondroyd,

    Affiliation Institute of Cancer Research, Surrey, United Kingdom

  • Hansjuerg Alder,

    Affiliation Department of Molecular Virology, Immunology and Molecular Genetics, Comprehensive Cancer Center, Ohio State University, Columbus, Ohio, United States of America

  • Jeff Palatini,

    Affiliation Department of Molecular Virology, Immunology and Molecular Genetics, Comprehensive Cancer Center, Ohio State University, Columbus, Ohio, United States of America

  • Massimo Negrini,

    Affiliation Dipartimento di Medicina Sperimentale e Diagnostica, Centro Interdipartimentale di Ricerca sul Cancro, Università, Ferrara, Italy

  • Carlo M. Croce

    Affiliation Department of Molecular Virology, Immunology and Molecular Genetics, Comprehensive Cancer Center, Ohio State University, Columbus, Ohio, United States of America

Genome Wide Identification of Recessive Cancer Genes by Combinatorial Mutation Analysis

  • Stefano Volinia, 
  • Nicoletta Mascellani, 
  • Jlenia Marchesini, 
  • Angelo Veronese, 
  • Elizabeth Ormondroyd, 
  • Hansjuerg Alder, 
  • Jeff Palatini, 
  • Massimo Negrini, 
  • Carlo M. Croce


We devised a novel procedure to identify human cancer genes acting in a recessive manner. Our strategy was to combine the contributions of the different types of genetic alterations to loss of function: amino-acid substitutions, frame-shifts, gene deletions. We studied over 20,000 genes in 3 Gigabases of coding sequences and 700 array comparative genomic hybridizations. Recessive genes were scored according to nucleotide mismatches under positive selective pressure, frame-shifts and genomic deletions in cancer. Four different tests were combined together yielding a cancer recessive p-value for each studied gene. One hundred and fifty four candidate recessive cancer genes (p-value<1.5×10−7, FDR = 0.39) were identified. Strikingly, the prototypical cancer recessive genes TP53, PTEN and CDKN2A all ranked in the top 0.5% genes. The functions significantly affected by cancer mutations are exactly overlapping those of known cancer genes, with the critical exception for the absence of tyrosine kinases, as expected for a recessive gene-set.


A variety of approaches have been applied to the identification of cancer genes [1]. Procedures have been developed that allowed identification of genes causative of cellular transformation [2], [3], and of complex processes such as invasiveness and metastasis [4]. In vitro and in vivo methods, using cellular or animal models, led generally to the discovery of dominant cancer genes, or oncogenes. On the other hand, tumor suppressors have been discovered mainly by molecular genetics approaches. Such is the need of identifying additional tumor suppressors, or recessive cancer genes, that new tests for loss-of-function continue to be developed [5].

Many well-characterized cancer genes harbor somatic base substitutions or small insertion/deletions. For example, coding region frame-shifts and point mutations account for 75% of the somatic mutations in CDKN2A and TP53, two major tumor suppressor genes [6], [7], [8]. The oncogene B-raf, first described over 20 years ago, was also shown to be mutated in some human cancers [9], alongside PI3K and some tyrosine phosphatases [10]. Meanwhile, other cancer genes have been discovered through the phenomenon of inherited predisposition. Familial cancer is rare in comparison to non-hereditary cancer, but a number of recessive genes have been identified using linkage analysis [11], [12]. Large scale super-family sequencing projects, i.e. the kinome and phosphatome projects, followed and showed that, although missense mutations are found in some members of these two superfamilies, they are not a common ground for somatic cancer mutations. Greenman and co-workers [13] undertook comprehensive sequencing of 518 protein-kinase-encoding genes in 210 cancers. Kinases have been implicated in many aspects of tumorigenesis and several have now been validated as targets for drug therapy [14]. In their analysis of the collection of cellular kinases, the kinome, Greenman et al. [13] identified 1,000 mutations. Mutations were relatively common in cancers of the lung, stomach, ovary, colon and kidney, and rare in cancers of the testis and breast, and in carcinoid tumors, which are usually found in the gastrointestinal tract. Tumors with defects in DNA-mismatch repair harbored large numbers of mutations, whereas other types of tumor revealed no detectable mutations. To distinguish driver from passenger mutations, Greenman et al. used a statistical model comparing the observed-to-expected ratio of synonymous (no amino-acid change) mutations with that of non-synonymous (altered amino acid) mutations. An increased proportion of non-synonymous mutations implies selection pressure during tumorigenesis. Overall, they identified 158 predicted driver mutations in 120 kinase genes. In contrast to the recurrent mutations in BRAF in malignant melanomas [15] most kinase mutations identified across different tumor types were therefore single hits. More recently, Wood and co-workers [16] used a different strategy, but reached similar conclusions, with the complete sequencing of 20,857 transcripts from 18,191 genes in a limited number of tumors (11 breast and 11 colon). The high number of automatically detected DNA mutations provided immediately the following question: how to identify from a potentially high number of sequence mismatches those that are causative of cancer pathogenesis. A series of subsequent filters revealed that most of them were silent (did not result in amino acid change) and a similar amount were single nucleotide polymorphisms (SNPs). The final number of mutations which were defined as truly somatic affected more than 1000 genes. Interestingly, few common driver mutations were identified among the kinase genes in these studies. This is consistent, for example, with the finding that only 1 out of 18 members of the PI3K family had somatic mutations in cancer [17].

Interesting observations can be made from an accurate global study of the mutations reported in cancer. Futreal et al. [18] conducted such an extended census from bibliography indicating that as many as 299 genes contribute to human cancer. However 70% of these genes are associated with leukemias, lymphomas and mesenchymal tumors, which account for only 10% of cancer incidence. Furthermore about 75% of those genes are associated with translocations, and at least 90% of listed cancer genes are dominant at the cellular level (i.e. activated oncogenes, fusion oncoproteins). Nevertheless, it is generally recognized that the vast majority of germline mutations resulting in cancer predisposition are recessive [18]. Thus it seems likely that most of the cancer genes are recessive and remain still undiscovered.

For these reasons we devised a novel method for the identification of candidate recessive cancer genes from genome-scale datasets. We applied our novel procedure to mine data from sequences and comparative genomic hybridizations. Our method takes account of the different gene inactivation modes, ranging from point mutations to whole gene deletions. The assumption underlying our investigation was that, by studying cancer genes from different mutational perspectives and combining the respective probabilities, sequencing noise and polymorphisms could be filtered out and bona fide recessive cancer genes would be identified.


Harvesting candidate mutations from ESTs

In this paper, a novel method was applied to the identification of genes mutated in non-hereditary human cancers (Figure 1). The procedure gathered sequence information from the expression sequence tag (EST) database and an appropriate algorithm was tailored to extract information from “low quality” sequence data. The procedure analyzed more than 3×109 nucleotides of human coding sequence in over 5,600,000 ESTs derived from both healthy and cancerous tissues and cell lines. ESTs are potentially very valuable for mutation studies since they represent cloned single alleles, but are also unverified sequences, with a high rate of sequencing errors [19], [20]. Therefore, in order to exploit the full potential of ESTs we had to develop a method for the detection of bona fide “cancer” mutations in a context of frequent sequencing errors or, at best, polymorphisms. Although previous work [19] attempted to evaluate sequencing error rate in ESTs, we followed an alternate route. Our procedure was based on the assumption that the rate of sequencing errors was constant for each human gene, at each nucleotide position. As a corollary, we assumed that the “gene/position-specific sequencing error rate” was constant across normal and cancer EST libraries. Since base composition, context and sequence are by definition constant within each different human gene, we believed these assumptions were safe. Only exceptions would be due to the tumors harboring DNA repair defects.

Figure 1. The rationale for selection of candidate recessive cancer genes.

The diagram shows the steps in the procedure for the evaluation of mutation probabilities and the data flow towards the identification of candidate recessive cancer genes. Molecular data were extracted from public databases (dbEST and GEO at NCBI, and Stanford Microarray Database). A very large number of alignments (over 4.5 million) was obtained for over 24,000 human genes from BLAST analysis of 3 Gbases of EST sequences. The alignments were parsed to extract mismatches which were deposited in the Cancer Mutome local SQL database. The mismatches were then evaluated by specific procedures to associate mutational p-values to each human gene. In parallel, almost 20,000 human genes were assayed from 744 array CGH to define their propensity to deletion in cancer. The specific mutational p-values were combined to produce a recessive cancer p-value. A genome subset of 154 genes, among which TP53, PTEN, CDKN2A and CDKN2B were present, was selected (cancer p-value<1.5×10−7).

High sequencing noise was expected to be present in the heterogeneous EST database and cancer is a complex multi-faceted genetic disease, therefore a single statistical test would not result in reliable selection of cancer genes. Furthermore, we wanted to focus on recessive genes, inactivated by the occurring events. Thus, to assay the different mutational modes of recessive cancer gene, we accordingly devised a number of mutational tests. The statistical tests were eventually combined to identify the genes that are often inactivated in cancer.

Starting from the RefSeq human mRNA repository, 27,184 sequences (defined Queries) were aligned to more than 5.6 million human EST sequences, from 7574 different EST libraries, for a total of almost 3.0 Gbases of coding sequence. BLASTs [21] were run for each query versus the ESTs and 3,839,543 successful alignments were produced (stored in the Alignments SQL table of the Cancer Mutome database) for 24,932 human queries (Stats database table). An average of 150 hits (high scoring pairs, HSP. or sequences) was produced for each query (human gene or splicing variant). The quality control of the BLAST alignments was of the foremost importance for our strategy. In order to minimize the mining of technical errors we defined a stringent threshold for alignment quality (expect≤1E-21) and the low quality ends of alignments were discarded. All (43,965,904) nucleotide mismatches, and gaps/insertions, were recorded in the database Mutations table. Amino acid (AA) substitutions and premature stops (33,614,754 mismatches) were then selected from the alignments (AA_Mutation table). To reduce the complexity, and the expected number of false positives, we decided to evaluate only those genes with a high number of mismatches (irrespective of the samples cancer status). A pre-processing based on inter-quartile range (IQR) was therefore applied and 8,972 genes (IQR higher than 0.5) were retained for further cancer mutation assays. These genes were sufficiently rich in putative mutations (mismatches) to fulfill the role of potential cancer gene candidates.

The first component of our strategy was the identification of genes harboring inactivating point mutations. We evaluated the point mutations according frequency, location, capacity to alter the amino acid sequence, and consequences on the reading frame. Our procedure was thus tailored to consider statistically all the above features of a point mutation.

Data mining for amino-acid substitutions and premature terminations

We defined pAA as the probability that a gene displays an excess of amino acids substitutions in cancer when compared to non cancer samples. pNSSR, instead, indicates the probability that the significant amino acids substitutions in the cancer samples are under positive selection pressure. To detect short range clustering of cancer mutations, common in cancer recessive genes, and to balance out noise, i.e. sequencing errors, we chose a paired t test coupled to a sliding window. We normalized the counts of the mismatches in the two classes, cancer and control, by using a gene specific and position specific factor. Null mismatch counts were adjusted to unity, prior to normalization. The normalization values were obtained, for each gene and at each nucleotide position, as the local ratios of the sequenced nucleotides in the cancer and control samples. The paired t test (cancer vs. control, paired for codons) was applied to a sliding window with a length of 25 codons. To perform a robust assay a codon was evaluated only when aligned at least 10 times in each class (cancer and control). Gene specific confidence limits for T scores where generated by bootstrap analysis and a threshold p-value of 0.05 was used to select the significant amino acid positions. For each human gene, a p-value (pAA) was finally associated to the sum of the peaks corresponding to the significant T scores. A sequence mismatch was recorded only once for each EST library.

An over-estimation of pAA could be due to passenger mutations, such as those produced by altered DNA repair systems, prevalent in some cancer. Since passenger mutations should be randomly distributed over the genome, an additional test was therefore implemented to refine the pAA. The ratio of non-synonymous (NS) to synonymous (S) DNA mutations is a measure of the selective pressure during tumor progression, as synonymous alterations are unlikely to exert a growth advantage and will be selectively lost [17]. Furthermore, mismatches due to sequencing errors, as well as differential representation (cancer to normal differential expression), are all expected to be neutral with respect to the NS to S ratio. The codons significant for amino acid substitutions (p<0.05) were therefore assayed for positive pressure. As a proof-of-concept, the NS/S ratios in the TP53 mutated region were analyzed by paired t test (p<0.033, FDR = 0.092) and revealed higher values in cancer than in control. Thus we applied the NS to S ratio test to each gene, in cascade after that for the local mutation frequency (pAA) described above. Bootstrap was again used to define the p-values. The probability of a cancer protein having frequent amino acid changes (pAA) coupled to selective positive pressure in cancer (pNSSR), two events which are not independent, was defined as the average of the two respective p-values (pAA-NSSR).

Data mining for frame-shifts in cancer ESTs

Having defined for each human gene a p-value for causal amino acid substitutions in sporadic cancers, we needed a corresponding index for gene inactivation due to open reading frame shifts in exons. Cancer genes can be disrupted by micro-insertions or -deletions in their coding sequence, resulting in an altered primary structure. A genome wide survey of our mismatch database indicated that single nucleotide alterations were by far the most common insertions/deletions in ESTs. We indicated with pFrameshift the probability that a gene had an excess of frame-shifts, due to single nucleotide deletions/insertions in cancer, compared to control tissues. We tested the hypothesis that these mutations were frequent in cancer genes, by studying again TP53. Our assay showed that single nucleotide frame-shifts associated to cancer were non-randomly enriched in TP53. When looking for frame-shifts induced by 1 nucleotide insertions/deletions, an analogous test to that for pAA was designed, as detailed in Experimental Procedures, to generate pFrameshift.

Identification of deleted genes in cancer by high resolution array comparative genomic hybridization

Cancer genes can be affected in their genomic structure by large amplifications and deletions. Recessive cancer genes are expected to be deleted or otherwise inactivated and this component must be included in our mutational model. We therefore assigned to each human gene p-values for deletion in cancer. To obtain such p-values, we compiled data from high resolution comparative genomic hybridizations of 744 tumors into the GeoSoft database. We used array CGH (aCGH), obtained from GEO (NCBI) and SMD (Stanford Microarray Database), with sufficiently high resolution to distinguish the human genes (information for samples and datasets in supplemental Table S1). Each tumor sample was compared to a healthy control sample on a two channel oligonucleotide-based platform. The human genes were evaluated in each sample by using the normalized log2 ratio (tumor over control). Different probes related to the same gene were averaged. Gene symbols were used as keys to unequivocally identify a gene within and across platforms. Data were normalized according to the providers. As a pre-processing step we reduced the assay complexity by retaining only those genes with high variability (standard deviation of log2 ratio>0.2). Then, for each gene we computed the percentiles of the log2 ratios (only for genes measured in at least 300 samples). A gene affected by deletions in tumors would possess a low (negative) log2 ratio 5thpercentile, while one with amplifications would display a high (positive) 95th percentile.

Bootstrap analysis (random swap between the tumor and control channels) was used to simulate gene specific 5th and 95th percentiles. Then, gene specific p-values for deletions (pDeletion) were finally calculated as the percentage of simulated 5th percentiles exceeding the real 5th percentiles. At this stage, we had to take in consideration two phenomena, associated to aCGH but not linked to cancer: sex chromosomes and polymorphic structural copy number variations (CNVs). The control sample in aCGHs was frequently from male (more than 50% of aCGHs), while roughly half of the tumors were of female origin and thus lacked the Y-chromosome. Therefore the Y-chromosome genes were expected to appear as deleted, or better “pseudo-deleted”. Conversely, we expected the X chromosome genes, except for those belonging to the pseudo-autosomal region, to appear as “pseudo-amplified”. Genes located in the sex chromosomes indeed behaved correctly, as shown in detail for the pseudo-autosomal region 1 (PAR1) in Xp22 (supplemental Figure S1). Polymorphic CNVs, from normal population variability and not linked to cancer, should also lead to large fold-changes, resulting in high 95th or low 5th percentiles. However, we expected that polymorphic CNVs, not associated to cancer, would not display significant pDeletion values. In fact their 5th percentiles would not qualify as significant after the random swap simulation. CDKN2A and CDKN2B were identified as the most deleted genes in human cancers; PTEN, ATM, and TP53 were also identified as deleted (p-values<0.001). Three thousand and three hundred seventy four genes were significantly deleted (p<0.001).

Combination of mutation analyses: the candidate recessive cancer genes

Cancer genes are affected by different types of point mutations and of chromosomal alterations. We defined a candidate cancer gene as recessive when affected by mutations potentially leading to loss of function; i.e. when it was frequently mutated in its coding region and frequently altered in its genomic structure, in particular deleted. The combination of the different genome wide tests produced a p-value for recessive cancer genes. The recessive cancer gene (pRecessiveCancer) p-value was defined as the product of the three p-values (pAA-NSSR, pFrameshift, pDeletion). One hundred and fifty four human genes were included in the final candidate gene list after combinatorial mutation analysis was performed (pRecessiveCancer<1.5×10−7). The number of cancer recessive genes in a simulation by random association of the four mutation tests was of 60.5 (false detection rate of 0.39). The selection by the combinatorial approach appeared to be specific, since three classical recessive cancer genes, TP53 (16th position), PTEN (92nd) and CDKN2A (135th) were detected. When we compared the candidate gene-set to the whole genome, no major bias emerged towards gene size and structural polymorphisms, as expected from a well-behaved statistical procedure. The recessive cancer gene sizes did not differ significantly from that of the whole human genome (supplemental Figure S2). When we considered copy number variations, the cancer gene-set contained 15 polymorphic CNVs (15/154 or 10%) while 13.6% of all genes scored for pDeletion contained at least one CNV. This difference in proportion was not significant (p>>0.05), suggesting that there was no false enrichment for CNVs by our method, as expected by the design of the algorithm.

Gene ontology and functional analysis

The mechanisms and functional pathways associated with the cancer recessive genes were statistically evaluated. The enrichment in Gene Ontology (GO) terms was assessed by using EASE, at The biological processes significantly affected in the cancer gene set are listed in supplemental Table S2. The significant GO terms grouped by EASE functional clustering were: ATP/nucleotide binding, cell death/apoptosis, cell cycle, mitochondrion, RNA binding, methylation, tumor suppressor, DNA metabolism and DNA repair (EASE enrichment score >2, EASE P-value<1×10−4, Benjamini p-value<0.01). A highly overlapping functional spectrum was obtained for the Cancer Census genes [18]. The most notable exceptions to the overlapping ontologies in the two cancer gene-sets were related to “protein tyrosine kinases”, absent from the candidate recessive list. These proteins are one of the most represented classes of oncogenes, or dominant cancer genes. A functional classification similar to that of EASE was obtained with BinGO and Cytoscape (data not shown), where some of the most significant cellular processes identified were involved in cancer pathogenesis, such as cell cycle, cell death/apoptosis (corrected p-value<1×10−3). Finally, we generated a control set of human genes by random associating the p-values from the four mutation tests. When EASE and BinGO were applied to this control set no significant GO terms were identified.


We devised and applied a multi-tier genome-wide data mining assay towards the identification of genes prone to “recessive-type” mutations in cancer. The p-values resulting from each tier were combined to produce a “recessive cancer gene” p-value (Table 1 and 2). Three of the most notable cancer recessive genes, i.e. TP53, PTEN and CDKN2A, ranked 16th, 92nd and 135th, respectively, among all tested human genes. The block diagram of our rationale and the data flow are shown in Figure 1. The tests can be subdivided into two groups: one for detection of point mutations (amino acid substitutions and frame-shifts) and one for structural alterations (large deletions). In principle we could have also used a test for partial gene deletions, but in ESTs intra-gene rearrangements can be confounded with alternative exon splicing.

Table 2. The candidate recessive cancer genes with genomic location and associated copy number variations.

The probability of a protein having amino acid mutations and frame-shifts in cancer, events which are independent, was defined as the product of the respective p-values. Just using these two tests, the prototypical TP53 and PTEN cancer genes ranked 205th and 233rd out of 27,184 evaluated human transcripts (p-value<1×10−4). Additionally, two other well-known recessive cancer genes, CDKN2A and CDKN2B, also had significant p-values, albeit lower rankings (p<0.0025 and FDR = 0.019, respectively). This behavior was expected for genes with small coding regions, which might be more commonly deleted than mutated [6]. Their presence in the significant point mutations cancer gene-set, even at this intermediate stage, reassured us of the selection capabilities of our algorithm. Nevertheless this early classification, based entirely on point mutations, was compiled only from two mutation tests; thus, relying on EST sequencing data, it was still not reliable according to our model which incorporated an additional mutation mode. It should be noted that we did not set to identify translocations, alterations expected to be dominant at the cellular level and therefore not suited to our quest for recessive genes.

The last test, based on aCGH analysis, confirmed that a very large portion of the human genome is frequently deleted in cancer. As expected for our 2-channels aCGH procedure, we correctly detected sex chromosome genes as differentially represented in the genome screens. In particular, owing to the resolution of our structural assay, the genes from the pseudo-autosomal region 1 were identified as normal diploid (supplemental Figure S1). Most importantly, we would expect that polymorphic CNVs had not filtered through the aCGH assay. Indeed, only a small percentage of cancer genes coincided with polymorphic CNVs and this percentage is even smaller than expected by chance (Table 2).

The number of deletions detected by aCGH in the cancer genome is very high (more than 10% of human genes were deleted in cancer). Notwithstanding this deletion excess, when all mutation modes are included, the number of candidate genes is less than 0.5% of the analyzed human genome.

The cancer gene products are involved in biological processes such as cell cycle, DNA repair and apoptosis, in agreement with literature. The same functional terms are also associated to the genes in the COSMIC Cancer Census [18]. Strikingly, tyrosine kinases, dominant oncogenes, present in the Cancer Census, were absent from our cancer gene-set, in agreement with the selection for recessive genes.

Some strong limitations are inherent to our approach. It is unlikely that the recorded frame-shifts are polymorphisms, since they alter the primary structure of the gene products. Conversely, they might be very often results of sequencing errors. For this reason, we chose to filter out as much as possible the sequencing errors by using a paired t test over a sliding window. Another controversy might be related to the somatic character of the detected mutations. Since there are virtually no germ-line sequences corresponding to the tumor libraries in the EST database, there can not be any formal demonstration that the selected genes correspond to somatic mutation targets. We can not establish how many of the detected mismatches are real mutations, nor how many of them are truly of somatic origin. We could only attach to each human gene a p-value for the excess of mismatches with gene inactivating potential in cancer samples. The presence of TP53, PTEN and CDKN2A in the candidate gene-set and its functional characteristics, are evidences in favor of the hypothesis that we measured an excess of somatic cancer mutations. We will be able to refute this hypothesis by using various experimental protocols. On the other hand, it is possible that some of the candidate genes might bear germ-line mutations and thus constitute predisposition traits for cancer insurgence.

When we compared our results to those of the recently published massive sequencing project, some differences emerged. We used a larger amount of sequencing data, albeit of lower quality since we did not use second pass sequencing data. We obtained from dbEST a number of mismatches roughly 5 times higher than the genome wide sequencing screens. This excess could be due to the lower quality sequencing data in ESTs or the higher sensitivity of our approach compared to PCR based direct sequencing. Detection of under-represented mutations in often heterogeneous cancer biopsies can be a technical challenge for direct sequencing, but not for cloned ESTs.

ESTs were used in previous attempts to identify cancer related genes. Almost invariably these approaches were based on expression profiling, which in tumor samples is probably correlates and late events, among the steps leading to tumor development and progression. In a very different data mining effort on EST sequences in cancer, Qiu and co-workers [20] measured SNP-tumor association. Their analysis was highly focused on single nucleotide mismatches, and restricted to known mutations described in the SNP database and present in at least 50 EST hits. They identified 4,865 SNP frequent in tumors (p<0.05), out of which 327 induced amino acid substitution (cSNP). Many major histocompatibility complex (MHC) class II molecules were present among these coding SNPs, while none was present in our recessive cancer gene-set. Most importantly, no landmark cancer genes, such as TP53, PTEN and CDKN2A were present within cSNPs. Finally, none of the SNP genes detected by Qiu et al. [20] were present in our candidate recessive cancer gene set.

The minute cancer recessive sub-genome (<0.5%) we identified might represent a milestone towards the identification of novel markers for early diagnosis and prognosis. Additionally, our mining strategy can be applied to the data which will be available upon the sequencing of cancer genomes [22]. Finally, our work might lead to a different equilibrium within the pool of cancer genes, currently unbalanced towards dominant oncogenes.

Materials and Methods

EST data mining

All human coding sequences were extracted from RefSeq mRNA database at NCBI (27,184 sequences). The dbEST database contained more than 5.6 million human ESTs (exceeding 3,009 million nucleotides in length). The dbEST libraries (7574) were manually annotated corresponding to the biomaterial of origin and ESTs were subdivided in the following seven classes: cancer tissues and cell lines (Y, 4466 libraries), normal tissues (N, 2621), cell lines of uncertain origin (C, 193), hyperplasia (B, 32), normal tissues associated to cancer lesions (A, 33), matched normal controls from cancer patients (M, 70) and undetermined origin (U, 159). Only the library with clear cut origin was used: i.e. 4466 cancer tissues and cell lines (Y) vs. 2621 normal tissues (N). Tissues associated (A) or matched (M) to cancer, benign tumors (B) and other cell lines (C) were not used. The coding sequences for each RefSeq entry were aligned against the human dbEST database by using BLAST. The Cancer Mutome MySQL database was populated with a total of 43,965,904 mismatches and gaps extracted from 3,839,543 alignments. Perl was used to develop all the scripts and implement the system. BioPerl was used for the BLAST procedure and parsing. BLAST parameters were set to default (expect = 1E-21) with the exception of recovering up to a maximum of 500 alignments for each query.

Detection of point mutations in ESTs

To attenuate the problem of high sequencing error rate in ESTs, our procedure retrieved candidate mutations only in the region of maximum nucleotide identity to the query. Our assumption was that an identical error rate was present in the two EST populations, those derived from the control and those from the cancer cells. Therefore the frequencies of mismatches due to sequencing errors are expected to be comparable across ESTs for the same genes. The mismatches were considered for subsequent analysis only when present in the internal sequence (not in the first or last ten nucleotides of the BLAST alignments). Mismatches were then evaluated for their capabilities of changing the amino acid residue in the correspondent codon. A single candidate mutation was considered only once for each dbEST library, to avoid bias due to RNA copy number. The 8972 human genes most variable for number of mismatches (IQR>0.5) were retained for further testing. Statistics for amino acid substitutions, non-synonymous to synonymous nucleotide exchange rate and frame-shifts were calculated for each human coding sequence. Gene specific confidence limits for the respective paired t tests were calculated by bootstrap analysis. The two bootstrap classes were composed by random extracting 1000 times, with replacement, cancer or normal status from the library classes [23], [24].

In the first of three different measures, the frequencies of amino acid substitution were compared, for each gene in normal and cancerous tissues, by using paired t test over a 25-residues protein window. Normalization of mismatches for the control and cancer classes was attained by using a gene specific and local correction factor. The correction factor was derived by dividing the respective counts of ESTs in both classes at each nucleotide position of the query. The score assigned to each human RefSeq gene corresponded to the sum of the T scores values exceeding the gene-specific confidence limit (p<0.05) over the sliding window (i.e. the area of the peaks above the threshold).

The second measure, linked to the amino acid substitution frequency consisted in the evaluation of the selective pressure for amino acids changes. This filter was implemented to separate causal from bystander mutations and to further diminish the effects of sequencing errors. The ratios of non-synonymous (NS) to synonymous (S) nucleotide substitutions within the cancer and normal ESTs were calculated for each gene. A paired t test was used to compare the cancer and normal NS/S substitution ratios at different codons. When the number of synonymous substitutions at denominator was null, unity was added to both numerator and denominator. Only the amino acid positions significant for frequency of substitutions (in the amino acid substitution test above) were evaluated here. The gene specific confidence limit at 5%, the p-values and the FDR were again computed by bootstrap, as described above.

A third measure on point mutations was relative to the frequency of frame-shifts, which can produce premature protein termination or other major alterations in primary structure. In a paired t test, analogous to that for the pAA, a 25-nt sliding window based procedure was applied to the number of frame-shifts induced in cancer by 1 nucleotide insertions or deletions. Longer DNA alterations were not recorded, and were extremely rare. The gene p-values for such frame-shifts in cancer were again computed by bootstrap and defined as pFrameshift.

ESTs P-value and false detection rate calculation

Procedures were devised for calculation of gene-specific p-values and false detection rates in each one of the described approaches. Bootstrap analysis was used to compute the adjusted probability that a human gene was affected in cancer but not in normal ESTs [23], [24]. The resampling test allowed us to define confidence limits for each different gene and to effectively tackle local issues such as DNA composition, CpG occurrence, and protein or gene length. For the point mutation analyses, the resampling procedure was performed only on the protein residues found to be above T threshold (p<0.05). A range of bootstraps were performed to choose the lowest number of resampling cycles yielding stable p-values through a short gene list and 1000 cycles were found to be a satisfactory requirement. The ESTs belonging to cancer and normal classes were randomly subdivided to form two simulated classes with the same size as the original ones. The gene specific p-value was defined as the frequency at which the resampling test scored equal or better than the real test. Null p-values were set to half of the lowest p-value in the respective simulations.

Detection of deletions in array CGH

744 comparative genomic hybridization arrays were studied (537 samples from GEO and 207 from SMD). All platforms were 2-channel based, data were downloaded as normalized values, and probes were indexed by gene symbol. Gene data and annotations were stored in the GeoSoft database. All normalized log ratios were converted to log2 ratios, with the cancer value at the numerator and the control value at the denominator. Pre-filtering of genes was performed on standard deviation, to exclude the genes which did not show high variation of their genomic profiles (std dev<0.2). Genes were scored when measured in at least 300 tumors. Deleted cancer genes were expected to have log2 ratios lower than the 5th percentile of the bootstrapped log2 ratios; amplified genes log2 ratios higher than the 95th percentile of the bootstrapped values. Bootstrap analysis was used (10,000 random swaps of tumor and control channels) to obtain gene specific p-values and confidence limits for deletion and amplification.

Point Mutation and aCGH combined p-values

Finally, the p-values obtained by the three different tests: pAA-NSSR, pFrameshift and pDeletion were multiplied together to compute the global pRecessiveCancer p-value. This p-value was used to sort the human genes by their propensity to bear mutations in cancer. One hundred and 54 genes were selected with p-value below 1.5×10−7. One hundred resampling cycles were performed by randomly associating p-values for each mutation test and yielded a false detection rate of 39%. EASE ( and BinGO (Cytoscape plugin) were used for Gene Ontology analysis. Hyper-geometric test with Benjamini and Hochberg false discovery rate correction was used in BinGO [25].

Supporting Information

Figure S1.

Genomic structures are correctly identified by the aCGH protocol. Track analysis in UCSC Genome Browser of Xp22 Pseudo-Autosomal Region 1 (PAR1). The Pseudo-Autosomal Region 1 is correctly identified as normal (diploid) by the array CGH analysis, while the rest of X chromosome is reported, also as expected, “pseudo-amplified”. The chromosome X genes 3 prime of PAR1 appear as amplified because their DNA copy number is higher than expected when compared to the respective average DNA copy number in the whole, mixed sex, tumour population.


(0.31 MB TIF)

Figure S2.

Distribution of gene size in the candidate recessive cancer gene-set. The recessive cancer gene sizes do not differ significantly from the gene sizes in the human genome (most common genes range between 32 and 128 kb).


(0.28 MB TIF)

Table S1.

Array CGH datasets.


(0.04 MB DOC)

Table S2.

Functional (Gene ontology, biological process) chart of the candidate cancer recessive genes, FDR<0.5


(0.22 MB DOC)


We thank Prof. Italo Barrai for critical review of the manuscript.

Author Contributions

Conceived and designed the experiments: SV. Performed the experiments: SV JM. Analyzed the data: SV NM JM EO. Contributed reagents/materials/analysis tools: SV AV HA JP MN. Wrote the paper: SV NM EO CC.


  1. 1. Croce CM (2008) Oncogenes and cancer. N Engl J Med 358: 502–511.
  2. 2. Reddy EP, Reynolds RK, Santos E, Barbacid M (1982) A point mutation is responsible for the acquisition of transforming properties by the T24 human bladder carcinoma oncogene. Nature 300: 149–152.
  3. 3. Tabin CJ, Bradley SM, Bargmann CI, Weinberg RA, Papageorge AG, et al. (1982) Mechanism of activation of a human oncogene. Nature 300: 143–149.
  4. 4. Douma S, Van Laar T, Zevenhoven J, Meuwissen R, Van Garderen E, et al. (2004) Suppression of anoikis and induction of metastasis by the neurotrophic receptor TrkB. Nature 430: 1034–1039.
  5. 5. Schlabach MR, Luo J, Solimini NL, Hu G, Xu Q, et al. (2008) Cancer proliferation gene discovery through functional genomics. Science 319: 620–624.
  6. 6. Murphy JA, Barrantes-Reynolds R, Kocherlakota R, Bond JP, Greenblatt MS (2004) The CDKN2A database: Integrating allelic variants with evolution, structure, function, and disease association. Hum Mutat 24: 296–304.
  7. 7. Olivier M, Eeles R, Hollstein M, Khan MA, Harris CC, et al. (2002) The IARC TP53 database: new online mutation analysis and recommendations to users. Hum Mutat 19: 607–614.
  8. 8. Stenson PD, Ball EV, Mort M, Phillips AD, Shiel JA, et al. (2003) Human Gene Mutation Database (HGMD): 2003 update. Hum Mutat 21: 577–581.
  9. 9. Garnett MJ, Marais R (2004) Guilty as charged: B-RAF is a human oncogene. Cancer Cell 6: 313–319.
  10. 10. Wang Z, Shen D, Parsons DW, Bardelli A, Sager J, et al. (2004) Mutational analysis of the tyrosine phosphatome in colorectal cancers. Science 304: 1164–1166.
  11. 11. (1997) (1997) Pathology of familial breast cancer: differences between breast cancers in carriers of BRCA1 or BRCA2 mutations and sporadic cases. Breast Cancer Linkage Consortium. Lancet 349: 1505–1510.
  12. 12. Friend SH, Bernards R, Rogelj S, Weinberg RA, Rapaport JM, et al. (1986) A human DNA segment with properties of the gene that predisposes to retinoblastoma and osteosarcoma. Nature 323: 643–646.
  13. 13. Greenman C, Stephens P, Smith R, Dalgliesh GL, Hunter C, et al. (2007) Patterns of somatic mutation in human cancer genomes. Nature 446: 153–158.
  14. 14. Kantarjian H, Sawyers C, Hochhaus A, Guilhot F, Schiffer C, et al. (2002) Hematologic and cytogenetic responses to imatinib mesylate in chronic myelogenous leukemia. N Engl J Med 346: 645–652.
  15. 15. Davies H, Bignell GR, Cox C, Stephens P, Edkins S, et al. (2002) Mutations of the BRAF gene in human cancer. Nature 417: 949–954.
  16. 16. Wood LD, Parsons DW, Jones S, Lin J, Sjoblom T, et al. (2007) The genomic landscapes of human breast and colorectal cancers. Science 318: 1108–1113.
  17. 17. Samuels Y, Wang Z, Bardelli A, Silliman N, Ptak J, et al. (2004) High frequency of mutations of the PIK3CA gene in human cancers. Science 304: 554.
  18. 18. Futreal PA, Coin L, Marshall M, Down T, Hubbard T, et al. (2004) A census of human cancer genes. Nat Rev Cancer 4: 177–183.
  19. 19. Irizarry K, Kustanovich V, Li C, Brown N, Nelson S, et al. (2000) Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nat Genet 26: 233–236.
  20. 20. Qiu P, Wang L, Kostich M, Ding W, Simon JS, et al. (2004) Genome wide in silico SNP-tumor association analysis. BMC Cancer 4: 4.
  21. 21. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403–410.
  22. 22. Stratton M (2008) Genome resequencing and genetic variation. Nat Biotechnol 26: 65–66.
  23. 23. Davison AC, Hinkley DV (1997) Bootstrap Methods and Their Application. Press CU, editor. Cambridge.
  24. 24. Efron B, Tibshirani RJ (1993) An Introduction to the Bootstrap. Hall C, editor. London..
  25. 25. Maere S, Heymans K, Kuiper M (2005) BiNGO: a Cytoscape plugin to assess overrepresentation of gene ontology categories in biological networks. Bioinformatics 21: 3448–3449.