Inferring the Temporal Order of Cancer Gene Mutations in Individual Tumor Samples

The temporal order of cancer gene mutations in tumors is essential for understanding and treating the disease. Existing methods are unable to infer the order of mutations that are identified at the same time in individual tumor samples, leaving the heterogeneity of the order unknown. Here, we show that through a complex network-based approach, which is based on the newly defined statistic –carcinogenesis information conductivity (CIC), the temporal order in individual samples can be effectively inferred. The results suggest that tumor-suppressor genes might more frequently initiate the order of mutations than oncogenes, and every type of cancer might have its own unique order of mutations. The initial mutations appear to be dedicated to acquiring the function of evading apoptosis, and some order constraints might reflect potential regularities. Our approach is completely data-driven without any parameter settings and can be expected to become more effective as more data will become available.


Introduction
Cancer is a genetic disease caused by the mutation of cancer genes consisting of oncogenes and tumor-suppressor genes. In most cancer cases, multiple mutations occur in a procedure known as tumor progression [1,2]. To understand tumor progression, studies have been performed to model general regularities on the temporal order of mutations for a given type of cancer using both experimental and computational approaches [3][4][5][6][7]. As a canonical model, the order of mutations for colorectal cancer was reconstructed through tumor size and grade [8]. The latest computational models infer the typical temporal order constraints for certain type of cancers by simulating tumor progression as a stochastic process [9][10][11]. Despite this progress, there is still no well-defined method to infer the order of mutations identified at the same time in individual samples, although this inference is necessary to reveal the heterogeneity of the order of mutations in a cancer. Recently, as new generation sequencing becomes widely applied, the mutation landscapes in various cancers are being revealed one by one. The results have shown that the mutations in a cancer frequently demonstrate statistical correlations with each other or even cause-and-effect linkages of induction between the former and the latter [12][13][14][15][16][17][18]. However, these correlations/ linkages have not been fully exploited in inferring the temporal order of mutations.
From an informatic perspective, this study defines a statistical measurement to assign value to the correlations or linkages mentioned above and model the mutations within a complex network, through which the temporal order of the mutations in individual samples can be inferred. We call the measurement the carcinogenesis information conductivity (CIC), which measures the reachability of transferring the information of a cancer gene having mutated to the transcription process of a given un-mutated cancer gene to induce its mutation. Statistically, the reachability can be estimated by the individual occurrence frequencies and the sequential co-occurrence frequency of the two genes' mutations in cancer samples. Additionally, competition among the information sent from multiple mutated genes to the given un-mutated gene should also be considered as any successful sending will cause the target gene to mutate, thus ending the mutation process. In this study, we call any two mutations found out in the same cancer sample co-occurrent mutations. While most genomic studies provide this quantity in an indirect way, here we aim at disentangling the sequence of occurrence of two mutational events from the simple co-occurrence. From these sequences of mutation occurrence, the sequential co-occurrence frequency can be calculated (Materials and Methods). Based on this idea, we have defined the CIC from cancer gene i to cancer gene j as: where f i (f j ) is the occurrence frequency of the mutation of gene i (j) in cancers, f ij is the sequential co-occurrence frequency of the mutation of gene i followed by the mutation of gene j, and p ij is the priority of gene i compared to other mutant genes to send the information to gene j. We have determined that p ij~1 jS ij j : P s[S ij (1z P k=i I(f kj wf ij js)). In this equation S ij is the set of cancer samples with mutant genes i and j, jS ij j is the number of samples in the set, and I(f kj wf ij js) is an indicator function that equals 1 if f kj wf ij for the mutant genes k, j and i in sample s. Otherwise it equals 0. Accordingly, the highest priority of one will be assigned if f ij is larger than f kj in every sample of the set, and the more times that f kj wf ij , the larger value the p ij . We regard formula (1) as a measurement of carcinogenesis information conductivity because the ratio (f ij =f i ) is an estimate of the maximum chance that gene i sends carcinogenesis information to gene j and causes its mutation, the ratio (f ij =f j ) is an estimate of the maximum chance that the mutation of gene j is caused by carcinogenesis information received from gene i, and p ij is the priority of the communication link compared with other links to gene j. The value of C ij ranges from 0 to 1. Like the definition of activation force, a measurement we previously proposed for weighting the links of complex networks [19], the definition of CIC follows the formula of gravity if we imagine the ratios (f ij =f i ) and (f ij =f j ) as masses and the priority p ij as distance. Statistics defined in this manner are likely to distribute their values in a power law, which is convenient for analyzing complex networks of intricate relationships including those in biology [20][21][22][23][24]. One challenge in computing the CICs is the lack of cancer samples that can be used as the source of the sequential cooccurrence frequencies of the cancer gene mutations because the mutations of different genes in a cancer sample are usually identified at the same time by sequencing. To tackle this challenge, we present an iterative procedure that couples CIC computation and the inference of the probability of every potential order of cancer gene mutation. The application of this procedure to the Catalogue of Somatic Mutations in Cancer (COSMIC) database [25,26] revealed that the iteration reached convergence within fewer than 10 loops, and the convergent results suggest significant conclusions.

Iterative inference scheme
To perform the iterative inference procedure, a large set of cancer samples with cancer gene mutations identified by genomewide sequencing is necessary. With the dataset, we determine the basic statistics of occurrence and non-sequential co-occurrence frequencies of cancer gene mutations. From these basic statistics, the iterative inference for the number of samples in question begins and the CIC results and probable orders of cancer gene mutation for each sample in question are determined when the iteration reaches convergence. Fig. 1 illustrates an overview of the procedure.

Iterative procedure of CIC computation and inference of mutation order
By definition, sequential co-occurrence frequencies are necessary to estimate the CIC value. However, this requirement cannot be satisfied by the current databases, including COSMIC. To overcome this difficulty, we adopt an iterative procedure to couple the inference of the occurring mutation orders and the computation of the CICs. First, we evenly divide a non-sequential cooccurrence frequency into the two possible sequential cooccurrence frequencies to calculate the initial CICs. We then infer the mutation orders with the initial CICs to repredict the sequential co-occurrence frequencies, repeat CIC computation and inference of the mutation orders until a convergent result is obtained.
Based on the principle of maximum entropy, we first use a uniform prior distribution of the occurrence orders, which means that for the non-sequential co-occurrence frequency of the mutation of two genes i and j, the two mutation orders of iRj and jRi occur with the same probability. Therefore, the necessary sequential co-occurrence frequency is set as a half of the corresponding non-sequential frequency. With this setting, we compute the initial CIC between every pair of cancer genes.
We then compute the CIC that an order of more than two mutant genes possesses. In this computation, we must consider that each of the preceding genes may send the carcinogenesis information in parallel to a target gene within the order. Therefore, we borrow the principle of computing resistance in a circuit, which is a parallel-by-serial procedure; we sum all the parallel CICs from the preceding genes to a target gene within the order to determine the phase CIC of the order and then formulate the order CIC by cascading all the phase CICs. Consider the order APCRATMRKRAS as an example; this order contains two phases of information sending, RATM and RKRAS. During the first phase, the information can be sent from only one source, APC. Therefore,C APC,ATM , the CIC from APC to ATM, simply becomes the CIC of the first phase. In the second phase, however, both APC and ATM can become the information source, requiring the summation of the two parallel CICs as the CIC of the second phase. After the parallel step of each phase, the reciprocals of phase CICs, regarded as resistances, are serially summed as the reciprocal of the order CIC. The steps are summarized as follows: Parallel step: The kth gene in the order is the information receiving gene at the (k-1)th phase and has k-1 senders of parallel information. An order consisting of n genes has n-1 phases of carcinogenesis information conduction. In general, we have the equation, the CIC from gene i(p) to gene i(kz1), and i(x) is the index of the gene at positionx in the order. Based on the definition of the CIC, a larger CIC value of a possible order implies easier carcinogenesis information conduction within the order. Among all competing orders, the larger the CIC value of an order, the greater probability the occurrence of the order. Therefore, we presume that the CIC of an order is positively proportional to the probability of that order occurring. When estimating the probability of every potential order by a linear mapping from the CICs of all potential orders for a given set of mutant genes, the total of the probabilities of all the potential orders is equal to one. Formally, for a sample with n mutant cancer genes, the number of potential orders is n!; we map the CIC of order m (m = 1, 2, …, n!) into its probability using the equation After determine the probabilities of every possible order of the mutations, we redetermine the predicted sequential co-occurrence frequencies as follows: : g ij where p l m is the probability of order m of sample l, l~1,2,:::,L, and L is the number of samples in question. I(i?jjo l m ) is an indicator function that equals 1 when gene i occurs before gene j in order m of sample l and equals 0 in all other cases, and g ij is the nonsequential co-occurrence frequency between gene i and gene j. If the redetermined f ij values are nearly identical to the old ones or become convergent, the computed CICs and thus the inferred order probabilities can be regarded as reliable outcomes. Otherwise, the CICs and the order probabilities have to be redetermined in a new loop. The iterative procedure continues in this manner until convergence is reached. In practice, the criterion of convergence can be regarded as satisfied when the absolute difference between the new and old values of f ij monotonically reduces to a sufficiently small value.
Because we begin the iterative procedure with an initial prediction of the sequential co-occurrence frequencies from nonsequential frequencies based on the maximum entropy principle, which provides the maximum modification potential of the sequential co-occurrence frequencies in the first iteration, the modification will decrease gradually and finally become insignificant. This premise was verified in the study; a satisfying convergence was reached within fewer than 10 loops of the The occurrence and co-occurrence frequencies of the cancer gene mutations f i and g ij are determined from available samples, where i,j~1,2:::,m, and m is the number of the cancer genes targeted in the study. An occurrence of a gene will be counted if it is mutated in one of the samples, and a co-occurrence of a pair of genes will be counted if both are mutated in one of the samples; therefore, g ij~gji and g ii~0 . (b) Based on the principle of maximum entropy, the initial values of the sequential co-occurrence frequencies are set as The carcinogenesis information conductivities, C ij , are calculated from the vector of f i and the matrix of f ij . It should be noted that C ij might not be equal to C ji , implying that the matrix of C ij represents a directed network. (d) For each of the L samples in question, the probabilities of every potential order of the mutant genes in sample P l are computed according to the CICs of each order (Methods). (e) The matrix of f (k) ij is redetermined by the matrix of g ij and the ratio of the probability-weighted number of the orders indicated that i occurs before j to the number of co-occurrence frequency, it is important to note that has not reached the criterion of convergence, the inferred orders will not be regarded as stable and a new loop of the calculation of C ij and P l will be performed. Otherwise (f), the orders with a probability higher than random chance and the corresponding probabilities fÔ O (k) l g and fP P (k) l g are regarded as the referred results. For example, of all 6 potential orders for a sample with three mutant cancer genes a, b and c, orders b?a?c and b?c?a are identified as the probable ones due to probabilities of 0.7 and 0.2 (higher than a random chance of 1/6). doi:10.1371/journal.pone.0089244.g001 inference procedure using a set of samples from the COSMIC database.
The iteration based on COSMIC data reaches convergence within 10 loops. Here, we use the computation of CIC from KRAS to APC to introduce the procedure in detail. Initially, we calculate the occurrence frequencies of f KRAS = 125 and f APC = 209 and a non-sequential co-occurrence frequency g KRAS,APC = 79 from the COSMIC database. By defining half of the non-sequential cooccurrence frequency (79) as the sequential frequency, we determine that f KRAS,APC = 39.5. When comparing with the sequential co-occurrence frequencies from genes other than KRAS to the gene APC in each of the 79 samples, f KRAS,APC is found to have an average order of 1.47. Therefore the priority p KRAS,APC = 1.47, and the initial value of C KRAS,APC = (39.5/ 125)*(39.5/209)/1.47 2 = 0.028.
Using the initial CICs between all cancer gene pairs, we estimate the probability of every potential mutation occurrence order in each sample in the manner described above. According to the probabilities, the non-sequential co-occurrence frequencies can be unevenly divided into sequential frequencies. For the 79 samples in this example, the ratio of KRASRAPC vs. APCRKRAS based on the corresponding total probability for each order is 0.28: 0.72. Therefore, we update the value of f KRAS,APC = 79*0.28 = 22.1, and the priority p KRAS,APC is then determined with the new f KRAS,APC . With these new values, we redetermine C KRAS,APC .
The convergence of C KRAS,APC and its counterpart C APC,KRAS during the iterations is shown in Fig. 2. This example demonstrates that the values reach a satisfying convergence after just 6 iterations. This example also represents the common situation, thus we ended the computation of CICs after 10 iterations in this study.
Complexity of the inference procedure CIC computation has a complexity of O(n 2 ) if the number of cancer genes in the study is n, and the inference of the probabilities of all potential orders for a sample with m mutant cancer genes has a complexity of O(m!m 2 ). In our study, n is equal to 397 and m ranges from 2 to 8. Therefore, the complexity of O(m!m 2 ) can differ greatly for different samples. In reality, during the inference for the 1,118 samples reported in the study, the majority of the time was consumed by a few samples with the maximum number of mutant cancer genes. It is worth noting that during the entire procedure, we only have to compute the CICs once in each loop to infer the order probabilities for all samples. The inference procedure with 10 iterations for the 1,118 samples was completed within 10 minutes on a platform consisting of a PC (4*2.66 GHz Quad CPU) and Matlab.

Study data
The results reported in this study were obtained from a recent COSMIC database (issued on September 12 th , 2012) on coding point mutations. It is a table file containing the names of the mutated cancer genes in each cancer sampled. Mutant genes in the same cancer have the same tumor ID (ID_tumour), and the fields of genome-wide-screen and primary side provide the necessary information used in this study.
Steps for determining the occurrence and co-occurrence frequencies of cancer gene mutations in the samples The occurrence and co-occurrence frequencies of cancer genes in the cancer samples were used to estimate the CICs in the study, and the basic statistics were determined using the following steps: 1) Download the source file CosmicMutantExport_v61_120912.tsv through ftp://ftp.sanger.ac.uk/pub/CGP/cosmic/data_export/; 2) Make a temporary file by obtaining the records with the value 'y' in the 'genome-wide screen' field from the source file; 3) Make a primary file by obtaining the records of cancer genes defined by the file Table_1_full_2012-03-15.xls in the Cosmic web site from the temporary file, and refining the records into sequences of Gene_name and ID_Sample; 4) Make a mutation_sequence file in which each record is a list of the mutated genes in the same sample based on the primary file, and discard the record that contains only one gene name in the mutation_sequence file; 5) Count the occurrence and co-occurrence frequencies of the cancer genes based on the mutation_sequence file.

Features of the estimated CICs
We performed the inference on cancer gene mutation data from genome-wide scanned samples collected in a recent version of the COSMIC database. A total of 1,212 samples harboring 6,281 mutations in 397 cancer genes was available for determining the basic occurrence and co-occurrence frequencies. From these, 1,118 samples, each harboring no more than 8 mutant cancer  genes, were used in the iterative procedure of CIC computation and order inference. Table S1 lists the 1,118 samples. The results were found to converge within 10 iterations. After convergence, CICs with a value greater than 1.0E-6 presented a power law-like distribution over the magnitudes, such that the overwhelming majority has a magnitude less than the average of 4.0E-4 and a very small portion has a larger than average magnitude (Fig. 3, Table S2). This feature is also true for the distribution of the magnitudes of the CICs from (or to) a given gene in most cases, which means that only a small number of partners are significant in terms of carcinogenesis information conduction for any given gene. In other words, the CICs identify the closest partners in carcinogenesis information conduction. Furthermore, the directed networks of cancer genes linked by the CICs were asymmetrical and small world-like. The CIC from gene i to gene j was usually unequal to that from gene j to gene i; the network has a number of hub genes with many more links than normal. This feature is consistent with the notion that the signaling network in cancer is analogous to the Internet, which constructs a small world with hub nodes [27][28][29]. Fig. 4 illustrates a CIC linked network covering 44 cancer genes, including the hub genes APC, TP53 and MLL3, and the links stronger than 1.0E-2 showing asymmetry. The asymmetry of the CICs implies the existence of a preference for certain mutation orders. Additionally, the three hub genes are all tumorsuppressor genes, and the strongest directed link, with a value of 0.136, is from APC to KRAS, one of the most frequently mutated oncogenes, suggesting a superior information channel from the mutation of APC to the mutation of KRAS.

The inference of probable orders
The inferred mutation orders with a probability higher than random chance, referred to hereafter as probable orders, provided more concrete insights into tumor progression. We analyzed the probable orders inferred for the 1,118 cancer samples in question to investigate a maximum of 8 mutation steps from initiation. The primary sites of the samples were mainly located in the ovary (256), large_intestine (LI, 180), haematopoietic_and_lymphoid_tissue (HLT, 148), prostate (100), breast (97), central_nervous_system (CNS, 86), and upper_aerodigestive_tract (UAT, 72). Table S3 lists all the probable orders and their probabilities in the analyzed samples, and Table 1 shows a selection of them. Based on the probable orders, we concluded that in a given sample only a small portion of all the potential orders has a probability higher than random chance, and the sum total of the probabilities of those orders is close to the number of samples with a ratio of 1034.4/1118. This indicates that the inference identified a small portion of all the potential orders permutated by the given set of mutant cancer genes as the probable orders. For a sample harboring two mutant cancer genes, the inference always strongly suggests one of the two potential orders. However, for the samples with more than two mutant cancer genes, some orders might have comparable high probabilities. Although we cannot judge the individual plausibilities of the inferred probable orders because of the lack of ground truth for the orders in most cases, their significance could be strongly suggested by evaluating the inference with samples of a certain cancer type that have been well studied in terms of order. For example, APC, KRAS and TP53 are the three most frequently mutated genes in colon cancers, and their mutation orders have been well modeled [30,31]. In our results, the sample with mutant cancer genes APC and KRAS, yielded an inferred probability of 0.95 for the order APCRKRAS, which was consistent with previous studies. For the sample with mutant APC, KRAS and TP53 cancer genes, three probable orders of APCRKRASRTP53 (0.33), APCRTP53RKRAS (0.32) and TP53RAPCRKRAS (0.19) were inferred from 6 potential ones, and this result was also consistent with previous studies. BRCA1 germline mutations confer a high risk of breast and ovarian cancer, but somatic loss of the wild-type BRCA1 allele has been shown to usually occur after mutation of TP53 [32]. In agreement with this observation, we inferred the somatic mutation order TP53RBRCA1 with a probability greater than 0.99. These examples provide evidence to support the inference validity.

Initiators of probable mutation orders
Identifying the initiators of mutation orders has been regarded as one of the major challenges in the study of tumor progression [1]. Our inferred probable orders of mutation provided informative hints to solving this challenge. By examining the genes that initiate the probable orders, we found that the initiators were dominated by tumor-suppressor genes. An overwhelming majority (more than 77.5%) of the probability-weighted number of the probable orders was inferred to be initiated by a tumor-suppressor gene rather than an oncogene. There were 368 cancer genes in the test cancer samples, among them only 92 were tumor suppressors. More specifically, there were 1,858 mutations of tumor suppressors among totally 3,823 mutations of all the cancer genes. Therefore the average chance for tumor suppressors to initiate the mutation orders was 48.6% (1858/3823). This demonstrates that the dominance of tumor suppressors in initiating the mutation orders could not be ascribed to chance. Additionally, the ratios of the number of times a gene was the initiator to its mutation frequency were generally different, implying that it is not certain that frequently mutated genes will mutate early (Table 2). Significantly, the probability-weighted number of the probable orders started by the top two tumor-suppressor gene initiators TP53 and APC, consisted of percentages as large as 46.9% and 11.4%, respectively. In contrast, the top two oncogene initiators, PIK3CA and KRAS, were found in percentages as small as 3.1% and 1.3%, respectively. The top initiators of mutation at the respective primary cancer sites suggested more details (Table 3). In general, all cancers at the major primary sites of the samples revealed a tumor-suppressor gene as their top initiator. In particular, TP53 was a common top initiator in four of the previously listed cancer types, ovary, UAT, breast and prostate, with percentages of 91.5%, 73.4%, 57.6% and 30.4%, respectively. In LI cancers, the top initiator was APC (57.5%), followed by TP53 (29.7%). Both CNS and HLT cancers had no obviously superior initiators, with CIC (13.6%), PIK3CA (10.1%) and TP53 (10.0%) as the top three initiators for the former, and TP53 (14.9%), NPM1 (10.4%) and MLL2 (9.9%) as the top three initiators for the latter. From the perspective of initiator distribution, ovary, LI, UAT and breast cancers were inferred to be dominated by a small number of tumor-suppressor genes, while HLT, CNS and prostate cancers were inferred to have more diverse significant initiators.
Previous studies have suggested a number of hallmark functions that need to be acquired for a cancer to generate, helping researchers understand the complexity in tumor progression in a way of logical, scientific manner [33,34]. Our inferred results point to a suggestion that goes one step further. In most cancers, the earliest acquired hallmark function might be evading apoptosis because the majority of first mutated genes in every cancer type in Table 3 (TP53, APC, KRAS, PIK3CA, NPM1 and CIC) have been found to encode apoptosis-regulating proteins, and the mutation of all of these genes has been shown to lead to deficient apoptosis functions. Specifically, the mutation of TP53 can result in the  removal of a key component of the DNA damage sensor, which functions to induce apoptosis [33,34], mutant forms of the APC protein can attenuate responses to apoptotic stimuli [35,36], the mutations in KRAS and PIK3CA can activate pathways that transmit anti-apoptotic survival signals [33], and the proteins encoded by NPM1 and CIC have been shown to function in apoptosis [37,38].

Informative transitions in the probable orders
The transitions in the probable orders provided additional information on tumor progression. Though mutations in BRCA1 and BRCA2 have been regarded as key markers for breast cancer occurrence, somatic mutations in the two genes in the breast cancer samples were not very frequent, with rates of 3/97 and 6/ 97, respectively, and both genes were inferred to have no chance of initiating a probable order. However, among all transitions in the probable orders of the breast cancers, TP53RBRCA2 and TP53RBRCA1 were identified as the second and fourth most frequent transitions, respectively, implying that mutations in these two genes tend to occur next to the mutation of TP53. Similarly, the transition of TP53RBRCA1was ranked as the third most frequent in the probable orders in ovarian cancer, supporting the conjecture mentioned above. In LI cancers, mutations in APC, TP53 and KRAS were found to occur at extraordinarily frequencies with rates of 146/180, 111/180 and 79/180, respectively, and their mutual transitions were the top six most frequent, implying that these three genes as a group play dominating roles in LI cancers. Liquid HLT cancers were inferred to have the 3 most frequent transitions that converged on one gene, TP53RBCL2, MLL2RBCL2 and EZH2RBCL2. Given that BCL2 is a key antiapoptotic gene [39] and was the most frequently mutated gene in HLT cancer samples, these convergent transitions suggest that HLT cancers might acquire the function of evading apoptosis in a unique way, mutation of the key anti-apoptotic gene BCL2 next to the mutations of certain tumor-suppressor genes. Informatively, among all the 36 BCL2 mutant samples, mutations at 179C and 392C of CDS (Coding DNA Sequence) were as frequent as 5 and 4 times, respectively, suggesting those to be hotspot mutations that might play a particular role in evading apoptosis. Because TP53, MLL2 and EZH2 were inferred to be the top initiators of mutation in HLT cancer samples, the function of evading apoptosis could be acquired in an early stage of tumor progression.

Discussion
The inferred results from individual samples firmly revealed the order heterogeneity in a given cancer type, showing the complexity of the disease. The results also highlighted the limited number of genes that are able to initiate the mutations and revealed that the hallmark function of evading apoptosis is acquired early. Other regularities implied in the results might also be significant in understanding and treating the disease. The proposed approach for inferring the temporal order of mutations is superior to existing methods in two ways: 1) it can be used to infer the order of mutation in individual samples with mutations in multiple genes which have been identified at the same time; and 2) it is completely data-driven, free from the difficulty in existing methods of setting proper parameters, such as fitness, mutation rate or waiting time [9][10][11]. When this approach is better supported by more sufficient data, it is expected to help discover more reliable information to understand the mechanism of carcinogenesis. Fortunately, the wide application of a new generation of sequencing will make this hope a reality.
The key to the success of this study is finding the statistical measurement of CIC, which proved to usually be asymmetrical between a pair of cancer genes, laying the foundation of mutation order inference. Meanwhile, the iterative procedure provides a feasible way to infer the CICs from non-sequential co-occurrence frequencies. With the CICs, the linkages between cancer gene mutations are modeled as a complex network with directed links. The small world-like nature of the complex network makes the inference of the temporal order of mutations effective.