Fig 1.
A. A compendium of cancer omics data is used as the training dataset. Three types of data from the 5,097 pan-cancer tumors were used in this study, including SM data (774,483 mutation events in 22,580 genes), SCNA data (1,612,667 copy number alteration events in 25,038 genes), and gene expression data (13,563,530 DEG events in 20,411 genes). SM and SCNA data were integrated as SGA data. Expression of each gene in each tumor was compared to a distribution of the same gene in the “normal control” samples, and, if a gene’s expression value was outside the significance boundary, it was designated as a DEG in the tumor. The final dataset included 5,097 tumors with 1,364,207 SGA events and 13,549,660 DEG events. B. A set of SGAs and a set of DEGs from an individual tumor as input for TCI modeling. C. The TCI algorithm infers the causal relationships between SGAs and DEGs for a given tumor t and output a tumor-specific causal model. D. A hypothetic model illustrates the results of TCI analysis. In this tumor, SGA_SETt has three SGAs plus the non-specific factor A0, and DEG_SETt has six DEG variables. Each Ei must have exactly one arc into it, which represents having one cause among the variables in SGA_SETt. In this model, E1 is caused by A0; E2, E3, E4 are caused by A1; E5, E6 are caused by A3; A2 does not have any regulatory impact.
Fig 2.
Estimation of the most probable causative SGAs for MPL by TCI and eQTL.
A. A diagram of PI3K/AKT pathway, with PIK3CA, PTEN, PIK3R1 and AKT1 as key signaling proteins in the pathway. B. Results of TCI analysis of the most probable causes of the DEG MPL. There are ~300 SGAs and ~3,000 DEGs in each tumor on average, which are organized as a bipartite graph respectively. Solid green squares represent SGAs present in the current tumor; empty green square represent SGAs not present in the current tumor. For a DEG observed in a tumor, e.g., MPL, TCI aims to search for the most probable cause among SGAs observed in the tumor. An arrow represents a causal link between an SGA and a DEG, while the weight of an arrow represents the posterior probability that the SGA causes the DEG in the current tumor. PIK3CA is predicted to be the most probable cause for DEG MPL in 200 tumors; thus, we rank PIK3CA as 1st. PTEN is the most probable cause for DEG MPL in 140 tumors, ranking it as the 2nd most probable cause of DEG MPL. AKT1 is the most probable cause for DEG MPL in 11 tumors, and PIK3R1 is the most probable cause for DEG MPL in 10 tumors. C. eQTL analysis of the possible causes of DEG MPL. eQTL considers all SGAs (i.e., ~17,600 SGAs) as possible causes for DEG MPL. The p values of PIK3CA, PTEN, PIK3R1 and AKT1 were ranked as having the 3rd, 5,690th, 7,661th and 14,563th strongest association with DEG MPL, respectively.
Fig 3.
The landscape of SGAs and SGA-FIs.
A & B. The distributions of SGAs per tumor and SGA-FIs per tumor of different cancer types. Beneath the bar box plots, the distributions of different types of SGAs (SM, copy number amplification, and deletion) are shown. C. Distribution of SGA-FIs against the alteration frequency and protein length. Pink dots indicate SGA-FIs, and green dots represent SGAs that were not designated as SGA-FIs. A few commonly altered genes are indicated by their gene names, where genes labeled with blue font are well-known drivers, and those labeled with orange font are novel candidate driver. D. Tumor-specific Bayesian prior distributions for top 15 most frequent SGAs. The number above each box represents number of tumors that the corresponding SGA appears in. E. A Circos plot shows SGA events and SGA-FI calls along the chromosomes. Different types of SGA events (SM, copy number amplification, deletion) are shown in tracks 2, 3, and 4, respectively. Track 1 shows the number of times that an SGA is labeled by TCI as an SGA-FI. The gene names denote the top 62 SGA-FIs (some are SGA units) that were called in over 300 tumors with a call rate > 0.8. Genes labeled with blue font are known drivers from two TCGA reports, and orange ones are novel candidate drivers. F. SGA-FIs that were called in less than 300 tumors and with a call rate > 0.9 are shown in this frequency-vs-call rate plot. As before, genes labeled with blue font are known drivers from TCGA studies, and orange ones are novel candidate drivers.
Fig 4.
SM and SCNA perturbing a gene exert common functional impact.
A. Combining SM and SCNA data drisrupts the correlation structure among genes enclosed in common SCNA fragments. The chromosome cytobands enclosing three example genes (PIK3CA, CSMD3, and ZFHX4) are shown. The bar charts show the frequency of SCNA (red, standing for amplificaton) and SM (green). The disequilibrium plots beneath the bar charts depict the correlationship among genes within a cytoband. B-E. The SGA patterns, i.e. SM and CN amplification/deletion, across different cancer types for PIK3CA, CDKN2A, CSMD3 and ZFHX4, F-I. SGA-FI target DEG call rates in SM tumors and CN amplification/deletion for PIK3CA, CDKN2A, CSMD3 and ZFHX4. J-M. Venn diagrams illustring the relationships of DEGs caused by CN amplication/deletion and SM for PIK3CA, CDKN2A, CSMD3 and ZFHX4.
Fig 5.
Statistical and experimental evaluation of TCI predictions.
A. The causal relationship inferred by TCI is statistically sound. Plots in this panel show the probability density distribution of the highest posterior probabilities assigned to each DEG in TCGA dataset, when the TCI algorithms was applied to real data (red) and two random datasets, in which DEGs permutated across all tumors (blue) and the corresponding SGA permutated across all tumors (green). The panel on the left shows the results for the posterior probabilities for all most probable candidate edges in whole dataset; rest of the plots show the distributions of posterior probaiblities of most probable edges pointing from 3 specific SGAs to predicted target DEGs. B. Boxplots of q-values of t-test associated with predicted target DEGs for 8 SGA-FIs in different LINCS cell lines that were experimentally perturbed. Each box represent one SGA perturbed in one cell line. For example, APC-HA1E denotes that APC perturbed in HA1E cell line. Each black dot represents a q-value associated with a target DEG of an SGA-FI, when the expression value was assessed with a t-test of the before and after genetic manipulation of a given SGA-FI gene.
Fig 6.
Cell biology evaluation of oncogenic properties of CSMD3 and ZHFX4.
A-B. The impact of knocking down CSMD3 and ZFHX4 on cell proliferation. C-D. The impact of knocking down CSMD3 and ZFHX4 on cell migration. E. Impact of ZFHX4 knockdown on apoptosis in PC3 cell line measured by Annexin V and propidium iodide (PI) staining.
Fig 7.
Detection of functional impact of SGA-FIs reveals functional connections among SGA-FIs.
A. Top 45 SGAs-FIs (regulating the largest number of DEGs) and their relationships with 17 cancer hallmark gene sets. The value in a cell represents the fraction of genes in a hallmark gene set that is covered by the target DEGs of each SGA-FI. B-E. Top 15 SGA-FIs that share the most significant overlapping target DEGswith PIK3CA, TTN, CSMD3, and ZFHX4. An edge between a pair of SGA-FI indicate that they share significantly overlapping target DEG sets, and the thickness of the line is proportional to negative log of the p-values of overlapping target DEG sets. F. An “oncoprint” illustrating the causal relationships between the DEG RUNDC3B and its 3 main drivers according to TCI, namely, PIK3CA, CDKN2A, and PTEN. Each column corresponds to a tumor; green bars indicate tumors in which TCI designated each of the three SGA genes as a driver, regardless of what DEGs it was driving in a given tumor. The causal relationship is color-coded, which illustrates which SGA-FI is predicted by TCI to cause the RUNDC3B DEG event; the blue bar indicates the DEG events that were assigned to SGA-FIs other than the above 3 SGA-FIs; gray bars indicate a wild type genomic and transcriptomic status.
Fig 8.
TCI predicts the SGA-FIs and their functional impact at the individual tumor level.
A. A graph produced by TCI for tumor TCGA-B1-A657 that predicts major SGA-FIs and their regulated cancer processes. Blue nodes represent SGA-FIs and red nodes (squares) represent oncogenic processes. An green directed link indicates that TCI predicts that the SGA-FI at the tail of tha arrow regulates 10% or more of the DEGs in the cancer process at the head of the arrow. B. Same DEGs regulated by distinct SGA-FIs in different tumors. DEGs in cancer processes shared between tumor TCGA-B1-A657 and tumor TCGA- HE-A5NL are shown as pie-charts. Blue nodes denote SGA-FIs in tumor TCGA-B1-A657. Red nodes denote SGA-FIs in tumor TCGA- HE-A5NL. Yellow nodes, (i.e., NEFH), are shared by both tumors. Each large node in the middle represents an oncogenic process. Within the circular nodes in the middle of the figure, the number in the purple area denotes the number of DEGs specific to TCGA-B1-A657. The number in the red area denotes the number of DEGs specific to TCGA- HE-A5NL. The number in the yellow area denotes the number of DEGs shared by both tumors. An green directed link indicates an SGA-FI regulates 10% or more DEGs in the cancer process.