Integration of pan-cancer transcriptomics with RPPA proteomics reveals mechanisms of epithelial-mesenchymal transition

Integrating data from multiple regulatory layers across cancer types could elucidate additional mechanisms of oncogenesis. Using antibody-based protein profiling of 736 cancer cell lines, along with matching transcriptomic data, we show that pan-cancer bimodality in the amounts of mRNA, protein, and protein phosphorylation reveals mechanisms related to the epithelial-mesenchymal transition (EMT). Based on the bimodal expression of E-cadherin, we define an EMT signature consisting of 239 genes, many of which were not previously associated with EMT. By querying gene expression signatures collected from cancer cell lines after small-molecule perturbations, we identify enrichment for histone deacetylase (HDAC) inhibitors as inducers of EMT, and kinase inhibitors as mesenchymal-to-epithelial transition (MET) promoters. Causal modeling of protein-based signaling identifies putative drivers of EMT. In conclusion, integrative analysis of pan-cancer proteomic and transcriptomic data reveals key regulatory mechanisms of oncogenic transformation.


Introduction
Central to the understanding of cancer cells are their epithelial or mesenchymal traits, which are governed by epithelial-mesenchymal transition (EMT). Cells that have undergone EMT display increased invasiveness and metastatic potential [1]. The transition is reversible, in that cells can also undergo mesenchymal-to-epithelial transition (MET) [2]. This plasticity plays a role in cancer progression and metastasis by increasing the capacity of cancer cells to invade and colonize at remote tissue [3]. EMT is thought to be governed by a few master regulators that induce epigenetic and transcriptional reprogramming, affecting the expression of multiple downstream genes [4]. The transition is characterized by the down-regulation of E-cadherin, which has been the gene most extensively studied, resulting in disruption of adherens junctions [5]. The inhibition of E-cadherin expression is known to be mediated by the transcription factor Snail [6]. At the CDH1 loci, Snail recruits protein complexes containing histone deacetylases (HDACs) that deacetylate H3 and H4 histones, silencing the transcription of Ecadherin [7]. Other key transcription factors implicated in EMT are ZEB1/2 and TWIST [8]. The regulation of EMT-TFs by miR200 and miR34 constitutes a double-negative feedback mechanism [2], predicting a bistable system with binary transition between cellular states. Essentially, EMT is controlled by multiple interconnected regulatory networks, which include transcriptional and post-transcriptional mechanisms. Due to high regulatory complexity, proteomic and transcriptomic technologies provide an opportunity to obtain a more global understanding of EMT and MET, while possibly discovering additional molecular mechanisms with implications for targeted cancer therapeutics.
The reverse phase protein array (RPPA) is a high-throughput proteomics method that utilizes antibody binding to quantify protein expression and post-translational modifications including phosphorylation, acetylation, and protein cleavage. Compared to mass spectrometry proteomics, RPPA has higher sensitivity for low-abundance proteins and is characterized by increased throughput; however, it relies on high-quality antibodies, so it cannot identify proteins or post-translational modifications de novo [9]. Using RPPA, 736 cancer cell lines have been assayed for 450 proteins and phosphoproteins covering well-established cancer-related signaling pathways [10]. This data complements prior efforts to characterize basal mRNA expression across many of the same cancer cell lines for different cancer types [11]. In addition, tumor samples have been characterized by similar RPPA experiments for samples from the Cancer Genome Atlas (TCGA) [12], which are publicly available through the Cancer Proteomics Atlas (TCPA) [13].
Most genome-wide studies of EMT in cancer cell lines and tumors have focused on particular cancer types. Combining EMT signatures based on cell lines and tumors of multiple cancer types can identify general transcriptomic features of EMT in cancer cells, which are clinically relevant for multiple types of cancer. More recently, transcriptomic data from TCGA and Cancer Cell Line Encyclopedia (CCLE) have been used to define a pan-cancer EMT signature based on the expression of E-cadherin and Vimentin alone [14]. In this study, we integrate transcriptomics and RPPA data from multiple cancer cell lines to study pan-cancer cellular states associated with EMT.

Transcript and protein signatures of pan-cancer cell lines organize by Ecadherin expression
The Cancer Cell Line Encyclopedia (CCLE) contains 1037 cancer cell lines with profiled transcriptomes [11], and the MD Andersen Cell Line Project (MCLP) contains 736 cancer cell lines profiled by RPPA [10]. Out of these cancer cell lines, 381 have both available RPPA and microarray data. RPPA measurements are available for 450 proteins and phospho-proteins, of which 311 genes can be matched to mRNAs measured in CCLE. In the RPPA data, 79 proteins are measured both at the basal expression and phosphorylation levels ( Fig 1A). To our knowledge, this data set, although far from genome-wide at the protein level, represents the largest collection of cancer cell line data measured at the transcriptional, translational, and post-translational levels.
Transcriptional profiling of human tumor samples accurately predicts the tissue of origin for common cancer types [15]. This suggests that despite oncogenic transformation, cancer cells retain cellular identity and molecular features of their ancestral cell lineage, which is a key confounding factor in pan-cancer analyses. To assess how cancer cell lines relate based on transcript and protein expression, we visualized distances between cell lines using the t-Distributed Stochastic Neighbor Embedding (t-SNE) method [16]. As expected, the cell lines are clustered predominantly by their tissue of origin for both RPPA and transcriptomic data ( Fig  1B, top). Cancer types with ill-defined or multiple clusters included breast and ovary as well as cell lines from the most common targets of metastasis: liver, lung, and bone. Nonetheless, most cell lines were correctly classified by nearest neighbor classification (S1 Fig), even when the t-SNE perplexity parameter was varied widely. In addition, independently from the t-SNE analysis, Gap statistics [17] from the average linkage hierarchical clustering at different tree cuts resulting in clusters of varying cardinalities displays similar grouping of cell lines. Furthermore, the inflection points in the number of clusters formed at different thresholds demonstrate that the cell lines are organized into distinct clusters based on expression vector similarity (S3 Fig). Next, we asked: to what extent the RPPA and transcriptomic data is concurrent. Although the cell line distances for protein and mRNA data were correlated (r = 0.58), they were surprisingly different for particular pairs of cell lines ( Fig 1C). To quantify these differences and rank cell lines with the most characteristic protein or transcript signatures, we calculated the residuals of a linear regression between the protein and mRNA cell line distances. According to this model, we found that the 49 breast cancer cell lines had the largest distances at the protein vs. mRNA levels compared with other sets of cells from other tissues of origin, suggesting that the RPPA measurements better distinguish breast cancer subtypes. Interestingly, hierarchical clustering based on the RPPA data supports three luminal breast cancer subtypes compared with two subtypes identified by transcriptomic data (Fig 1D) here and elsewhere [12]. More broadly, combining data from microarray and RPPA data strengthened the cancer type clustering of cell lines, further suggesting that these measurements of global cellular states are complementary. Overall, clustering of cell lines by transcriptomic and RPPA data is consistent with some cancer types being well-defined and others spanning a wide spectrum of molecular states, while retaining few but important distinguishing differences at both the cancer type and subtype level.
To test the hypothesis that EMT governs the molecular states of cell lines across cancer types, we colored the z-scores of E-cadherin expression on the points on the t-SNE maps. For both transcripts and proteins, the cancer cell lines were globally organized by a gradient of Ecadherin expression (Fig 1B, middle). This organization indicates a central role for EMT in characterizing the molecular states of cancer cell lines. Most cancer types associated with common carcinomas had cell lines that spanned this E-cadherin gradient, with lung and breast cancer displaying the largest span. In contrast, cell lines from skin, bone, blood, and kidney were exclusively found in regions with low E-cadherin expression; whereas cell lines from pancreatic and large intestine cancers were found mostly in regions with high E-cadherin expression with only few cell lines expressing E-cadherin at low levels. To ensure the robustness of these findings, we ran independent t-SNE analyses by varying the perplexity parameter, which recapitulated both the E-cadherin gradient and the cancer type-specific clusters (S2 Fig). In comparison, principal component analysis (PCA) yielded less separation of cancer types and a less prominent gradient of E-cadherin expression (S4 Fig). Using cell line annotations from the Catalogue of Somatic Mutations in Cancer (COSMIC), we found no obvious association to whether the cell lines were derived from primary or metastatic tumors (Fig 1B, bottom). This suggests that the arrangement of cell-lines on the t-SNE plots, and thus global expression at the mRNA and protein levels, is dominated by tissue of origin much more than metastatic status. Nonetheless, we propose that collections of pan-cancer cell lines can be used to study aspects of EMT related to E-cadherin expression, which is also clearly bimodal (Fig 2A).

Bimodal protein expression and phosphorylation indicate oncogenic transitions
Oncogenesis is a multi-step process by which cells acquire cancerous traits, which often mimic physiological cellular processes such as embryogenesis [14]. Such processes are governed by molecular switches that turn on or off coordinated cellular programs. Hence, analyzing the bimodality of protein expression can potentially illuminate cellular states of oncogenesis. To evaluate this idea, we fit a univariate two-component Gaussian mixture model to each RPPA measurement using the expectation-maximization (EM) algorithm. We evaluated bimodality against unimodal distributions using the Bayesian Information Criterion (BIC). Out of the 450 antibody-based RPPA measurements, 260 were bimodal across 736 pan-cancer cell lines ( Fig  2B, S1 Table). Among the most bimodal proteins were E-cadherin, Claudin7, and Rab25, all of which have been previously associated with EMT or MET [18]. However, because of the preponderance of tissue-specific signatures among pan-cancer cell lines, bimodal protein expression could more simply be explained by cell type-specific expression. For example, LCK was highly expressed only in a subset of blood cancer cell lines ( Fig 2C) concordant with its specific roles in T cell development [19]. To account for residual effects of ancestral cell types, we quantified the tissue diversity of the cell lines assigned to the low-and high expression states by the Shannon entropy of the tissue distributions. We then excluded the lower tertile of the minimum tissue entropy of the low-and high expression states. This approach yielded 172 bimodal proteins and phosphosites (Fig 2B, S1 Table). Out of these, 90 had balanced bimodal distributions including E-cadherin, Claudin7, and Rab25, indicating common pan-cancer oncogenic switches, while 82 were classified as rare transitions ( Fig 2D). This filtering and classification is likely prone to false positives due to other confounding factors such as different stages of the circadian clock at time of measurement. Compared to non-bimodal proteins, the proteins associated with the common switches were uniquely enriched for 107 Gene Ontology terms (p < 0.05, after Benjamini-Hochberg correction) many of which can be linked to metastasis and invasion ( Fig 2E).

Coupled bimodality suggests transcript-and protein-based regulatory basis for oncogenic switches
To identify whether the observed bimodal protein expression across cancer cell lines correlate with transcriptional regulation, we evaluated the bimodality of matching transcripts from CCLE ( Fig 3A). We then defined bimodal coupling coefficients between mRNA and protein measurements as the Spearman's correlation between the posterior probabilities of the mixture model. Overall, 14.0% of proteins measured in MCLP had highly coupled (r b > 0.5) bimodal expression of mRNA and protein. Slightly fewer proteins (10.8%) were uniquely bimodal only at the protein level, including important cancer-related proteins such as MEK1, mTOR, E2F1, TTF1, EIF4G, and JAB1. Hence, these proteins are bimodally expressed due to post-transcriptional regulatory mechanisms such as protein translation and degradation. In addition, we compared the bimodality of proteins and their phosphosites as measured by antibody binding in the RPPA data ( Fig 3B). Here, we found weaker bimodal coupling, indicating that phosphosignaling leading to bimodal phosphorylation is mostly independent from basal protein expression. Interestingly, bimodal HER2 phosphorylation at Y1248 was moderately coupled to HER2 protein expression (r b = 0.46), most likely due to autophosphorylation on increased dimerization at higher expression [20]. The bimodal EMT proteins E-cadherin, Claudin7, and Rab25 all had high bimodal mRNA-protein coupling (Fig 3C), confirming that these EMT switches are mostly determined by transcriptional regulation. Nonetheless, 30 cancer cell lines had high expression of the E-cadherin transcript but low protein expression (Fig 3D and 3E), suggesting that E-cadherin could be translationally or post-translationally controlled in some cellular contexts. Among these cell lines, 3 out of 4 CDH1 genotyped cell lines in COSMIC had either nonsense (MDA-MB-453 and HT115) or frameshift (MDA-MB-134-VI) mutations in CDH1, which validate our ability to identify effects on E-cadherin translation. Inactivating mutations in CDH1 are frequently observed in breast and gastric cancers with cancer type-specific mutational patterns and are associated with loss of cell-cell adhesion and increased cell motility [21]. The nature of the low E-cadherin protein expression in the other 26 cell lines remains unknown, but likely includes inactivating mutations and possibly translational or post-translational regulation. Transcriptional mechanisms that determine molecular switches are regulated by upstream signaling, such as phosphorylation cascades, which leads to coordinated expression of multiple genes. To detect candidates for such signaling and further characterize the EMT-related states in cancer cell lines, we analyzed the network of bimodal coupling coefficients among bimodal protein and phosphosites associated with high tissue diversity. We first trimmed the protein network by including only significant bimodal coupling coefficients (FDR < 5%, Bonferroni) with |r b | > 0.3. This yielded a network of 172 protein nodes connected by 507 edges, from which network communities were defined based on the leading non-negative eigenvector [22]. In total, we detected 8 protein communities that likely reflects shared underlying signaling or cellular events (Fig 4A). One community (EMT1) was clearly linked to EMT, containing E-cadherin, β-catenin, Fibronectin and Twist among several other EMT-related proteins (Fig 4A). E-cadherin was connected to EPPK1, INPP4B, Stathmin, Jagged1, UGT1A, and PDCD1L1, indicating that these might be involved in EMT. The strong bimodal coupling to EPPK1 could help explain why loss of E-cadherin is associated with migratory phenotypes and not just loss of cell-cell adherence; in mice keratinocytes, EPPK1 knockout cells exhibit faster migration and increased wound healing [23]. E-cadherin was also positively coupled to the phosphosite EGFR pY1068, which was in turn positively coupled to SRC pY416 and STAT3 pY705, suggesting a role for phosphorylation of these sites in EMT. Strikingly, all detected communities contained multiple proteins with known mechanisms linking them to EMT but also identified potentially undiscovered components (Fig 4A). The dispersion of these EMT-related proteins among the identified protein communities suggests that they are either part of separate biological processes, or that their involvement in EMT depends on cancer subtypes. Another intriguing possibility is that the multiple protein communities associated with EMT reflect partial cellular states inbetween epithelial and mesenchymal phenotypes (Fig 4B). In support of this idea, P-cadherin has previously been suggested as a marker of metastable EMT states [24]. Here we find P-cadherin in the EMT2 community (Fig 4A). In addition, Claudin7 is highly coupled to E-cadherin (r b = 0.70), but found in a separate community (EMT3), along with Rab25 and N-cadherin. Looking closer at this correlation, cell lines had high E-cadherin and low Claudin7 expression but not conversely (Fig 4C). Two other communities are identified, EMT4 and EMT5. The EMT4 cluster contains key cell-cycle transcription-factors such as FoxM1, Cyclin-B1, and Elk1, together with protein kinases that are known to positively regulate their activity, including PLK1, MAPK, and MEK1. Consequently, this cluster indicates changes in cell proliferation regulation. The EMT5 cluster contains a clique made of 3 protein kinases known to regulate the protein translation machinery: RICTOR, P70S6K, and PDK1; and S6 a key protein in the 40S ribosomal subunit. Hence, this cluster likely represents changes in protein translation activity related to overall cell growth. It should also be noted that highly studied proteins and phosphoproteins such pAKT, Cyclin D1, PTEN, and PKC are known to be central to many other pathways, not just to EMT. Hence, labeling all identified clusters as EMT clusters needs to be considered with such general functions in mind. Altogether, it is possible that the protein communities EMT1 and EMT3 may reflect a two-step transition (Fig 4B). In summary, the quintessential EMT marker E-cadherin was found centrally in a large protein and phosphosite network community with clear associations to known EMT markers. For these reasons, we focused subsequent analyses around the expression of E-cadherin, arguing that this approach reflects core aspects of EMT that are invariant across cancer types.

E-cadherin and bimodally coupled mRNA define genome-wide EMT profile
Because E-cadherin is primarily transcriptionally controlled, we next sought to characterize the coordinated transcriptional program associated with E-cadherin down-regulation. First, we defined an EMT signature based on the bimodal coupling (|r b | > 0.5) between E-cadherin protein expression and transcriptomic measurements from CCLE, resulting in 239 transcripts -215 positively and 24 negatively coupled (Fig 5A, S2 Table). To our knowledge, these 239 genes include many novel epithelial and mesenchymal markers, while recovering many known EMT markers previously described (Fig 5C), for example, Axl which was reported for non-small cell lung carcinoma [25], or KPNA2 for ovarian carcinoma [26]. The preponderance of positively coupled transcripts suggests that the EMT signature is predominantly characterized by down-regulation of genes governing epithelial traits rather than by gain of mesenchymal traits. Nonetheless, the bimodal coupling coefficients were shifted towards negative values (Fig 5B) and we did find negatively coupled mesenchymal markers such as ZEB1/2 [4].
To further characterize the EMT signature, we performed enrichment analysis on the epithelial markers (Fig 5B). Enrichment analysis [27,28] for Gene Ontology (GO) cellular components and biological processes clearly demonstrated epithelial phenotypes (Fig 5D). We also found enrichment for localization to the perinuclear region, which is a cytosolic region next to the nuclear envelope with largely unknown composition and biological function. This suggests that the epithelial markers can be used to prioritize spatial cellular regions not widely considered to be affected by EMT such as the perinuclear region, where components of endocytosis aggeregate, although it is well established that endocytosis is central to cell migration. Enrichment for TF binding, using aggregated results from ChIP-seq studies [29], identified SOX2, SMAD2-4, TP63, GATA3-4, and GATA6, which likely act to down-regulate epithelial genes during EMT (Fig 5E). The identified enriched TFs OCT4, SOX2, NANOG, KLF4, and ESRRB are all known to be essential for maintaining pluripotency of human and mouse embryonic stem cells [30]. These TFs bind to super-enhancer regions and through the Mediator complex [31]. Therefore, large parts of the observed transcriptional bimodality could be explained by TF co-operation at super-enhancers resulting in switch-like regulation at numerous genomic loci. At the CDH1 loci, ENCODE ChIP-seq data supports the involvement of super-enhancers since the loci is marked by high H3K27ac correlated with E-cadherin expression (S5 Fig). Previously, super-enhancers have been proposed to control partial EMT through the putative master regulator TFs ETS2, HNF4A, and JUNB [32], the first two of which we also identified through the TF enrichment analyses. Taken together, pan-cancer bimodality uncovers oncogenic states and regulatory mechanisms of EMT and MET.

EMT and MET can potentially be induced by small molecules
Gene expression-based, high-throughput screening is a promising approach to identifying small-molecule candidates that can reverse or mimic changes in expression observed in transition to a disease state [33]. To detect small molecules that would maximally push cells toward the EMT or MET expression state, we queried the EMT signature against signatures from 20,000 small-molecule perturbations of~50 human cell lines generated by the library of network-based cellular signatures (LINCS) project L1000 dataset [34]. We searched for small molecules that down-regulate epithelial genes and up-regulate mesenchymal genes, resulting in candidate EMT inducers (Fig 5E). Small molecules with the opposite effects were interpreted as MET inducers. Strikingly, most small molecules predicted to induce EMT were HDAC inhibitors, whereas most small molecules predicted to induce MET were kinase inhibitors. The identified HDAC inhibitor Trichostatin A has been shown to induce EMT in prostate cancer cells through modification of H3 near promoters of EMT-related genes [35]. Of the candidate MET inducers, Selumetinib, Trametinib, and PD-0325901 are thought to inhibit MEK, while Saracatinib and Dasatinib to inhibit SRC among other kinome targets. In agreement with these findings, a prior high-content chemical screen aimed at identifying inhibitors of EMT has predominantly identified other similar kinase inhibitors based on cell growth and migration assays [36]. Hence, in summary, caution should be placed in utilizing HDAC inhibitors as therapeutics due to their putative potential to enhance EMT as predicted by chemogenomics screening.

Causal protein and phosphorylation models identify drivers of cancer signaling and progression
The bimodal coupling model we implemented to analyze EMT is essentially correlative and hence not causal. However, establishing causal interactions based on RPPA data is challenging without time-series or direct perturbation data such as gene knock-downs or knock-outs [37]. With sufficient sample size and coverage of diverse cell lines, it is in principle possible to identify causal, regulatory interactions between measured signaling components. Despite not satisfying the observation that cell signaling regulatory networks contain feedback loops [38], learning Bayesian network learning algorithms can be applied to construct causal models of cellular regulatory networks, including cell signaling networks, from observational data [39][40][41]. To infer causal relationships among proteins and phosphosites measured by RPPA, we used a Fast Greedy Search algorithm to estimate a Bayesian network over all 450 RPPA measurements (Fig 6). Based on the resulting directed acyclic graph, we calculated betweenness centrality, subgraph centrality, in-degree, and out-degree for each analyte. Bimodal phosphosites overall had higher betweenness centrality (p = 0.036, t-test). By considering measures of network influence, several proteins and phosphosites were identified as promising candidate drivers (Fig 6A and 6B). In particular, SRC pY416 had the highest out-degree. This phosphosite is known to be highly predictive of patient survival [42].
Furthermore, we analyzed the network neighborhood of E-cadherin, Claudin7, and Rab25, which spanned the proposed two-step transition: EMT3 followed by EMT1 (Fig 6C). Using a hierarchical network layout algorithm, E-cadherin was found downstream of Claudin7 and Rab25 concurrent with EMT3 preceding EMT1 in cancer cell lines. In addition, several proteins and phosphosites upstream of the EMT markers were plausible oncogenic drivers for specific cancer types, supported by prior reports. For example, MACC1 is associated with pancreatic EMT and metastasis [43], which the analysis found for both pancreatic and pleural cancer cell lines, while suggesting the opposite effect in lung, endometrium, and upper digestive tract cancers. Lastly, we also found LKB1, CHK1, and Stathmin upstream of EMT markers. The inactivation of LKB1, which is frequently mutated in lung adenocarcinomas, induces EMT in lung cancer cells through activation of ZEB1 [44], whereas CHK1 mediates DNA damage response as part of EMT by stabilizing ZEB1 [45]. Inhibiting the microtubule destabilizer Stathmin impedes EMT by increased microtubule formation [46]. In conclusion, inferred Bayesian protein networks based on pan-cancer cell lines can potentially identify key drivers of EMT.
To validate the causal models, we carried out bootstrapping for 200 iterations and considered the average network for both cell line and tumor data (Supporting Information, S3 Table, Fig 6E). Although the reproducibility of particular edges in the ensemble of Bayesian networks was relatively low for the cell line data, the connection from Claudin7 to Rab25 was present in 48% of the networks, and in all the networks inferred from tumor samples. In contrast, the statistical reproducibility of the tumor networks was higher, most likely due to the larger sample size (n = 3,161). Overall, the average Bayesian networks were significantly correlated between cell line and tumor data (Fig 6D). The bimodal coupling coefficients, however, were lower in the tumor data, indicating that bimodal expression is less pervasive in tumors compared to cell lines. This result might indicate that tumors, more-so than cell lines, contain mixtures of cell types that are in multiple cellular states. Overall, the learning Bayesian Network strategy employed here is exploratory and needs further evaluation, parameter tuning and validation.

Discussion
Cellular transitions from epithelial to mesenchymal phenotypes share common characteristics such as down-regulation of E-cadherin in a variety of tissues and cancer contexts [47]. In this study, we demonstrate that in pan-cancer cell lines, bimodal coupling of transcript, protein, and phosphosite expression reveal epithelial and mesenchymal states. However, many known EMT markers are dispersed across bimodally coupled network modules, suggesting that they are involved in distinct regulatory programs. Different modules likely correspond to intermediate cellular states of an EMT spectrum, whereby transcriptional down-regulation of E-cadherin, and other genes, represents a decisive loss of epithelial traits. In agreement, we find that E-cadherin expression is primarily transcriptionally controlled, possibly with context-dependent control at the translational or post-translational level. By anchoring the investigation around the transcriptional program associated with E-cadherin down-regulation, we identified 239 bimodal EMT markers, many of which have not previously been associated with EMT.
The observation that EMT markers are particularly bimodal suggests that cell lines are unequivocally either epithelial or mesenchymal in cell culture. It follows that the EMT decision for cells is determined by the growth medium and the genetics of the cancer cells, rather than by stochastic processes leading to heterogeneous mixtures of cells. Therefore, the identified bimodal switches likely reflect deterministic rather than stochastic architecture. However, in bulk experiments of cell lines, rare but important populations of cells such as mesenchymal stem cells could be neglected. The lack of obvious association between metastasis and E-cadherin expression raises some questions. Possibly, the in vitro conditions, lack of cues from tumor microenvironments, and cell culture passages might mask the original metastatic events from which the cell line is derived. More broadly, cell culture conditions may fail to model crucial aspects of how EMT occurs in complex tissue environments. Yet, the identified deterministic mechanisms may be valid only under the right conditions.
We find that cancer cell lines down-regulate epithelial and up-regulate mesenchymal genes when treated with HDAC inhibitors. This observation warrants caution for the use of HDAC inhibitors as cancer and other therapies. If HDAC inhibitors induce EMT in cancer cells, this could explain the disappointing outcomes in clinical trials of HDAC monotherapies for solid tumors [48]. Several protein kinase inhibitors were predicted to revert cancer cell lines to a more epithelial state, but most of these kinase inhibitors are not currently in clinical use. Therefore, these kinase inhibitors may be effective as metastatic repressors and could be under-investigated due to the contemporary focus on targeted drug treatments rather than broad functional effects. Furthermore, the mechanisms of action for the small-molecules may inform us about EMT or MET drivers. For example, the identification that the two SRC inhibitors Dasatinib and Saracatinib are potentially MET inducers, the co-clustering of SRC pY416 with Claudin7, and its large out-degree in the Bayesian network, all corroborate evidence to the importance of SRC activity for regulating EMT during cancer progression. In addition, the identification of kinase inhibitors rather than other classes of small-molecules suggests that phospho-signaling in general is particularly important for driving MET.
Lastly, we show that causal models of protein expression and phosphorylation in cancer cell lines identify known and putative drivers of EMT. Due to the promising preliminary results from the causal models that we constructed, identifying molecular drivers of EMT, despite the lack of statistical power to robustly detect individual causal interactions, it is clear that measuring more cell lines, under more conditions, would substantially increase the sensitivity and in turn quality of such models. Also, if a sufficient number of pan-cancer cell lines could be profiled by mass spectrometry proteomics, the developed bimodal methodology could be reapplied to confirm and discover novel associations between proteins and post-translational modifications that drive oncogenic state transitions.

Matching RPPA and CCLE cell line data
The RPPA data for 736 cancer cell lines were generated at the MD Anderson Cancer Center. The selection criteria of the 474 measured proteins were based on the aim to cover known cancer-related signaling pathways. We excluded antibodies with missing values across cell lines by requiring that each RPPA measurement is present in at least 40 cell lines. This resulted in a dataset with 450 antibody-based measurements. The CCLE mRNA data and cell line annotations of 1,037 cancer cell lines were retrieved from the CCLE portal at: https://portals. broadinstitute.org/ccle. We used the gene-centric RMA-normalized data.

Cell line distances and clustering
For all methods relying on geometric distances, Euclidean distances were computed considering only pairwise complete features. Sparse RPPA measurements were excluded, requiring that each protein is measured in at least 100 cell lines, which resulted in the inclusion of 263 protein measurements. To reduce the dimensionality of the RPPA and mRNA data, we used t-SNE implemented in the R package 'tsne' with perplexity value of 30 and at 5,000 iterations, and all other arguments at their default values [16]. Only cell lines with available RPPA and mRNA data were included. For the combined RPPA and CCLE mRNA embedding, the distance matrices for each data set were weighted by the sum of all distances. In this way, each data type contributed equally to the combined analysis. To more rigorously assess the number of clusters supported by the RPPA and CCLE data sets, we calculated the Gap statistic [17] from the average linkage hierarchical clustering at tree cuts resulting in clusters of varying cardinalities. PCA was performed using the R package 'pcaMethods' from data that were centered and scaled to unit variance, while imputing missing values with the 'svdImpute' method. Furthermore, we analyzed patterns in the classification and misclassification of the tissue of origin for the RPPA and mRNA data using 3-nearest-neighbor classification according to a leaveone-out cross-validation scheme. Linear regression was carried out between the RPPA and CCLE mRNA distance matrices with the RPPA distances considered the target variable. The residuals of the regression thus quantify the deviation from expected distance for RPPA data for each pairwise cell line distance. To compare the clustering of breast cancer cell lines, we computed tanglegrams using the dendextend R package. The tanglegram method uses a random search to rotate tree nodes minimizing the overlap of lines drawn between leaves of two trees.

Statistical models of bimodal expression
We fit univariate two-component Gaussian distributions using the expectation-maximization (EM) algorithm implemented in the 'mixtools' R package with default parameters. To compare the fitted distribution to unimodal Gaussian distributions, we calculated the difference between the Bayesian Information Criterion (BIC). The data were determined to be bimodal if the BIC difference was larger than 2. Based on the fitted Gaussian mixture model, we calculated, using Bayes' theorem, the posterior probabilities of measurements being generated from the high expression component. Note that the probability of belonging to the low component is 1-p. To estimate the tissue diversity of each bimodal fit, we first calculated the frequencies of tissues assigned to the low (p < 0.5) and the high (p > = 0.5) component. We then calculated the Shannon entropy of the tissue distributions associated with the low-and the high components. The bimodal RPPA measurements were classified into groups of low, medium, and high tissue diversity by the tertiles of the minimum tissue entropy associated with low-and high expression. The bimodal expression was considered common if the fitted mixture coefficients were above 1/4 and rare if below. Based on the posterior probabilities of the bimodal fits associated with high tissue diversity, we calculated a network of bimodal coupling coefficients defined as Spearman's correlations between the posterior probabilities. To detect robust communities in this network, we set a cutoff of |r b | > 0.3 and calculated the leading non-negative eigenvectors using the igraph R package. The network was visualized in Cytoscape with node size proportional to the mixing parameter of the two-component Gaussian fit and with edge coloring based on the coupling coefficients.

EMT signature enrichment and queries
The coupling coefficients between the E-cadherin RPPA measurements and matched CCLE transcript data were used to define an EMT signature (r b > 0.5). Enrichment analysis was performed with Enrichr [28]. L1000CDS2 was used to query small molecules as potential inducers or reversers of EMT [34]. We summarized the EMT and MET small-molecule predictions by reporting the top-50 small molecules identified using boxplots to aggregate small molecules with multiple experimental conditions such as cell lines, dosage, or timing.

Causal modeling
The causal models of the RPPA measurements across cancer cell lines were inferred using the Fast Greedy Search algorithm [49] implemented by the BD2K Center for Causal Discovery [50]. We used the rcausal R package version 0.99.5 to run the Java implementation with penalty discount 4 and depth 3. To visualize causal neighborhoods, we computed graph cuts and rendered the subnetwork in R using a Sugiyama layout of the directed acyclic graph. The tissue-specific correlations were layered on top of the edges as histograms. To estimate the robustness of the resulting causal network, we ran the algorithm several times in a bootstrap scheme (M = 200) by sampling with replacement.