Molecular characterization of breast and lung tumors by integration of multiple data types with sparse-factor analysis

Effective cancer treatment is crucially dependent on the identification of the biological processes that drive a tumor. However, multiple processes may be active simultaneously in a tumor. Clustering is inherently unsuitable to this task as it assigns a tumor to a single cluster. In addition, the wide availability of multiple data types per tumor provides the opportunity to profile the processes driving a tumor more comprehensively. Here we introduce Functional Sparse-Factor Analysis (funcSFA) to address these challenges. FuncSFA integrates multiple data types to define a lower dimensional space capturing the relevant variation. A tailor-made module associates biological processes with these factors. FuncSFA is inspired by iCluster, which we improve in several key aspects. First, we increase the convergence efficiency significantly, allowing the analysis of multiple molecular datasets that have not been pre-matched to contain only concordant features. Second, FuncSFA does not assign tumors to discrete clusters, but identifies the dominant driver processes active in each tumor. This is achieved by a regression of the factors on the RNA expression data followed by a functional enrichment analysis and manual curation step. We apply FuncSFA to the TCGA breast and lung datasets. We identify EMT and Immune processes common to both cancer types. In the breast cancer dataset we recover the known intrinsic subtypes and identify additional processes. These include immune infiltration and EMT, and processes driven by copy number gains on the 8q chromosome arm. In lung cancer we recover the major types (adenocarcinoma and squamous cell carcinoma) and processes active in both of these types. These include EMT, two immune processes, and the activity of the NFE2L2 transcription factor. In summary, FuncSFA is a robust method to perform discovery of key driver processes in a collection of tumors through unsupervised integration of multiple molecular data types and functional annotation. Author Summary In order to select effective cancer treatment, we need to determine which biological processes are active in a tumor. To this end, tumors have been quantified by high dimensional molecular measurements such as RNA sequencing and DNA copy number profiling. In order to support decision making, these measurements need to be condensed into interpretable summaries. Such summaries can be made interpretable by connecting them to biological processes. Biological process activity is continuous and multiple biological processes are taking place in a single tumor. Therefore, the biological processes associated with a tumor are misrepresented by clustering, which tries to put every tumor in a single cluster. In the method introduced in this paper (funcSFA), molecular measurements are summarized into a small number factors. A factor is a continuous value per tumor that aims to represent the activity of a biological process. When applied to breast and lung cancer, funcSFA identifies factors covering well known biology of these tumor types. FuncSFA also finds novel factors covering biology whose importance is not yet widely recognized in these tumor types. Some of the factors suggest treatment opportunities that can be further investigated in cell lines and mice.


Introduction
Cancer is a heterogeneous disease, both at the molecular level and in response to 2 treatment. If we can better understand the variation between tumors, we may get a 3 better understanding of why tumors respond differently to treatment. This could, in 4 turn, lead to better treatment selection for patients. 5 To chart the variation across tumors, projects such as The Cancer Genome Atlas 6 (TCGA) have collected a variety of molecular data from thousands of tumors [1][2][3]. 7 Analyses of these data provide a better understanding of the underlying biological 8 processes associated with the cancer. For example, recurrent copy number 9 aberrations or recurrent point mutations may reveal the drivers of carcinogenesis. 10 Complementary to this, RNA expression or protein phosphorylation can reveal 11 downstream changes involving many genes, even if the upstream driver of those 12 changes is unknown. Hence, the different data types are reflections of the same 13 biological state, yet each of them encodes information not present or only partially 14 present in the others. Therefore, a comprehensive characterization of the molecular 15 variation across tumors requires the integration of multiple data types. 16 A popular approach to characterizing tumors is clustering of RNA expression data. 17 Examples include the PAM50 subtypes [4] in breast cancer and the consensus 18 subtypes in colorectal cancer [5]. Since these subtypes are only based on the RNA 19 expression of tumors, they will fail to capture differences in tumors that are more 20 clearly, or even exclusively observed on other molecular levels, such as protein 21 expression, or DNA copy number. 22 Integrative clustering approaches such as Bayesian consensus clustering [6], patient 23 specific data fusion [7] and iCluster [8] do take multiple data types into account. 24 However, clusters are unsuitable models of biological processes for at least two 25 2/30 reasons. First, a biological process can be activated in multiple contexts and multiple 26 independent biological processes can be active simultaneously. However, as clusters 27 cannot overlap, it becomes a challenge to represent this variation in a discrete 28 clustering. For example, immune infiltration occurs in both ER+ and ER-negative 29 breast tumors, but once a tumor is assigned to the ER+ cluster it cannot be assigned 30 to an immune cluster that spans all breast cancer tumors. Second, the variation in 31 activity of a biological process is often more complex than can be captured by a simple 32 distinction between absent or present. Instead, it is more naturally expressed along a 33 continuous scale of activity levels. This cannot be captured by discrete clusters. 34 Paradigm [9] improves upon the abovementioned approaches by integrating multiple 35 data types to infer activity levels of biological processes. Activity levels of biological 36 processes in tumors are assigned independently of each other, avoiding the limitation 37 of cluster analysis. To estimate these activity levels, Paradigm leverages existing 38 knowledge available from pathway databases. A limitation of this approach is that 39 using existing knowledge a priori limits discovery of new biological processes. More 40 importantly, it also limits the discovery of biological processes in new contexts 41 (e.g. tumor types) because activity of a process in a new context might involve a set of 42 genes that is only partially overlapping with the genes currently annotated to that 43 process. 44 Here we introduce FuncSFA, a sparse-factor analysis with a tailored gene-set 45 enrichment analysis (GSEA) [10] that integrates multiple data types to provide both a 46 continuous characterization and a functional interpretation of the variation across 47 tumors at the molecular level (Fig 1). The sparse-factor analysis identifies factors 48 explaining variation in multiple data types such as RNA expression, protein 49 expression, and DNA copy number data. Subsequently, the factors are interpreted 50 and linked to known biology using a gene-set enrichment analysis of the factors on 51 the RNA expression data. The interpretation obtained from the gene-set enrichment 52 analysis is validated by comparison of the genes, epitopes and copy number 53 aberrations in the factor to external resources. Together this not only provides insight 54 into variation across tumors but also into the biology underlying the molecular data. 55 The sparse-factor analysis is based on a reinterpretation of the mathematical 56 framework behind the iCluster method [8,11]. Our reinterpretation improves upon the 57 iCluster method in several key aspects. First, through proper factor rescaling and 58 more efficient optimization approaches, we ensure convergence of FuncSFA on 59 multiple molecular datasets that have not been pre-matched. An example of 60 pre-matching is in the iC10 subtypes [12], where the authors only selected genes 61 where RNA expression correlates with DNA copy number. Second, in contrast to 62 iCluster, FuncSFA does not assign tumors to discrete clusters, but identifies the 63 dominant driver processes across multiple molecular data types and across all 64 tumors, and then represents, for each tumor, the spectrum of processes active in that 65 tumor. Taken together, FuncSFA represents, for the first time, a robust method to 66 perform discovery of the key biological processes driving the most important 67 phenotypic differences across a set of tumors through unsupervised integration of 68 multiple molecular data types. 69 We applied FuncSFA to the TCGA breast and lung cohorts and identified 10 factors in   . The green circles represent the factors, and the red, blue and yellow circles at the bottom represent the observed variables, with the colors representing the data types and each circle representing an individual variable (i.e. the expression of a gene or protein, or the copy number of a gene). The black lines connecting the individual variables to the factors represent the regression coefficients. B: Graphical representation of the mathematical concepts of SFA with X representing the N × n data matrix, Z the N × k obtained factor matrix and B the k × n factor coefficients. C: Graphical representation of the computations of the factor expression coefficients. The coefficients represented by the k × n m matrix C are obtained by regressing the N × n m RNA expression matrix, X m , on the N × k factor matrix Z. D: The gene-set enrichment analysis designed to assign biological processes or pathways to the obtained factors. E: Application of the factors to determine the activity of the factors (or associated biological processes) in a new tumor. (N: number of tumors; n: number of features; k: number of factors; n m : number of mRNA features; Z: factor matrix; X: data matrix (concatenation of mRNA, copy number and Reverse Phase Protein Array (RPPA) data); B: Sparse factor coefficients; C: Factor regression coefficients; GSEA: Gene-set enrichment analysis).

4/30
represented by an independent factor. In lung cancer, which has remained largely 75 uncharacterized, we also identified an EMT and Immune factor, as well as a factor 76 associated with the main lung subtypes-Adenocarcinoma and Squamous Cell 77 Carcinoma. We also identified a factor which captures the activation of the 78 transcription factor NFE2L2. Here the power of integration of multiple data types is 79 highlighted by the fact that the activity of this factor is associated with mutations in 80 NFE2L2 as well as its inhibitor KEAP. We expect that the identified factors not only 81 provide a more complete characterization of the biological processes active in the 82 different tumors, but will also provide a starting point for the development of better 83 treatment strategies.  Sparse-factor analysis 94 In our sparse-factor analysis, we assume that each tumor type is characterized by a 95 set of key driving factors, or biological processes, that give rise to molecular 96 phenotypes. These driving factors cannot be measured directly, but are observed 97 indirectly through noisy measurements of multiple molecular data types such as 98 mRNA expression, copy number aberrations and protein expression and modification.

99
The challenge is to employ all these measured data types simultaneously to identify 100 this unobserved structure in the data.  The factors are identified by maximizing the joint likelihood of the the measured data 112 and the factors, selecting factors and regression coefficients explaining the measured 113 data. This is accomplished in a methodology that is akin to probabilistic principal 114 component analysis (PCA) [13]. We employ elastic net regularization [14] to create 115 parsimonious models by enforcing sparsity on the regression coefficients associated analysis. Consequently, a factor is defined by the subset of molecular variables that 120 contribute to that factor as well as the regression coefficients that model the degree to 121 which each molecular variable contributes to the factor.  The joint likelihood is maximized using an expectation maximization algorithm. This 129 algorithm improves over the iCluster2 algorithm [11] by rescaling the factors to unit 130 variance and by estimating coefficients with coordinate descent [15]). On the TCGA

144
The first step of the gene-set enrichment analysis is to regress the factors on the gene 145 expression matrix (Fig 1C). This may sound counter-intuitive, as the purpose of the 146 SFA is to pinpoint the genes contributing to the factor, hence revealing the underlying 147 drivers. However, the motivation for this is two-fold. First, the sparsity constraints on 148 the coefficients introduce zeros in the coefficients. Although this is beneficial for 149 human interpretation and driver identification, genes with zero coefficients can not be 150 ranked in terms of their contribution to the factors. This is a major complicating factor 151 as such a gene ranking is an essential part of the gene-set enrichment analysis.

152
Second, RNA expression remains a data type that captures most of the variation in 153 the cell, and can, as such be quite informative regarding the activity of biological 154 processes and hence for the interpretation of the factors. So, having gone through the 155 process of identifying driving factors that are robust, in the sense that they are 156 common to all data types, we employ the regression of the factors to the complete 157 RNA expression data set to identify all genes with expression patterns that show 158 association with the factors. This association is captured in the regression coefficients 159 (Matrix C in Fig 1C). We then normalize each coefficient by the gene standard 160 deviance to obtain the 'factor expression coefficients'.

161
The second step of the gene-set enrichment analysis is to rank the genes based on 162 6/30 the factor expression coefficients and compute the enrichment statistic for every 163 gene-set factor pair ( Fig 1D) [10]. We determined statistical significance of the 164 enrichments by a sample permutation test [10]. The enrichment results provide input 165 for a manual curation and verification process to identify the most likely biological 166 process that gives rise to the identified factors.  Application to breast and lung cancer 174 We have applied FuncSFA to the breast cancer [1] and lung cancer [2,3] data sets 175 from TCGA. Breast cancer is arguably the most exhaustively subtyped type of cancer, 176 and hence serves as a very good positive control for FuncSFA [1,12,16,17]. Lung 177 cancer has not been studied so extensively, even though at the moment it is the   Second, most subtyping approaches that have been applied to date revealed ten or 207 fewer subtypes: five intrinsic subtypes in breast cancer [19,20] ten IC10 subtypes in 208 breast cancer [12], four consensus subtypes in colorectal cancer [5], four in 209 squamous cell carcinoma [21] and three in lung adenocarcinoma [22]). Therefore, our 210 assumption was that ten factors would be sufficient to capture the strongest (known) 211 factors, while leaving room for the discovery of new biological processes without 212 running the risk of compromising the robustness of the factors being discovered by 213 having a too large factor-to-sample-size ratio.

214
Breast Cancer 215 We applied FuncSFA to the breast cancer data employing 10 factors based on the 216 arguments given above. We performed functional annotation of the factors, as  Table). This resulted in the following 10 factors: ER (Estrogen Receptor),  First, the ER factor is strongly associated with both mRNA and protein expression of 228 ESR1, GATA3, PGR and AR, as expected. Second, the EMT factor shows strong 229 association with THBS2 and COL11A1 expression-the genes identified by 230 Anastassiou and colleagues [23] (see also the more detailed description of the EMT 231 factor below). In addition, the EMT factor shows association with many collagens.

232
Third, the HER2 factor shows the expected strong association with ERBB2 copy explained by a given factor mostly being the variation in gene expression. There are a 242 few exceptions. The HER2, Luminal proliferative and 8q-gained factors, explain more 243 variation in copy number data than in RNA expression data. For HER2 and 8q-gained 244 tumors this is not surprising as they are clearly copy number driven (Fig 2 and 4). explains technical variation in the RNAseq data. The ER factor is unique in the sense 251 that it is the single factor that explains most variation in RNA expression and RPPA   Table). This suggests that this factor represents the  Second, the HER2 factor shows large SFA coefficients for ERBB2 protein expression 284 and copy number gain (Fig 2). The gene-set enrichment analysis shows enrichment 285 for signatures of ERBB2 amplification and the HER2-enriched PAM50 subtype.

286
Importantly, this factor is strongly associated with the amplicon on Chromosome 17 287 harboring ERBB2 and GRB7, as evidenced by the large coefficients identified for this 288 amplicon in the factor analysis (Fig 2 and 4). Taken together, this suggest that the 289 HER2-factor indeed identifies the HER2+ tumors. This is confirmed by the strong 290 association of this factor with the HER2-enriched PAM50 subtype (Fig 3B.3,

291
AUC=0.96) and by the large weights of the HER2/GRB7 locus on DNA copy number 292 (Fig 4).     EMT is a process frequently associated with cancer [24] and it involves multiple 313 regulators, including SNAI1, SNAI2, TWIST1 and ZEB1 as well as their targets [25].

314
The gene-set enrichment analysis revealed an association of one of the factors with 315 EMT and the extracellular matrix (S1 Table and

343
The Immune factor shows enrichment for Interferon-Alpha Response and other 344 immune related signatures (S1 Table and S1 Table). In order to shed further light on 345 13/30 this factor we employed publicly available Cibersort scores [29]  FuncSFA to the lung tumor data from the TCGA in order to further assess its 380 usefulness in uncovering new and clinically relevant biology. We merged the TCGA 381 lung adenocarcinoma and lung squamous cell carcinoma datasets to obtain a data set 382 that is comparable in size to the breast cancer set. We employed the same data types 383 (gene expression, RPPA and copy number) as we employed in the breast cancer 384 analysis. As the dataset is of comparable size to the breast cancer dataset, and as 385 Wilkerson and colleagues defined a total of 7 lung cancer subtypes [21,22] we set out 386 to identify 10 factors. This allows the capturing of known variation with some room for 387 the discovery of novel subtypes.  Fig 1) for the three data types are represented in Fig 6. As before we will first 394 provide some general observations of the results and then provide a detailed 395 description and analysis of each factor.

396
The coefficients depicted in Fig 6 reveal  In the remainder of this section, we will first discuss the Wilkerson subtypes that have 424 been proposed for lung cancer. Then we will discuss each of the identified factors in 425 greater detail.

426
The Wilkerson molecular subtypes of lung cancer 427 The molecular subtypes of lung adenocarcinoma and lung squamous carcinoma were 428 defined by Wilkerson and colleagues [21,22], and also found to be present on the 429 larger TCGA datasets [2,3]. Fig 7B.2

444
The TRU subtype is characterized by highly expressed asthma, excretion and 445 surfactant genes, the PP subtype with the overexpression of defense response genes 446 (chemokines) and the PI subtype with high expression of DNA-repair genes [22]. The 447 Wilkerson squamous cell carcinoma subtypes are termed 'Basal', 'Classical', Adenocarcinoma factor (Fig 7B.2 and B.3, and S10 Fig), which is consistent with the 455 fact that NKX2-1 is expressed in both the Secretory subtype and in adenocarcinomas. 456 Finally, the Primitive subtype has previously been associated with proliferation and 457 DNA processing and repair.

458
Adenocarcinoma and squamous cell carcinoma 459 The differences between adenocarcinoma and squamous cell carcinoma are captured 460 by the Adenocarcinoma factor, which is high in adenocarcinomas and low in 461 19/30 squamous cell carcinomas. Interestingly, in the primitive subtype of squamous cell 462 carcinoma, tumors with high and low values for this factor exist (Fig 7B.2 and B.3, and 463  S10 Fig). In addition, copy number is dominated by a particularly strong negative 464 association with the gain of Chromosome 3q26-29, implying that the 465 adenocarcinomas show absence of gains while squamous cell carcinomas carry this 466 gain (Fig 9). We also observe the expected associations between this factor and 467 mutations in NFE2L2, STK11, KEAP1. KRAS, EGFR, PTEN, TP53, PIK3CA, BRAF 468 and ARID1A (Fig 8A). As expected, the Adenocarcinoma factor is high in all the 469 Wilkerson adenocarcinoma subtypes, as compared to the squamous cell carcinoma 470 subtypes. More specifically, the Adenocarcinoma factor is very low in the Basal and 471 Classical subtypes but higher in the Secretory and Primitive subtypes.

472
Two factors strongly associated with RPPA: BSCC and 8p11 gained 473 The largest fraction of the variance explained by the BSCC and 8p11-gained factors is 474 associated with the RPPA data ( Fig 7A). In fact, the amount of variation explained in 475 the RPPA data by these two factors is the highest across all factors and all data types. 476 The 8p11-gained factor is associated with increased gain of recurrent aberrations 477 containing the genes NDS3, LETM2 and, FGFR1, which are located on the 8p11 478 chromosomal band (Fig 9), and shows high scores in the PI subtype. As expected, 479 the BSCC factor shows, in general, higher scores within the Basal subtype as 480 compared to other squamous cell carcinoma subtypes (S7 Fig and S10 Fig).  correlates with signatures of γδ T-cells and M1 macrophages (Fig 8B). In addition, this 513 factor is associated with an increased loss of 9p23-21. The Infiltrating B-cells factor 514 shows strong association with interferon associated signatures and T-cell pathways in 515 the gene-set enrichment analysis. However, this association is weaker than for the 516 Immune factor. When considering the RNA expression coefficients of this factor we 517 see large coefficients for immunoglobulins, but in contrast to the Immune factor, 518 neither the HLA nor the complement system is represented. This factor has a strong 519 positive correlation with plasma B-cells ( Fig 8B) and is associated with a higher 520 mutation rate of STK11 (Fig 8A).

521
As the PP subtype shows overexpression of defense response genes (chemokines),  association is also strongly supported by the large coefficients of genes encoding 535 ribosomal proteins (Fig 6). Finally, as in breast cancer, we also identified an EMT 536 factor in lung. As in breast, the EMT lung factor is strongly correlated with the  The NFE2L2 factor shows association with NFE2L2 targets in the gene-set 542 enrichment analysis, suggesting this factor captures the activation of NFE2L2. This 543 conclusion is confirmed by a strong association of this factor with mutations in 544 NFE2L2 and in its inhibitor KEAP1 (Fig 8A). Interestingly, the factor is depleted for 545 EGFR mutations, although EGFR activates the NFE2L2 pathway by inhibiting 546 KEAP1 [34]. Still, these three genes in the NFE2L2 pathway show a mutually 547 21/30 exclusive mutation pattern (Fig 8C, p<0.05, DISCOVER groupwise test [35]), 548 suggesting that they independently activate the NFE2L2 pathway effecting the 549 common downstream gene expression changes captured by this factor. So, this factor 550 could be a readout of NFE2L2 pathway activation, which in some conditions has been 551 suggested to increase resistance to EGFR inhibitors 552 In addition to the clear NFE2L2 association, this factor also shows large values for the 553 coefficients for genes related to xenobiotic detoxification, such as CES1, CYP4F3 and 554 CYP4F11. Hence, it is not surprising that the NFE2L2 factor shows high scores in the 555 Classical subtype, which is also associated with this process.

556
In summary, we have identified factors that do show overall association, but not very 557 clear one-to-one correspondence with the previously defined Wilkerson subtypes. Our 558 factors do capture the major lung cancer subtypes (Adenocarcinoma and Squamous 559 cell carcinoma), and identified processes that occur in both of these types. These 560 include epithelial to mesenchymal transition, two immune processes, and, most 561 interestingly, a factor that reports the activity of the NFE2L2 pathway. The latter may 562 have interesting therapeutic consequences.

564
In both breast and lung cancer the sparse-factor analysis was able to recover known 565 biology. Additionally, we find factors that capture biology that traditional clustering 566 methods have not been able to find, such as the Immune, EMT, and infiltrating B-cells 567 factors. Our results illustrate the need for continuous factors as opposed to discrete  Several factors may potentially have an application in predicting a tumors' response to 587 treatment. The NFE2L2 factor might be indicative of response to dimethyl fumarate 588 (DMF) treatment, which targets the NFE2L2 pathway and can inhibit 589 carcinogenesis [36]. Because the factor measures activation of the transcription factor 590 through its downstream targets, it could be more predictive than only looking at 591 mutations in the transcription factor itself. This hypothesis could be tested on patient predictive for response to immune checkpoint inhibitors. If the immune system is 595 already highly active, a clinically relevant response might be easier to achieve. These 596 hypotheses can be tested by obtaining RNA expression data from pre-treatment 597 biopsies of patients treated with immune checkpoint inhibitors, scoring these factors 598 based on these data and correlating that with their response to checkpoint inhibitors.

599
In the lung cancer data set we found two highly correlated factors. Without the 600 sparsity penalties, the algorithm should find orthogonal factors. Orthogonal factors 601 explain the most variance, but the sparsity penalties prevent full orthogonality. An 602 explanation of the highly correlating factors can be that the optimal sparsity penalty is 603 different per factor, where one biological process involves thousands of genes, and 604 another only a few.

605
In this study we used only the tumors which have all data types available. With 606 increasing tumors sizes and an increasing number of data types this results in a 607 situation where ever growing parts of the data can not be incorporated in an 608 integrative analysis. The EM framework quite naturally allows for missing data, so this 609 might be an interesting future extension.

610
In addition to the data types we used here, other molecular data types that could be 611 included are DNA methylation and miRNA expression data. As the mutation data 612 includes driver events it would be desirable to include it as well. This would require 613 transforming the mutation data such that it has approximately Gaussian error. One 614 way to do this is by smoothing mutations over an interaction network [37]. The 615 inclusion of higher level tumor phenotypes, such as features obtained from MR 616 imaging or pathology slides, could guide the method towards finding biological 617 processes that lead to clinically relevant differences. 618 We have shown FuncSFA is able to find biological processes that are active across 619 otherwise very different tumors, such as the ER+ and ER-subtypes in breast cancer.

620
Applying this method to a pan-cancer cohort might find biological processes that are 621 activated in a large number of tumor types and provide insight into how tumors of 622 different origin relate to each other.

623
In summary, we have shown that FuncSFA is able to integrate at least three data 624 types. It identifies continuous-valued factors that could be simultaneously active in the 625 same tumor removing the necessity to assign tumors to clusters. Our results illustrate 626 the advantage of factors over clusters with several biologically or clinically relevant 627 examples. The identified factors represent the heterogeneity both within and between 628 cancer types, and represent the activation of biological processes in a patient specific 629 manner. Considering the more complete and fine-grained characterization they allow, 630 these factors could benefit the personalized treatment of cancer.

632
The FuncSFA method consists of two steps. In the first step, a sparse-factor analysis 633 integrates multiple data types into a small number of factors. In the second step, the 634 factors are linked to existing knowledge of biology by doing a gene-set enrichment 635 analysis. 636

637
The data of a single data type i with N tumors and n i features is represented in a 638 n i × N data matrix X i . The data matrices of t data types are stacked together into a 639 n × N data matrix X = [X T 1 , . . . , X T t ] T . This data matrix is factorized into a n × k 640 coefficient matrix B and k × N factor matrix Z, where the number of factors k is much 641 smaller than the total number of features n.

642
The probabilistic model has been described previously by Shen and colleagues. [8]. 643 We briefly summarize this probabilistic model here. A tumor sample vector x with n The penalties λ 1,i and λ 2,i are kept the same for all features 651 in a data type.

653
To optimize the penalized complete data log-likelihood we use an iterative algorithm, 654 improving over the iCluster2 algorithm [11] at several key points. It consists of the 655 following steps.  where S is the soft-thresholding operator S(b, l 1 ) = sign(b)(|b| − l 1 ) + . In a single 667 iteration of the main algorithm the coefficients are updated once in sequential 668 order. 669 4. Residual variance is estimated per data type: for data type i. To find the optimal penalty weight parameters λ i,1 , λ i,2 for the L1 and L2 penalties 676 respectively, we reparameterize. calculating the BIC, is calculated using the method of Tibshirani and Taylor [38]. RNA expression data from RNA sequencing was preprocessed using 691 limma-voom [39]. We selected the 1000 genes with the highest median absolute 692 deviation. The mean variance-trend in RNA sequencing data was taken into account 693 by multiplying the RNA expression with the precision weights obtained from voom.

694
DNA copy number was sampled at regions with recurrent copy number aberrations.

695
Recurrently aberrated copy number regions were taken from the RUBIC paper [18] for 696 breast cancer or obtained by running RUBIC for lung cancer. To obtain the copy 697 number of region in a tumor, we took the median copy number of all segments in the 698 tumor sample overlapping that region. This results in a sample by region matrix with 699 copy number data. Protein expression data from RPPA was used directly as 700 processed by TCGA. To remove differences in variance between data types we 701 divided every data type by its standard deviation. 702

25/30
Additional methods 703 t-SNE [40] was used to summarize the factors in two non-linear dimensions, yielding a 704 high-level map the factors and other tumor variables can be projected on. These 705 t-SNE maps were calculated using scikit-learn [41].

706
The explained variance of a factor is calculated by subtracting the summed square excluding the factor i. The squared Frobenius norm || · || 2 F is the sum of squares of the 711 elements of a matrix.

712
Software and data availability. 713 The software for the sparse-factor analysis is available from