Molecular characterization of breast and lung tumors by integration of multiple data types with functional sparse-factor analysis

doi:10.1371/journal.pcbi.1006520

Fig 1.

Overview of FuncSFA.

A: Graphical representation of Functional Sparse-Factor Analysis (FuncSFA). The green circles represent the factors, and the red, blue and yellow circles at the bottom represent the observed variables, with the colors representing the data types and each circle representing an individual variable (i.e. the expression of a gene or protein, or the copy number of a gene). The black lines connecting the individual variables to the factors represent the regression coefficients. B: Graphical representation of the mathematical concepts of SFA with X representing the N × n data matrix, Z the N × k obtained factor matrix and B the k × n factor coefficients. C: Graphical representation of the computations of the factor expression coefficients. The coefficients represented by the k × n_m matrix C are obtained by regressing the N × n_m RNA expression matrix, X_m, on the N × k factor matrix Z. D: The gene-set enrichment analysis designed to assign biological processes or pathways to the obtained factors. E: Application of the factors to determine the activity of the factors (or associated biological processes) in a new tumor. (N: number of tumors; n: number of features; k: number of factors; n_m: number of mRNA features; Z: factor matrix; X: data matrix (concatenation of mRNA, copy number and Reverse Phase Protein Array (RPPA) data); B: Sparse factor coefficients; C: Factor regression coefficients; GSEA: Gene-set enrichment analysis).

More »

Expand

Fig 2.

The strongest sparse-factor analysis coefficients for the breast cancer data set for each of the three data types and all ten factors.

The height of the bars shows the values of the coefficients. Non-significant coefficients (p > 0.05, signifance test of coefficient in an ordinary least-squares model) are marked with N.S. If a gene is strongly associated with a factor, we show all coefficients of that gene in the model. RNA expression coefficients are shown in blue. Protein expression coefficients are shown in orange. Any modifications of an epitope are noted in a short text description: pX = phosphorylated at residue X; clX = cleaved at residue X. DNA copy number coefficients are shown in red. Numbers refer to the recurrently aberrated loci in S5 Table. Recurrent gains are prefixed with a g, losses with an l. Also see S1 Table.

More »

Expand

Fig 3.

Sparse-factor analysis on the TCGA breast cancer data set.

A: Explained variation per data type and factor. B: The top-left panel shows a t-SNE map of the tumors with the different colors showing PAM50 subtypes. The remaining panels show the tumors in the same positions as the PAM50 map, but colored according to the value of the represented factor in each tumor.

More »

Expand

Fig 4.

Copy number and factors in the TCGA breast cancer dataset.

Normalized coefficients representing the contribution of DNA copy number aberrations to the factors. Specifically, the coefficients represent the contribution of recurrently gained (left) or lost (right) copy number regions identified by RUBIC to the factors represented in the rows. Recurrently aberrated copy number regions are annotated with chromosomal bands or putative driver genes in the region.

More »

Expand

Fig 5.

Additional factors add detail over well-known subtypes of breast cancer.

A: Scatterplot of the EMT factor versus the sum of the RNA expression of COL11A1 and THBS2 (CPM: count per million). B: Pearson correlation (ρ) between the factors and cell type fractions. Only significant correlations (p < 0.05, |ρ|> 0.2) are shown.

More »

Expand

Fig 6.

The strongest SFA coefficients for the lung cancer data set for each of the three data types and all ten factors.

Height of the bars shows the values of the coefficients. Non-significant coefficients (p > 0.05, signifance test of coefficient in an ordinary least-squares model) are marked with N.S. If a gene is shown we show all coefficients of that gene in the model. RNA expression coefficients are shown in blue. Protein expression coefficients are shown in orange. Any modifications of an epitope are noted in a short text description. pX = phosphorylated at residue X. DNA copy number coefficients are shown in red. Numbers refer to the recurrently aberrated loci in S5 Table. Recurrent gains are prefixed with a g, losses with an l. Also see S2 Table.

More »

Expand

Fig 7.

Sparse-factor analysis on the TCGA lung cancer dataset.

A: Explained variance per data type and factor. B: B.1 shows the t-SNE map of all lung tumors with red denoting the Adenocarcinomas and blue the Squamous Cell Carcinomas. With the tumors in the same positions as in B.1, B.2 depicts the subtyping as proposed by Wilkerson and colleagues [21, 22]. The remaining panels show the tumors in the same positions as the first two maps, but colored according to the value of the represented factor in each tumor.

More »

Expand

Fig 8.

Copy number and factors in the TCGA lung cancer dataset.

Normalized coefficients representing the contribution of DNA copy number abberations to the factors. Specifically, the coefficients represent the contribution of recurrently gained (left) or lost (right) copy number regions identified by RUBIC to the factors represented in the rows. Recurrently aberrated copy number regions are annotated with chromosomal bands or putative driver genes in the region.

More »

Expand

Fig 9.

Mutations and immune infiltration in lung cancer.

A: Mann-Whithney U statistic of the factor values between tumors with and without a mutation in a gene divided by the product of the number of tumors in each group. Only significant (p < 0.05) values are shown. B: Pearson correlation (ρ) between the factors and cell type fractions. Only significant correlations (p < 0.05, |ρ|> 0.2) are shown. C: Mutations in genes of the NFE2L2 pathway and STK11 (black: tumor is mutated in this gene).

More »

Expand