Figure 1.
The ICA Model of Gene Expression
Schematic depiction of the ICA model for gene expression.
(A) Measured gene expression variations are caused by alterations in the activation levels of biological pathways. In the ICA model, the gene expression matrix is decomposed into the product of a “source” matrix S and a “mixing” matrix A, where K is the number of inferred independent components (IC) to which pathways and regulatory modules map. The columns of S describe the activation levels of genes in the various inferred independent components, while the rows of A give the activation levels of the independent components across tumor samples. The product of S and A can be written as a sum over the IC submatrices IC-1,IC-2,...IC-K.
(B) IC–k–submatrix is obtained by multiplying the k-th column of S, Sk, with the k-th row of A, Ak. The genes with the largest absolute weights in Sk are selected and tested for enrichment of biological pathways, while the distribution of weights in Ak are tested for discriminatory power of phenotypes. (Colour codes for heatmaps: red, overexpression; green, underexpression; blue, upregulation; yellow, downregulation.)
Table 1.
Breast Cancer Cohorts
Figure 2.
(A) For each cohort and method, we give the pathway enrichment index, PEI, defined by the fraction of biological pathways (536 in total) found enriched in at least one component.
(B) For each cohort and method, we give the fraction of cancer-signalling and oncogenic pathways (14 in total) successfully mapped by the inferred components.
(C) For each cohort and method, we give the fraction of motif-regulatory gene sets (173 in total) captured by the inferred components.
Figure 3.
Most Consistent and Frequently Mapped Pathways and Regulatory Motifs
(A) For each method, we compare the number of pathways that were consistently mapped to components across the four major breast cancer studies.
(B) Twenty of the most frequently mapped pathways by ICA. The scores give the average number of ICA components in which the pathway was mapped.
(C) For each method, we give the number of motif-regulatory gene sets consistently mapped to components across the four major breast cancer cohorts.
(D) The 20 most frequently mapped transcription factors/regulatory motifs by ICA. The scores give the average number of ICA components in which the regulatory module of the motif was mapped.
Figure 4.
Heatmaps of Association of Pathways and Regulatory Modules with Breast Cancer Phenotypes
For three phenotypes (ER, Grade, Outcome), we show heatmaps of association between phenotypes and selected pathways (A) and selected regulatory motifs (B), as revealed by the four ICA algorithms across the four major breast cancer cohorts. For phenotypes, we used a p-value threshold of 0.05 to establish whether an ICA component was associated with that phenotype. For pathways and regulatory modules, we used the Benjamini corrected p-values as before. For each cohort, we then counted the number of ICA algorithms that found a component linking a phenotype with a pathway/regulatory module, which was colour-coded as 4 (dark red), 3 (red), 2 or 1 (pink), and 0 (white). For Wang's cohort, grade information was unavailable and is colour-coded as grey.
Figure 5.
The Association of Immune Response with Estrogen Receptor Status
(A) For each major breast cancer cohort, we give the heatmap of component expression values for the component enriched for the immune-response pathway characterised in [39]. Thus, the heatmap matrix shown is SgkAks where k is the component enriched for the immune response pathway, g is any gene found on the array that is also in the pathway and the selected feature set of the component, and s denotes the tumour sample. Samples have been ordered according to a k-means (k = 2) clustering over the set of genes. The ICA algorithm for which this heatmap is shown is the KernelICA algorithm. Blue denotes “upregulation,” yellow “downregulation.” For the samples, black denotes an ER− and grey an ER + tumour.
(B) For each major breast cancer cohort, we give the heatmap of expression values for the same set of genes as in (A). Thus, the heatmap matrix shown is Xgs where Xgs denotes the measured expression level of gene g in sample s. As before, samples have been ordered according to a k-means (k = 2) clustering over the represented genes. Red denotes relative overexpression, green underexpression. Magenta denotes the upregulated cluster, cyan the downregulated cluster.
Table 2.
Association of Immune Response with Estrogen Receptor Status
Figure 6.
The Association of Epithelial–Mesenchymal Transition with Histological Grade
(A) For each major breast cancer cohort where grade information was available, we give the heatmap of component expression values for the component enriched for the EMT pathway characterised in [41]. Thus, the heatmap matrix shown is SgkAks where k is the component enriched for the EMT pathway, g is any gene found on the array that is also in the pathway and the selected feature set of the component, and s denotes the tumour sample. The ICA algorithm for which this heatmap is shown is the KernelICA algorithm. Samples have been ordered according to a k-means (k = 2) clustering over the set of genes. Blue denotes “upregulation,” yellow “downregulation.” For the samples, histological grade is colour-coded as black (high-grade), blue (intermediate grade), and skyblue (low-grade).
(B) For each major breast cancer cohort, we give the heatmap of expression values for the same set of genes as in (A). Thus, the heatmap matrix shown is Xgs where Xgs denotes the measured expression level of gene g in sample s. As before, samples have been ordered according to a hierarchical clustering over the represented genes. Red denotes relative overexpression, green underexpression. Magenta denotes the upregulated cluster, cyan the downregulated cluster.
Table 3.
Association of Epithelial–Mesenthymal Transition with Grade
Figure 7.
Inter-Method Comparison of Selected Associations of Pathways and Regulatory Modules with Breast Cancer Phenotypes
The ability of the various methods to capture novel biological associations between pathways/regulatory modules and phenotypes is represented as a binary heatmap across methods and cohorts. (A) Immune response pathway and ER status, (B) EMT-pathway and grade, (C) IRF and ER status, (D) Neurofibromin-1 and clinical outcome. Black denotes a statistically significant association between a pathway/regulatory module and the phenotype in question, white means no evidence of an association.
Figure 8.
Average association networks shown for ER status (A) and clinical outcome (B). Only edges between phenotypes, pathways, and transcription factors are shown (for the sake of clarity, edges between any two pathways, transcription factors, or phenotypes are not shown). An edge between two nodes was defined if the association between the two nodes was present in at least three out of the four studies, as predicted by the KernelICA algorithm. The diagrams are colour-coded as follows: phenotype (red), pathways (green), and transcription factors/binding motifs (blue).
INFLR, inflammatory response; TM, tyrosine metabolism.