Figure 1.
(A) Representative module activity functions used by previous methods are compared to logic functions considered in this study. Logic functions capture a wide range of differential activities that are not captured by any single function. Our method uses logic functions directly in the classification process and extends to classification scenarios with more than two classes. (B) Network-guided search for decision trees associated with network modules. Each decision tree maps to a connected subnetwork. (C) Decision tree and the corresponding logic function represented as a truth table. The decision tree assigns each sample to a class by performing a series of tests where each test determines whether the expression of a selected gene is higher (>) or lower (<) than a threshold value. The gene is interpreted as being up-regulated if its expression is above the threshold. Otherwise the gene is down-regulated. Each path from root to leaf in the tree defines a single decision rule which maps to a different row in the truth table. Decision trees are typically not grown to the full extent and thus not all genes must be tested along each path if a subset of the genes is sufficient to determine the output.
Figure 2.
Network decision modules underlying embryonic origin, breast cancer metastasis and mesenchymal transformation of brain tumors.
Expression profiles for each of the three case studies are combined with a network of protein-protein interactions among human transcription factors. Network-guided forests are used to identify key network modules that are most important for correct sample classification (representative modules are shown for each study). Grey edges indicate physical protein-protein interactions, blue edges indicate protein combinations that often co-occur in the same decision trees and are most important for classification (as indicated by the permutation test). Node color indicates protein importance whereas edge width indicates the importance of a protein combination. Each module is assigned a decision tree that specifies the output of the module based on the activity of its genes (see also Figure S1).
Figure 3.
Network modules capture causal developmental factors and are reproducible.
(A) Consensus network modules underlying tissue origin (modules of size greater than 2 are encircled). Gene pairs that often co-occur in the same decision trees and are most important for classification are shown in blue. Node color indicates protein importance whereas edge width indicates the importance of a protein combination. (B) Enrichment for developmentally-related phenotype categories in the MGI database (FDR is reported above each bar). (C) Enrichment of germ-layer specific genes identified by NGF based on the Gene Ontology (FDR is reported above each bar). (D) Percentage of genes, interactions, and modules that were reproduced based on an independent dataset. (E) Percent of reproduced single genes and gene combinations (Fisher's Exact Test P-values are reported). NGF* indicates the result for NGF applied to networks with perturbed expression measurements.
Table 1.
Network modules corresponding to known regulatory complexes in development.
Figure 4.
Classification performance and validation of markers of breast cancer metastasis.
(A) Average area under the ROC curve for NGF, RF, NGF applied to permuted networks (NGF**), and Naïve Bayes, compared to reported scores for representative previous methods (error bars denote standard deviation estimated over 100 runs). (B) General cancer and breast cancer associated genes identified among the 100 top-scoring genes or 100 most abundant genes in the forest created using RF or NGF. using the real network or networks with permuted edges (average over 100 permutations is shown). (C) Genes ranked by their importance for classification in two independent breast cancer patient cohorts (y vs. x axis). Network-Guided Forest, blue points; regular Random Forest, green points.
Figure 5.
Network functions underlying cancer progression.
(A) The decision trees for mesenchymal transformation are dissected by assigning their gene pairs to one of three functional categories based on the sign of gene expression in predicting the more aggressive phenotype. The percentage of gene pairs assigned to each of the three functional categories is shown as a function of the score threshold used for selecting gene pairs. Accuracy is calculated as the average Laplace score (Text S1) over all trees in the forest. (B) Enrichment for interactions between oncogenes, between tumor suppressors and between an oncogene and a tumor suppressor among functional categories identified using NGF. Percent of such interactions among top scoring pairs in each functional category is reported along with the Fisher's Exact Test P-value of enrichment.