Automated Discovery of Functional Generality of Human Gene Expression Programs

doi:10.1371/journal.pcbi.0030148

Figure 1.

GeneProgram Algorithm Steps

The main steps of the algorithm are: data pre-processing (steps 1–2), model inference/learning (step 3), and model posterior distribution summarization (step 4). See the text for details.

More »

Expand

Figure 2.

Conceptual Overview of the Data Generation Process for Gene Expression in Human Tissues

The GeneProgram probability model can be thought of as a series of “recipes” for constructing the gene expression of tissues, as depicted in this schematic example for a digestive tract. In the upper right, four expression programs (labeled A–D) are shown, consisting of sets of genes (e.g., GA1 represents gene 1 in program A). Cells (circles) throughout the digestive tract choose genes to be expressed probabilistically from the programs. The biological experimenter then collects mRNA by dissecting out the appropriate tissue sample, homogenizing it, lysing cells, and extracting nucleic acids.

More »

Expand

Figure 3.

Overview of the GeneProgram Probability Model, Which Is Based on Hierarchical Dirichlet Process Mixture Models

(A) The model consists of a three-level hierarchy of Dirichlet Processes. Each node describes a weighted mixture of expression programs (each colored bar represents a different program; heights of bars = mixture weights). The distributions at each level are constructed on the basis of the parent mixtures above. Tissue group and root level nodes maintain distributions over usage modifiers (shaded circles above bars, darker circles = more probable), which are variables that alter the manner in which each tissue uses an expression program. In this example, there are two possible values for usage modifier variables (+ or −), corresponding to gene induction or repression; more complex patterns such as temporal dynamics can be captured by using more values. Tissues are at the leaves of the hierarchy, and choose particular values for usage modifier variables. The observed gene expression in each tissue is characterized by a vector of discretized expression magnitudes (first row of small shaded squares below each tissue) and pattern types (second row of squares with +/− designations below each tissue).

(B) Example of a gene expression program, which represents a set of genes that are likely to behave coordinately in particular tissues that use the program. On the left is a simple program containing five genes (colored bars = expression frequencies). A tissue probabilistically chooses a set of genes from a program, and also a setting for its usage modifier variable. Note that usage is consistent across a program for a particular tissue, which facilitates biological interpretation.

More »

Expand

Figure 4.

Synthetic Data Experiments Demonstrated GeneProgram's and Other Algorithms' Abilities to Recover Gene Sets from Noisy Data

(A) Noisy synthetic data, containing 150 genes (rows) and 40 tissues (columns), with four gene sets and four tissue sample populations. Vertical numbers and colored bars designate gene sets; corresponding horizontal elements designate tissue sample populations. Each gene set contained 40 genes with varying mRNA levels; programs 3 and 4 overlapped in ten genes. See the text for complete details.

(B) Hierarchical clustering was applied to sorting both rows (genes) and columns (tissues) of the data. Arrows from (A) indicate resorting of gene sets. Note that hierarchical clustering did not separate gene sets 3 and 4 correctly, and broke up gene set 2 horizontally.

(C) SVD factors did not clearly correspond to the distinct gene sets or tissue populations in the synthetic data; the first three factors are shown.

(D) An NMF implementation, which searches for the optimal number of factors, was applied to the data. In this case, three factors were optimal. The method performed fairly well in recovering genes sets 1 and 2, although there was some overlap between the sets. However, gene sets 3 and 4 were not recovered as separate sets.

(E) A simplified version of GeneProgram using a flat hierarchy (automatic tissue grouping disabled) accurately recovered gene sets 1 and 2, but failed to separate sets 3 and 4.

(F) The full GeneProgram implementation using automatic tissue grouping correctly recovered all four gene sets.

More »

Expand

Table 1.

GeneProgram Outperformed Two Popular Biclustering Algorithms, an NMF Implementation and Samba, in Recovering Biologically Interpretable Gene Sets from Two Compendia of Mammalian Gene Expression Experiments

More »

Expand

Figure 5.

GeneProgram Outperformed Two Popular Biclustering Methods, an NMF Implementation and Samba, in Terms of Gene Set Consistency between Two Large Compendia of Mammalian Tissue Gene Expression Data

Because the two data compendia used different microarray platforms and sources for tissues, similarities in discovered gene sets between compendia were likely to be biologically relevant. For each algorithm, we used gene sets discovered from one data compendium to compute the significance of the overlap (p-values) with sets produced using the second compendium. We then inverted the analysis and averaged the results to produce the correspondence plots shown. The plots depict log p-values on the horizontal axis and the fraction of gene sets with p-values below a given value on the vertical axis (see the Methods section for details). The larger fraction of gene sets at most p-values suggests that GeneProgram generally produces the most consistent results between the data compendia.

More »

Expand

Figure 6.

GeneProgram Discovered Five Tissue Groups (A–E) and 104 Expression Programs (Numbered Columns, Only 76–104 Shown Here; 1–75 Shown in Figures S3–S5) in a Compendium of 62 Short Time-Series Gene Expression Datasets Measuring the Responses of Human Cells to Infectious Agents and Immune-Modulating Molecules

Each entry in the matrix represents the usage of an expression program (darker shading = higher usage), and matrix entries are colored and labeled with temporal patterns. The patterns are early induction (E+, red), middle induction (M+, yellow), late induction (L+, green), early repression (E−, cyan), middle repression (M−, blue), and late repression (L−, light purple). Generality scores are shown at the bottom of the figure; programs are sorted from lowest to highest scores (left to right).

More »

Expand