Clustering gene expression time series data using an infinite Gaussian process mixture model

doi:10.1371/journal.pcbi.1005896

Fig 1.

Clustering performance of state-of-the-art algorithms on simulated time series data.

Box plots show summaries of the empirical distribution of clustering performance for (A) DPGP, (B) hierarchical clustering, (C) fDPGP, (D), k-means clustering, (E) BHC, (F) Mclust, (G) GIMM, and (H) SplineCluster in terms of Adjusted Rand Index (ARI) across twenty instances of each of the 31 data set types (S1 Table). Higher values represent better recovery of the simulated clusters. Vertical dotted lines separate data sets with widely varied cluster size distributions (left) from data sets with widely varied generating hyperparameters (right). Observations that lie beyond the first or third quartile by 1.5× the interquartile range are shown as outliers.

More »

Expand

Fig 2.

DPGP clusters in H. salinarum H₂O₂-exposed gene expression trajectories.

(A–L) For each cluster, standardized log₂ fold change in expression from pre-exposure levels is shown for each gene as well as the posterior cluster mean ±2 standard deviations. Control strain clusters are on left and ΔrosR clusters on right, organized to relate the ΔrosR clusters that correspond to each control cluster. Note that control cluster 5 had no corresponding ΔrosR cluster, but transcripts in this cluster instead distribute to a variety of ΔrosR clusters, none of which had a majority of cluster 5 transcripts. (M) Heatmap displays the proportion of DPGP samples from the Markov chain in which each gene (on the rows and columns) clusters with every other gene in the control strain. Rows and columns were clustered by Ward’s linkage. The predominant blocks of elevated co-clustering are labeled with the control cluster numbers to which the genes that compose the majority of the block belong. As indicated, cluster 6 is dispersed across multiple blocks, primarily the blocks for clusters 3 and 5. (N) Same as (M), except that values are replaced by the proportions in the ΔrosR strain instead of the control strain. Rows and columns ordered as in (M).

More »

Expand

Fig 3.

Clustered trajectories of differentially expressed transcripts in A549 cells in response to dex.

For each cluster in (A–M), standardized log₂ fold change in expression from pre-dex exposure levels is shown for each transcript, and the posterior cluster mean and ±2 standard deviations according to the cluster-specific GP.

More »

Expand

Fig 4.

Differences in TF binding and histone modification occupancy in A549 cells in control conditions for the four largest DPGP clusters.

(A) Heatmap shows the elastic net logistic regression coefficients for the top twenty predictors (sorted by sum of absolute value across clusters) of cluster membership for the four largest clusters. Predictors were log₁₀ library size-normalized binned counts of ChIP-seq TF binding and histone modification occupancy in control conditions. Distance indicated in row names represents the bin of the predictor (e.g., <1 kb means within 1 kb of the TSS). An additional 23 predictors with smaller but non-zero coefficients are shown in S6 Fig. (B) Kernel density histogram smoothed with a Gaussian kernel and Scott’s bandwidth [63] of the TF binding and histone modification occupancy log₁₀ library size-normalized binned count matrix in control conditions transformed by the first principal component (PC1) for the two largest down-regulated DPGP clusters. (C) Same as (B), but with matrix transformed by PC2 and with the four largest DPGP clusters.

More »

Expand

Fig 5.

Differences in changes in transcription factor binding in A549 cells in response to glucocorticoid exposure for the four largest DPGP clusters.

(A) Heatmap shows all coefficients (sorted by sum of absolute value across clusters) for predictors with non-zero coefficients as estimated by elastic net logistic regression of cluster membership for the four largest DPGP clusters. Predictors on y-axis represent log fold-change in normalized binned counts of TF binding from ethanol to dex conditions as assayed by ChIP-seq. Distance indicated in row names reflects the bin of the predictor (e.g., 1 kb = within 1 kb of TSS). (B) Boxplots show the logFC in normalized binned counts across clusters and for the group of non-DE transcripts for CREB1, (C) FOXA1, and (D) USF1.

More »

Expand