Predicting Spatial and Temporal Gene Expression Using an Integrative Model of Transcription Factor Occupancy and Chromatin State
Figure 2
An iterative Bayesian model can accurately predict gene expression.
(a) The learned Bayesian network topology reveals regulatory relationships between transcription factors (TFs) and specific tissues. Each node in the network represents TF occupancy data (TF-f and time-T) or a specific activity class (tissue or time-period). The edges represent the probability of a CRM being active as a function of a particular binding event, with darker blue lines having the highest probability. Predicted activity in Meso class is dependent on Twist (Twi) binding to a CRM at 2–4 hr, while VM activity depends on Biniou (Bin) occupancy at two time-points. Meso = unspecified mesoderm, VM = visceral muscle, SM = somatic muscle. (b) Histogram showing average enrichment of correct predictions within the top 2% of genes with highest posterior probability from all 10 activity classes, where a 15-fold enrichment is obtained using the iterative trained model including all datasets. This enrichment steadily decreases as one or more datasets are removed, going form a 9-fold enrichment when omitting insulator binding and H3K4me3 activity data (TF+EM), ∼6-fold enrichment when TF binding is used with either insulator or H3K4me3 data without the iterative EM procedure, to an ∼3-fold enrichment when TF binding data or histone marks alone are used. (c) Validation of the cross-validated model using in-situ hybridzation data for 600 genes not included in the training set. The average area under the curve (AUC) for all 10 classes ranges from 0.82 (training) to 0.78 (new data).