Figure 1.
Generating a predictive model of spatio-temporal gene expression.
(a) A typical genomic locus within the Drosophila genome. Depicted tracks represent, from top to bottom: Transcription factor (TF) binding (log2 ChIP-chip signal shown for one factor, blue), computed cis-regulatory modules (CRMs) from 15 developmental conditions (green). A zoomed heat map shows a detailed view of TF binding for one CRM for all 5 TFs and 5 time-points, the level of blue represents the degree of ChIP enrichment in log2. Insulator (INS) binding is shown in red (ChIP signal shown for CP190, one of 6 factors in dark red), Histone H3K4me3 for a selected time-point (orange) and gene models from RefSeq are indicated in black (inactive genes) or red (active genes) depending on the level of H3K4me3 signal. The boundaries of insulator occupancy places all CRMs in the vicinity of three genes, twi and CG30194 and l(2)06496, while the enriched H3K4me3 signal at the twi and l(2)06496 promoter indicates that they are the only genes actively expressed genes at these stages. The activity of only one enhancer is known within this locus (twi-PE). The spatio-temporal expression patterns of the twi gene is shown, characterized by in-situ hybridization. (b) A schematic representation of the iterative Bayesian modeling approach. The model consists of two major components joined through iteration of the EM algorithm: A Bayesian network that uses TF occupancy data (ChIP) and TF activity data (from transgenic reporter assays) to model CRM activity (an exemplary network topology that was a result of an optimization run is shown in a separate panel); a probabilistic model that uses insulator occupancy, promoter activity, CRM occupancy and estimates of CRM activity to model spatio-temporal gene expression. Separate panel includes all data used for an exemplary locus containing Tinman and Bagpipe genes. It is an interesting case as both genes are expressed in different times and sub-tissues originating from the mesoderm. In essence, the model estimates the probability of a gene's activity as a function of all data between the two insulator elements (green Chip signal in the inlay panel). An expectation maximization step (EM) is used to iteratively improve both the BN topology, CRM activity predictions, maximum CRM-gene distance (dmax), and the gene expression predictions until a local maximum of the likelihood is reached.
Figure 2.
An iterative Bayesian model can accurately predict gene expression.
(a) The learned Bayesian network topology reveals regulatory relationships between transcription factors (TFs) and specific tissues. Each node in the network represents TF occupancy data (TF-f and time-T) or a specific activity class (tissue or time-period). The edges represent the probability of a CRM being active as a function of a particular binding event, with darker blue lines having the highest probability. Predicted activity in Meso class is dependent on Twist (Twi) binding to a CRM at 2–4 hr, while VM activity depends on Biniou (Bin) occupancy at two time-points. Meso = unspecified mesoderm, VM = visceral muscle, SM = somatic muscle. (b) Histogram showing average enrichment of correct predictions within the top 2% of genes with highest posterior probability from all 10 activity classes, where a 15-fold enrichment is obtained using the iterative trained model including all datasets. This enrichment steadily decreases as one or more datasets are removed, going form a 9-fold enrichment when omitting insulator binding and H3K4me3 activity data (TF+EM), ∼6-fold enrichment when TF binding is used with either insulator or H3K4me3 data without the iterative EM procedure, to an ∼3-fold enrichment when TF binding data or histone marks alone are used. (c) Validation of the cross-validated model using in-situ hybridzation data for 600 genes not included in the training set. The average area under the curve (AUC) for all 10 classes ranges from 0.82 (training) to 0.78 (new data).
Figure 3.
Validating spatio-temporal expression predictions in the visceral muscle.
(a) Receiver Operator Curves (ROC) for the activity class visceral muscle (VM). The area under the curve (AUC) is 0.87 for the full iterative model using all data (TF+ALL), which becomes progressively lower for simpler models that either do not include chromatin data (TF+EM), or do not include the EM step (TF+His, TF+Ins). (b) Enrichment of correct predictions in the top (2%) of genes for different models and validation data. Blue bars present performance of different models using the training data for the visceral muscle activity class (VM). Red bars show analagous enrichment for the in-situ validated examples as well as for the top 100 predictions of genes expressed in VM, which were manually annotated based on the literature. (c) Embryo images showing double fluorescent in-situ hybridization against the gene with predicted expression (red) and a specific marker for VM (green, biniou), where overlapping gene expression in VM is shown in the merge panel. The white arrow points to the VM. All embryos are orientation with anterior to the left and dorsal up. In-situ data for all 22 genes tested are shown in Fig. S9.