Predicting Spatial and Temporal Gene Expression Using an Integrative Model of Transcription Factor Occupancy and Chromatin State
Figure 1
Generating a predictive model of spatio-temporal gene expression.
(a) A typical genomic locus within the Drosophila genome. Depicted tracks represent, from top to bottom: Transcription factor (TF) binding (log2 ChIP-chip signal shown for one factor, blue), computed cis-regulatory modules (CRMs) from 15 developmental conditions (green). A zoomed heat map shows a detailed view of TF binding for one CRM for all 5 TFs and 5 time-points, the level of blue represents the degree of ChIP enrichment in log2. Insulator (INS) binding is shown in red (ChIP signal shown for CP190, one of 6 factors in dark red), Histone H3K4me3 for a selected time-point (orange) and gene models from RefSeq are indicated in black (inactive genes) or red (active genes) depending on the level of H3K4me3 signal. The boundaries of insulator occupancy places all CRMs in the vicinity of three genes, twi and CG30194 and l(2)06496, while the enriched H3K4me3 signal at the twi and l(2)06496 promoter indicates that they are the only genes actively expressed genes at these stages. The activity of only one enhancer is known within this locus (twi-PE). The spatio-temporal expression patterns of the twi gene is shown, characterized by in-situ hybridization. (b) A schematic representation of the iterative Bayesian modeling approach. The model consists of two major components joined through iteration of the EM algorithm: A Bayesian network that uses TF occupancy data (ChIP) and TF activity data (from transgenic reporter assays) to model CRM activity (an exemplary network topology that was a result of an optimization run is shown in a separate panel); a probabilistic model that uses insulator occupancy, promoter activity, CRM occupancy and estimates of CRM activity to model spatio-temporal gene expression. Separate panel includes all data used for an exemplary locus containing Tinman and Bagpipe genes. It is an interesting case as both genes are expressed in different times and sub-tissues originating from the mesoderm. In essence, the model estimates the probability of a gene's activity as a function of all data between the two insulator elements (green Chip signal in the inlay panel). An expectation maximization step (EM) is used to iteratively improve both the BN topology, CRM activity predictions, maximum CRM-gene distance (dmax), and the gene expression predictions until a local maximum of the likelihood is reached.