Identification of gene specific cis-regulatory elements during differentiation of mouse embryonic stem cells: An integrative approach using high-throughput datasets

doi:10.1371/journal.pcbi.1007337

Table 1.

Datasets and cell types used in this study.

More »

Expand

Fig 1.

Gene-specific predictive models.

(A) Schematic representation of the methodology involved in developing gene specific predictive models. 1. Integration of DNaseI-seq and H3K27ac to quantify the chromatin activity profile (CAP) in candidate cis regulatory elements (CREs). TF ChIP-seq data is used to generate the transcription factor binding profile (TFBP) to quantify the community effect of candidate CREs mapped to a specific gene. 2. Gene wise expression values are obtained as RPKMs to form gene expression profiles (GEPs). 3. CAPs, TFBPs and GEPs are generated for all the regions and genes in the analysis. 4. CAP and TFBP are integrated in order to generate gene specific CRE networks. A greedy community detection is performed in order to identify the communities of CREs (coCREs) in the networks. A new set of CAPs involving aggregate CAPs of the coCREs along with the individual CAPs for singleton CREs are used to predict the GEP for a specific gene. (B) Histogram showing the distribution of candidate CREs per gene within 100kB of the transcription start site over all genes in the study. (C) The plot shows the change in cross-validated Mean Squared Error (MSE) as a function of increasing λ for a predictive model of Runx1 gene expression. The two vertical dotted lines show the two cut offs λ_min and λ_1se. The total number of CREs with non-zero coefficients (β) at a given λ is shown above the plot.

More »

Expand

Fig 2.

Predictive model for an example gene, Runx1.

(A) A network representation of the model, where the gene (here Runx1) for which the model is built (red octagon), chosen CREs (blue hexagons) and TFs bound to the chosen CREs are represented as nodes. Black arrows indicate the regulation of the gene by the CRE/coCRE and coloured arrows represent the binding of TFs to the CREs in different cell types. The colours corresponding to the cell types are given below the network. The TFBP of the CRE in a specific cell type is represented as a circular histogram and in the case of coCREs these represent the frequency of occurrence of a specific TF in the regions of that community (here the community comprises of 4 regions). The p-value of observing a combinatorial binding profile in that cell type is provided for each TFBP node and the methodology is given in Methods section. The abbreviations for the TFs in the circular histogram are: Esrrb (EB), Nanog (NG), Pou5f1 (O4), Sox2 (S2), Cebpb (CB), Elk4 (E4), Gata2 (G2), Lmo2 (L2), Tal1 (T1), Fli1 (F1), Tead4 (T4), Meis1 (M1), Gata1 (GA1), Gfi1 (G1), Gfi1b (GB), Runx1 (R1), Spi1 (P1). It should be noted that not all TFs in the circular histogram have supporting ChIP-seq data in all cell types (Table 1). In the absence of ChIP-seq data for a specific cell type, the bar for that TF in the histogram of that cell type is zero. (B) The gene expression profile (GEP) of Runx1 with cell types along the horizontal axis and FPKM on the vertical axis. (C) The plot shows the best linear fit between the actual (X) and predicted (Y) GEP for Runx1. The spearman correlation coefficient is also provided. (D) The plot shows the tag density profile normalised as coverage per million aligned reads for the 10 cell types. Runx1 gene structure is provided in blue below the coverage tracks. The predictor CREs that were used in the lasso model are given as grey boxes and the chosen CRE and the coCRE are given in red and yellow respectively. The super enhancers (SE) identified by Whyte et al.[53] are given as green bars and the enhancers given by SEA is in blue. The experimental enhancers identified by Schütte et al. and Dogan et al. are provided as well. In the case of Runx1 there is no overlap with the Dogan et al. dataset, and hence the absence of any bars. It should be noted that the coCRE enhancer is represented as a composite of red boxes of member CREs.

More »

Expand

Table 2.

Generation of models.

More »

Expand

Fig 3.

Characteristics of chosen regions.

Conservation, chromatin events and TF binding events in chosen CREs (red) compared to all candidate CREs (blue). Left is the TF gene set, middle the TF cluster set and right all differentially expressed genes (DE set). (A) The log of the phylogenetic conservation scores (see Methods), (B) the chromatin events (H3K27Ac peaks and DHS), and (C) TF binding events. The p values were obtained using student t-test for conservation and Kolmogorov-Smirnoff test for chromatin and binding events. All the p values were less than 0.01 except for conservation distribution in TF set (p = 0.07).

More »

Expand

Fig 4.

Validation of predicted enhancers.

(A) Overlap of genomic regions between published sets of enhancers (vertical axis) and the CREs/coCREs chosen for genes in the three gene sets (horizontal axis). The dot plot indicates the significance (-log₁₀(p) with p adjusted for multiple testing) of the pairwise overlaps (red/large size = high significance, orange/small size = low significance). The absence of a dot signifies p > 0.05. The right panel (Enrich) shows a negative control of overlaps with candidate CREs that were not chosen as predictive by our method but with H3K27ac enrichment level similar to the chosen (co)CREs. (B) Expression of Sptbn1 across stages of haematopoietic and cardiac differentiation shown as a blue line chart. The inset plot shows the chromatin activity profile (CAP) for a CRE predicted to be associated with this gene. (C) A UCSC browser snapshot of the predicted CRE within the Sptbn1 gene body. The snapshot shows this region shaded in blue, illustrating the dynamics of the active chromatin mark, H3K27ac (top), and chromatin accessibility (bottom) across the cell types. The coordinate considered for further validation is chr11:30166167–30166587 highlighted in transparent cyan box. (D) A 5-day time course of haematopoietic differentiation, tracking the expression of a YFP reporter gene driven by the predicted CRE. Expression peaks on day 5 (D5), which is equivalent to the haemogenic endothelium (HE). The controls are the ESC line HM1 (black) and HM1 cells targeted with the reporter construct containing the minimal promoter (MP) only (grey).

More »

Expand

Table 3.

Overlaps of predicted regulatory elements with experimentally tested regions.

More »

Expand

Fig 5.

Cis regulatory networks (CRNs) and joint clustering of expression and regulation.

(A) CRN for the TF cluster set. The genes are represented as nodes and directed edges show the genes that are co-expressed and also bound by the TF of one gene (source) to a predicted CRE of the other gene (target). The cell type at which the expression of the gene is highest is shown as colours on the node. The size of the node name is proportional to its degree. (B) The CRN for the TF set (colours as A). (C) Joint clustering of genes in the TF cluster set. Genes cluster together according to the relatedness of both gene expression patterns (red-blue heatmap) and the binary pattern of TF binding (green-white heatmap) at their main predicted (co)CRE. Each cluster is distinguished by a colour coded bar above the GEP heatmap (highlighted as “Joint cluster ID”). For each cluster, the average of the TF binding profile is shown as the TF binding propensity, where 0 represents absence of TF binding and 1 represents binding of that specific TF in all the regions belonging to that cluster (green = TF binding; white = no TF binding).

More »

Expand