^{1}

^{1}

^{2}

^{3}

^{4}

^{5}

^{3}

^{6}

^{3}

^{3}

^{7}

^{1}

^{*}

The authors have declared that no competing interests exist.

Conceived and designed the experiments: MH KD ZQ JD SS JF BR JSL. Performed the experiments: MH KD JD SS. Analyzed the data: MH KD JD SS. Contributed reagents/materials/analysis tools: MH KD JD SS JF. Wrote the paper: MH KD ZQ JSL.

Knowledge of spatial chromosomal organizations is critical for the study of transcriptional regulation and other nuclear processes in the cell. Recently, chromosome conformation capture (3C) based technologies, such as Hi-C and TCC, have been developed to provide a genome-wide, three-dimensional (3D) view of chromatin organization. Appropriate methods for analyzing these data and fully characterizing the 3D chromosomal structure and its structural variations are still under development. Here we describe a novel Bayesian probabilistic approach, denoted as “Bayesian 3D constructor for Hi-C data” (BACH), to infer the consensus 3D chromosomal structure. In addition, we describe a variant algorithm BACH-MIX to study the structural variations of chromatin in a cell population. Applying BACH and BACH-MIX to a high resolution Hi-C dataset generated from mouse embryonic stem cells, we found that most local genomic regions exhibit homogeneous 3D chromosomal structures. We further constructed a model for the spatial arrangement of chromatin, which reveals structural properties associated with euchromatic and heterochromatic regions in the genome. We observed strong associations between structural properties and several genomic and epigenetic features of the chromosome. Using BACH-MIX, we further found that the structural variations of chromatin are correlated with these genomic and epigenetic features. Our results demonstrate that BACH and BACH-MIX have the potential to provide new insights into the chromosomal architecture of mammalian cells.

Understanding how chromosomes fold provides insights into the complex relationship among chromatin structure, gene activity and the functional state of the cell. Recently, chromosome conformation capture based technologies, such as Hi-C and TCC, have been developed to provide a genome-wide, high resolution and three-dimensional (3D) view of chromatin organization. However, statistical methods for analyzing these data are still under development. Here we propose two Bayesian methods, BACH to infer the consensus 3D chromosomal structure and BACH-MIX to reveal structural variations of chromatin in a cell population. Applying BACH and BACH-MIX to a high resolution Hi-C dataset, we found that most local genomic regions exhibit homogeneous 3D chromosomal structures. Furthermore, spatial properties of 3D chromosomal structures and structural variations of chromatin are associated with several genomic and epigenetic features. Noticeably, gene rich, accessible and early replicated genomic regions tend to be more elongated and exhibit higher structural variations than gene poor, inaccessible and late replicated genomic regions.

The spatial organization of a genome plays an important role in gene regulation, DNA replication, epigenetic modification and maintenance of genome stability

Much progress has been made in recent years to reconstruct 3D chromosomal structures from the Hi-C data by translating the observed chromatin contact frequency between two genomic loci to the population average spatial distance between them. Bau and colleagues

However, all the existing methods have several limitations. First, as pointed out by Yaffe and Tanay

More importantly, all of the existing methods focus on reconstructing consensus 3D chromosomal structures, but pay little attention to evaluating magnitudes of structural variations of chromatin at different resolution scales. To quantify structural variations of chromatin, the optimization-based methods usually require a large number of parallel runs, which is computationally intensive and not directly interpretable. Similarly, the Gaussian model in MCMC5C is derived from a consensus 3D chromosomal structure, which cannot be used to measure structural variations of chromatin either.

Since chromatin interactions captured by Hi-C experiments come from a cell population instead of a single cell, it is challenging to study structural variations of chromatin from the Hi-C data. When the cell population consists of multiple sub-populations, of which each corresponds to a distinct 3D chromosomal structure, the Hi-C data can only be interpreted as a measurement of the population average effect. The Hi-C data of mammalian genomes is further complicated by the fact that the pair of homologous chromosomes cannot be distinguished from each other without genotype information. Without fully characterizing structural variations of chromatin in a cell population, the consensus 3D chromosomal structure inferred from the Hi-C data is not directly interpretable or even misleading.

Although the global 3D chromosomal structure is indeed quite dynamic in a cell population, the local 3D chromosomal structure could be homogeneous. A recent study

In this work, we test the hypothesis of consensus 3D structure at the topological domain scale via rigorous statistical analysis of Hi-C data. To achieve this goal, we propose two integrated probabilistic approaches called BACH (which is the short name for “

Applying BACH and BACH-MIX to a high resolution Hi-C dataset, we found that BACH, instead of BACH-MIX, is preferred in about half of the topological domains. Of the topological domains in which BACH-MIX fits the data better, most contain one dominant sub-population, whose 3D chromosomal structure can be reconstructed by the BACH algorithm. These results suggest that most topological domains exhibit homogeneous 3D chromosomal structures in a cell population. We also found that geometrical properties of these topological domains, particularly the shape and the structural variations, are associated with several genomic and epigenetic features. Furthermore, we found significantly lower structural variations at domain center regions than at domain boundary regions.

The BACH algorithm takes the chromosomal contact matrix generated by Hi-C or TCC experiments and local genomic features

Compared to other published methods, BACH has the following advantages: (1) It explicitly models and corrects known systematic biases associated with Hi-C data, such as restriction enzyme cutting frequencies, GC content and sequence uniqueness

In the BACH algorithm, we assume that chromosomal regions of interest exhibit a consensus 3D chromosomal structure in a cell population. However, this assumption may not be true, because chromosomal regions may exist in multiple inter-convertible configurations. To test the consensus 3D chromosomal structure assumption and study structural variations of chromatin in a cell population, we propose a variant algorithm called BACH-MIX (

BACH-MIX contains two types of parameters: the parameters to determine the local consensus 3D chromosomal structures of the two adjacent sub-regions, and the parameters to determine the spatial arrangement of the two adjacent sub-regions. In practice, the local 3D chromosomal structures of the two adjacent sub-regions can be estimated by applying BACH twice separately, each to the contact map of one sub-region. The main computation in BACH-MIX is to estimate the parameters corresponding to each spatial arrangement of the two adjacent sub-structures.

A spatial arrangement of the two adjacent sub-structures can be represented by a rotation matrix with three Euler angles

In practice, a majority of the 104 spatial arrangements of the two adjacent sub-regions are insignificant in terms of having very low proportions. To overcome over-fitting, we adopt a two-step procedure to achieve sparsity: first, we apply the full BACH-MIX model with 104 spatial arrangements to estimate the proportion for each of them; second, we remove insignificant spatial arrangements whose proportion is less than 1%, and re-estimate the proportion for the significant spatial arrangements.

We applied BACH and BACH-MIX to a dataset recently generated in our lab

We conducted a genome-wide analysis by applying BACH and BACH-MIX to this high-resolution mESC Hi-C dataset. Both BACH and BACH-MIX were applied to the 40 KB resolution Hi-C contact matrices. In the preprocessing procedure, we filtered out 300 topological domains whose length is less than 400 KB or do not contain known mouse gene (13.64% out of total 2,200 domains). We also filtered out a subset of 40 KB genomic loci within each topological domain according to restriction enzyme cutting frequencies (number of fragment end < = 5), GC content (< = 0.3) and sequence uniqueness (mappability score < = 0.8) (

To validate the spatial distances inferred by the BACH algorithm, we compared the spatial distances BACH predicted (referred to as the BACH distances) to the spatial distances measured by FISH

In the previous analysis, we obtained the 3D chromosomal structure predicted by BACH for each topological domain. Next, we divided each topological domain into two sub-regions of equal sizes, and applied BACH-MIX to infer the spatial arrangement of the two sub-regions. We evaluated the goodness of fit of the BACH model and the BACH-MIX model for each of these 1,900 topological domains in terms of AIC, which penalizes the log-likelihood of a model with the number of parameters in the model. A smaller AIC indicates a better model fitting. In the HindIII sample, BACH achieved smaller AIC than BACH-MIX in 875 out of 1,900 (46.05%) topological domains. For the rest 1,025 topological domains where BACH-MIX fits the data better than BACH, 487 topological domains have one dominant spatial arrangement of the two sub-regions with proportion greater than 80%. In 482 out of these 487 topological domains, the dominant 3D chromosomal structure can be captured by BACH. Therefore, BACH can reconstruct the consensus structure or the dominant structure in 1,357 topological domains (71.42% of 1,900 topological domains). We obtained consistent results in the NcoI sample. In the NcoI sample, BACH achieved smaller AIC than BACH-MIX in 1,156 out of 1,900 (60.84%) topological domains. For the rest 744 topological domains where BACH-MIX fits the data better than BACH, 394 topological domains have one dominant spatial arrangement of the two sub-regions with proportion greater than 80%. In 393 out of these 394 topological domains, the dominant 3D chromosomal structure can be captured by BACH. Therefore, BACH can reconstruct the consensus structure or the dominant structure in 1,549 topological domains (81.53% of 1,900 topological domains).

In the following analysis, we focus on 1,199 (the overlap between 1,357 topological domains in the HindIII sample and 1,549 topological domains in the NcoI sample, 63.11% out of 1,900) topological domains in which BACH can reconstruct the consensus 3D chromosomal structure or the 3D chromosomal structure of the dominant sub-population in both HindIII sample and NcoI sample. To summarize the structural properties of topological domains, we approximated each 3D chromosomal structure BACH predicted (40 KB resolution) by a cylinder, and computed the ratio between its height and diameter, abbreviated as

To evaluate the relationship between structural properties of chromatin (measured by HD ratio) and its functional forms at the topological domain scale, we collected genomic and epigenetic features for each topological domain, including gene density (UCSC reference genome mm9), gene expression

Two illustrative examples in the HindIII sample: one for a more elongated 1 MB domain (chromosome 18, 33,960,000∼34,960,000) belonging to compartment A, the other for a less elongated 1 MB domain (chromosome 7, 62,040,000∼63,040,000) belonging to compartment B. In

The original Hi-C study

We further study the structural variations of chromatin in a cell population. We first selected 562 topological domains with size larger than 1 MB, and applied BACH and BACH-MIX to the 1 MB region around the center of each selected domain center region. Additionally, we used 985 domain boundaries with size shorter than 40 KB as the control group, and applied BACH and BACH-MIX to the 1 MB region around each selected domain boundary region. We divided each 1 MB genomic region (domain center/boundary region) into two 500 KB adjacent sub-regions, predicted the 3D structure of each sub-region by BACH, and then inferred the spatial arrangements of the two sub-structures. Both BACH and BACH-MIX were applied to the 40 KB resolution Hi-C contact matrices.

Among all the possible spatial arrangements of two adjacent genomic regions, we defined the effective structures as those with their posterior mean proportions greater than 5%, and report the number of effective structures at each locus. A locus with a smaller number of effective structures exhibits lower structural variations than a locus with a larger number of effective structures. In the HindIII sample, the average number of effective structures is 2.20 for the domain center regions, and 2.82 for the domain boundary regions (

Two illustrative examples in the HindIII sample: one for the domain center region (Chromosome 2, 117,580,000∼118,580,000) with low structural variations, and the other for the domain boundary region (Chromosome 1, 135,540,000∼136,540,000) with high structural variations. In

Next, we evaluated the relationship between structural variations of topological domains and its functional forms. We divided the 562 selected domain center regions into two groups, regions with high structural variations (i.e., containing multiple effective structures, threshold = 5%) and regions with low structural variations (i.e., containing one effective structure, threshold = 5%), and compared the genomic and epigenetic features between these two groups (

Although it is widely accepted that the chromatin structure is highly dynamic, it is unclear whether the cell population contains one dominant chromosomal structure, or multiple distinct chromosomal structures with comparable mixture proportions. To quantify structural variations of the whole chromosome in the cell population, we designed the following two-step procedure. In the first step, we applied BACH to the whole chromosome scale Hi-C contact matrix and obtained a predicted 3D chromosomal structure (the mode of the first BACH posterior distribution, referred to as

If there exists a dominant chromosomal structure (referred to as

In practice, however, we need a reference probability distribution in order to claim that the observed RMSD(

We applied the above two-step procedure to the real Hi-C data to generate 3D chromosomal structure for each mouse chromosome by treating each topological domain as a basic unit.

To further understand why shorter chromosomes appeared variable in our two-step procedure at the whole chromosome level, we also conducted a local-level structural comparison. In detail, we used a sliding window of ten domains to scan along each chromosome. For each local region of a chromosome covered by the sliding window, we evaluated the structural similarity between

A competing method, MCMC5C, has been proposed to generate whole chromosome level 3D models for the human chromosomes

We applied the BACH algorithm to the whole chromosome Hi-C contact matrix, and obtained the predicted 3D chromosomal structures for the 15 long chromosomes (chr 1 to chr 14 and chr X). We first investigated how compartments labeled “A” versus those labeled “B” are distributed spatially in the whole chromosome model. Among all the 1,835 topological domains in chr 1 to chr 14 and chr X, 848 belong to compartment A, 633 belong to compartment B, and the remaining 354

We used the 3D chromosomal structure BACH predicted for chromosome 2 in the HindIII sample as an illustrative example. In

We further study how genomic and epigenetic features are distributed spatially in the whole chromosome model. Similar to the previous analysis for compartment labels (A or B), we conducted the same analysis for each of the eleven genomic and epigenetic features in consideration (

We have described BACH and BACH-MIX, two Bayesian statistical models, to study 3D chromosomal structures and structural variations of chromatins from the Hi-C data. The benefits of using a probabilistic approach are two-folds: first, rigorous statistical inference can be carried out to properly remove systematic biases and account for observational noise sources; second, sequencing depth variations can be explicitly modeled by Poisson distributions. Our results demonstrate that BACH is significantly more reproducible and achieves higher consistency with the FISH data than an existing algorithm (MCMC5C). Application of BACH to a recently published Hi-C dataset from the mouse ES cells reveals interesting structural properties of mammalian chromosomes. Specifically, we found that geometric shapes of topological domains are strongly correlated with several genomic and epigenetic features. For example, gene rich, actively transcribed, accessible and early replicated chromatins tend to be more elongated than gene poor, lowly transcribed, inaccessible and late replicated chromatins. Furthermore, by using a variant BACH-MIX algorithm, we found that structural variations of a chromatin are also correlated with several genomic and epigenetic features.

There are several issues that we have not addressed in this paper, such as biophysical properties of chromatin fiber

To reconstruct the underlying consensus 3D chromosomal structure, we develop the following probabilistic model, similar to the “beads-on-a-string” model (

Let

Let

Let

Let

We propose the BACH-MIX algorithm to study the spatial arrangement of two adjacent genomic regions. Here we assume that each genomic region exhibits a unique consensus 3D chromosomal structure, but the spatial arrangement of two adjacent genomic regions has certain level of flexibility, and varies according to a probabilistic distribution. More precisely, let

Let

Assuming

Under the default setting of BACH, we draw 100 3D chromosomal structures at each step of sequential importance sampling. We further enrich each 3D chromosomal structure ten times when we implement the rejection control technique. In the Gibbs sampler of BACH and BACH-MIX, we run three parallel chains with 5,000 MCMC iterations in each chain. The first 1,000 samples are dropped as the burn-in stage, and then every 50^{th} sample in the last 4,000 samples are used for the posterior inference. We use the Gelman-Rubin statistic

The computation time of BACH and BACH-MIX depends on the number of MCMC iterations and the number of loci in the genomic region of interest. All MCMC calculations are conducted on computing nodes in Harvard Linux cluster “Odyssey”, each with dual Xeon E5410 2.3 GHz quad core processors and 32 GB RAM. Under the default setting, BACH takes 81 seconds to predict a 3D chromosomal structure with 25 loci; BACH-MIX takes 8 minutes to predict the proportion of 104 distinct 3D chromosomal structures for two 13 loci adjacent genomic regions. The computation time increases almost quadratically with the number of loci in the genomic region of interest.

BACH and BACH-MIX can be freely downloaded at

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)

(DOCX)