Spatial clustering and common regulatory elements correlate with coordinated gene expression

Many cellular responses to surrounding cues require temporally concerted transcriptional regulation of multiple genes. In prokaryotic cells, a single-input-module motif with one transcription factor regulating multiple target genes can generate coordinated gene expression. In eukaryotic cells, transcriptional activity of a gene is affected by not only transcription factors but also the epigenetic modifications and three-dimensional chromosome structure of the gene. To examine how local gene environment and transcription factor regulation are coupled, we performed a combined analysis of time-course RNA-seq data of TGF-β treated MCF10A cells and related epigenomic and Hi-C data. Using Dynamic Regulatory Events Miner (DREM), we clustered differentially expressed genes based on gene expression profiles and associated transcription factors. Genes in each class have similar temporal gene expression patterns and share common transcription factors. Next, we defined a set of linear and radial distribution functions, as used in statistical physics, to measure the distributions of genes within a class both spatially and linearly along the genomic sequence. Remarkably, genes within the same class despite sometimes being separated by tens of million bases (Mb) along genomic sequence show a significantly higher tendency to be spatially close despite sometimes being separated by tens of Mb along the genomic sequence than those belonging to different classes do. Analyses extended to the process of mouse nervous system development arrived at similar conclusions. Future studies will be able to test whether this spatial organization of chromosomes contributes to concerted gene expression.

chromosome structure of the gene.To examine how local gene environment and transcription factor regulation are coupled, we performed a combined analysis of timecourse RNA-seq data of TGF-β treated MCF10A cells and related epigenomic and Hi-C data.Using Dynamic Regulatory Events Miner (DREM), we clustered differentially expressed genes based on gene expression profiles and associated transcription factors.
Genes in each class have similar temporal gene expression patterns and share common transcription factors.Next, we defined a set of linear and radial distribution functions, as used in statistical physics, to measure the distributions of genes within a class both spatially and linearly along the genomic sequence.Remarkably, genes within the same class despite sometimes being separated by tens of million bases (Mb) along genomic sequence show a significantly higher tendency to be spatially close despite sometimes being separated by tens of Mb along the genomic sequence than those belonging to different classes do.Analyses extended to the process of mouse nervous system development arrived at similar conclusions.Future studies will be able to test whether this spatial organization of chromosomes contributes to concerted gene expression.

INTRODUCTION
A cell continuously receives signals from its local environment and accordingly adjusts cellular programs, such as cell proliferation, motility and metabolism [1].Typically, regulation of a cellular process requires changes in the expression of a group of genes in a temporally coordinated manner [2].How such coordinated regulation is achieved is a central question that remains poorly addressed.
A mechanism of such regulation is through specific interaction network structures of transcription factors (TFs).TFs bind to certain DNA sites and regulate transcriptional activities of their targeted genes.A TF can regulate multiple target genes to form a socalled single-input-module (SIM, or fan-out) [3].This SIM network motif appears in a high frequency to coordinate the expression of genes with related functions such as those in bacterial metabolic pathways [4].Gene regulation in eukaryotic cells is more complex since the three-dimensional structure of DNA has a more profound impact on gene transcription than that in prokaryotic cells.For instance, a nucleosome structure with a high packing level limits gene accessibility [5].Furthermore, epigenetic modifications can strongly influence gene transcription [6].It is not fully understood how these different regulation mechanisms collectively control the expression of a group of genes.
To examine how multiple levels of regulation lead to concerted expression of gene groups, we analyzed the temporal gene expression profiles of TGF-β treated human mammary epithelial MCF10A cells in the context of histone modification patterns and chromosome structures derived from Hi-C data.The TGF-β family is crucial for regulating a complex signal transduction network in embryonic and fetal development, and is also involved in multiple physiological and pathological processes such as wound healing and cancer progression [7].Its signaling event starts from membrane embedded TGF-β receptors, which bind active TGF-β molecules from the extracellular environment [8].The TGF-β signal is then transmitted into the cell through a signal transduction network and triggers a cascade of cellular responses.The latter is achieved through temporally coordinated expression changes of groups of genes with related functions such as cell proliferation, metabolism, and motility [9].TGF-β also induces a global reprogramming of cell epigenome [10], which reinforces cellular responses for committed cell phenotype transition.We also analyzed temporal gene expression together with histone modifications and chromosome structures during mouse neural differentiation, another well-defined model for studying cell phenotype transition [11,12].Specifically, we analyzed a recently published dataset that combined Hi-C, RNA-seq, and ChIP-seq studies on the differentiation process from mouse embryonic stem cells (ESCs) to neural progenitor cells (NPCs) then to cortical neurons (CNs) [13].In both the TGF-β response and neural differentiation systems, our analyses reveal that genes co-regulated by a common TF(s) have the tendency to be spatially close, even if they are distant along the linear genome sequence.

RNA extraction and library preparation
Total RNA was isolated from the cell pellets with an RNA extraction kit (Qiagen, Cat No. 74104).All RNA extracts were confirmed with high quality (RQN score = 10.0) using the Fragment Analyzer TM platform (Advanced Analytical Technologies, Inc).Libraries were prepared using the NEBNext Ultra RNA Library Prep Kit for Illumina (NEB, Cat No. E7530L) according to the manufacturer's instructions.Briefly, mRNA was first isolated from total RNA with oligo d(T)25 beads (all volumes were halved except for washing steps, NEB, Cat No. E7490S).Next, purified mRNA was denatured and melted into small fragments, and subjected to random priming and extension for reverse transcription.After that, double-stranded cDNA was end-repaired, dA-tailed, adaptor ligated, and amplified with 12 PCR cycles.Constructed libraries were subjected to purification and quality control; the final quality-ensured libraries were pooled and sequenced on an Illumina HiSeq 4000 instrument for 150 bp paired-end sequencing.

RNA-seq data processing
Paired-end cleaned reads were aligned to the human reference genome hg19 (UCSC) using TopHat (v 2.1.1)with default parameters.The BAM files of mapped reads were used to annotate transcripts and calculate the FPKM values using the Cufflinks, Cuffquant, Cuffnorm suite [14].Differentially expressed (DE) genes were identified between any two time points with the criteria: fold change >2 or < 0.5 and FDR < 0.05.
The FPKM values of genes from the RNA-seq dataset were further cleaned up using custom R scripts.Hierarchical clustering of genes was performed using an R package (pheatmap).Gene expression and TF regulation based Hidden Markov Model (HMM) clustering was performed with the DREM2 software [15].RNA-seq results of ESC, NPC and neuron cells were downloaded from the GEO database under the accession number GSE96107.

Chromosome structure analyses
Hi-C data were downloaded from the GEO database (MCF10A, GEO:GSE66733; mouse nervous system GEO:GSE96107).Chromosome structures were constructed using an R package (igraph).Clustering of bins was obtained with the fast-greedy algorithm [16].
Physical distances between bins were estimated with a Matlab code provided by Lesne et al. [17].This code uses a Shrec3D algorithm, which first relates the Hi-C contact frequency between every two genomic sites with a spatial distance, then approximates the actual distance between the two sites by their shortest-path distance on a contact graph.This algorithm alleviates uncertainty of reconstructing the spatial distance between two distal sites only by their own contact frequency.

Distribution function calculation
Linear distribution function: For a tagged HMM class α gene, we divided the flanking sequences into bins with a size of Δl base pairs, and the i-th pair of bins [ −i − 1 ∆, − ∆] and [  ∆, i + 1 ∆], i = 0, 1, etc. (Fig S1A).In the i-th pair of bins, there are n iα genes belonging to the same HMM class as the tagged gene.For the 0-th pair of bins the counting of the genes should exclude the tagged gene.The linear correlation was calculated as , where i = 0, 1, 2, etc. N α was defined as the total number of genes belonging to class α, and the average <> α was performed over every HMM class α gene as the tagged gene.
As a control, where n i was the total number of genes in the i-th bin and N the total number of human genes, and , where n iD is the total number of DE genes in the i-th bin, and N D is the total number of DE genes.

Spatial distribution function:
The idea of a radial distribution function from statistical mechanics was implemented (Fig S1B) [18].Each chromosome was divided into sequence bins with a size of 250 kb.A tagged gene from HMM class α resides in a bin that we referred to as the tagged bin.The spatial distance in the 3D physical space between the tagged bin and another bin containing a specific HMM class β gene was analyzed using the Shrec3D algorithm [17] to convert the contact frequency between two bins from Hi-C data to a spatial distance.The sphere centered at the tagged bin was divided into shells with a thickness Δr.In our analysis, Δr ≈ 60 nm based on the estimated conversion in [17].
Next, an average spatial correlation function between a class-α-gene-containing bin at the origin and class-β-gene-containing bins within the i-th shell was defined as, where  !" is the number of HMM class β genes within a spherical shell (Δr, ( + 1)Δr ), and this number excludes the tagged class α gene within the 0-th shell in the case β = α; N β is the total number of class β genes and this number excludes the tagged class α gene in the case β = α; V is the volume of the nucleus and the unit was chose so that V = 1, and the average over α is again performed over all genes belonging to class α as the tagged gene.
Similarly, the controls were defined as, where n i is the number of all genes within the i-th shell; N is the total number of human genes; n iD is the number of DE genes within the i-th shell; and N D is the total number of DE genes.Again, the tagged gene was excluded when counting n 0 and n d0.

Changes in gene expression reflect cell phenotype transition in response to TGF-β
We used MCF10A cells, a non-tumorigenic human mammary epithelial cell line, as a previous reports that under TGF-β treatment cells are under growth arrest until they finish EMT [21].
Histone modifications can also affect gene expression [22].To investigate the relationship between histone modification and gene expression, we integrated genomewide H3K4me3 and H3K4ac profiles obtained by Messier et al. [23] with our RNA-seq data.Both H3K4me3 and H3K4ac are histone modification marks that are associated with active or poised genes [24].We used H3K4me3 and H3K4ac profiles of all human genes as a control, and examined the marks in each HC class.The results in Fig 2B show that all HC classes have elevated H3K4me3 and H3K4ac compared to the control, and there is no apparent difference between different classes.Each HC class also has a broad bimodal distribution.That is, genes within an HC class do not share common histone modification patterns.Given that histone modification patterns correlate with local chromosome structures [25], these results suggest that genes from the same HC class have heterogeneous local chromosome environments.
Next, we adopted a different clustering scheme, the Dynamic Regulatory Events Miner (DREM), which clusters genes by combining gene expression time series with additional pre-established transcriptional networks [26].Figure 3A shows the clustering results analyzed with DREM2 based on a Hidden Markov Model (HMM) [15].At each conjunction node, genes are assigned to different branches based on their expression trend and upstream regulators (transcription factors on this node).Genes from an upstream branch can become key regulators at subsequent nodes [15,26].It reveals a hierarchy of gene regulation during the process of TGF-β-induced phenotype change.
With DREM2 the DE genes were clustered prominently into 46 branches with 19 nodes at the conjunction sites and 25 end classes.For clarity, we call the latter HMM classes to distinguish from the HC classes that are based on expression only.
Compared to the HC classes, HMM classes showed finer dynamic patterns and GO enrichment information (Table S1).For example, genes in the first seven HMM classes all had increased expression, but differed in their detailed temporal profiles.Genes in class C1 increased their expression to high levels already on day 2. Genes related to metalloendopeptidase activity were enriched in this class by over 17 fold with respect to the reference genes.Four of the matrix metalloproteinases (mmps), mmp2/7/11/13, are also in this class.These four MMPs are known to degrade components of extracellular matrix proteins such as gelatin, fibronectin, and laminin, and mediate biological activities including migration, mammary epithelial cell apoptosis, and EMT [27].Heparin binding genes were another type of highly enriched genes.These genes, such as periostin (postn), fibronectin (fn1), are also known to be related to matrix or cell membrane formation and thus affect cell migration and adhesion [28].Another class of early activation genes, class C2, was also enriched with genes related to cell matrix and membrane structure.Among them five of the pcdh family members, including pcdh7/a4/b9/b10/b13, are integral membrane proteins that are involved in cell-cell recognition and adhesion [29].In general, genes within each HMM class had narrower distributions and thus higher similarity of Therefore, genes clustered through the DREM2 analysis based on common TFs and similar dynamic profiles tend to have closely related functions.

Genes sharing common regulators have an enhanced tendency to be spatially close
As mentioned above, local chromosomal DNA environment affects gene transcriptional activity.We wondered whether genes sharing similar expression patterns and common regulatory factors, as in an HMM cluster identified by the DREM2 analysis, are also spatially close and share similar local DNA environment.To test this hypothesis, we first examined gene arrangement along the linear genome sequences.We divided the whole human genome into bins with a resolution of 1 Mb, a typical size of a topologically associated domain (TAD).Then we matched all genes to the relevant bins based on their genomic positions.Statistical analysis of all the genes spreading along the chromosomes showed that genes are not evenly distributed along the DNA sequences (Fig 4A).Most bins have less than ten genes, and globally one third of the bins are gene-free.By contrast, ~3% of the bins (a total of less than 100 bins) contained 17% of the overall human genes.
This uneven distribution was slightly more profound for the DE genes under TGF-β treatment: DE genes resided in less than half of the bins and 17% of DE genes were enriched in only 2.5% of the bins.
To further examine the gene distribution along chromosomes, we defined an averaged linear distribution function  !(see Materials and Methods for details).It measured how the chromosomal density of a group of genes of interest changes with respect to the transcription starting site (TSS) of a given gene.For a given gene x belonging to an HMM class α as a tagged gene, we divided the DNA sequences along both sides flanking the TSS of x into bins with a size of 125 kb (Fig S1A, r = 125 kb), and counted the fraction of HMM class α genes in each bin.We then repeated this process by choosing every gene in the HMM class α as the tagged gene, and calculated the average density of HMM class α genes ( !!()) in the i-th bin with respect to the tagged gene.For statistical comparison, we also calculated a similar  !!" () for all human genes and  !!" () for all DE genes with respect to the tagged HMM class α genes as controls.If there were no class-specific gene clustering along the genomic sequence, one would expect that Next, we investigated the spatial arrangement of the DE genes using a set of available Hi-C data from MCF10A cells [30].Following an approach used in statistical mechanics

Discussion
Recent studies on chromosome conformations have revealed the existence of structural units such as promoter-enhancer hubs, topologically associated domains (TADs), and meta-TADs and demonstrated that these structural units play important roles in gene regulation [31][32][33][34].Several studies focusing on specific genomic regions have shown correlation between gene expression and local chromosome structures [35,36].In this work, we provide a genome-wide perspective on the relationship between chromosome structure and gene regulation by integrating the RNA-seq and Hi-C data.We first only used the expression data and grouped genes that share similar temporal expression patterns and are co-regulated by common TFs together.We found that genes within each group display a significantly enhanced tendency to be clustered spatially in the threedimensional chromosome structure, regardless whether these genes are close (< 1 Mb) along the genome sequence or separated by as far as tens of Mb.This observation further suggests that the three-dimensional chromosome structure is part of a multi-layer gene regulation program.
Our analysis reveals two related mechanisms that achieve spatial clustering of genes subject to common regulators.Some genes are located close in chromosome sequence and consequently spatially close.By contrast, some genes that are far apart along chromosome sequence can also become adjacent spatially by forming three-dimensional structures.TFs may actively orchestrate such chromosome structure organization [37,38].
Alternatively, other DNA binding factors such as long non-coding RNAs and transcription initiation complexes can drag associated chromosomal regions together to form enhancer-promoter hub structures.These hub structures may facilitate TF binding and related cooperative regulation such as phase separated molecular assemblies [39].
Functionally, spatial co-localization may contribute to temporally coordinated regulation of a group of genes in eukaryotic cells.This co-localization can be viewed as a further refinement of the SIM network motif first noticed in prokaryotic cells.Spatial colocalization may facilitate simultaneous regulation of local chromosomal environment of these genes, such as DNA methylation and histone modification, and chromosome compaction, all of which affect gene expression activities.Indeed, a recent study on Drosophila embryos shows that a group of genes separated by genomic distance but pulled together by an enhancer element exhibit similar expression fluctuation patterns [40].
Our analysis of the MCF10A data, however, has a number of limitations.In summary, based on an integrated analysis of transcriptome, epigenome, and chromosome 3D structural information we propose a mechanism for concerted regulation of gene groups that can be further evaluated with more systematically measured datasets.
That is, concerted gene regulation can be achieved through a common trans regulator(s) and the spatial co-localization of target genes.This observation further suggests that genes may be spatially organized into functional units, consistent with the hierarchical patterns and long-range interactions revealed by chromosome structure studies [36,41].
The relationship between gene expression and chromosome structure can be better understood by grouping genes into finer HMM classes based on their expression patterns and regulatory elements.

major
in vitro model to study in this work.This cell line has been widely used to study the TGF-β induced epithelial-to-mesenchymal transition (EMT)[1,19] (Fig 1A).Cells were treated with 4 ng/ml TGF-β for 12 hours, 2, 3, 5, 8, 12, and 21 days (Fig 1B).Untreated MCF10A cells showed typical epithelial morphology with tight cell-to-cell adherence.With TGF-β treatment, we observed progressive morphological changes indicating the transformation from epithelial to mesenchymal phenotype.From day 2 to day 5, cells started to show loosened intercellular adherence.After day 5, some cells appeared with expanded cell size and extended long cell axis.With further TGF-β treatment, more cells acquired a spindle-like shape.On day 21, only a small fraction of cells still maintained epithelial morphology and most cells had undergone EMT.Next, we performed RNA-seq studies to uncover changes of gene expression accompanying EMT.At each time point, we harvested cell samples and extracted RNA.The RNA-seq results revealed that about 33% of human genes were differentially expressed upon TGF-β treatment.Principal component analysis (PCA) over these ~ 7000 DE genes showed an expected larger separation between gene expression profiles of samples from different time points than those of replicate samples from the same time point (Fig1C).The global transcriptome change over time reflected in the PCA space was consistent with the gradual morphological change of cells over time and the previous report that TGF-β-induced EMT proceeded through intermediate states[19].Gene classes sharing similar expression patterns and upstream regulators exhibit similar functional characteristicsTo further examine the temporal patterns and functions of the DE genes, we performed hierarchical clustering (HC) analysis.The analysis divided the DE genes into seven HC classes based on similar expression patterns in each (Fig2A)[20].Among the seven HC classes, class I with ~1,700 genes exhibit a monotonically decreasing pattern, and class II of ~2,000 genes exhibit a monotonically increasing pattern.Another two classes III and IV show transient up and transient down dynamics, respectively.The remaining three classes V-VII display wavy dynamic patterns to varying degrees.Gene ontological (GO) analysis (Fig S2) revealed that genes in each class are typically involved in multiple cellular processes.For example, genes in the decreasing class (class I) are related to RNA polymerase I activity and snoRNA binding.These two functions are related to the RNA metabolic process, including ribosomal RNA production, modification, and binding to regulatory factors.The observation that these genes are down-regulated is consistent with histone modification patterns (Fig 3B) than those of the HC classes do (Fig 2B).
within statistical errors (see Materials and Methods for explanation of terms).Instead, the  !values of more than half of HMM classes were not significantly higher than those of DE gene and all human gene controls.The upper left panel of Fig 4B shows HMM class C23 as an example.Only five HMM classes showed statistically significant increases of  !values over controls (although the increases are small) within the first pair of bins (≤ 125 kb), indicating relative accumulations of genes from the same HMM class; one of them (HMM class C24) is shown in Fig 4B upper right panel.

[ 18 ]
Fig 3A).As expected, Fig S4C shows that the extent of spatial clustering of genes within each class is reduced as compared to those shown in Fig 4D.
While we performed RNA-seq analysis of MCF10A cells at a number of time points during TGF-β treatment, the lack of simultaneous time-course Hi-C and epigenomic data prevented us from analyzing how spatial clustering may change dynamically upon the change of gene expression status.In addition, having the RNA-seq, Hi-C and epigenomic datasets obtained from different labs also raises a concern of potential cell line drifting during culture.It is desirable to have an integrated set of parallel RNA-seq, epigenomic and Hi-C measurements from the same batch of cells, similar to how the ESC differentiation was studied by Bonev et al. [13] but at more time points.Together with the gene regulatory network analysis, such datasets would permit finer clustering and identifying gene groups that each contains multiple spatially clustered, co-regulated and functionally related genes, and examining to what extent these units are either cell type specific or conserved among different cell types.

Figure 2 TGF
Figure 2 TGF-β induced gene expression changes show distinct temporal patterns.

Figure 3
Figure 3 Genes clustered based on both expression patterns and key transcription

Figure 4 1 -
Figure 4 Genes with similar expression patterns and controlled by the same up-

Figure 5
Figure 5 Mouse ESC-CN system shows a similar enhanced tendency of physical