Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

Multidimensional scaling methods can reconstruct genomic DNA loops using Hi-C data properties

Abstract

This paper proposes multidimensional scaling (MDS) applied to high-throughput chromosome conformation capture (Hi-C) data on genomic interactions to visualize DNA loops. Currently, the mechanisms underlying the regulation of gene expression are poorly understood, and where and when DNA loops are formed remains undetermined. Previous studies have focused on reproducing the entire three-dimensional structure of chromatin; however, identifying DNA loops using these data is time-consuming and difficult. MDS is an unsupervised method for reconstructing the original coordinates from a distance matrix. Here, MDS was applied to high-throughput chromosome conformation capture (Hi-C) data on genomic interactions to visualize DNA loops. Hi-C data were converted to distances by taking the inverse to reproduce loops via MDS, and the missing values were set to zero. Using the converted data, MDS was applied to the log-transformed genomic coordinate distances and this process successfully reproduced the DNA loops in the given structure. Consequently, the reconstructed DNA loops revealed significantly more DNA-transcription factor interactions involved in DNA loop formation than those obtained from previously applied methods. Furthermore, the reconstructed DNA loops were significantly consistent with chromatin immunoprecipitation followed by sequencing (ChIP-seq) peak positions. In conclusion, the proposed method is an improvement over previous methods for identifying DNA loops.

1 Introduction

Gene expression is regulated by one- and three-dimensional chromatin structures [1], which can vary between cells [2]. DNA loops are formed via interactions between proximal promoters and distal enhancers or insulators [3]. Although the effect of these DNA loops on gene expression is well established, the detailed mechanism underlying gene expression regulation is poorly understood. In particular, diseases caused by genetic mutations, such as cancer, are promoted by irregular chromatin structures [4, 5]. Determining when and where DNA loops form along with visualization development are vital for disease prevention. High-throughput chromosome conformation capture (Hi-C) examines genomic interactions by determining the number of genomic contacts [3]; however, it presents numerous problems. Hi-C data are noisy [6], with most of them representing the average of an entire cell population, as single-cell Hi-C data are uncommon owing to the cost of the method. Compared to Hi-C contact frequency between initially close coordinates, the frequency between the distant coordinates that form DNA loops is typically underestimated [7].

These aspects complicate genome structure prediction. Previous studies have reproduced the entire 3D genome structure from noisy Hi-C data using two methods: 1) The first involves converting Hi-C contact frequencies into distance data using an arbitrary function and then applying the multidimensional scaling (MDS) or t-distributed stochastic neighbor embedding (t-SNE). The 3D genome structure is then reproduced by considering the Poisson distribution and other factors in the contact frequencies and using the Markov chain Monte Carlo method to ensure consistency. 2) The stoHi-C method reproduces the genomic 3D structure by applying t-SNE to yeast cell Hi-C data [8]. As a result, a genomic structure similar to the yeast cell chromosome map is obtained. In addition, the miniMDS method divides high-resolution Hi-C data into multiple parts and applies MDS to low-resolution data [9]. Then, by integrating these data, a robust chromosome structure is produced. However, these methods have only been applied to reproduce whole-genome 3D structures, and further biological verification has not yet been performed. In previous studies, the 3D structure was only verified by determining the correlation between the input (Hi-C contact frequency or transformed distance data) and the output (3D genome coordinates). Therefore, this study presents a method of reproducing genomic 3D structures solely based on DNA loop structure. Several methods, such as HiCCUPs [10] and cLoops [11], have been proposed to directly call DNA loops instead of reconstructing the genome structures. However, the genome structure must be visualized to investigate the physical interactions of genomes that cannot be understood using raw Hi-C data or loop-only calling methods. In addition, the method proposed here is expected to improve our understanding on the relationship between the dynamic state of complex genomes and genomic functions, such as gene expression regulation.

Previously, the average number of Hi-C contact frequencies at the same base distance was replaced to account for missing Hi-C data. However, this approach is ineffective to represent the genome structure [7]. Here, I demonstrate that the chromatin structure can be reproduced by setting the missing values to zero. To the best of my knowledge, this is the first study to apply MDS to Hi-C data to reproduce DNA loop-specific genomic structures without determining missing values. DNA loop reproduction is robust and missing values can be disregarded by exploiting the ability of MDS to reproduce positions relative to that of Hi-C data. Compared to results of previous methods, this study findings revealed significantly more transcription factors involved in loop formation. The results are also significantly consistent with the ChIP-seq peaks and biological findings. The proposed method of reproducing DNA loops for Hi-C data, which vary based on the experimental specifications, is expected to provide a basis to elucidate the mechanisms underlying the transcription and organization of 3D chromatin structures.

2 Materials and methods

2.1 Data

Hi-C datasets measure the physical interactions of genomes. In this study, the following seven representatives Hi-C datasets were used (Table 1).

thumbnail
Table 1. Information on the seven Hi-C datasets analyzed in this study.

https://doi.org/10.1371/journal.pone.0289651.t001

All data were retrieved from the Gene Expression Omnibus (GEO) database [12]. In this study, all .hic files were loaded into the R package “straw” [13] with vanilla coverage for normalization.

2.1.1 GSE201353.

The GSE201353 database includes Hi-C data collected at eight time points with a resolution of 10,000 bp; the cells are quiescent macrophages derived from human THP-1 macrophages using Illumina NovaSeq 6000 (GSM6061759 to GSM6061798) [14] (Illumina Inc., San Diego, CA, USA). Reed et al. investigated the interrelationship between 3D chromatin structure and transcription [14]. These data were collected after treatment with LPS/IFNg for 0, 0.5, 1, 1.5, 2, 4, 6, and 24 h.

2.1.2 GSE141067.

The GSE141067 database includes Hi-C data collected at eight time points with a resolution of 50,000 bp from human U2OS osteosarcoma cells [15]. Kang et al. investigated histone modifications and long-range chromosome interactions after mitosis [15]. These time series data were collected during the cell cycle (0 min (metaphase) and 35 min (anaphase/telophase); 60 min (cytokinesis); and 90, 120, 180, 240, and 360 min (G1) (GSM4194449 to GSM4194464)) using Illumina NovaSeq 6000.

2.1.3 GSE149103.

The GSE149103 database includes Hi-C data with a resolution of 10,000 bp and three different pancreatic cells: immortalized cells (GSM4490488), PANC-1 (GSM4490510), and Capan-1 (GSM4490532), which were assessed using HiSeq X Ten [16]. Ren B et al. established that chromatin loops were significantly altered and associated with epigenetic changes in metastatic pancreatic cancer cells [16].

2.1.4 GSE160235.

The GSE160235 database includes Hi-C data collected at three time points with a resolution of 10,000 bp using a colorectal adenocarcinoma cell line. The control, RNAPll-degron at the post-mitotic phase, G2 phase, and transition to G1 phase, were assessed using Illumina NovaSeq 6000 [17]. Zhang et al. concluded that RNA polymerase II is required for chromatin reorganization [17].

2.1.5 GSE167150.

The GSE167150 database includes Hi-C data for breast cancer cell lines with a resolution of 10,000 bp. These data were obtained for six breast cancer subtypes, including triple-negative breast cancer (TNBC) and normal cells, using Illumina NovaSeq 6000 [18]. Kim et al. found that, compared to the other five breast cancer subtypes, TNBC has a more rapid progression, disrupted chromatin structure, and tissue-specific loops [18].

2.1.6 GSE168470.

The GSE168470 database includes Hi-C data for lymphoma with a resolution of 10,000 bp. These data were obtained for lymphoma cancer subtypes, including WSU-DLCL2, DLBCL, and germinal center B-cells, using Illumina HiSeq 2500 and Illumina NextSeq 500 [19]. Sungalee et al. found that H3K27ac dynamics may regulate genomic interactions and maintain oncogene expression [19].

2.1.7 GSE143465.

The GSE143465 database includes Hi-C data for renal cancer with a resolution of 10,000 bp. These data were obtained for renal cancer subtypes, including N-IDR, WT/A9, and N-IDR FS/A9, using Illumina NovaSeq 6000 [20]. Ahn et al. demonstrated that phase-separated NUP98-HOXA9 induces chromatin loops in a proto-oncogene [20].

2.2 Methods

Fig 1 shows the flowchart of analyses performed in this study.

thumbnail
Fig 1. The flowchart of analyses performed in this study.

https://doi.org/10.1371/journal.pone.0289651.g001

2.2.1 Missing values in Hi-C data.

The missing values in the Hi-C datasets were set to zero to differentiate between DNA with and without loops.

2.2.2 Preprocessing to clearly represent DNA loops.

Hi-C contact frequency between distant coordinates that form DNA loops are underestimated compared to Hi-C contact frequency between initially close coordinates [7]. Thus, the only important pre-processing step was to reduce the gap between Hi-C contact frequencies using the multiplying Eq (1). (1) where i and j are defined as the bin coordinates that divide the genome coordinates by the resolution and dij represents the Hi-C contact frequency. The Hi-C data with a resolution of 10 kbp had a large default value for the number of contact frequencies below 50 kbp between the coordinates; therefore, 50 kbp was considered as the cutoff. The Eq (1) is a dependent function of nucleotide distance and causes more pronounced Hi-C contact frequency between distant coordinates that form DNA loops (Figs 2 and 3).

thumbnail
Fig 2. Plot of the distance between coordinates versus the natural logarithm of Hi-C contact frequency before multiplying by weight 0–50,000 kbp of chromosome 5 by series GSM6061774.

https://doi.org/10.1371/journal.pone.0289651.g002

thumbnail
Fig 3. Plot of the distance between coordinates versus the natural logarithm of Hi-C contact frequency after multiplying by weight 0–50,000 kbp of chromosome 5 by series GSM6061774.

https://doi.org/10.1371/journal.pone.0289651.g003

For Hi-C contact frequencies above 50 kbp, the gap in the Hi-C data was filled by multiplying by the natural logarithm of the distance. For instance, adjusting the cutoff value of 50–70 kbp did not significantly change the structure of the genome. However, setting the cutoff to 20 kbp altered the genome structure because a Hi-C contact frequency below 20 kbp will be low because of the natural logarithm. Here, setting the cutoff value to 50 kbp reproduces a genome structure that represents the DNA loop. However, there is no strict biological basis for this.

Various other functions were also considered. For instance, multiplying the number of Hi-C contact frequencies by a constant multiple (Eq (2)) and the original Hi-C data did not reproduce a genomic structure consistent with the biological findings (S1 File). (2)

If now has a large value, then, the distance between genomic coordinates is small. Therefore, the reciprocal of was taken to treat the processed Hi-C contact frequency as a distance. However, the missing values are zero, and all values were added to 1. Then, the inverse of the was used as the distance between genome coordinates. Moreover, the diagonal component was set to zero because of the number of Hi-C contact frequencies for the same coordinate. (3)

The ‘PHi-C’ method reproduces genome structures by considering the simple macromolecular model of the genome (a single chromosome) as a ‘linked bead’ [21]. PHi-C accurately estimates genome structure in four dimensions but requires dense Hi-C data. In this case, a versatile model was desired, hence the reciprocal frequency of interactions between two particles in the model (i.e., between two bins or restriction fragments in the genome) was adopted as the distance. The inverse of the power of the frequency is debatable, but it was set to 1 as used in fractal globule models [22].

2.2.3 Multidimensional scaling.

The MDS construction method finds coordinates in the original spatial data based on similarity or distance data [23]. Although Euclidean distance is generally used as proximity for similarity, weighted Euclidean distance, Manhattan distance, and Minkowski space are also commonly used. Hi-C data have a larger contact frequency when the genomic coordinate distances are closer; however, as MDS deals with distance data, Hi-C data are transformed.

The reciprocal of Hi-C data reveals that the closer the distance between genomic coordinates, the smaller the value. By contrast, the farther the distance, the larger the value. Furthermore, the diagonal component was assigned a value of zero.

The MDS algorithm is briefly described below. When the number of elements is N, the input is a square matrix D, where Dij is the distance between the i-th and j-th genomic coordinates. Then, xn was defined as each genomic coordinate to be reproduced as MDS output. The original genome structure X is obtained from a distance D of the genome coordinates as follows. D(2) was defined as the matrix of all components of the distance matrix D squared, and multiply D(2) by the N×N centralization matrix from both sides. (4) X* is the centered data matrix and K is the inner product N × N matrix obtained from the centered data matrix. Next, consider a matrix X that satisfied the eigenvalue decomposition of K and K can be decomposed as follows: (5) where V is an unitary matrix, and A is an eigenvalue matrix. Then, from Eqs (4) and (5), (6)

Eq (6) can obtain the desired genomic structure X. Eigenvalues and eigenvectors are arranged in order of magnitude of the eigenvalues. Empirically, the second and third eigenvectors v2, v3 after eigenvalue decomposition frequently retain their original structure. This was verified in my previous paper with a simple simulation [7]. Therefore, after eigenvalue decomposition, v2 and v3 of the eigenvector V were chosen as the genomic structure. Here, this structure is defined as a tentative chromosome structure (Fig 4).

thumbnail
Fig 4. Plot of the second and third eigenvectors.

Chromosome 5 0–50,000 kbp by series GSM6061774 after MDS. The blue line represents the root of the DNA loop, which is defined in Section 2.2.4.

https://doi.org/10.1371/journal.pone.0289651.g004

2.2.4 Criteria for DNA loop selection.

The criteria for selecting DNA loops from the tentative chromosomes are described below. Then, the Euclidean distance between v2 and v3 was calculated as follows: (7) where i is the locus divided by the resolution and Ei is the distance of the genome at i and i+ 1. The length of the loops is dependent on the cell but was averaged over 170–200 kb in quiescent macrophages [14]. Therefore, the average of ten points (100 kbp) each of Ei was calculated and used as the distance. Since this process reduced the total data by 100 kbp, insufficient data were represented by averages obtained at every 90, 80, 70, 60, and 50 kbp from both sides as follows. (8) where i represents the coordinates divided by the resolution and N is the number of elements of i. In addition, j and k are elements to complement the reduced i by taking the average of Ei. This smoothing was performed at 100 kbp, which is neither too long nor too short. Further investigation is needed when dealing with other resolutions. The criteria for obtaining coordinates as DNA loops are shown below. The threshold value was set at two times the average distance between genomic coordinates E. The coordinates above the threshold were considered the regions that form DNA loops.

However, as enhancers and promoters are located at the ends of DNA loops, subthreshold coordinates were also evaluated. Therefore, in this study, the following criteria were established (Fig 1):

Formulaically, let T be the threshold, greater than the threshold T is denoted as and is a DNA loop, where i is the coordinates divided by the Hi-C resolution. In contrast, smaller than the threshold value is denoted as . When proceeding from both sides of the DNA loop to directions i + j and ij (j is any natural number) that are smaller than the threshold value T, the root of the DNA loop is defined as and . Especially, the location of the local minimum was set as the end of the DNA loop.

Therefore, the coordinates of the blue points in Fig 5 are considered enhancer and promoter regions.

thumbnail
Fig 5. Distance plot between coordinates of chromosome 5 0–50,000 kbp by series GSM6061774.

The red line is the threshold, and the blue points are the roots of the DNA loop, which is defined in Section 2.2.4.

https://doi.org/10.1371/journal.pone.0289651.g005

2.2.5 Enrichment analysis.

Genes in the DNA loops selected by the MDS-based method were obtained using the R package “BiomaRt” [24]. These were analyzed using the enrichment analysis software “g:profiler” [25]. Enrichment analyses evaluate the significance of the gene list in the selected region.

2.2.6 Consistency between DNA loops and ChIP-seq or ATAC-seq peaks.

Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is an experimental technique for mapping DNA-binding proteins and histone modifications [26]. Assay for transposase-accessible chromatin with high-throughput sequencing (ATAC-seq) investigates DNA accessibility using hyperactive Tn5 transposase [27]. ATAC-seq helps identify open chromatin regions and assesses transcription factor occupancy. DNA loops have many arbitrary protein interventions to promote or regulate transcription, and three spatial spaciousness, as such, selected DNA loops and ChIP-seq and ATAC-seq should match the peaks in the data. Therefore, the DNA loop regions selected by the MDS-based method were rechecked to determine whether they matched the locations of ChIP-seq or ATAC-seq peaks. When a partial match was found between the position of the top 300 peaks and the DNA loop regions selected by the MDS-based method, the loop was counted. Then, Fisher’s exact probability test was performed to determine whether the selected DNA loops significantly matched the peaks compared to randomly selected loops.

ChIP-seq data for H3K27ac, H3K27me3, and CTCF were analyzed; where H3K27ac is an active enhancer marker, CTCF is involved in DNA loops, and H3K27me3 is involved in heterochromatin but is also found to be abundant in DNA loops and represses gene expression [28].

3 Results

Eqs (1) and (3) were applied to the Hi-C data to obtain a genome structure that differentiated between the DNA with and without loops. Then, DNA loops obtained through the MDS-based method were selected using Eqs (7) and (8) (S1 File). Genes in the DNA loops were obtained using the R package “BiomaRt” [24]. The number of DNA loops obtained by the MDS-based method and genes observed in the selected region in GSM6061774 to GSM6061778 (THP-1 90 min biological replicates 1 to 5) are summarized in Table 2.

thumbnail
Table 2. Proportion of DNA loop regions selected by the multidimensional scaling (MDS)-based method and genes in the DNA loops.

https://doi.org/10.1371/journal.pone.0289651.t002

Salzberg et al. mined four databases and estimated that the number of human genes was between 19,901 and 21,306 [29]. Therefore, 21,306 genes were selected for analysis.

Independent of the replication, the number of DNA loops is nearly 10%, and the number of genes is nearly 22%. Genes account for only approximately 1.5% of the DNA in the human genome. Therefore, DNA loop regions selected based on the MDS method should be located in gene-dense regions.

3.1 Enrichment analysis

All genes containing DNA loops by the MDS-based method were subjected to enrichment analysis using “g:profiler” [25].

3.1.1 GSE201353 (macrophages).

Transcription factors related to DNA loop formation were enriched for data collected at all eight time points and five biological replicates. Enrichment analysis for the data collected at 90 min, when DNA loop formation increases rapidly, are shown in Table 3.

thumbnail
Table 3. Enrichment analysis of Hi-C data collected at 90 min by g:Profiler.

https://doi.org/10.1371/journal.pone.0289651.t003

HOXA is a family of homeobox genes. HOXA13, HOXA1, HOXA5, HOXA6, HOXA7, HOXA9, and HOXA10 are involved in prostate cancer, and they form a DNA loop with a locus that induces prostate cancer and regulates gene transcription [30]. CCAAT-enhancer-binding proteins (CEBPs) promote specific gene expression in many organ cells, including hepatocytes, hematopoietic cells, and kidney cells, through interactions with promoters. These proteins collect core activators and open chromatin structures [31]. CEBPA is a member of the CEBPs and interacts with promoters to promote the expression of specific genes. It mobilizes core activators, opens chromatin structures, and carries general transcription factors [32]. SATB1 (SATB Homeobox 1) is known as a global chromatin organizer. SATB1 forms DNA loops by linking the nuclear matrix regions (MARs) to the nuclear matrix at fixed distances [33]. The high mobility group protein, HMGIY is involved in transcription and replication and coordinates the nucleosome and chromatin structure [34, 35]. HMGIY can bend straight DNA and for DNA loops in Cos-7 cells [36].

3.1.2 GSE141139 (osteosarcoma).

Genes in regions that exceeded the data threshold were also used in the enrichment analysis. Transcription factors involved in DNA loop formation were enriched in all data collected at eight time points. Enrichment analysis for data collected at 90 min, when long-range interactions were completed, are listed in Table 4.

thumbnail
Table 4. Enrichment analysis of Hi-C data collected at 90 min by g:Profiler.

https://doi.org/10.1371/journal.pone.0289651.t004

Estrogen receptor alpha (ERα), one of the two main types of estrogen receptors, is a nuclear receptor activated by ER that is involved in transcription activation and DNA binding. This receptor has been clinically implicated in breast, ovarian, and other types of cancers. Distal ERαBS interacts with proximal sites to form chromatin loops in human breast adenocarcinoma cells [37]. Retinoic acid receptor alpha (RARα) regulates transcription in a ligand-dependent manner. Related diseases include acute myeloid leukemia and leukemia [38]. Phosphorylated RARα is located in the R1 and R2 regions of the Cyp26A1 promoter and mobilizes RNA polymerase and TFIIH to form a DNA loop [39]. Histone deacetylase 1 (HDAC1) plays a vital role in regulating eukaryotic gene expression, and mobilization of p300 or HDAC1 to NFκB and AP-1 binding sites promotes DNA loop formation [40]. GATA Zinc Finger Domain Containing 2A (GATAD2A) is a transcriptional repressor that enables protein-polymer adaptor activity. It is mainly responsible for chromatin compaction [41]. Sp3 Transcription Factor (SP3) is a transcription factor that regulates transcription by binding to GC and GT box regulatory elements. In human mammary carcinoma cells, SP3 binds to GC1 and GC2 elements of the topoisomerase IIα promoter, forming a DNA loop that can function either as a transcriptional activator or repressor [42]. Kruppel Like Factor 16 (KLF4) regulates transcription via RNA polymerase II. The expression of KLF4 is highly associated with stemness in human osteosarcoma carcinomas [43]. T-Box Transcription Factor 2 (TBX2) is the only T-box transcription factor that functions as a transcriptional repressor rather than a transcriptional activator. It is implicated in lung, breast, bone, pancreas, and melanoma cancers and represses transcription in human fetal kidney HEK293 cells by forming DNA loops in concert with HDAC and PBX1 [44]. Nuclear Factor I C (NFIC) is also known as CCAAT-Binding Transcription Factor. When bound to DNA, it invokes the core activator, which opens the chromatin structure to make room for the general transcription factor [32]. cAMP responsive element binding protein 1 (CREB1) cooperates with CTCF proteins to create complex protein-DNA interactions. It causes transcriptional repression and DNA loop formation in leukemic Jurkat T cells [45]. JunD Proto-Oncogene is a member of the JUN family and protects cells from apoptosis. In human hepatocytes, it binds to the CYP2C9 promoter and forms a DNA loop [46]. Estrogen Receptor 1 (ESR1) is associated with breast cancer, endometrial, and other types of cancer. Especially in normal breast epithelial cells, estrogen stimulation induces the formation of DNA loops in ESR1 at the 16p11.2 gene cluster [47]. Sp1 Transcription Factor (SP1) is involved in many processes, including cell differentiation, apoptosis, and chromatin remodeling. SP1 cooperates with the transcription factor GATA1 at erythroid-specific promoters in erythroid cells to form DNA loops close to distant enhancers [48].

3.1.3 GSE149103 (pancreatic cancer).

Transcription factors related to DNA loop formation were enriched in three data sets. As a representative, the results of the enrichment analysis for Capan-1 cells are shown in Table 5.

thumbnail
Table 5. Enrichment analysis of Capan-1 Hi-C data by g:Profiler.

https://doi.org/10.1371/journal.pone.0289651.t005

LHM1 is a homeobox transcription factor that cooperates with the GATA protein in mice to facilitate the formation of long-range interactions [49]. OCT4 (POU class 5 homeobox 1) plays a vital role in embryonic development and stem cell pluripotency. OCT4 forms a cohesin-dependent enhancer-promoter loop in embryonic cells and trimethylates H3K4 at the SOX-17 locus to activate the SOX-17 promoter is activated [50].

3.1.4 GSE160235 (rectal cancer).

Transcription factors related to DNA loop formation were enriched in all datasets. As a representative, the enrichment analysis results for TOP2A2B_bothcontrol_30 are shown in Table 6.

thumbnail
Table 6. Enrichment analysis of TOP2A2B_bothcontrol_30 Hi-C data by g:Profiler.

https://doi.org/10.1371/journal.pone.0289651.t006

In breast cancer cells, PARP binds to the base lesion region of the MAR and is involved in the chromatin loop [51]. IPF1, also known as PDX1 (Pancreatic and Duodenal Homeobox 1), is an insulin-promoting factor. In pancreatic cells, Pdx1 and BETA2/NeuroD1 form a DNA loop in insulin activity [52]. FOXP3 is a master transcription factor for regulatory T cells (Treg) and cooperates with NFAT to form long-range chromatin interactions in mouse Treg cells [53].

3.1.5 GSE167150 (breast cancer).

Transcription factors involved in DNA loop formation were enriched in all datasets. As a representative, the enrichment analysis results of BT549 are shown and mentioned in Table 7.

thumbnail
Table 7. Enrichment analysis of BT549 Hi-C data by g:Profiler.

https://doi.org/10.1371/journal.pone.0289651.t007

3.1.6 GSE168470 (lymphoma).

Transcription factors related to DNA loop formation were enriched in all datasets. As a representative, the results of the enrichment analysis for Patient_1_merged_rep12 are shown in Table 8.

thumbnail
Table 8. Enrichment analysis of Patient_1_merged_rep12 by g:Profiler.

https://doi.org/10.1371/journal.pone.0289651.t008

HFH2, also known as FOXD3, is involved in pluripotent cell development. FOXD3 binds to enhancer sites and collects the SWISNF chromatin remodeling complex ATPase BRG1 and induces chromatin ribo ring [54].

3.1.7 GSE143465 (kidney).

Transcription factors related to DNA loop formation were enriched throughout the data. As a representative, the results of the enrichment analysis for HEK_HiC_NUP_IDR_FS_A9_1.1 are shown in Table 9.

thumbnail
Table 9. Enrichment analysis of HEK_HiC_NUP_IDR_FS_A9_1.1 by g:Profiler.

https://doi.org/10.1371/journal.pone.0289651.t009

3.2 Consistency between DNA loops and peaks of ChIP-seq and ATAC-seq

ChIP-seq or ATAC-seq data were analyzed in conjunction with the Hi-C data, and the top 300 peaks were checked to determine whether they matched the enhancer-promoter regions in this study. In the present study, Fisher’s exact probability test was performed to determine whether the enhancer-promoter regions in this study were significantly consistent with random selections. A comparison of the enhancer-promoter regions by GSM6061749_LIMA_ChIP_h3k27ac _THP1_WT_LPIF_S_0090_1.1.1_peaks and GSM6061774 (90 min Hi-C data for macrophages) is shown in Table 10.

thumbnail
Table 10. Consistency between DNA loops and ChIP-seq peaks.

https://doi.org/10.1371/journal.pone.0289651.t010

In GSM6061774, 318,380,000 bp were selected as loop regions, of which 2,410,018 bp were partially matched to peaks of the ChIP-seq data. The expected value was determined by multiplying the proportion of the top 300 peak regions (peak regions divided by total genomic regions) by the length of the loop region. The p-value for Fisher’s exact probability test was 2.2 × 10−16, and the odds ratio was 39.33, indicating significant agreement. Other data were also in significant agreement (S1 File).

4 Discussion

A previous study revealed that the MDS-based method could reproduce DNA loops more prominently than existing studies [7]. Therefore, the results presented here (by substituting missing values with values of zero) were compared with those from the previous method. The previous method assigns an average for each genomic coordinate distance to the missing values. Thus, it could not express the marked difference between the coordinates of loop formation and non-loop formation. Surprisingly, the current method could significantly distinguish between DNA with or without loops. Chromosome distance plots based on the present method and those based on the previous method are shown in Figs 6 and 7.

thumbnail
Fig 6. Distance plot between coordinates of chromosome 5 50,000–100,000 kbp of series GSM6061774 by the MDS-based method.

https://doi.org/10.1371/journal.pone.0289651.g006

thumbnail
Fig 7. Distance plot between coordinates of chromosome 5 50,000–100,000 kbp of series GSM6061774 by the previous MDS-based method.

https://doi.org/10.1371/journal.pone.0289651.g007

DNA loop regions in the whole genome selected by the previous method, miniMDS [9], and the current method were 316150000[bp]/3186000000[bp] = 0.0992, 62840000[bp]/3186000000[bp] = 0.0197, and 145900000[bp]/3186000000[bp] = 0.0458, respectively. The number of DNA loops was 1,822, 2,235, and 3,860, and the average lengths were 77,650 bp, 28,090 bp, and 82,520 bp. The results of the enrichment analysis using the previous method are summarized in Table 11.

thumbnail
Table 11. Enrichment analysis of Hi-C data collected over 90 min by the previous method.

https://doi.org/10.1371/journal.pone.0289651.t011

The results for the proposed MDS-based method are shown in Table 3. Moreover, the transcription factors involved in DNA loop formation were not enriched by miniMDS. Consequently, they indicate that the number of enriched transcription factors obtained by the present method was higher compared with that of the previous method.

The current study presents certain limitations. First, MDS was used to reproduce the genome structure, which is computationally expensive if the resolution of Hi-C data is improved. Second, if the Hi-C data set contains time series data, time-independent genomic structures must be identified beforehand. Third, although bulk Hi-C data were analyzed, the method should be applied to single cells to observe changes in genome structure, which presents a challenge for future studies.

The primary goal of this study was to identify a method to reconstruct genomic structures that differentiates DNA with or without loops. As indicated in the results, the validity of the proposed method is verified by three main findings: (i) The number of genes on DNA loop regions was large. (ii) Transcription factors involved in DNA loop formation were enriched in the enrichment analysis by g: profiler. (iii) The positions of the selected DNA loop regions and the ChIP-seq and ATAC-seq peaks were significantly consistent. To the best of my knowledge, no existing study has confirmed the consistency of the DNA loop to this extent. Therefore, the consistency of the three findings was confirmed in a data-driven manner, which is useful for reproducing DNA loops.

Acknowledgments

I am grateful to my supervisor Prof. Y-h. Taguchi for providing valuable discussions.

References

  1. 1. Matthews K. DNA looping. Microbiological reviews. 1992;56(1):123–136. pmid:1579106
  2. 2. Nagano T, Lubling Y, Stevens TJ, Schoenfelder S, Yaffe E, Dean W, et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature. 2013;502(7469):59–64. pmid:24067610
  3. 3. Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. science. 2009;326(5950):289–293. pmid:19815776
  4. 4. Kloetgen A, Thandapani P, Ntziachristos P, Ghebrechristos Y, Nomikou S, Lazaris C, et al. Three-dimensional chromatin landscapes in T cell acute lymphoblastic leukemia. Nature genetics. 2020;52(4):388–400. pmid:32203470
  5. 5. Díaz N, Kruse K, Erdmann T, Staiger AM, Ott G, Lenz G, et al. Chromatin conformation analysis of primary patient tissue using a low input Hi-C method. Nature communications. 2018;9(1):1–13. pmid:30498195
  6. 6. Yaffe E, Tanay A. Probabilistic modeling of Hi-C contact maps eliminates systematic biases to characterize global chromosomal architecture. Nature genetics. 2011;43(11):1059–1065. pmid:22001755
  7. 7. Ishibashi R, Taguchi Y. Identification of Enhancers and Promoters in the Genome by Multidimensional Scaling. Genes. 2021;12(11):1671. pmid:34828279
  8. 8. MacKay K, Kusalik A. StoHi-C: Using t-distributed stochastic neighbor embedding (t-SNE) to predict 3D genome structure from Hi-C Data. bioRxiv. 2020;.
  9. 9. Rieber L, Mahony S. miniMDS: 3D structural inference from high-resolution Hi-C data. Bioinformatics. 2017;33(14):i261–i266. pmid:28882003
  10. 10. Durand NC, Shamim MS, Machol I, Rao SS, Huntley MH, Lander ES, et al. Juicer provides a one-click system for analyzing loop-resolution Hi-C experiments. Cell systems. 2016;3(1):95–98. pmid:27467249
  11. 11. Cao Y, Chen Z, Chen X, Ai D, Chen G, McDermott J, et al. Accurate loop calling for 3D genomic data with cLoops. Bioinformatics. 2020;36(3):666–675. pmid:31504161
  12. 12. Edgar R, Domrachev M, Lash AE. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic acids research. 2002;30(1):207–210. pmid:11752295
  13. 13. Durand NC, Robinson JT, Shamim MS, Machol I, Mesirov JP, Lander ES, et al. Juicebox provides a visualization system for Hi-C contact maps with unlimited zoom. Cell systems. 2016;3(1):99–101. pmid:27467250
  14. 14. Reed KS, Davis ES, Bond ML, Cabrera A, Thulson E, Quiroga IY, et al. Temporal analysis suggests a reciprocal relationship between 3D chromatin structure and transcription. Cell reports. 2022;41(5):111567. pmid:36323252
  15. 15. Kang H, Shokhirev MN, Xu Z, Chandran S, Dixon JR, Hetzer MW. Dynamic regulation of histone modifications and long-range chromosomal interactions during postmitotic transcriptional reactivation. Genes & development. 2020;34(13-14):913–930. pmid:32499403
  16. 16. Ren B, Yang J, Wang C, Yang G, Wang H, Chen Y, et al. High-resolution Hi-C maps highlight multiscale 3D epigenome reprogramming during pancreatic cancer metastasis. Journal of Hematology & Oncology. 2021;14(1):1–19. pmid:34348759
  17. 17. Zhang S, Übelmesser N, Josipovic N, Forte G, Slotman JA, Chiang M, et al. RNA polymerase II is required for spatial chromatin reorganization following exit from mitosis. Science advances. 2021;7(43):eabg8205. pmid:34678064
  18. 18. Kim T, Han S, Chun Y, Yang H, Min H, Jeon SY, et al. Comparative characterization of 3D chromatin organization in triple-negative breast cancers. Experimental & Molecular Medicine. 2022; p. 1–16. pmid:35513575
  19. 19. Sungalee S, Liu Y, Lambuta RA, Katanayeva N, Donaldson Collier M, Tavernari D, et al. Histone acetylation dynamics modulates chromatin conformation and allele-specific interactions at oncogenic loci. Nature Genetics. 2021;53(5):650–662. pmid:33972799
  20. 20. Ahn JH, Davis ES, Daugird TA, Zhao S, Quiroga IY, Uryu H, et al. Phase separation drives aberrant chromatin looping and cancer development. Nature. 2021;595(7868):591–595. pmid:34163069
  21. 21. Shinkai S, Nakagawa M, Sugawara T, Togashi Y, Ochiai H, Nakato R, et al. PHi-C: deciphering Hi-C data into polymer dynamics. NAR genomics and bioinformatics. 2020;2(2):lqaa020. pmid:33575580
  22. 22. Serra F, Di Stefano M, Spill YG, Cuartero Y, Goodstadt M, Baù D, et al. Restraint-based three-dimensional modeling of genomes and genomic domains. FEBS letters. 2015;589(20):2987–2995. pmid:25980604
  23. 23. Cox MA, Cox TF. Multidimensional scaling. In: Handbook of data visualization. Springer; 2008. p. 315–347.
  24. 24. Durinck S, Spellman PT, Birney E, Huber W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nature protocols. 2009;4(8):1184–1191. pmid:19617889
  25. 25. Raudvere U, Kolberg L, Kuzmin I, Arak T, Adler P, Peterson H, et al. g: Profiler: a web server for functional enrichment analysis and conversions of gene lists (2019 update). Nucleic acids research. 2019;47(W1):W191–W198. pmid:31066453
  26. 26. Park PJ. ChIP-seq: advantages and challenges of a maturing technology. Nature reviews genetics. 2009;10(10):669–680. pmid:19736561
  27. 27. Buenrostro JD, Wu B, Chang HY, Greenleaf WJ. ATAC-seq: a method for assaying chromatin accessibility genome-wide. Current protocols in molecular biology. 2015;109(1):21–29. pmid:25559105
  28. 28. Cai Y, Zhang Y, Loh YP, Tng JQ, Lim MC, Cao Z, et al. H3K27me3-rich genomic regions can function as silencers to repress gene expression via chromatin interactions. Nature communications. 2021;12(1):719. pmid:33514712
  29. 29. Salzberg SL. Open questions: How many genes do we have? BMC biology. 2018;16(1):1–3.
  30. 30. Luo Z, Rhie SK, Lay FD, Farnham PJ. A prostate cancer risk element functions as a repressive loop that regulates HOXA13. Cell reports. 2017;21(6):1411–1417. pmid:29117547
  31. 31. Ramji DP, Foka P. CCAAT/enhancer-binding proteins: structure, function and regulation. Biochemical Journal. 2002;365(3):561–575. pmid:12006103
  32. 32. Miller M, Shuman JD, Sebastian T, Dauter Z, Johnson PF. Structural basis for DNA recognition by the basic region leucine zipper transcription factor CCAAT/enhancer-binding protein α. Journal of Biological Chemistry. 2003;278(17):15178–15184. pmid:12578822
  33. 33. Galande S, Purbey PK, Notani D, Kumar PP. The third dimension of gene regulation: organization of dynamic chromatin loopscape by SATB1. Current opinion in genetics & development. 2007;17(5):408–414. pmid:17913490
  34. 34. Catez F, Hock R. Binding and interplay of HMG proteins on chromatin: lessons from live cell imaging. Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms. 2010;1799(1-2):15–27. pmid:20123065
  35. 35. Reeves R. Nuclear functions of the HMG proteins. Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms. 2010;1799(1-2):3–14. pmid:19748605
  36. 36. Vogel B, Löschberger A, Sauer M, Hock R. Cross-linking of DNA through HMGA1 suggests a DNA scaffold. Nucleic acids research. 2011;39(16):7124–7133. pmid:21596776
  37. 37. Fullwood MJ, Liu MH, Pan YF, Liu J, Xu H, Mohamed YB, et al. An oestrogen-receptor-α-bound human chromatin interactome. Nature. 2009;462(7269):58–64. pmid:19890323
  38. 38. De Braekeleer E, Douet-Guilbert N, De Braekeleer M. RARA fusion genes in acute promyelocytic leukemia: a review. Expert review of hematology. 2014;7(3):347–357. pmid:24720386
  39. 39. Bruck N, Vitoux D, Ferry C, Duong V, Bauer A, de Thé H, et al. A coordinated phosphorylation cascade initiated by p38MAPK/MSK1 directs RARα to target promoters. The EMBO journal. 2009;28(1):34–47. pmid:19078967
  40. 40. Chen YJ, Chang LS. NFκB-and AP-1-mediated DNA looping regulates matrix metalloproteinase-9 transcription in TNF-α-treated human leukemia U937 cells. Biochimica et Biophysica Acta (BBA)-Gene Regulatory Mechanisms. 2015;1849(10):1248–1259. pmid:26260845
  41. 41. Torchy MP, Hamiche A, Klaholz BP. Structure and function insights into the NuRD chromatin remodeling complex. Cellular and Molecular Life Sciences. 2015;72(13):2491–2507. pmid:25796366
  42. 42. Williams AO, Isaacs RJ, Stowell KM. Down-regulation of human topoisomerase IIα expression correlates with relative amounts of specificity factors Sp1 and Sp3 bound at proximal and distal promoter regions. BMC molecular biology. 2007;8(1):1–10. pmid:17511886
  43. 43. Qi Xt, Li Yl, Zhang Yq, Xu T, Lu B, Fang L, et al. KLF4 functions as an oncogene in promoting cancer stem cell-like characteristics in osteosarcoma cells. Acta Pharmacologica Sinica. 2019;40(4):546–555.
  44. 44. Lüdtke TH, Wojahn I, Kleppa MJ, Schierstaedt J, Christoffels VM, Künzler P, et al. Combined genomic and proteomic approaches reveal DNA binding sites and interaction partners of TBX2 in the developing lung. Respiratory research. 2021;22(1):1–17. pmid:33731112
  45. 45. Wicks K, Knight J. Transcriptional repression and DNA looping associated with a novel regulatory element in the final exon of the lymphotoxin-β gene. Genes & Immunity. 2011;12(2):126–135. pmid:21248773
  46. 46. Makia NL, Surapureddi S, Monostory K, Prough RA, Goldstein JA. Regulation of human CYP2C9 expression by electrophilic stress involves activator protein 1 activation and DNA looping. Molecular pharmacology. 2014;86(2):125–137. pmid:24830941
  47. 47. Hsu PY, Hsu HK, Singer GA, Yan PS, Rodriguez BA, Liu JC, et al. Estrogen-mediated epigenetic repression of large chromosomal regions through DNA looping. Genome research. 2010;20(6):733–744. pmid:20442245
  48. 48. O’Connor L, Gilmour J, Bonifer C. Focus: Epigenetics: The role of the ubiquitously expressed transcription factor Sp1 in tissue-specific transcriptional regulation and in disease. The Yale journal of biology and medicine. 2016;89(4):513.
  49. 49. Cross AJ, Jeffries CM, Trewhella J, Matthews JM. LIM domain binding proteins 1 and 2 have different oligomeric states. Journal of molecular biology. 2010;399(1):133–144. pmid:20382157
  50. 50. Abboud N, Morris TM, Hiriart E, Yang H, Bezerra H, Gualazzi MG, et al. A cohesin–OCT4 complex mediates Sox enhancers to prime an early embryonic lineage. Nature communications. 2015;6(1):1–14. pmid:25851587
  51. 51. Galande S, Kohwi-Shigematsu T. Caught in the act: binding of Ku and PARP to MARs reveals novel aspects of their functional interaction. Critical Reviews™ in Eukaryotic Gene Expression. 2000;10(1). pmid:10813395
  52. 52. Babu DA, Chakrabarti SK, Garmey JC, Mirmira RG. Pdx1 and BETA2/NeuroD1 participate in a transcriptional complex that mediates short-range DNA looping at the insulin gene. Journal of Biological Chemistry. 2008;283(13):8164–8172. pmid:18252719
  53. 53. Chen Y, Chen C, Zhang Z, Liu CC, Johnson ME, Espinoza CA, et al. DNA binding by FOXP3 domain-swapped dimer suggests mechanisms of long-range chromosomal interactions. Nucleic acids research. 2015;43(2):1268–1282. pmid:25567984
  54. 54. Krishnakumar R, Chen AF, Pantovich MG, Danial M, Parchem RJ, Labosky PA, et al. FOXD3 regulates pluripotent stem cell potential by simultaneously initiating and repressing enhancer activity. Cell stem cell. 2016;18(1):104–117. pmid:26748757