DeepHiC: A generative adversarial network for enhancing Hi-C data resolution

Hi-C is commonly used to study three-dimensional genome organization. However, due to the high sequencing cost and technical constraints, the resolution of most Hi-C datasets is coarse, resulting in a loss of information and biological interpretability. Here we develop DeepHiC, a generative adversarial network, to predict high-resolution Hi-C contact maps from low-coverage sequencing data. We demonstrated that DeepHiC is capable of reproducing high-resolution Hi-C data from as few as 1% downsampled reads. Empowered by adversarial training, our method can restore fine-grained details similar to those in high-resolution Hi-C matrices, boosting accuracy in chromatin loops identification and TADs detection, and outperforms the state-of-the-art methods in accuracy of prediction. Finally, application of DeepHiC to Hi-C data on mouse embryonic development can facilitate chromatin loop detection. We develop a web-based tool (DeepHiC, http://sysomics.com/deephic) that allows researchers to enhance their own Hi-C data with just a few clicks.


Introduction
The high-throughput chromosome conformation capture (Hi-C) technique [1] is a genome-wide technique used to investigate three-dimensional (3D) chromatin conformation inside the nucleus. It has facilitated the identification and characterization of multiple structural elements, such as the A/B compartment [1], topological associating domains (TADs) [2,3], enhancer-promoter loops [4] and stripes [5] over recent decades. In practice, Hi-C data is conventionally stored as a pairwise read count matrix , where is the number of observed interactions (read-pair count) between × genomic regions and , and the genome is partitioned into fixed-size bins (e.g., 25 kb). Bin size (i.e., resolution), is a crucial parameter for Hi-C data analysis, as it directly affects the results of downstream analysis, such as predictions of enhancer-promoter interactions [6][7][8][9][10][11] or identification of TAD boundaries [6,[12][13][14][15][16]. Depending on sequencing depths, the size of commonly used bins ranges from 1 kb to 1 Mb.
Because of the high cost of sequencing, most available Hi-C datasets have relatively low resolution [17], which limits their application in studies of genomic regulatory elements. Sequencing high-resolution Hi-C matrices demands sufficient sequencing coverage; otherwise, the contact matrix would be extremely sparse and contain excessive stochastic noise. When sequencing Hi-C data, billions of read-pairs are typically necessary to achieve truly genome-scale coverage at kilobase-pair resolution [18], and the cost of Hi-C experiments generally scales quadratically with the desired level of resolution [19]. Low-resolution data may be less effective for detecting large-scale genomic patterns such as A/B compartments, but the decrease in resolution when analyzing Hi-C data may prevent identification of fine-scale genomic elements such as sub-TADs [20,21] and enhancerpromoter interactions, even lead to inconsistent results when detecting interactions and TADs in replicated samples [22]. Therefore, developing a computational model to impute a higher-resolution Hi-C contact matrix from currently available Hi-C datasets show its potency and usefulness.
Several pioneering works on solving problems related to low-resolution Hi-C data have recently emerged. Li et al. proposed deDoc for detecting megabase-size TAD-like domains in ultra-low resolution Hi-C data [23]. Zhang et al. proposed a deep learning model called HiCPlus to enhance Hi-C matrices from low-resolution Hi-C data [17]. HiCPlus showed that chromatin interactions can be predicted from their neighboring regions, by using the convolutional neural network (CNN) [24].
Carron et al. proposed a computational method called Boost-HiC for boosting reads counts of longrange contacts [25]. And Liu et al. proposed HiCNN [26] which is a 54-layer CNN and achieved better performance than HiCPlus. While these results were encouraging, three problems still exist in Hi-C data resolution enhancement algorithms. First, Hi-C data contain numerous high-frequency details ( and its nearby values are very large, while values in neighboring regions are small) and sharp edges, which are usually considered to indicate the presence of enhancer-promoter loops, stripes, and TAD boundaries. Models relied on regression and mean squared error (MSE) loss, which is thought to yield solutions with overly smooth textures [27], are likely to smooth these features. Thus, we seek to develop a model which is capable of predicting data with a sharp or degenerated distribution. Second, the structural patterns and textures of Hi-C data are abundant. The hypothesis space, which is controlled by the number of parameters, should be able to capture richer structures as it grows [28]. It is possible that increasing the depth of network would increase accuracy [29], while ensuring the model's generalizability and restraining the overfitting problem. The final critical problem is the stochastic noise in Hi-C data. An effective model should be able to predict solutions resides on the manifold of target data and thus diminish stochastic noise (i.e., capability for denoising) [30,31].
In order to make accurate prediction of high-resolution Hi-C data from low-coverage sequencing samples against these three problems. We developed a deep learning model which employed the  [36]. Also, researchers started to design task-specified loss functions, using not only MSE loss (i.e., L2 loss) but other losses like perceptual loss [37] as well, and gain surprising advancements [38].
In this paper, we propose a GAN-based method DeepHiC to enhance the resolution of Hi-C data.
Using low-resolution Hi-C matrices (obtained by downsampling original Hi-C reads) as input, we demonstrate that DeepHiC is capable of reproducing high-resolution Hi-C matrices. DeepHiCenhanced data achieve high correlation and structure similarity index (SSIM) compared with original high-resolution Hi-C matrices. And even using as few as 1% original reads, while no previous methods enhancing data of this depth, DeepHiC is still capable of inferring high-resolution data and achieves higher correlation and SSIM score than real high-resolution data from the replicated assay.
Compared with previous methods, our method is more accurate in predicting high-resolution Hi-C data, even in fine-grained details, and performed better when applying to different cell lines.
Enhancements of DeepHiC improve the accuracy of downstream analysis such as identification of chromatin loops and detection of TADs. In this study, we applied DeepHiC to Hi-C data in mouse embryonic development and demonstrated that, compared with the original low-resolution Hi-C data, DeepHiC-enhanced Hi-C data provides more interpretable results for the identification for chromatin loops. Besides, we also develop a web-based tool (DeepHiC, http://sysomics.com/deephic) that allows researchers to enhance their own Hi-C data with just a few clicks. In summary, this work introduces an effective model for enhancing Hi-C data resolution and establishes a new framework for prediction of a high-resolution Hi-C matrix from low-resolution data.

Hi-C data sources and processing
The high-resolution (10-kb) Hi-C data used for training and evaluating were obtained from GEO (https://www.ncbi.nlm.nih.gov/geo/) under accession number GSE63525. Prediction and evaluation were implemented in 4 datasets collected for the GM12878, K562 and IMR90 cell lines, note that replicate data were available for assays performed in the GM12878 cell line. The high-resolution Hi-C contact maps for each dataset were derived from reads with mapping quality > 30.
Corresponding low-resolution data were simulated by randomly downsampling the sequencing reads to different ratios range from 1:10 to 1:100 (i.e., 1% reads). Downsampled data would typically be processed at lower resolution because of the shallower sequencing depths. In our experiments, low-resolution contact maps were built using the same bin size as used for high-resolution Hi-C to fit the models' requirement. All resolution enhancing methods compared in our study used this same procedure as reported in HiCPlus [17] to ensure fair comparisons.
Hi-C data pertaining to mouse embryonic development were obtained from GEO under accession number GSE82185. Hi-C matrices of 10-kb bin size were created using the HOMER (http://homer.ucsd.edu/homer/) analyzeHiC command with the following parameters: -res 10000window 10000.
ChIA-PET data for the CTCF target in the K562 cell line were obtained from ENCODE (https://encodeproject.org) under accession number ENCSR000CAC. ATAC-seq data on mouse early embryonic development was obtained from GEO under accession number GSE66390.
In each Hi-C matrix, outliers are set to the allowed maximum by setting the threshold be the 99.9th percentile. For example, 255 is about the average of 99.9-th percentiles for 10-kb Hi-C data, so all values greater than 255 are set to 255 for 10-kb Hi-C data. Then all Hi-C matrices are rescaled to values ranging from 0 to 1 by min-max normalization [39] to ensure the training stability and efficiency.

DeepHiC architecture
In general, DeepHiC is a GAN model that comprises a generative network called generator and a discriminative network called discriminator. The generator tries to generate enhanced outputs that approximate real high-resolution data from low-resolution data, while the discriminator tries to tell generated data apart from real high-resolution data and reports the difference to the generator. The contest (hence "adversarial") between generator and discriminator promotes the generator learns to map from conditional input to a data distribution of interest.
As depicted in S1 Fig, the generator net ( ) is a convolutional residual network (first row), while the discriminator net ( ) is a convolutional neural network (second row). The net takes low-resolution matrices ( ) as input and outputs enhanced matrices ( ) with identical size. The adversarial component, the net, takes the enhanced output and the real high-resolution data ( ) as input and outputs 0-1 labels. The green arrowed lines describe how data are processed in DeepHiC. The net, employs two layers: the convolutional layer (blue block) and the batch normalization (BN) layer [40] (yellow block). Together with elementwise sum operation (green ball) and skip-connection operation (green polyline), some of these layers form the residual blocks (ResBlocks) [41]. There are five successive ResBlocks in . As for the activation function (pink block), we elected to use the Swish function [42] instead of the Rectified Linear Unit (ReLU) for activating some layers. The Swish function is defined as: where and is the sigmoid function. Swish has been shown to works better than ReLU in deep = 1 models [43]. Note that the final outputs of are scaled by: Thus, elements in output matrices range from 0 to 1. In general, the net contains about 121,000 parameters. The network is a convolutional network similar with the VGG network [44]. The number of kernels in a convolutional layer is depicted via block width: the more kernels, the wider the width of the block. The final output of is a scalar value ranges from 0 to 1 by a sigmoid function. More details of the hyperparameters of network architectures, such as kernel size and filter numbers, are summarized in S1 Table and S2 Table. To establish the GAN paradigm for training (Fig 1a), we employed both the generator net and the discriminator net . The net aims to generate enhanced outputs by approximating to the real high-resolution matrices , while the net attempts to distinguish the real from the generated . In the net, the value of output is considered to be the probability of to be real data. = ( ) Divergences between and , as well as the probability of to be real data, are minimized according to a carefully designed loss function. Besides, these two networks are trained alternatively by the backpropagation algorithm. It imputes enhanced contact maps using a 23-layer residual network called Generator. In the training process, the enhanced outputs are approaching real high-resolution matrices by minimizing mean square error (MSE) loss, perceptual loss (PPL), and total variation (TV) loss, meanwhile, a Discriminator network distinguishes enhanced outputs from the real ones and reports the probabilities of enhanced outputs to be real to the Generator through adversarial (AD) loss. The imputation and discrimination steps form the adversarial training process.

Loss functions in DeepHiC
A critical point when designing a deep learning model is the definition of the loss function. Many methods have recently been proposed to stabilize training [45,46] and improve the quality of synthesized images [37] by the GAN model. For DeepHiC, the binary cross entropy loss function for the network was used to measure the error of output, as compared with the assigned labels.
Because real and generated high-resolution data are paired in practice, it can be described as: where is the index for pairs of real and generated data, and is the number of pairs. Here we used and .
For generator loss, we used four loss functions, which were added to yield a final objective function. Firstly, we used MSE to measure the pixel-wise error between predicted Hi-C matrices and real high-resolution matrices, defined as: which is also called L2 loss. The loss function is broadly used for regression problems, while the fact that loss does not correlate well with the human perception of image quality [47] and overly smooths refined structures in images [27]. We also employ perceptual loss [37], however, based on the feature layers of the VGG16 network. We used total variation (TV) loss , derived from the total variation denoising technique, so as to suppress noise in images [48]. Final generator loss is yielded in combination with adversarial (AD) loss derived from network and defined as: Note that without logarithmic transformation, which allows for fast and stable training of = ( ∑ )/ the net [45]. Hyperparameters are scale weights that range from 0 to 1. , ,

Implementation of DeepHiC and performance evaluation
DeepHiC is implemented in Python scripts with PyTorch 1.0 [49]. After splitting GM12878 dataset into a training set and a test set, the model was trained on the training set. The final model was trained on chromosomes 1-14 and tested on chromosome 15-22. We divided contact matrices where the genomic distance between two loci is Mb, as the average size of TAD is Mb and there are < 2 < 1 few significance interactions outside TADs, thus could be omitted for training. The Adam optimizer [50] is used with a batch size of 64, and all networks are trained from scratch, with a learning rate of 0.0001. For DeepHiC, we train the networks with 200 epochs. In order to yield loss terms on the same scale, the hyperparameters for generator loss are set as , , and . All = 0.006 = 2 × 10 -8 = 0.001 evaluations are performed using an NVIDIA 1080ti GPU.
In order to assess the efficiency of DeepHiC during training, we performed an improved measure called structure similarity index (SSIM) [51] to measure the structure similarity between different contact matrices. The SSIM score is calculated by sliding sub-windows between images. The measure for comparison of two identically sized sub-windows, and (from two images) is: where mean ( ), variance ( ), and covariance ( ) are computed with a Gaussian filter. They measure the differences of luminance, contrast, and structure between two images, respectively. 1 , are constants to stabilize the division with a weak denominator. In our experiments, the size of sub-2 windows and the variance value of Gaussian kernel are set as 11 and 3, respectively. And all compared matrices are rescaled by min-max normalization to same range to eliminate the differences of luminance in order to compare the contrast and structure differences. was divided into non-overlapping sub-regions, then enhanced sub-regions were predicted from them by the generator network of DeepHiC. Finally, the high-resolution sub-matrices predicted were merged into a chromosome-wise Hi-C matrix, as the final enhanced output.

Identifying chromatin loops and detecting TAD boundaries
Chromatin loops [7] are identified using the commonly used software: Fit-Hi-C. We parallelized the software for faster running speed and suitable for our data. The modified code is available in https://github.com/omegahh/pFitHiC. Fit-Hi-C parameters were set as follows: . Significance = 10 , = 2, = 120, = 2, = 100 was calculated only for intra-chromosome interactions.
TADs were detected using the insulation score algorithm [14] with minor modifications: the width of the window used when calculating insulation score was set to 5 times of Hi-C matrix resolution to better detect the boundaries of finer-domain structures. We computed the delta score using insulation score of 5 nearest loci upstream and of 5 nearest loci downstream. We identified TADs as the genome region between center of 2 adjacent boundaries and regions containing low-coverage bins were excluded.

Measurements for two TAD segmentations
We investigated the consistency of segmentations formed by different TAD boundaries in the genome. Here we calculated the distance of two segmentations and the corresponding overlap, defined as follows. We denote the two segmentations as and , which are formulated in sets consisting of their split points: where are numbers of split points. Thus, we could calculate the distance from one split point , ∈ to segmentation , as follows: The overlap of an interval from , compared with T, could be measured as follows:

Implementation of baseline models
For baseline models, we only performed comparisons on data downsampled to 1/16 read as they commonly used in their study [17,25,26]. The python source code for HiCPlus was obtained from https://github.com/zhangyan32/HiCPlus_pytorch, together with the codes for data processing and pretrained model parameter file. We obtained HiCPlus results using the downloaded source code and pre-trained model parameter file. The scheme of data downsampling and reconstructing were implemented according to the description in its paper [17].

Parameters training of DeepHiC model
In current study, we propose a conditional generative adversarial network (cGAN), DeepHiC, for enhancing Hi-C data from low-resolution samples. It contains a generative network and a discriminative network . The former takes low-resolution data as inputs and imputes the enhanced outputs, while the latter is only employed during training process as a discriminator for reporting the differences between enhanced outputs and real high-resolution Hi-C data to the network , which form the adversarial training (Fig 1a). Also, in order to alleviate the overly-smooth problem caused by mean square error (MSE) loss, we utilized the perceptual loss to capture structure features in Hi-C contact maps and the total variation (TV) loss for suppressing artifacts [52]. The detailed architecture of DeepHiC is depicted in S1 We also trained the generator net as a regression model without the adversarial part, but SSIM scores in the test set vibrated substantially (S5 Fig). These results suggest that the GAN-based framework efficiently restrains the over-fitting phenomenon and its necessity for prediction.
In prediction step, we divided the large Hi-C matrix into small squares as model inputs, because the division step is required to generate abundant samples for model training.

DeepHiC reproduces high-resolution Hi-C from as few as 1% downsampled reads
We used the high-resolution Hi-C data in the GM12878, K562 and IMR90 cell lines from Rao's Hi-C (access code GSE63525) in our experiments. Datasets pertaining to different cell types are denoted as GM12878, GM12878R, K562, and IMR90 for convenience (GM12878R represents the replicated assay in the GM12878 cell line). First, we constructed high-resolution (10-kb) contact matrices using all the reads from the raw data. Then we downsampled the reads to different ratios (ranges from 1:10 to 1:100) of the original reads to simulate the low-resolution Hi-C data. We also constructed contact matrices at the same bin size. Therefore, we obtained paired high-resolution and low-resolution Hi-C data. The experimental high-resolution data were regarded as ground truth in the following analysis, while the low-resolution data were enhanced by DeepHiC after the model had been trained. Quantitatively, DeepHiC-enhanced data achieve higher correlations than the experimental replicate (i.e., GM12878R), even they were predicted from 1% downsampled data (Fig 1d). In SSIM measure, DeepHiC also achieves higher SSIM score than the experimental replicate (S6 Fig). These results indicate that the DeepHiC model is capable of reproducing high-resolution Hi-C data with high similarity even using 1% downsampled reads. Because the high-resolution data we used is at 10-kb resolution, it implies that our method could enhance 1Mb resolution Hi-C data to 10-kb resolution with high quality. And there is no available imputation algorithm for enhancing Hi-C data from such a sequencing depth before.

Enhancements of low-resolution data
We then trained DeepHiC using 1/16 downsampled data for fairly comparing with other baseline methods such as HiCPlus, Boost-HiC, and HiCNN (Methods). SSIM scores converged at 0.9 in test set (S6 Fig). We first investigated the enhancements afforded by DeepHiC by visualizing data in the form of heatmaps (S1 Note).

DeepHiC outperformed other methods in terms of genome-wide similarity
We quantitatively investigated genome-wide performance for all four datasets. We calculated SSIM scores for downsampled, HiCPlus-enhanced, and DeepHiC-enhanced data, as compared with real high-resolution data for all 1 Mb 1 Mb ( bins) sub-regions with non-overlap at the × 100 × 100 diagonal across the entire genome (S11 Fig). Fig 3a shows   DeepHiC can be used to enhance the Hi-C matrix for other cell types.
We omitted comparison with Boost-HiC considering that it aims to enhance long-range contacts.
Evaluation of Boost-HiC is plotted in S12 Fig and S13 Fig. Besides,  Chromatin loops in high-resolution Hi-C were accurately recovered from

DeepHiC-enhanced matrices
After demonstrating that DeepHiC can restore high-resolution Hi-C from low-resolution data, we investigated whether these enhanced high-resolution matrices could facilitate the identification of significant chromatin interactions, which are usually considered to be chromatin loops. For this purpose, we used Fit-Hi-C software to obtain significant intra-chromosomal interactions. We applied Fit-Hi-C to Hi-C data present above, in four datasets, using the same parameters (Methods).
Statistical confidence values (i.e., q-values) for all loci-pairs were acquired by Fit-Hi-C. We kept the  (c) We evaluated the overlap of significant loci-pair with real Hi-C data at each distance, using the preset cut-off. were more similar to the real high-resolution data than any others for the entire dataset. We also compared the overlap of identified interactions with real high-resolution data at each genomic distance, as shown in Fig 4c. The Jaccard index ( ) of identified interactions between DeepHiCenhanced data and real high-resolution data was higher at each genomic distance. In addition to using the aforementioned threshold for q-values, we tried more thresholds by scanning various false discovery rates (FDR), ranging from 0.001 to 0.05, with step size of 0.001. We evaluated the overlap of identified interactions according to FDR scanning. We found that DeepHiC outperformed others (Fig   4d). These results suggested that DeepHiC-enhanced Hi-C data are more accurate in predicting chromatin loops and yield less artifact noise.
Next, we compared the chromatin loops identified in these Hi-C matrices with the identified chromatin loops by CTCF chromatin interaction analysis by paired-end tagging sequencing (ChIA-PET) in the K562 cell line, which related data is available in the ENCODE project. ROC analysis is performed same with the description in HiCPlus, we used the identified CTCF-mediated chromatin loops from ChIA-PET as true positives. As for negatives, we randomly selected the same number of loci pairs that were not predicted to be interacting pairs by ChIA-PET (10 repeats). We then plotted the ROC (receiver operating characteristic) curve and calculated the area under the ROC curve (AUC) for each. As shown in Fig 4e, CTCF interacting pairs and non-interacting pairs were separated from the DeepHiC-enhanced matrix in the predicted results (average AUC = 0.843). We also observed that the AUC score for the DeepHiC-enhanced matrix was significantly higher than both the AUC derived from the HiCPlus/HiCNN-enhanced matrix (p-value = 0, paired t-test) as well as the AUC derived from the downsampled matrix (p-value = 0, paired t-test).

DeepHiC is more precise in detecting TAD boundaries
The detection of TADs is not as sensitive to resolution decline as algorithms for detecting TADs, we obtained roughly the same results when using the Hi-C data with various downsampling ratios [23].
However, we found that some refined TAD structures were shifted-even wrongly detected-in lowresolution data. Therefore, we continually assessed the performance of DeepHiC in recovering TADs, especially in fine-scale TADs. We calculated the score of insulation scores across the entire Δ genome for all four datasets (Methods). The zero-points within monotonic rising intervals are considered to be TAD boundaries. Fig 5a illustrates   rank-sum test), those of HiCNN-enhanced data (p-value = , Wilcoxon rank-sum test) and those 0.035 of downsampled data (p-value = , Wilcoxon rank-sum test). We also investigated the 1.3 × 10 -193 distribution of the overlap of segmentations vs. experimental high-resolution data (Fig 5c). The results showed that our model had a high proportion of high (p-value < for 1 × 10 -20 downsampled/BoostHiC-enhanced/HiCPlus-enhanced data, < for HiCNN-enhanced data, Mann 0.001 Whitney U-test), which indicates that more TADs are precisely matched with those in real Hi-C data.
Same results of comparisons for other cell types are illustrated in S18 Fig.

Application of DeepHiC improves identification of chromatin loops in mouse early embryonic developmental stages
DeepHiC can be used to enhance the resolution of existing time-resolved Hi-C data obtained through early embryonic growth. These data are prone to low resolution due to limited cell population. Therefore, algorithms for detecting significant interactions, when applied to these data, may produce results with a relatively high false positive rate. We demonstrate that DeepHiC can be applied to Hi-C data of mouse early embryonic development to enable identification of significant chromatin interactions with a considerably lower false positive rate. We applied Fit-Hi-C to both original lowresolution Hi-C contact matrices and DeepHiC-enhanced contact matrices (Fig 6a) and kept pairs of loci with q-values lower than a preset cut-off (0.5 percentile) as significant interactions (predicted loops). Chromatin loops regulate spatial enhancer-promoter contacts and are relevant to domain formation [4,53], and anchors of chromatin loops co-localize with open chromatin regions including insulators, enhancers, and promoters. We evaluate the accuracy of Fit-Hi-C significant interactions according to the fraction of all significant interactions that connect promoter regions, as well as by the fraction connecting two accessible chromatin regions marked by ATAC-seq peaks. As shown in Fig   6b, significant interactions identified using DeepHiC enhanced Hi-C data are more likely to anchor at gene promoters than those identified using original Hi-C data. They are also more likely to co-localize with open chromatin regions at both of their anchoring loci than loops predicted with original Hi-C data (Fig 6c). We mainly focused on the 8-cell stage and beyond because Hi-C data from earlier stages only demonstrate weak TADs and depleted distal chromatin interactions [54]. To generate control datasets, we randomly repositioned all predicted significant interactions for original Hi-C data, while maintaining the distance between anchors of each loop, using the "shuffle" command in Bedtools [55].
We repeated this process 20 times to generate 20 random significant interaction datasets. We found that the fraction of predicted significant interactions that connected accessible loci was significantly higher for DeepHiC-enhanced Hi-C data, compared with random control data. Using an example at chromosome 5, we showed that significant interactions predicted using original Hi-C data were highly separated and frequently located outside of TADs (Fig 6d). This is inconsistent with the known characteristics of chromatin loops, as they are mostly located within TADs and are frequently observed as strong apexes of TADs and sub-TADs [4,56]. Figure 6c shows that significant interactions as predicted using DeepHiC-enhanced Hi-C data are predominantly located within TADs, and at the apexes of TADs, where they co-localize with open chromatin regions. Therefore, DeepHiC is a powerful tool for studying chromatin structure during mammalian early embryonic development. Error bar: standard deviation. Significance: ***: p-value < , one-sample t-test. 1 × 10 -20 (d) A representative Hi-C contact matrix, with significant interactions as depicted for the 8-cell stage.
Left panel: Original Hi-C contact matrix and predicted significant interactions (bold pixels inside red circles). Right panel: DeepHiC enhanced contact matrix and predicted significant interactions (blue pixels).

Discussion
Hi-C is commonly used to map 3D chromatin organization across the genome. Since its introduction in 2009, this method has been updated many times in order to improve its accuracy and resolution.
However, owing to the high cost of sequencing, most available Hi-C datasets have relatively low resolution (40-kb to 1-Mb). The low-resolution representation of Hi-C data limits its application in studies of genomic regulatory networks or disease mechanism, which require robust, high-resolution 3D genomic data.
In this study, we proposed a deep learning method, DeepHiC, for predicting experimentallyrealistic high-resolution data from low-resolution samples. Our approach can produce estimates of experimental high-resolution Hi-C data with high similarity, using 1% sequencing reads. DeepHiC is built on state-of-the-art techniques from the deep learning discipline, including the GAN framework, residual learning, and perceptual loss. With using of the GAN framework, carefully designed net architecture, and loss functions in DeepHiC, it becomes possible to predict high-resolution Hi-C with high structural similarity of 0.9 to real high-resolution Hi-C. This approach may be used to accurately predict chromatin interactions, even in fine detail. Because of the huge quantity of parameters (~121,000) included in the network, DeepHiC may be used to approximate the real data, and to make predictions in other cell or tissue types. More importantly, enhancements afforded by DeepHiC favor the identification of significant chromatin interactions and TADs in Hi-C data. Finally, we also applied DeepHiC to Hi-C data pertaining to mouse early embryonic developmental stages, which only lowcoverage sequencing data were available, and enhancements afforded by DeepHiC improved the accuracy of identification of chromatin loops for these data.
DeepHiC provides a GAN-based framework with which to enhance Hi-C data, and even other omics data. GAN framework is a state-of-the-art technique in deep learning field in recent years. The idea of adversarial training facilitates the deep model to capture learnable patterns efficiently and stably. DeepHiC is trained with real high-resolution data as ground truth and is therefore a supervised learning paradigm. The quality of ground truth determines the upper-bound efficiency of the model.
Here we used the deepest sequencing reads in GM12878 as a training set. It would be possible to retrain or fine-tune the model if more accurate Hi-C data were available, potentially reaching restriction-fragment resolution. DeepHiC could be used not only to enhance existing low-resolution Hi-C data but also to reduce the experimental cost of sequencing in future Hi-C assays. Once a single real high-resolution dataset is obtained, researchers can produce experimentally-realistic highresolution Hi-C data at a low price. Besides, we also develop a web-based tool (DeepHiC, http://sysomics.com/deephic) that allows researchers to enhance their own Hi-C data with just a few clicks. And the enhancement procedure will be finished in 3-5 minutes using single CPU (for example, enhancement on chromosome 1 of human will cost 4.7 minutes using a Xeon CPU E5-2682 v4 @ 2.5GHz). It will be faster when using a GPU (22s for Nvidia 1080ti).
In conclusion, DeepHiC introduced the GAN framework for enhancing the resolution of Hi-C interaction matrices. By utilizing the GAN framework and other techniques such as residual learning, DeepHiC can generate high-resolution Hi-C data using a low fraction of the original number of sequencing reads. DeepHiC can easily be used in a number of Hi-C data analysis pipelines, and prediction could be executed quickly in minutes on human genome.

Data availability
A python code for the proposed DeepHiC method and data processing pipeline, as well as training and prediction is available at https://github.com/omegahh/DeepHiC. A user-friendly web server is available at http://sysomics.com/deephic/.