MaxHiC: A robust background correction model to identify biologically relevant chromatin interactions in Hi-C and capture Hi-C experiments

Hi-C is a genome-wide chromosome conformation capture technology that detects interactions between pairs of genomic regions and exploits higher order chromatin structures. Conceptually Hi-C data counts interaction frequencies between every position in the genome and every other position. Biologically functional interactions are expected to occur more frequently than transient background and artefactual interactions. To identify biologically relevant interactions, several background models that take biases such as distance, GC content and mappability into account have been proposed. Here we introduce MaxHiC, a background correction tool that deals with these complex biases and robustly identifies statistically significant interactions in both Hi-C and capture Hi-C experiments. MaxHiC uses a negative binomial distribution model and a maximum likelihood technique to correct biases in both Hi-C and capture Hi-C libraries. We systematically benchmark MaxHiC against major Hi-C background correction tools including Hi-C significant interaction callers (SIC) and Hi-C loop callers using published Hi-C, capture Hi-C, and Micro-C datasets. Our results demonstrate that 1) Interacting regions identified by MaxHiC have significantly greater levels of overlap with known regulatory features (e.g. active chromatin histone marks, CTCF binding sites, DNase sensitivity) and also disease-associated genome-wide association SNPs than those identified by currently existing models, 2) the pairs of interacting regions are more likely to be linked by eQTL pairs and 3) more likely to link known regulatory features including known functional enhancer-promoter pairs validated by CRISPRi than any of the existing methods. We also demonstrate that interactions between different genomic region types have distinct distance distributions only revealed by MaxHiC. MaxHiC is publicly available as a python package for the analysis of Hi-C, capture Hi-C and Micro-C data.

Author summary MaxHiC is a robust machine learning based tool for identifying significant interacting regions from both Hi-C and capture Hi-C data. All the current existing models are designed for either Hi-C or capture Hi-C data, however we developed MaxHiC to be applicable for both Hi-C and capture Hi-C libraries (two different models have been used for Hi-C and capture Hi-C data). MaxHiC is also able to analyse very deep Hi-C libraries (e.g., Micro-C) without any computational issues. MaxHiC significantly outperforms current existing Hi-C significant interaction callers and even Hi-C loop callers in terms of enrichment of interactions between known regulatory regions as well as biologically relevant interactions. This is a PLOS Computational Biology Methods paper.

Background
Hi-C is a genome-wide chromosome conformation capture technology used to identify interactions between pairs of genomic regions. Hi-C libraries consist of a pair of reads which can be mapped to the genome for identifying pairs of interacting regions. The data is typically summarized as a contact matrix, in which each row and column correspond to a (fixed-bin) fragment of the genome and each cell shows the number of reads supporting an interaction between the two bins. Hi-C data has multiple biases due both to the multiple steps used in the protocol and the proclivity of closely located genomic regions that are more likely to interact. Not all interactions are biologically relevant and thus the important challenge in the analysis of Hi-C data is to distinguish between background/artefactual interactions and those interactions that are more likely to be functional/regulatory.
There are two main approaches used to identify significant interactions in Hi-C data. The first, as used in Hi-C-DC [1], attempts to quantify known sources of biases (fragment length, GC content and mappability), and use them as variables in a background expectation function. The second approach, as used in Fit-Hi-C [2], CHiCAGO [3] and GOTHiC [4], assumes that all Hi-C biases are reflected in the read count data of interactions. Therefore, the observed or normalised read counts are used to assign a P-value to interactions. In order to account for unknown sources of bias a separate bias parameter is assumed for each bin. The tools using the first approach are limited to analysis of Hi-C libraries generated with restriction enzymes, whereas those using the second approach can also work with DNase Hi-C experiments in which the fragment sizes are unknown.
Rao et al. [5] also proposed HiCCUPS, which is not a background correction model, but can be used to identify chromatin loops. The term chromatin loop refers to the interaction between promoters and enhancers, genes or architectural loops, or polycomb-mediated regions [6]. Loop anchors usually happen at domain boundaries and bind CCCTC-binding factor (CTCF) sites. Chromatin loops can be detected as enriched Hi-C interacting regions compared to their neighbourhood regions. HiCCUPS identifies clusters of Hi-C contact matrixes and reports the centroid of the cluster, in which the frequency of the contact matrix is significantly higher than the local background. Using a local background and reporting only the centroid of a cluster results in identification of a small number of significant Hi-C interactions for a small range of distances~50kb to~5000kb. Peakachu [7], and Mustache [8] two other loop callers use machine learning techniques to identify chromatin loops. Peakachu builds loop-classifying models by using Random Forest classification techniques to predict chromatin loops from Hi-C data. Mustache uses a new computer vision technique, scale-space theory, to predict blob-shaped objects in Hi-C and Micro-C contact data.
We introduce MaxHiC (Maximum Likelihood estimation for Hi-C), a negative binomial model that uses a maximum likelihood technique to correct the complex combination of known and unknown biases in both Hi-C and capture Hi-C libraries. Benchmarking against current leading background correction models (GOTHiC, CHiCAGO, Fit-Hi-C, Fit-Hi-C2 [9], HiC-DC+ and loop callers (HiCCUPS, Peakachu [7], and Mustache [8].) shows that MaxHiC identifies interactions that contain substantially more evidence of biological relevance. Interacting regions identified by MaxHiC are significantly (Pvalue < 0.05) enriched for H3K27ac, H3K4me1, CTCF binding sites, DNase hypersensitive sites, and GWAS SNPs. They are also more likely to be annotated as promoters, enhancers, and insulator elements. More importantly the interacting pairs are substantially enriched for interactions between regulatory regions and non-coding polymorphisms linked by eQTL to target genes.

A negative binomial regression background model for Hi-C data
Here we present MaxHiC, a novel background correction model for identifying statistically significant interacting regions in both general and capture Hi-C libraries. In MaxHiC, the genome is divided into fixed size non-overlapping bins, and the read counts supporting interactions between bins are tested for significance. The read-count of interactions are assumed to follow a negative binomial distribution which is widely used to model over dispersed count data [10,11]. The mean parameter of the negative binomial distribution for each interaction is calculated as a function of the bias factors of its two ends and their genomic distance.
It is known that the contact frequency between pairs of regions linked by Hi-C reads decays as a function of distance along the chromosome [12]. A large fraction of Hi-C reads represents background ligation products of nearby genomic regions. In MaxHiC, distance is modelled by a function that decreases at increasing genomic distances to reach a small but constant nonzero value to account for background ligations. For trans-interactions, we use the same constant value as observed for distant cis interactions.
Bias factors of bins are calculated as a function of their total read-count and the mappability of their surrounding bins. All of the parameters of the model are learned by maximizing the logarithm of likelihood of the observed interactions using the ADAM algorithm [13]. The model is trained in multiple iterations; in each iteration, putative interactions with the best Pvalues are identified and set aside. These significant interactions are ignored in the next phase of training to prevent the trained background model from becoming biased [2]. The same approach is also used in MaxHiC to analyze Capture-Hi-C data. The only difference is that separate sets of parameters are learned for modelling target-target, target-non-target and nontarget-non-target interactions to account for the fact that interactions involving one or two of the targeted regions will be enriched to different levels compared to non-targeted regions. A schematic of the MaxHiC model for analysis of general Hi-C is shown in S1 Fig and a detailed explanation of both models is provided in the methods section.

MaxHiC vs other leading Hi-C tools
In order to evaluate the performance of MaxHiC, we performed a comprehensive benchmarking between MaxHiC and Hi-C significant interactions callers (SIC) including GOTHiC, Fit-Hi-C, Fit-Hi-C2, HiCDC+ [14], and CHiCAGO (for capture Hi-C only). We also compared the performance of MaxHiC with Hi-C loop callers including HiCCUPS [5], Peakachu [7], and Mustache [8]. We used several published Hi-C and capture Hi-C datasets to perform the analyses (see more details about the samples in S1 Table).
We first used MaxHiC to identify significant interactions (at 1kb, 5kb and 10kb bin sizes) in two publicly available Hi-C data from Rao et al. [5] (on the GM12878 EBV transformed Blymphoblastoid cell line and human mammary epithelial cells (HMEC)) and one set of capture Hi-C data from Mifsud et al. [15] (also on GM12878). For each library we also used the abovementioned background correction models and loop callers with their default parameters to identify significantly interacting regions. Tables 1 and S2 summarise the numbers of significantly interacting pairs found using each method. Notably the number of interactions called as significant by each method varied greatly (from 10,487 using HiCCUPS, to 8,136,100 using GOTHiC on GM12878).
As the p-values from all models may not be comparable, we ranked interactions identified by each tool, based on their reported p-value and then took the top 100K and 20K interactions reported by each tool. S2 and S3 Figs show Venn diagrams of the top 100K and 20K interacting pairs identified by each method at 10kb bin sizes.
The sets of interactions called as significant or insignificant by each method (Fig 1A for sample GM12878) had distinct distance distributions. At 10kb bin size, GOTHiC and HiC-CUPS had the smallest median distances between interacting regions of 170kb, while that for MaxHiC, Fit-Hi-C, Fit-Hi-C2 and HiCDC+ were 270kb, 7,590kb, 7,426kb and 480kb respectively. The low median length observed for GOTHiC is reflected in the observation that it calls almost all putative interactions below 100kb as significant.
Comparing the interactions called as significant by each method revealed interactions identified by MaxHiC had substantially more read support than those identified by all SIC models. Importantly, the number of significant interactions identified by MaxHiC is almost two times more than HiCDC+, however, the average read count of significant interactions in MaxHiC is much higher than HiCDC+ (Tables 1 and S2). Both the average number of reads and minimum number of reads supporting a significant interaction were higher for those called by MaxHiC than other models (Tables 1 and S2).
We also compared the average read count of MaxHiC's interactions with the loop caller methods. As Tables 1 and S2 show the average read count of MaxHiC's interactions is smaller than loop callers which was expected as loop callers only report the average signals from selected regions of interest (the centroid of the clusters, CTCF loops, H3K27ac loops, etc), in which the reported frequency for the selected regions is either the total number of interactions in the local cluster or the maximum frequency of all interaction in the local cluster. We then compared ranked interactions identified by MaxHiC based on their reported P-value and compared the average read count of top 20k MaxHiC's interaction with loop callers at 10kb bin size. Interestingly the average read count of significant interactions for MaxHiC (68.6) was higher than the loop callers (HiCCUPS: 58.6, Peakachu: 46.4, Mustache: 62.7).
By examining the average read count of significant and insignificant interactions called by each SIC method (Fig 1B for GM12878), it is revealed that at all distances, significant interactions called by MaxHiC and HiCDC+ had substantially more reads than those called as insignificant. GOTHiC only demonstrated this trend at distances above 100kb. At all distances, paired regions called as significantly interacting by MaxHiC had higher average read count support than the other five methods. This was also reflected in the minimum number and average number of reads required to call an interaction as significant (S4 Fig). We observed a similar pattern when considering interactions calculated using different fragment sizes (S5 Fig shows the same plots using 1kb, 5kb, 10kb, and 50kb bin sizes).

Evidence of regulatory potential for interacting regions identified by MaxHiC
We next investigated whether interacting regions identified by each method as significant were enriched for genomic and epigenetic annotations indicative of regulatory regions. Comparison of the interacting regions with histone mark profiles generated for GM12878 and HMEC by the ENCODE consortium [16][17][18] revealed the regions identified by MaxHiC were significantly (P-value < 0.05) enriched for H3K9ac, H3K27ac, H3K79me2, H3K4me3, H3K4me1, and H4K20me1 marks (Fig 2). They were also enriched for DNAse hypersensitive sites and CTCF binding but not for the repressive mark H3K27me3. Importantly there was very little if any enrichment of these features in regions called as significant by GOTHiC, Fit-Hi-C, and Fit-Hi-C2. There were also enrichments for HiCDC+'s regions, however, these We also used MaxHiC to investigate the enrichment of other important histone markers, such as H3K4me1, H3K4me2, H3K4me3, H3K27me3, H3K9me3 in the significant interactions. As S6 Fig shows all histone marks used in the analysis, were significantly enriched (Pvalue < 0.05) for the significant interactions identified in GM12878, except H3K27me3, H3K9me3, and H4K20me1.
To ensure that MaxHiC's performance is not affected by the number of statistically significant chromatin interactions used in the analyses, we ranked interactions identified by each tool, based on the reported p-value, and then took the top 10K, 20K, 50K, 100K interactions reported by each tool. We then repeated the enrichment analysis. As For the GM12878 analysis we also observed significant enrichment (P-value < 0.05) of transcription factor binding sites from ENCODE ChIP-seq data (S8 Fig and S3 Table). Again,  Table) were downloaded from Maurano et al. [29]. Note: HiCCUPS calls interactions with FDR < 0.05 as significant. For all other methods we used a threshold of P-value < 0.001 to identify significant interactions. https://doi.org/10.1371/journal.pcbi.1010241.g002

PLOS COMPUTATIONAL BIOLOGY
MaxHiC had much higher enrichment than SIC models and the loop callers Peakachu, and Mustache while it had a similar level of enrichment with HiCCUPS for GM12878 specific transcription factor binding sites. Together this suggests that both MaxHiC and HiCCUPS identify interactions between regions with regulatory potential however MaxHiC identifies substantially more (34-fold).
Next, the interacting regions identified by MaxHiC were significantly enriched (Pvalue < 0.05) for ENCODE annotated enhancers, promoters, insulators and transcription states [19,20] (Figs 2, and S7 for top 10k, 20k, 50k, 100kinteractions analyses). They were also significantly depleted (P-value < 0.05) of regions annotated as heterochromatin. Again, HiC-CUPS showed similar levels of enrichment while Fit-Hi-C and Fit-Hi-C2, HiCDC+ showed some lower level of enrichment for some of the chromatin states.
Besides the regulatory regions predicted by ENCODE, significant interactions identified by MaxHiC also contained a much higher fraction of promoters and enhancers identified by FANTOM5 as active in GM12878 and HMEC (in comparison to all FANTOM5 promoters and enhancers) (Fig 2) indicating that MaxHiC not only enriches for promoters and enhancers but those active in the cell type from which the Hi-C data was generated.
Importantly, interacting regions identified by MaxHiC and HiCCUPS on the GM12878 dataset were significantly (P-value < 0.05) more likely to overlap autoimmune SNPs identified in GWAS while those identified in the HMEC were significantly (P-value < 0.05) more likely to overlap breast trait associated SNPs (Fig 2, GWAS traits considered are provided in S4 Table). No such enrichment was observed for SNPs associated with unrelated traits such as kidney, lung, and liver traits.
Lastly, we compared interactions found by MaxHiC to those from the next top SIC method Fit-HiC2 (based on top 10k, 20k, 50k, and 100k interactions analyses) and the top loop caller method, HiCCUPS. In the comparison to Fit-Hi-C2, the interactions found by MaxHiC alone were significantly enriched (P-value < 0.05) for interactions between regulatory regions whereas those found by Fit-Hi-C2 alone showed no enrichment and were near the background expectation of 1 (

MaxHiC enriches for interactions between regulatory regions
We next examined the fractions of interacting pairs identified by each method involving regulatory elements at both ends. We used GM12878 related ChromHMM predicted chromatin segments from ENCODE. We specifically focused on promoter-enhancer, promoter-promoter, enhancer-enhancer and insulator-insulator pairs. In the GM12878 dataset approximately 12% of the raw unfiltered pairs fell into at least one of the above categories (Fig 3A). Strikingly, the significant interactions identified by MaxHiC were almost 4 times (47%) more likely to fall into one of these four categories (Fig 3A). Notably~17% of the pairs corresponded to promoter-enhancer interactions,~15% enhancer-enhancer,~8% promoter-promoter and 7% to insulator-insulator interactions. Similar levels of enrichment were observed for HiC-CUPS and Peakachu but as noted above, HiCCUPS identified almost 34-fold fewer significant loops. HiCDC+'s interactions were also significantly enriched for promoter-enhancer, promoter-promoter, enhancer-enhancer and insulator-insulator pairs. In contrast, the other methods had distributions similar to the unfiltered pairs with Fit-Hi-C, Fit-Hi-C2 and GOTHiC showing fractions slightly above that of the unfiltered pairs. Extending this to any interaction involving an enhancer, promoter or insulator increased the fraction of annotated pairs to 70%. Similar fractions were observed with the HMEC data.
We also repeated this analysis for the top 100K and 20K interactions identified by each model. Interestingly, for the top 100K interactions, MaxHiC had higher enrichment than all SIC models and loop callers in both GM1278 and HMEC cell lines. This enrichment was much higher in the top 20K interactions analysis (S11 Fig). In the top 20K interactions on HMEC cell line, 99%) of MaxHiC's interactions were annotated as enhancers, promoters or insulators. We next used expression quantitative trait loci (eQTL) pairs, identified in EBV-transformed lymphocytes as part of the Genotype-Tissue Expression (GTEx) Project [21] to examine enrichment of eQTL SNPs in regions that interact with promoters. Approximately 2.1% of the interactions identified by MaxHiC in the GM12878 cell line overlapped at least one eQTL pair. This corresponded to a 15.7-fold enrichment (Fig 3B). In contrast GOTHiC, Fit-Hi-C', Fit-Hi-C2, and HiCDC+'s interactions were lower (8.3, 7.9, 9, and 9.6-fold enriched respectively). Interestingly, the enrichment of eQTL pairs in MaxHiC's interaction was slightly higher than HiCCUPS and much higher than Peakachu and Mustache. Matching analyses on HMEC confirmed similar enrichments for regulatory features and eQTLs from mammary tissue (Fig 3B).

Genomic spacing of interacting regulatory elements
Using the putative regulatory region annotations from the analysis above, we next examined the spacing preferences for interactions between different regulatory features. We first examined preferred spacing for H3K27ac-H3K27ac and CTCF-CTCF structural loops. For MaxHiC, Fit-Hi-C, and Fit-Hi-C2 we observed the spacing between CTCF pairs to be larger (130kb) than H3K27ac-H3K27ac pairs (80kb) (Fig 4A). In both cases there were long tails reaching beyond 1Mb. Performing the same analysis on annotated regulatory regions revealed promoter-promoter, enhancer-promoter, and enhancer-enhancer pairs had almost identical spacing preferences (with median spacings of 70kb, 70kb and 90kb respectively), while insulator-insulator pairs were considerably further apart with a median spacing of 170kb (Fig 4B). We did not see this trend for HiCDC+ and loop callers, however we note HiCCUPS did not identify significant interactions below 50kb and above 5Mb.

MaxHiC contact maps and aggregate peak analyses
Comparing contact maps of significant interactions found for MaxHiC and HiCCUPS models (S12 Fig) shows at first glance they are very similar, however in a zoomed in region we can see additional interactions called by MaxHiC that are missed by HiCCUPS. Examining this in more detail revealed as approximately 8-fold more bases are identified as significantly interacting by MaxHiC than HiCCUPS (S13 Fig). However, to further show that these interactions are not random we provide an aggregate peak analysis (APA) comparing signals surrounding significant peaks called by MaxHiC and HiCCUPS. Fig 5 shows the aggregate background interaction frequency of 5 bins upstream and downstream of the significantly interacting bins. APA plots are shown for the top 5k, and 10k. As the figure shows, MaxHiC and HiCCUPS have almost the same APA patterns and confirmed that the significant interactions identified by MaxHiC are not random and that even when all 352k significant interactions identified by MaxHiC are considered there is strong focal enrichment in the APA. To demonstrate application of MaxHiC to a very large dataset we also analysed a Micro-C dataset [22] consisting of 480M raw valid interactions. The contact map for this dataset is shown in S14A Fig. Reassuringly matching aggregate peak analysis plots confirm there is strong focal enrichment for these significant interactions S14B Fig.

Application of MaxHiC to identify differentially interacting regions
To further demonstrate the use of MaxHiC in comparative Hi-C experiments we applied it to Hi-C datasets from cells depleted of the cohesion release factor WAPL (Haarhuis et al. [23]) and the polycomb protein RING1B cells (Boyle et al. [24]); and their matched wildtype controls. In the case of WAPL-depleted HAP1 cells MaxHiC identified almost 3 times more significant interactions than in matched wildtype cells (Fig 6A and 6B). Furthermore the distance distribution of interacting genomic regions found in the ΔWAPL cells (median = 6.4Mb) was significantly longer in comparison to wildtype (median = 3.1Mb) (Fig 6C); this reiterates the finding of Haarhuis et al. [23].
In the case of the RING1B depleted cells, MaxHiC was able to re-identify differentially interacting regions identified in the original paper (S15A and S15B Fig replicate Fig 5A from Boyle et al. [24]). We also observed that wildtype cells have a slightly larger median distance between interactions (325kb for WT) than the knockout cells (258kb for RING1B-/-) (S15C Fig). Lastly we examined enrichment for repressive H3K27me3 and RING1B sites (using ChIP-seq data from wildtype cells [25]) in interacting regions found in wild type cells and

MaxHiC identifies more validated pairs of element-gene pairs tested by CRISPRi
Lastly, as significant interactions identified by MaxHiC were enriched for enhancer-promoter interactions we sought to test whether these interactions are enriched for functional regulatory potential. To do this we asked whether interactions identified by MaxHiC and alternative tools run on a K562 Hi-C dataset from Rao et al. [5] could connect a set of 134 functional regulatory element-target gene pairs (involving 103 CRISPRi targeted regions) identified in K562 cells [26].

Discussion
Enhancers have important roles in controlling gene expression. Linking distal regulatory regions such as enhancers to the genes that they regulate is critical to the interpretation of regulatory variants identified in whole genome data. Importantly, genomic variants in enhancers and regulatory elements have been linked to human complex diseases [27,28]. Therefore, pairing enhancer regions to their regulatory targets is crucial for diagnosis of genetic diseases. Compared to simplistic assignment of enhancers to their nearest promoters, computational methods have improved the prediction of enhancer-promoter pairs [29][30][31][32]. However, physical interaction data is still far more convincing in validating the pairs. Hi-C data offers one approach to identify these links, however Hi-C sequencing libraries are extremely noisy because of self-ligations, background ligation artifacts and other complex sources of biases such as GC content and mappability of the sequences. Given this, accurate identification of

PLOS COMPUTATIONAL BIOLOGY
significantly interacting regions is dramatically affected by the tool used. Here, we have presented MaxHiC, a new tool for identifying significantly interacting regions from Hi-C data (and capture Hi-C data see S17 and S18 Figs and S5 Table). Through a dramatically improved discrimination of biologically relevant interactions from background, we showed that MaxHiC enriches for interacting regions with substantially more evidence of regulatory potential (Figs  1 and 2) than those identified by existing Hi-C analysis tools [2][3][4]9]. This is a critical advance as enhancer-promoter interaction maps are needed to interpret the impact of non-coding variants identified in intergenic regions. Importantly CRISPRi data were able to confirm that MaxHiC enriches for functional enhancer-promoter relationships.
From our analyses, only MaxHiC and HiCDC+ (and to a much lesser extent Fit-Hi-C2) enrich for interactions between regulatory regions, but MaxHiC regions, had much higher enrichment than HiCDC+. Specifically, interactions identified by MaxHiC were enriched for regulatory features, disease-associated genome-wide association SNPs, eQTL pairs and experimentally validated enhancer-promoter interactions. Although interacting regions identified by loop callers had similar levels of enrichment MaxHiC found 34-fold more significantly interacting regions. Our top'k' (k:10k, 20k, 50k, and 100k) interactions analyses showed that MaxHiC significantly outperformed not only SIC models, but also loop callers. Additionally, loop callers such as HiCCUPS appears to have limitations on the distances at which interactions can be found and the bin sizes possible. They also need very deep Hi-C resolution and substantial computational resources. From this observation we conclude that MaxHiC is a substantially better tool than existing Hi-C analysis packages.
We also implemented a capture version of MaxHiC to analyse capture Hi-C experiments; in the capture Hi-C technology, target regions (also called baits) are sequenced with higher depth than the others, resulting in three different types of interactions target-target, targetother and other-other with different properties. Capture Hi-C baits, but not other ends, have a further source of bias associated with uneven capture efficiency, thus for capture Hi-C experiments we model each class separately. As S17 and S18 Figs show the capture version of MaxHiC also identifies target-interacting regions that are enriched for regulatory features and disease-associated GWAS SNPs. The levels of enrichment are much higher than the regions identified by GOTHiC, Fit-Hi-C, and Fit-Hi-C2, and HiCDC+. The average read count of significant interacting regions identified by MaxHiC were much higher than other models. Importantly, MaxHiC's interacting regions had higher level of enrichment than the interacting regions identified by CHiCAGO, a specific model designed for the analysis of capture Hi-C data.
In conclusion, MaxHiC is a new open source tool for identifying significant interacting regions from Hi-C and capture Hi-C data. We have demonstrated that it significantly outperforms existing tools in terms of enrichment of interactions between known regulatory regions. We will continue to develop MaxHiC related tools for identifying topologically associating domains (TADs) and for identifying interactions that are significantly different between conditions. As single-cell Hi-C data becomes more broadly available we will also adapt MaxHiC to work with these datasets. Lastly, we believe that with minor modifications the same principles used here to analyse Hi-C data may be more broadly extended to identify significant interactions from RNA-chromatin interaction datasets [33,34].

Online methods
Tool availability. The MaxHiC source code and python package, a sample dataset, and instructions on how to run MaxHiC are provided at https://github.com/bcb-sut/MaxHiC. For more details about each parameter, please visit the GitHub page. A tutorial on using the MaxHiC package is also provided in S1 Information (instructions for MaxHiC). MaxHiC is memory and speed optimized and will work with both single core and multicore machines. It can work with very large Hi-C data sets (as bam file or contact maps generated by other tools) to identify significant interactions at a broad range of resolutions (1kb-500kb).

Publicly available Hi-C data and data access
Samples GM12878 B-lymphoblastoid cell line, Human Mammary Epithelial cell line, and k562 cell line Hi-C data from Rao et al. [5], capture Hi-C sample GM12878 B-Lymphoblastoid from Mifsud et al. [15] and mouse cortex Hi-C sample from Dixon et al. [35] were used to benchmark MaxHiC against other models. All samples are publically available. More details about the samples are provided in S1 Table. Mapping, filtering and Interaction calling MaxHiC takes as input either bam files or contact map files generated by packages such as HiCUP [36] and HiC-Pro [37]. For the analyses presented here, FASTQ paired-end reads were aligned to the human hg19 genome and filtered through the HiC-Pro pipeline (statistics are presented in S1 Table). To remove duplicate reads, filter for valid interactions, and generate Hi-C interaction matrices (5kb and 10kb bin-sizes), we set the following parameters: MIN_-FRAG_SIZE = 230; MAX_FRAG_SIZE = 1100; MIN_INSERT_SIZE = 120; and MAX_IN-SERT_SIZE = 990. All experimental artifacts, such as circularized reads and re-ligations, singletons have been filtered out using HiC-Pro and MHiC [38]. We also removed all selfinteractions from the downstream analyses. In-house scripts were then used for statistical evaluation of Hi-C libraries (see S1 Table).

The MaxHiC algorithm
In the following, we explain MaxHiC algorithm and mathematics behind the model for the analysis of both Hi-C and capture Hi-C experiments.

A negative binomial regression background model for Hi-C data
In MaxHiC, a negative binomial distribution has been used to model the read count of interactions based on the bias factors of the two ends and their genomic distance. The negative binomial distribution is the analytic form of a Poisson distribution with its mean parameter following a gamma distribution to account for over-dispersion in data. As shown in Eq 1, the observed read count of interactions, x ij , is modelled using their expected read count, μ ij , and a shared dispersion factor, r, between all interactions.
MaxHiC considers 'neighborhooding' (similar to local background) when identifying significant interactions. The parameters of the model are learned using a maximum likelihood approach. Gradient descent has been used to find the values of the parameters maximizing the logarithm of the likelihood function. In order to have higher accuracy, by having a higher difference between the read count of meaningful and meaningless interactions, and also to have higher speed, the proposed model uses the interactions between DNA bins, not DNA fragments. DNA is divided into pieces and all reads recorded for each fragment are assigned to the bin containing their middle points. All of the recorded interactions between pairs of bins with at least one read are the input samples used for training the model. After the model has been trained, the P-value of each interaction can be calculated based on the value of reliability, 1-CDF, for its observed read count.

Calculating the expected value for the read count
The expected read count for the interaction between bins i and j is calculated according to Eq 2. f(d ij ) indicates the expected value of read count to be recorded in the experiment for any two loci with distance d ij , v i and v j are bias factors of the two interacting bins. As mentioned above, bins have different characteristics and different probabilities of showing up in the experiment that affects the expected value of their interactions. It's assumed that the effects of two bins are independent from each other, so the total bias factor has been calculated as a product of the bias factors of the two ends.

Calculating bias factors for bins
In order to be applicable to all types of Hi-C protocols, accounting for the unknown sources of bias and also the relation between the known sources, the second approach, mentioned in the introduction, has been adopted in this study. A uniform bias factor has been assumed for each bin, which becomes more valid as the resolution of the analysis increases and the length of bins become shorter. We expect the bins included in more interactions with higher reads to have higher tendency to show up in the experiment. In addition, the total number of interactions that a bin has been included in, would have no effect on its bias. To deal with these problems, the bias factor for bins have been calculated using Eq 3 in which v base , b and b m are the parameters of the function and s i is the total number of reads observed for bin i. Here it is assumed that the total number of reads observed for each bin is a measure of its ability to show up in the final results of the test, and bias factor is assumed to be a linear function of s i in loglog space. v base is also considered as the base possible value for the bias factor. This prevents the bins with very few recorded reads to affect the parameters in the training process and it also prevents their related interactions with very few reads from showing up as significant because of their abnormally low bias. Many tools do this by putting a threshold and filtering bins with less bias, but in MaxHiC it is done automatically without any need for adjusting the threshold. As illustrated in Eq 3, v i is proportional to s i raised to a power b. The power can be learned and if it becomes less than one it can express a saturation phenomena. As duplicated read pairs are discarded in post-processing, having more read count for two interacting bins as the read count of their interaction increases, becomes lower than their initial potential.

Calculating expected read count as a function of distance
The expected read count as a function of distance for cis interactions is illustrated in Eq 4. The function is the maximum of two values. The first one, f v (d ij ), is a degree 3 polynomial function of genomic distance in log-log space which specifies the expected read count of interactions based on the linear genomic distance between the two interacting loci. The reason for using an exponential function is the fact that Brownian motion and diffusion can be modelled using an exponential function. This assumption is also consistent with the power law function observed for the read count as a function of distance which has been introduced by Lieberman et al. for the first time [39]. The degree of 3 has been used so the function can have varying curvature in different distances. In this function, the logarithm of distance has been used instead of the distance itself, as DNA is not a straight string and has multiple folding with different scales that causes proximity to have a closer relation with the logarithm of the length than the length itself. A similar concept has been used in Fit-Hi-C [2], CHiCAGO [3] and HiC-DC [1] methods. The biggest difference is the fact that all of these methods consider a strictly decreasing function for the expected read count which approaches zero as the genomic distance increases and causes all interactions between far genomic loci to be considered significant while they may have only one read. This cannot be completely logical as farther than some specific distance, random background collisions occur due to three dimensional closeness not the genomic closeness. This three dimensional closeness can be due to closeness to the fixed three dimensional structures such as loops or random dynamic crossing of DNA string while two different parts of it are having contact. The expected read count for these kinds of random interactions has been modeled using a constant parameter, f c . Therefore, in close genomic distances f vðd ij Þ is dominating while the value of f vðd ij Þ becomes very small, the value of f c dominates the expected read count. For trans-interactions, similar to the cis interactions for far loci, a constant parameter has been used to denote the expected value of read count for any random background trans-interactions.
f vðd ij Þ ¼ e ða 3 :lnðdÞ 3 þa 2 :lnðdÞ 2 þa 1 :lnðdÞþa 0 Þ Maximum function is not a smooth function to use in optimization. Therefore, instead of maximum, the soft-maximum function has been used. This soft-maximum resembles the maximum function. When an argument, for example α, is larger than the other, it's exponential becomes far more larger and so it's related factor, g t a , approaches 1 while the other factor, g t b , approaches 0. Therefore, the final value will be approximately equal to α itself.
e k�a e k�a þ e k�b ; g t b ¼ e k�b e k�a þ e k�b

Training the model in multiple rounds
As mentioned above, the method models read count of interactions occurring based on Brownian motion, not the meaningful chromatin contacts, which are called real interactions from now on. But real interactions are also among the data and can affect the parameters of the model and pull them up. For example, in distances where there are more read interactions, the expectation may become higher than other distances. To deal with this bias, the model is trained in multiple rounds, which can be specified by user (the default value for iterations is 4). In each round, first the parameters of the model are learned using a stochastic gradient descent approach. The training is done in iterations until all the parameters of the model have changed by less than 0.0001 compared to the previous iteration. Training also will stop when reaching 1000 iterations to handle oscillating situations around the optimum values. This number is high enough to give the model enough time to be trained well and has been chosen by the developers based on running the model on multiple Hi-C datasets (25 samples with 2-4 different resolutions). Next, a P-value is calculated for each interaction using the trained model and the real interactions are identified based on a fixed user-defined P-value as significance limit. The read count of these real interactions is substituted with their expected read count for calculating the total number of reads for each bin, as if they were not read what would their read count be, to remove the bias of having real interactions from bias factors of the bins. They are also completely ignored in the training of the model's parameters. This is another main difference between MaxHiC and other models.

An automatic way of removing the effect of noisy data
Typically, a high percentage of interacting fragments in Hi-C have only 1 read between; for example, the typical percentage can be above 85% for a resolution of 5Kb. End bins of many of these interactions have very low coverage which results in a very small expected value for the interaction. This can pull the value of parameters down for the benefit of themselves in the model. In order to cope with such noisy data, the expectation for read count is calculated based on Eq 6 instead of Eq 2. In this way, the maximum of expected value and a base expected read count, base e , is used as the final expected value. This will prevent the mentioned noisy interactions to affect greatly the parameters of the model.

A negative binomial regression background model for capture Hi-C data
Capture-Hi-C is a version of Hi-C in which some regions in DNA, which are called Baits or Target Regions, are sequenced with higher depth than the others. This will result in three different types of interactions in Capture-Hi-C with different properties, Target-Target, Target-Other and Other-Other. The background model developed for Capture-Hi-C follows the same statistical model as the one developed for General-Hi-C, but in order to account for the differences between these types of interactions, three different sets of parameters are used, each for one interaction type. Each set of parameters is learned separately from the others based on its own interactions, in the process of learning the parameters of the model. In calculating the bias factors for bins, the sum of read counts cannot be used directly as different interactions correspond to different types. To remove this barrier, all the read counts of interactions are converted to their equivalent read-counts in Target-Target interaction type: In Eq 7, t 1 and t 2 stand for the original interaction type and the one we want to convert the read-count to, which is assumed to be Target-Target, respectively. The expected value for each interaction type is easily calculated by using the set of parameters related to that interaction type and the calculated properties of the interaction, sum of reads for bins and their genomic distance, in the expected read-count formula regardless of the original type of the interaction.

Justification for the estimation of expected value of read count for trans chromatin interactions
In MaxHiC, distance is modelled by a function that decreases at increasing genomic distances to reach a small but constant non-zero value to account for random ligations. For trans-interactions, we use the same constant value as observed for distant cis interactions. Here two things are considered in the function modelling distance, more exactly the function modelling expected random ligations based on distance. It is a strictly decreasing function and it converges to a constant nonzero value when distance gets high. Which means it converges to a constant value when distance gets larger than a limit. In trans interactions, distance is not close so the same constant is considered in calculating the expected value.
To show that the long distance cis have similar read counts as trans, we counted average read count of long distance (� 100Mb) significant cis interactions (P-value < 0.001) against average read count of all significant trans interactions (P-value < 0.001). As a results, long distance significant cis interactions had a very similar average read count (21.7) as significant trans interactions (22.1). Note: the average read count of all significant cis interactions was 29.4. We also plotted distribution of number of significant long distance cis interactions against significant trans interactions. As the plot shows both groups have almost the distribution S19 Fig. Assessment of feature enrichment (ENCODE regulatory features, FANTOM5 promoters/enhancers) Annotation datasets. Tissue specificity is an important aspect of many genetic diseases [40]; We therefore downloaded tissue-specific ChIP-Seq data in the form of processed peak calls for histone modifications (e.g. H3K27ac, H3K4me1, etc), CTCF binding sites, DNase hypersensitive sites, and ChromHMM predicted promoters and enhancers from the ENCODE project [19]. We used FANTOM5 active promoters and transcribed enhancers from the FAN-TOM5 consortium [41,42]. FANTOM5 tissue-specific enhancers were obtained from [42].
Region-based analysis. To compute the enrichment of interactions overlapping an epigenetic feature, we first calculated the fraction of significant interactions with at least one end overlapping a feature. We then calculated fraction of all interactions overlapping with the feature as the genome-wide expectation. The enrichment value was then calculated by dividing the fraction of significant interactions overlapping a feature by the genome-wide expectation.

Enrichment value ¼ fraction of significant interactions overlapping a feature fraction of all interactions overlapping a feature ð8Þ
To calculate the enrichment P-value, we used Fisher's Exact test in the following manner: (i) number of significant interactions overlap with the feature; (ii) number of significant interactions that do not overlap with the feature; (iii) number of all interactions overlap with the feature; (iv) number of all interactions that do not overlap with the feature. The same approach is used to calculate the P-values for other enrichment analyses in this study.
Pair-based analysis. To compute the enrichment of interactions that both sides of an interaction overlap with an epigenetic feature, we repeated the above analyses in which both sides of interaction overlap with the feature.
Capture Hi-C analysis. In the capture Hi-C analysis, to ensure our results were not affected by the strong peak signal over target regions, only target-interacting regions ("bait-toany interactions") are considered and signals that mapped as bait-bait interactions and those signals that mapped outside of the target-interacting regions are excluded from the analysis. To calculate the fraction of significant target-interacting regions overlapping an epigenetic feature in the capture Hi-C analysis, we first calculated the fraction of significantly target-interacting regions overlapping a feature. We then randomly selected fragments that had no interaction with the target regions (we selected exactly the same number of fragments as the number of significant target-interacting fragments). The enrichment value then calculated by dividing the fraction of significant interacting regions overlapping a feature on the average fraction of 1000 permutations. We performed the above analysis for each Hi-C background model, separately.

Enrichment value in capture HiÀ C data
¼ fraction of significant interactions overlapping a feature average fraction of random interacting regions overlapping a feature in 1000 permutations ð9Þ

GWAS data analysis
GWAS SNPs for autoimmune disease, neurological/behavioural traits and kidney/liver/lung disorders were downloaded from Maurano et al. [29]. To compute the enrichment of interactions overlapping GWAS SNPs, we first calculated the fraction of significant interactions with at least one end containing GWAS SNPs. We then calculated fraction of all interactions overlapping GWAS SNPs as the genome-wide expectation. The enrichment value was then calculated by dividing the fraction of significant interactions overlapping GWAS SNPs by the genome-wide expectation. In the analysis of capture Hi-C library, we only considered overlapping of GWAS SNPs with target-interacting regions in both the significant target-interacting regions and the permutated target-interacting regions.

Overlapping Hi-C interactions with eQTL pairs
We used the set of GTEx v7 eQTLs identified as significant in EBV-transformed lymphocytes from the Genotype-Tissue Expression (GTEx) Project [21]. To calculate the fraction of eQTL pairs overlapping with significant interactions, we considered those Hi-C interactions where one side of the interaction overlapped an eQTL SNP and the other side overlapped the promoter of the eQTL target gene identified by GTEx. The promoter regions are defined as 5kb± of starting point of a gene (TSS). To calculate enrichment, we divided these by the genomewide expectation.

CRISPRi validation
To validate P-E pairs identified in Hi-C data, an orthogonal dataset of validated enhancer-promoter pairs was taken from Nasser et al. publication [26]. CRISPR perturbations in K562 cells identified 103 non-promoter elements involved in 134 functional element-gene relationships defined as causing significant changes in gene expression in K562 cells. We then counted a) number of non-promoter elements that overlapped with significant fragments identified by each model; b) number of element-gene pairs that overlapped with significant pairs identified by each model.

Parameters used for comparative Hi-C tools
All tools were run with their default parameters. For SIC models, we used P-value 0.001 to identify significant interactions. For loop callers, we used FDR 0.05 to identify significant loops.
Supporting information S1 Information. In this file, we provided an instruction to run MaxHiC. We also developed a separate version of MaxHiC to handle capture Hi-C libraries. In the capture Hi-C library, we are particularly interested to identify significant interactions that at least one end of interaction is baited (i.e., a capture region). Enrichment of GM12878 specific histone marks, putative regulatory regions and GWAS signal in regions identified by CHiCAGO, GOTHiC, Fit-Hi-C, Fit-Hi-C2, HiCDC+ and capture version of MaxHiC, on the capture Hi-C library GM12878 from Mifsud et al. [15]. This showed that significant target-interacting pairs identified by MaxHiC have much higher enrichment for regulatory pairs and autoimmune related GWAS SNPs.