Can random walking on a Hi-C contact matrix lead to data quality improvement? An assessment

Yongqi Liu; Shili Lin

doi:10.1371/journal.pone.0327100

Abstract

Hi-C and single cell Hi-C (scHi-C) data are now routinely generated for studying an array of biological questions of interest, including whole genome chromatin organization to gain a better understanding of the chromosome three-dimensional hierarchical structure: compartments, Topologically Associated Domains (TADs), and long-range interactions. Due to concerns about data quality, especially for scHi-C because of its sparsity, data quality improvement is seen as a necessary step before performing analyses to answer biological questions. As such, methods have been developed accordingly, among them is a set of methods that are “random walk”- based, including random walk with a limited number of steps (RWS) and random walk with restart (RWR). Nevertheless, there is little justification for the use of such methods, nor quantification of their performance success. Taking correct identification of TADs as the end point, in this paper, we describe the characteristics of random-walk-based approaches and carry out empirical investigation for identifying TADs before and after random walks. Due to the lack of practical guidelines for choosing tuning parameters necessary for performing random walks, it is difficult to know how many steps of random walk for RWS or how small a restart probability for RWR should one choose to achieve good performance. Even in the unrealistic scenario when one has the hindsight to use the optimal parameter values, little improvement in downstream TAD analyses by first performing random walk was observed. This conclusion was based on extensive analytical analyses, simulation study, and real data applications. Therefore, the current study provides a cautionary note to researchers who may consider using random-walk-based approaches prior to downstream analyses.

Citation: Liu Y, Lin S (2025) Can random walking on a Hi-C contact matrix lead to data quality improvement? An assessment. PLoS One 20(9): e0327100. https://doi.org/10.1371/journal.pone.0327100

Editor: Ravi Prakash,, University of Michigan, UNITED STATES OF AMERICA

Received: June 10, 2025; Accepted: September 8, 2025; Published: September 23, 2025

Copyright: © 2025 Liu, Lin. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: The software RW and the data analyzed in this manuscript are available at https://github.com/osu-stat-gen/RW. The human embryonic stem cells bulk Hi-C data can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35156. The GM and PBMC single cell Hi-C data can be downloaded from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE117874.

Funding: This work was supported by the National Institutes of Health [R01GM114142 to SL]. There was no additional external funding received for this study.

Competing interests: No authors have competing interests.

1 Introduction

Chromosome conformation capture (3C) and its derivatives are methods for quantifying interactions among genomic loci. In particular, the High-throughput Chromosome Conformation Capture (Hi-C) technology [1] and other subsequent improvements [2–3] permit genome-wide studies of the chromatin three-dimensional (3D) structure. Numerous methods have been developed in the past decade to process Hi-C data (bulk or single-cell) with various goals. One prominent example is for finding Topologically Associated Domains (TADs), which are contiguous regions of the genome associated with chromatin 3D structures and functions [4]. Loci within the same TAD interact with much higher frequency than those between different TADs, and the domain boundaries are often demarcated by CTCF insulator proteins [5–6].

With the development of technology, single-cell Hi-C data (scHi-C) are available for more in-depth studies of chromatins [7–11]. Hundreds or even thousands of cells can now be processed in a single run, which can then be used to study cell-to-cell variability in chromatin structures for cells within the same cell type. For example, Nagano et al. (2017) [10] utilized thousands of scHi-C data to study the chromosome cell-cycle dynamics and place the cells along the cell-cycle.

Hi-C data are usually organized as an contact matrix (), where is the number of equal-sized bins of the genome (also referred to as loci). Each element of the contact matrix is the number of pair-end reads signifying interactions between the corresponding pair of loci. Sparsity (i.e., with many zeros in the contact matrix) is a major concern, especially for scHi-C data [12–13], where the sequencing depth is often less than one-tenth of bulk data [11]. For better performance of downstream analysis, such as revealing the underlying domain structures, data quality improvement is essential. To address this issue, a number of methods have been developed, which all aim to smooth the data in some way and impute the zeros by borrowing information from neighbors. Such methods include HiCRep [14], HiCPlus [15], GenomeDISCO [16], scHiCluster [12], SCL [17], DeepHiC [18], ScHiC-Rep [19], SnapHiC [20], SnapHiC2 [21], scHiCStackL [22], Higashi [23], HiCImpute [24], and scHiCPTR [25]. Some of these methods were proposed for bulk while some were specifically for single cell data where further information may be borrowed from similar single cells and even bulk data. Among them, a prominent class is random-walk based [12,16,19–22,25]. S1 Table provides a list of methods/software packages that aim to address the issue of sparsity and data quality improvement, where the first half contains methods that are (at least partially) random-walk based.

A standard random walk method with a finite number (say ) of steps (RWS) propagates the information of all interactions related to locus and all interactions related to locus through the transition matrix constructed from the contact count matrix. The element of an RWS-smoothed contact matrix is the corresponding element in the -step transition probability matrix (TPM), for a suitable . It is suggested in GenomeDISCO that performing three steps of RWS () gives the best result of smoothing with the purpose of quantifying reproducibility of data [16]. As for scHiCluster and SnapHiC, part of their data quality improvement methods is to conduct a random walk with restart (RWR) algorithm, which utilizes not only the global information as in RWS but also preserve some local information through repeated restart [12,20].

The idea of using RWR for Hi-C data smoothing was borrowed from the usage of RWR in image processing, which appears to have emerged from the field of Computer Science [26]. In Pan et al. [26], the authors defined the affinity between two nodes in a network (analogous to the interaction intensity of two genomic loci in Hi-C data) to be the “steady-state” probability of RWR, which is the usual stationary distribution in a random walk modified by a restart component. In a standard random walk, the stationary distribution can be found by iteratively taking the product of the one-step transition matrix; similarly, the steady-state probability of RWR is found by taking many iterations until convergence. Therefore, unlike the use of RWS proposed in [16] for a finite number of steps, RWR generally runs the algorithm until convergence.

For methods relying on RWS, although random walking for three steps was suggested for one particular goal [16], whether this is appropriate for other purposes with different datasets has not been investigated. On the other hand, there does not seem to be any concrete recommendation for the restart probability in RWR, to the best of our knowledge. SnapHiC uses restart probability of 0.05 without justification [20], while scHiCluster claims in the appendix that this method is robust to the choice of restart probability, where the values of 0, 0.1, 0.2, …, 0.9 were tested [12]. Despite the prevalence of random-walk-based methods in Hi-C research, the rationale behind RWS and RWR (and the corresponding obligatory tuning parameter selection) for data quality improvement remains unclear, and little has been done to evaluate their impact on downstream analyses.

To apply a random-walk method, a Hi-C data matrix needs to be turned into a TPM. For Hi-C data manipulation, the first step is usually bias removal. This normalization can be done by vanilla coverage (VC), which divides a contact count by the sum of the corresponding row and column [1], square root vanilla coverage (sqrtVC), which divides a contact count by the square root of the product of corresponding row sum and column sum [2], or iterative correction and eigenvector decomposition (ICE), which iteratively corrects for the experimental biases and then performs eigenvalue decomposition [27]. Each element in the bias-removed matrix is then divided by the row sum, so that the sum of each row now equals to one, the characteristic of a TPM. However, such a TPM is no longer symmetrical, violating the basic property of a Hi-C derived data matrix.

In this study, we aim to investigate whether random-walk-based methods are good choices for improving Hi-C data quality, bulk or single cells. We first study the theoretical basis of RWS and RWR for Hi-C data quality improvement and carry out empirical study to visualize the data before and after random walking on the Hi-C matrix. Then, taking detecting TADs as an example downstream analysis, we investigate the impact of RWS- and RWR-smoothed data matrices on performing such a task. In particular, we consider three TADs detection algorithms from a large collection of such methods, including CaTCH [28], HiCseg [29], and TopDom [30], which were top performers among twenty-two TAD finders from a review article [31]. As a by-product, we also evaluate the relative performance of these three methods to assess the claims of robustness and consistency drawn in the review paper.

2 Materials and methods

We first discuss the theory of RWS and RWR in the context of Hi-C contact matrices with TAD structures; that is, greater contacts among loci within a TAD (diagonal blocks in the contact matrix) but much weaker contacts between TADs. The properties of RWS can be easily found in textbooks on stochastic processes (e.g., [32]), as RWS is simply a special case of a Markov chain. On the other hand, the concept of RWR is relatively new and primarily used in computer science, and its property is not widely studied, to the best of our knowledge. Therefore, in this paper, we will study the stationary distribution of RWR within our context. Since a Hi-C matrix is symmetric, we implement the Knight-Ruiz (KR) algorithm [33] for normalization to obtain a doubly stochastic TPM that preserves such a symmetric property.

2.1 The theory of RWS as a special case of Markov chains

Suppose is an TPM. Then the element of the transition matrix represents the probability of a RWS starting at locus reaching locus in one step. The -step TPM is obtained by multiplying s one-step TPM together. It is easily seen that is also symmetrical if we start with a symmetrical TPM ; therefore, it is doubly stochastic. Suppose the limiting matrix exists, then it can be obtained by letting (or setting to be sufficiently large in practice) and the limiting matrix will have identical rows (or nearly identical rows for a finite but large enough ) — each row is referred to as the limiting distribution. Further, if the TPM corresponds to an irreducible Markov chain (i.e., every locus can be reached from every other locus), then the limiting matrix will in fact have all identical entries. In the context of our problem, where our goal is to improve data quality — preserving TADs and enhancing their detections for example — we would not want to take a long random walk to reach the limiting matrix, since TADs would disappear. Therefore, the essential question is the goldilocks number, not too few but not too many steps. We will explore whether a rule of thumb provided in the literature, three steps (i.e., ), works in various settings.

We provide a concrete example where is a symmetric doubly stochastic matrix (can be thought of as a KR-normalized matrix from a Hi-C contact matrix — see Section 2.3) with TAD structures:

(1)

where we hypothesize TADs each of size is an all-1 square submatrix of dimension ; is an all-1 column vector of length , and is a constant. Thus, the total number of loci is . Since is doubly stochastic, the limiting matrix is when . This simple example demonstrates that, if too many steps are taken using RWS, the smoothed matrix will simply merge all domains together and the original block-diagonal structure is ruined. The first row of the heatmaps in S1 Fig show , for for an ideal matrix as in (1): for , respectively, and We can see that, after 10 steps, all domains vanished, illustrating the above theoretical analysis.

2.2 The limiting matrix of RWR for Hi-C matrix with TAD structures

Suppose we start with the same TPM as in RWS. Then, for RWR, at each locus , there is an additional fixed probability that the Markov chain can restart; that is, remaining at locus . The -step TPM , where we set , the identity matrix with the same dimension as the TPM . We can easily see that is symmetrical if is symmetrical. Further, when , , reducing to RWS with s steps. Suppose the limiting matrix exists, then it must satisfy , which, after some simplification, can be expressed as

(2)

provided that the inverse exists.

Considering the same example matrix as defined in (1), we obtain the limiting matrix for RWR:

(3)

where and . First, we note that, when , is the same as the limiting distribution of a standard random walk when . In the other extreme, when , it can be seen that , an identify matrix, thus also losing the domain structure. For a proper , the magnitude of , are all comparable, and roughly in the order of . On the other hand, although is still written in a block-diagonal form, the elements in the main diagonal, , (denoting ), is dominated by ( is typically in the order of one tenth whereas is typically at least an order of magnitude lower for Hi-C data). As such, the most prominent feature of the limiting matrix is the main diagonal only, not the blocks. As gets larger, the diagonal feature becomes more obvious, not surprisingly given the property when . The second row of the heatmaps in S1 Fig shows for values of . We can see that, regardless of the values, the most prominent feature of the limiting matrix is in fact the main diagonal, substantiating our theoretical analysis for RWR.

2.3 Implementation of KR normalization

As a preprocessing step, the Knight-Ruiz algorithm [33] is utilized for normalizing an input Hi-C data matrix before applying RWS and RWR. A typical Hi-C data matrix is symmetrical, with every element in the matrix being a non-negative integer. Therefore, the KR algorithm is particularly suited for transforming such a matrix into a doubly stochastic TPM that preserves these characteristics. Let denote an Hi-C count data matrix, where all the entries in the matrix are non-negative integers. The KR algorithm finds an N-dimensional diagonal matrix , where is a vector of positive numbers such that matrix is doubly stochastic. We note in passing that this matrix balancing problem can be written as solving a system of linear equation involving , and an inner-outer iteration scheme was proposed [33] for solving the equations. The matrix is symmetric with non-negative entries, and elements in each row (and each column) sum to one. In real data analysis, sometimes there are all-0 rows (and columns) in a contact matrix. In this case, we replace the diagonal elements in those rows with 1 before performing KR normalization. This manipulation, instead of cutting out all-zero rows (and columns), avoids creating fictitious TADs.

3 Results

3.1 Simulation studies

We present three simulation studies to illustrate the behavior of the RWS- and RWR-smoothed matrices and their impact on downstream TAD detections. These studies range from an ideal Hi-C data matrix with contrived TAD structures, to a realistic simulation setting, to simply subsampling from a real Hi-C data matrix. In addition to heatmap visualization, we also use the Adjusted Rand Index (ARI) [34], a number between and , as an objective measure of the congruence between the “true” TADs and the detected TADs, with a higher number denoting a greater congruence. This is done by assuming loci within each TAD in the “ground truth” structure belong to a cluster. The TADs identified from a data matrix are sorted into clusters the same way and compared to the “true clusters” for computing the ARI.

3.1.1 Study 1: An idealized Hi-C data matrix with TADs.

Our first study is based on the ideal TPM in (1) with the same parameter values as in S1 Fig except that we set We simulated noise at each position of the matrix by drawing it randomly from the distribution and added it to , leading to the observed matrix . We then KR normalized it to obtain the doubly stochastic matrix . We applied RWS to for 2, 3, 4, 5, and 10 steps, with the resulting smoothed matrices denoted as RWS2s, RWS3s, RWS4s, RWS5s, and RWS10s, respectively. For RWR, we considered several restart probabilities, 0.05, 0.1, 0.2, and 0.5, with the resulting smoothed matrices denoted as RWR0.05, RWR0.1, RWR0.2, and RWR0.5, respectively. A representative matrix from each of the two random-walk smoothing methods, together with the data matrices, are shown in the first column of Fig 1; all the smoothed matrices are in S2 and S3 Figs.

Download:

Fig 1. Heatmap visualization of the data matrices, along with the detected domain boundaries and ARI values (bottom left corner).

The five rows give various kinds of data matrices: the perfect block diagonal data (), the observed data with noise (), the KR-normalized data matrix (), the RWS-smoothed matrix with 3 steps (RWS3s), and the RWR-smoothed matrix with restart probability 0.1 (RWR0.1). The four columns show the data matrix (1^st column), the data matrix superimposed with the domain boundaries from CaTCH (2^nd column, green boxes), HiCseg (3^rd column, blue boxes), and TopDom (4^th column, purple boxes). The color scheme for all the heatmaps ranges from 0 (white) to 0.05 (red), with those values that are greater than 0.05 censored at 0.05.

https://doi.org/10.1371/journal.pone.0327100.g001

We observed that the domain boundaries become a bit blurry but are still visible with the KR-normalized matrix. For RWS, other than the one with just two steps (RWS2s), the domains are difficult to discern or complete gone; the resulting smoothed matrix is simply uniform across all rows and columns. For RWR, with the entire range of restart probability, the most prominent feature is the diagonal line, which has much higher interaction intensities compared to its surrounding entries, although some of the domain boundaries are still visible. These observations imply that applying RWS and RWR achieved the opposite effect than desired.

We then applied the three TAD detection algorithms, CaTCH, HiCseg, and TopDom, to ascertain whether there was an improvement on their performance after the data matrix was smoothed (Fig 1, S2 and S3 Figs; parameter settings for running these algorithms were given in S2 Table). For all data matrices considered, smoothed or not (and even for the underlying idealized matrix without noise), CaTCH simply called every two consecutive loci a domain, leading to a total of 100 domains (tiny green boxes) and a corresponding ARI of 0.024. On the other hand, HiCseg was able to detect the last four, except the first, domains (blue boxes) for the ideal, observed and KR-normalized matrices. Even though the domain structures are not visually apparent, HiCseg still managed to identify the last four domains for RWS2s and RWS3s. However, it started to falter when a larger number of steps was taken, as also characterized by the decreasing ARI values. For RWR, regardless of the restart probabilities, HiCseg erroneously called every two consecutive loci as a domain just like CaTCH, not surprisingly given prominent diagonal feature of the RWR-smoothed matrices. For TopDom, it correctly identified all the domains for the ideal, observed noisy, KR-normalized, and the RWS2s matrices (purples boxes). However, as the number of steps increases, the performance degraded quickly. On the other hand, TopDom was able to identify the domains correctly for all the restart probabilities investigated except when it was very small (in which case it broke the largest domain into two smaller ones).

These results, from a contrived dataset with known underlying ground truth, show that neither RWS nor RWR achieved the desired effects: they in fact destroy the domain structures, consistent with our theoretical results. Compared across the three domain detection algorithms, TopDom is seen to be the most successful, although not perfect. Specifically, CaTCH failed completely even with the ideal data without any noise. The performance of HiCseg and TopDom appear to be complementary in some sense: while HiCseg has a bit more success with some RWS-smoothed data, TopDom achieves much better results with the RWR-smoothed data.

3.1.2 Study 2: Biophysical-law-based simulated Hi-C data.

It is well-known that chromatin interaction frequencies are inversely related to genomic distance, referred to as the biophysical law, or the power-decay law [1,35]. In Hi-C data analysis, the logarithm of an interaction frequency between two loci is therefore typically taken to be linearly related to the genomic distance between these two loci [36]. The simulated data in this study observe this biophysical law in the background count matrix as we described in the following. First, a matrix of size was created following the power-decay law: the element represents the expected contact count between loci : , 1 , where and are constants, with being a negative number to observe the power-decay law and a positive number to control the sequencing depth [37]. For , that is, the counts on the diagonal, we set , with set to be at least as large as . In our simulation, we set , and . Then, the background matrix was generated by sampling from a negative binomial (NegBin) distribution: and , producing over-dispersed data when . In our study, we set equal to 1.2. This background matrix can be interpreted as the expected count matrix generated according to the biophysical law plus noise to induce features such as overdispersion. We then added a domain structure by creating a count matrix : we first randomly divided the loci into blocks, with each block serving as a domain and denoting the size as . Then for each element in each diagonal block submatrix of size , we assigned it a domain effect from a Poisson distribution: , where was generated from . In our simulation, we set . Note that is block, not position, dependent; therefore, the matrix is symmetric. All elements in the rest of the matrix (i.e., not in any of the diagonal block submatrices) are set to be 0. Combining information from the background and the domain matrices, we have . Finally, we account for Hi-C matrix sparsity by randomly setting of the within-domain elements and of those outside of any domain in to be zero, leading to our observed contact matrix . In our simulation, we let and .

The heatmaps for one simulated count matrix, , and its KR-normalized counterpart , are provided in S4 and S5 Figs, respectively, where the true domains (red boxes) are shown by superimposing them on the data matrices in S5 Fig. Compared to the counterparts for the first study, the TADs are less obvious with the raw data—thus, a harder problem—although the domains are still visually apparent even without drawing the true domain boundaries (S4 Fig). After applying RWS with two steps, the domains are still somewhat visible, but almost completely invisible with a larger number of steps (≥3). For RWR with a small restart probability, the most prominent feature of the matrices is the diagonal with much higher contact frequencies than the rest.

We once again apply the three domain detection algorithms to discern whether RWS or RWR smoothing can lead to better TAD identification (S4 and S5 Figs). As with the first example, CaTCH failed to correctly identify any domains. Even with the raw count data matrix , where the domains are clearly visible, CaTCH did not identify any TAD (i.e., all loci are lumped into one group), resulting in an ARI of 0. For the KR-normalized data matrix and all the RWS- and RWR-smoothed matrices, CaTCH simply called every consecutive pair of loci as a TAD — like in the first example — thus failed to detect any of the 20 underlying TADs; this is reflected in a very low ARI of 0.026. For HiCseg, with the observed count data E, most of the underlying TADs were correctly identified, leading to a high ARI of 0.95. However, with the KR-normalized matrix, it only achieved an ARI of 0.58, since many of the larger true TADs were broken into smaller ones. For the RWS-smoothed matrices, some of the TADs were recovered; however, the performance degrades with more steps. While the ARI for the RWS2s matrix is 0.94, the 3-step result, as recommended in GenomeDISCO [16], only achieved an ARI of 0.56. Performance further degraded with even more steps. Regardless of the restart probability, HiCseg was unable to recover any meaningful (underlying) TADs, leading to the ARI all about 0, like CaTCH. TopDom for the observed data led to an ARI of 0.68, but improved to 0.81 after KR normalization, with both recovering many of the true TADs. The ARI for RWS2s and RWS3s were impressive, at 0.87 and 0.99, respectively. However, with taking more steps of the random walk, the results degraded like HiCseg, and TopDom was unable to identify meaningful domains with five or more steps. For RWR-smoothed matrices, the performance of TopDom is incredibly stable across the different restart probabilities, with almost identical results as the KR-normalized matrix without any smoothing.

We carried out the above simulation 100 times to gauge variability across multiple datasets (Fig 2). We see that CaTCH has no ability for detecting the underlying TADs, consistent with all results that we have seen thus far. For HiCseg, the original count matrix in fact has the best results. For the RWS-smoothed matrices, the performance was the best for taking a two-step random walk — which was generally better than the KR-normalized matrix without smoothing — but degraded as the number of steps increases. For the RWR-smoothed matrices, HiCseg’s performance was lacking regardless of the restart probability. For TopDom, we see that the original count data, the KR-normalized matrix, the RWS-smoothed matrices with two or three steps, and all RWR-smoothed matrices regardless of the restart probabilities, all led to similarly performances, with the results for two or three steps edged out the rest.

Download:

Fig 2. Violin plots of the ARI values, based on 100 replications, from the results of three TAD detection algorithms.

Eleven kinds of data matrices were considered: count data, KR-normalized data, five RWS-smoothed matrices, and four RWR-smoothed matrices. Results are shown for CaTCH (row 1), HiCseg (row 2), and TopDom (row 3).

https://doi.org/10.1371/journal.pone.0327100.g002

These results indicate that, overall, RWR-smoothed matrices can lead to slightly better performance for the identification of TAD boundaries if an appropriate algorithm is selected. For RWS-smoothed matrices, there is a potential for better detection performance, although the result is highly dependent on the selection of the number of random walk steps as well as the TAD detection algorithms. Finally, we observed that, although there is variability, the result seen for a single dataset appears to carry over from replicate to replicate.

3.1.3 Study 3: Subsampling from a bulk Hi-C dataset.

Although the first two studies provide a good avenue for evaluating the RWS- and RWR-smoothed matrices and TAD detection algorithms since the underlying ground truth is known, they are nonetheless not based on real data, even though the data in the second study were simulated observing the biophysical law. Therefore, in this study, we subsampled from a real bulk dataset to obtain contact matrices that are more in line with real single cell data to evaluate the random-walk algorithms and TAD detection methods for single cells, where the sequencing depth is usually only up to one-tenth of that for bulk data. Specifically, we considered the long arm of chromosome 16 of the K562 bulk A dataset with 200 kb resolution (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2109887). We subsampled 10% of the bulk data by using a multinomial distribution, where the probability of obtaining a contact at each location of the matrix (in the upper triangular) is proportional to its original observed count. The counts for the lower triangular matrix are set by flipping the upper triangular matrix to obtain symmetry. Since RWS and RWR were originally used to address the problem of sparsity in single cells, we are particularly interested in assessing the performance in this subsampling setting by treating each subsampled as a “single cell.” Although there is no ground truth in terms of TADs, one may compare the performance of TAD detection between the bulk data—where the domains are visually observable—and the subsampled data to see whether there is any improvement in congruence with the bulk data for RWS- and RWR-smoothed matrices.

Using the count data directly for one sample, we can see that there are some consistencies in domain structures between the bulk and the subsampled data detected by all three algorithms, with ARI being 0.70, 0.65, and 0.54, for CaTCH, HiCseg, and TopDom, respectively (S6 Fig). We then KR-normalized both the bulk and the subsampled data. It is interesting to see that KR-normalized subsampled data now has less resemblance with the bulk data in terms of domain structures, although we see that the TADs detected by TopDom for the KR-normalized bulk data and the raw bulk data before normalization are very similar, which is not true for the other two domain detection algorithms (S6 Fig). We then further obtained RWS- and RWR-smoothed data based on the KR-normalized subsample matrix. We see that some domain structures become visible with the RWS-smoothed matrices, especially with a larger number of steps (Fig 3(a) and S7 Fig). RWR with a small restart probability, 0.05, 0.1, and 0.2, also seemed to reveal some domain structures; however, the diagonal feature started to take over with a larger restart probability (S8 Fig). Since the underlying ground truth is unknown, one could not tell for sure whether the domain structures visually apparent were true TADs, although we note that the observed domain structure from the bulk data and the structures revealed by the smooth data with an appropriate number of steps or an appropriate restart probability are similar for the large domains. Further, it is interesting to see that, although the performance of TopDom lacked behind CaTCH and HiCseg for the subsampled data before smoothing, it detected TADs for all RWS- and RWR-smoothed matrices, and all led to a higher ARI with the bulk data. For HiCseg, it called many meaningless small domains for all the data matrices (including the bulk data) except for RWS10s. Despite the excellent performance for the subsampled data as we noted above, CaTCH failed for all the other forms of data, namely all the normalized and smoothed matrices, calling every two consecutive bins as a domain.

Download:

Fig 3. Results based on the subsampling procedure to simulate single cell data.

(a) Data matrices (column one) and detected domain boundaries (columns 2-4 for CaTCH, HiCseg, and TopDom, respectively); row 1 shows the original bulk count matrix, while rows 2-4 provide, respectively, the KR-normalized, SWS3s, RWR0.1, results from one subsample. Note that 150 and 0.05 are the censured values for the heatmaps. (b) Violin plots of ARI values for the results from TopDom with 11 kinds of data matrices with 100 replicated subsamples.

https://doi.org/10.1371/journal.pone.0327100.g003

To gauge variability, we performed the subsampling 100 times, and we plotted the distributions of the ARIs between the subsampled matrices (before and after various levels of smoothing) and the bulk data. The results for TopDom applying to the RWS- and RWR-smoothed data are fairly consistent across a wide range of number of steps and restart probabilities, and there is an appreciable improvement over the results for the subsampled count or KR-normalized subsampled data matrices (Fig 3(b)). However, for CaTCH, the results from the count matrix are better than their KR-normalized and smoothed counterparts; for HiCseg, RWS with an appropriate number of steps (3–5) achieved similar results as the subsampled count data (S6(b) Fig).

3.2 Analysis of two experimental Hi-C datasets

We now turn to experimental data to further substantiate our findings from the three simulation studies. We evaluated whether the RWS- and RWR-smoothed matrices from real experimental data can lead to improvements on TAD detection using the same three algorithms that we investigated in the simulation studies. To be comprehensive, we considered two publicly available Hi-C datasets — one bulk and one single cells — that provides a range of data sparsity.

3.2.1 Human embryonic stem cells bulk Hi-C data.

We considered a bulk Hi-C dataset downloaded from the public domain (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE35156). This human Embryonic Stem Cells (hESC) Hi-C dataset was originally generated for defining and studying TADs [4]. We used the 40 kb resolution processed data [30] and focused on the 857 loci in Chromosome 22. Although there are a total of 1243 loci in the chromosome, the contact counts among the first 386 (the short arm and the centromere) are all zeros; therefore, they were not included in our analysis. The original data matrix exhibits visible domains, but such information becomes blurrier as the number of steps increases in RWS (S9 Fig). With different restart probability, contact information other than the main diagonal also get diluted (S10 Fig). Interestingly, CaTCH was able to detect domains that are obviously visible for the unprocessed data, but failed to provide any meaningful domain information for the rest of the data matrices. HiCseg also detected many domains, but at a finer scale for some than those called by CaTCH. HiCseg continues to detect domains for the RWS-smoothed matrices, but at a coarser and coarser scale as the number of steps increases. As we have seen with the simulated and subsampled data, HiCseg failed to detect anything meaningful with any of the RWR-smoothed matrices. For TopDom, the domain boundaries detected based on the raw data are remarkably similar to those from CaTCH. For the RWS-smoothed matrices, TopDom’s behavior tracked those produced by HiCseg: as the number of steps increases, the domains detected becomes coarser and coarser, and eventually becomes meaningless with RWS10s. However, unlike the other two TAD detection algorithms, TopDom detected some likely-meaningful domains with the RWR-smoothed data matrices, although the inherent diagonal feature of the smoothed matrices led to many small domains.

Overall, we observe some consistency among domain structures inferred by the same TAD calling algorithm and among the same type of smoothing methods. Further, the domain structures inferred by HiCseg based on the RWS-smoothed matrices, apart from RWS10s, are similar to those detected by TopDom, regardless of whether they were inferred from RWS- or RWR-smoothed matrices (Fig 4). However, the domain structures detected by CaTCH (or the lack of) do not share much commonality with those from HiCseg or TopDom.

Download:

Fig 4. Correlation plot of the ARI values between the identified domain boundaries on the bulk count, KR-normalized, and RWS-/RWR-smoothed matrices of the hESC data.

We considered 11 kinds of matrices (“bulk”, KR-normalized, five RWS-smoothed, and four RWR-smoothed matrices) and three TAD detection methods (CaTCH, HiCseg, and TopDom).

https://doi.org/10.1371/journal.pone.0327100.g004

3.2.2 GM and PBMC single cell Hi-C data.

In our second real data analysis, we considered a single-cell Hi-C dataset (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE117874) consisting of 14 lymphoblastoid (GM) and 18 peripheral blood mononuclear cells (PBMC) samples. We used chromosome 22 contact counts at 40 kb resolution, leading to a total of 860 loci after excluding those that have interaction counts being 0 (short arm and centromere). For evaluating the performance, we created a composite contact matrix combining the counts from all single cells for each of the two types, assuming intra-type similarity. The domains detected from the composite contact matrices will be treated as the “gold standard” for comparing with those from each of the single cells within the same type. For the GM cells, CaTCH’s analysis on the original single cells counts matrices led to reasonable congruence with the gold standard, with most of them achieving an ARI of more than 0.6. However, none of the smoothed matrices produced sensible results, with ARI being close to 0 (Fig 5(a)). For HiCseg, the best result of congruence is also achieved with the original single-cell count matrix, although RWS with an appropriate number of steps (RWS5s or RWS10s) also led to similarly reasonable performance. For TopDom, there is a large discrepancy between the “gold standard” and the domains obtained from the original single-cell count matrix, which might be due to the detection of some large domains in single cells (S11 Fig). In contrast, the RWS- and RWR-smoothed matrices all led to reasonably good results. Overall, these results are consistent with what we observed in our simulation studies. For the PBMC cells, the results are qualitatively the same as those for the GM cells (Fig 5(b) and S11 Fig).

Download:

Fig 5. Violin plots for the ARI values of the results from CaTCH (row 1), HiCseg (row 2), and TopDom (row 3).

(a) 14 GM cells; (b) 18 PBMC cells. Each ARI value was computed by comparing the domain boundaries detected by the composite count matrix (combining all data) with each of the single cells.

https://doi.org/10.1371/journal.pone.0327100.g005

4 Discussion and conclusion

In this paper, we studied whether random-walk-based methods, RWS and RWR, are appropriate tools for improving single-cell Hi-C data quality. These methods are often used as part of a pipeline for Hi-C data analysis, especially for single-cell Hi-C data, yet little justification is available in the literature. Our extensive investigation, through theoretical analyses, simulation studies, and real data applications all converge to the following findings: RWS-smoothed data matrices will lead to a “monolith” (all entries have the same contact probabilities) if too many steps are taken; RWR-smoothed data, regardless of its restart probability, will lead to the prominent diagonal feature. Therefore, in both types of random-walk-based smoothing approaches, the smoothed matrix may lose the underlying TAD structures rather than enhancing them. In almost all the simulation and real data studies, using the original count matrix usually led to comparable, if not better, performance with the best smoothed data, with a few exceptions, for the task of detecting topologically associated domains.

As a side product, our study also provided an evaluation of three TAD detection algorithms, which were deemed as the top performers in a previous study. Among the three TAD detection algorithms, CaTCH was seen to be least competitive, failing to uncover the underlying TADs most of the time, even under the ideal TAD structure setting regardless of whether the original count or normalized/smoothed matrices are used. HiCseg often has good performance with the original count data, and using RWS-smoothed data with an appropriate number of steps can sometimes lead to improved TAD detection capability. However, HiCseg typically failed to detect TADs with RWR-smoothed matrices. TopDom was seen to have the best overall performance. It can often produce results that are at least comparable, and sometimes better than using the original count matrix, if an appropriate number of steps for RWS or an appropriate restart probability for RWR were used.

Although we have considered real bulk and single cell data that represent a range of sparsity, we further carried out a study to directly gauge the influence of sparsity on random walk based smoothing methods. Specifically, in the subsampled study, instead of fixing the sampling rate at 10%, we considered a range, from 5% to 50% in increment of 5%, leading to the percentage of median observed zeros ranging from 97.2% (with 5% subsampling) to 88.5% (with 50% subsampling) over 100 replications. For the original data and the 10% subsamples, the zero percentage was 79.2% and 95.5%, respectively. For CaTCH and HiCseg, the results are incredibly stable; on the other hand, the performance of TopDom without smoothing is much more sensitive to data sparsity, whereas using smoothing lessens the effect of sparsity to some extent (S12 Fig).

In conclusion, the results from our extensive study do not support the use of random walk as a data improvement technique for TAD detection. It should be noted that all downstream analyses carried out in this study were limited to TAD detection; therefore, whether the same conclusion holds for other tasks, such as compartmental annotations, loop identification, or cell subtype discovery, remain unanswered.

If a reader would still choose to use a random-walk based method, then an important issue that deserves further discussion is the need for doubly stochastic matrix before applying RWS and RWR. Although a transition probability matrix is sufficient for a random walk, one without symmetry destroys the inherent feature of a contact matrix. Of the commonly used normalization methods, including ICE, symmetry is not guaranteed after normalization. In contrast, KR normalization has the capability of turning a symmetric matrix into a doubly stochastic matrix, in which not only each row sums to 1, but each column sums to 1 also.

Supporting information

S1 Table. Summary of Hi-C data improvement methods.

https://doi.org/10.1371/journal.pone.0327100.s001

(DOCX)

S2 Table. Summary of three TAD detection algorithms.

https://doi.org/10.1371/journal.pone.0327100.s002

(DOCX)

S1 Fig.

Heatmaps demonstrating the behavior of random walk methods. Heatmap visualization of a normalized contact matrix given in the form of Equation (first plot in the 1^st row), RWS-smoothed matrices with 2, 3, and 10 steps (second to the fourth plots in the 1^st row), and RWR-smoothed matrices with restart probability 0.05, 0.1, 0.2, and 0.5 (2^nd row). The total number of bins is chosen to be , and the number of TADs . The domain sizes are specified as 50, 30, 20, 90, and 10, respectively. The magnitude of outside-domain interaction frequency . The color scheme ranges from 0 (white) to 0.0525 (red, the maximum value in ), with those smoothed counts greater than 0.0525 capped at 0.0525 (e.g., all the main diagonals in the RWR-smoothed matrices).

https://doi.org/10.1371/journal.pone.0327100.s003

(DOCX)

S2 Fig.

RWS-smoothed data and TAD detection results for an idealized dataset. Heatmap visualization of the RWS-smoothed matrices (with 2, 3, 4, 5, and 10) in Simulation Study 1, along with the detected domain boundaries and ARI values (bottom left corner). The total number of bins is . The number of TADs , with sizes 50, 30, 20, 90 and 10, respectively. Same layout as in Fig 1. The color scheme for all the heatmaps ranges from 0 (white) to 0.05 (red).

https://doi.org/10.1371/journal.pone.0327100.s004

(DOCX)

S3 Fig.

RWR-smoothed data and TAD detection results for an idealized dataset. Heatmap visualization of the RWR-smoothed matrices (with 0.05, 0.1, 0.2, and 0.5) in Simulation Study 1, along with the detected domain boundaries and ARI values (bottom left corner). The total number of bins is . The number of TADs , with sizes 50, 30, 20, 90 and 10, respectively. Same layout as in Fig 1. The color scheme for all the heatmaps ranges from 0 (white) to 0.05 (red), with those values that are greater than 0.05 capped at 0.05.

https://doi.org/10.1371/journal.pone.0327100.s005

(DOCX)

S4 Fig.

Observed count matrix along with RWS-smoothed data and TAD detection results for a simulated dataset based on biophysical law. Heatmap visualization of the simulated count matrix (first row) and the RWS-smoothed matrices (with 2, 3, 4, 5, and 10, for the second to the sixth rows, respectively) in one realization of the simulation procedure described in Simulation Study 2, with the same layout as in Fig 1. The color scheme for the simulated count matrix heatmap ranges from 0 (white) to 30 (red), where 30 is the 99.5-th percentile of the simulated count. The color scheme for the heatmaps of the RWS-smoothed matrices ranges from 0 (white) to 0.007 (red).

https://doi.org/10.1371/journal.pone.0327100.s006

(DOCX)

S5 Fig.

KR-normalized observed count matrix along with RWR-smoothed data and TAD detection results for a simulated dataset based on biophysical law. Heatmap visualization of the KR-normalized matrix (first row) and the RWR-smoothed matrices (with 0.05, 0.1, 0.2, and 0.5, for the second to the fifth rows, respectively) in one realization of the simulation procedure described in Simulation Study 2, with the same layout as in Fig 1. The color scheme for all the heatmaps ranges from 0 (white) to 0.007 (red), with those values that are greater than 0.007 capped at 0.007.

https://doi.org/10.1371/journal.pone.0327100.s007

(DOCX)

S6 Fig.

Data and TAD detection results for the subsampled data based on a K562 bulk dataset. (a) Heatmap of the bulk count matrix (first row), the KR normalized bulk matrix (third row), one realization of the subsampled matrix (second row) and the KR normalized subsampled matrix (fourth row) with the detected TAD boundaries. The ARI value of each detected boundary on the subsampled/KR-normalized matrix (compared to the one detected on the bulk matrix) is listed at the bottom left corner of the heatmap. The color scheme for the bulk matrix heatmap ranges from 0 (white) to 10 (red), with those values that are greater than 10 capped at 10. The color scheme for the subsampled matrix heatmap ranges from 0 (white) to 2 (red), with those values that are greater than 2 capped at 2. The color scheme for the heatmaps of the KR-normalized matrices ranges from 0 (white) to 0.05 (red), with those values that are greater than 0.05 capped at 0.05. (b) Violin plots for the ARI values of CaTCH (first row) and HiCseg (second row) domain boundaries with 100 realizations of the subsampling procedure described in Simulation Study 3.

https://doi.org/10.1371/journal.pone.0327100.s008

(DOCX)

S7 Fig.

RWS-smoothed data and TAD detection results for a subsample based on a K562 bulk dataset. Heatmaps and identified boundaries on the RWS-smoothed matrices in one realization of the subsampling procedure described in Section 3.1.3, with the same layout as in Fig 3(a). The color scheme for all the heatmaps ranges from 0 (white) to 0.05 (red), with those values that are greater than 0.05 capped at 0.05.

https://doi.org/10.1371/journal.pone.0327100.s009

(DOCX)

S8 Fig.

RWR-smoothed data and TAD detection results for a subsample based on a K562 bulk dataset. Heatmaps and identified boundaries on the RWR-smoothed matrices in one realization of the subsampling procedure described in Section 3.1.3, with the same layout as in Fig 3(a). The color scheme for all the heatmaps ranges from 0 (white) to 0.05 (red), with those values that are greater than 0.05 capped at 0.05.

https://doi.org/10.1371/journal.pone.0327100.s010

(DOCX)

S9 Fig.

The heatmaps and identified boundaries on the bulk count matrix and RWS-smoothed matrices of hESC data. The ARI value of each detected boundary on the RWS-smoothed matrix (compared to the one detected on the count matrix) is listed at the bottom left corner of the heatmap. The color scheme for the bulk matrix heatmap ranges from 0 (white) to 150 (red), with those values that are greater than 150 capped at 150. The color scheme for all the other heatmaps ranges from 0 (white) to 0.05 (red), with those values that are greater than 0.05 capped at 0.05.

https://doi.org/10.1371/journal.pone.0327100.s011

(DOCX)

S10 Fig.

Data and results on a bulk hESC dataset. This figure provides the heatmaps and identified boundaries on the KR normalized and RWR-smoothed bulk matrices of hESC data described The ARI value of each detected boundary on the KR-normalized/RWR-smoothed matrix (compared to the one detected on the count matrix) is listed at the bottom left corner of the heatmap. The color scheme for all the heatmaps ranges from 0 (white) to 0.05 (red), with those values that are greater than 0.05 capped at 0.05.

https://doi.org/10.1371/journal.pone.0327100.s012

(DOCX)

S11 Fig.

Data and results on a single cell dataset. Heatmap visualization of the composite GM count matrix (1^st row, by summing up all 14 GM cells), count matrix of one GM cell (2^nd row, GSM3314359), the composite PBMC count matrix (3^rd row, by summing up all 18 GM cells), count matrix of one PBMC cell (4^th row, GSM3314376). The ARI value was calculated between the composite and single-cell count data within the same TAD finding algorithm and listed at the bottom left corner of the heatmap of a single-cell count matrix.

https://doi.org/10.1371/journal.pone.0327100.s013

(DOCX)

S12 Fig.

Influence of sparsity on random walk smoothing methods and TAD detection results. The subsampling procedure described in Simulation Study 3 (Section 3.1.3) was repeated 100 times for various subsampling percentages, ranging from 5% to 50% with an increment of 5%, and the median ARI is plotted: CaTCH (top row), HiCseg (middle row), and TopDom (bottom row).

https://doi.org/10.1371/journal.pone.0327100.s014

(DOCX)

Acknowledgments

The authors would like to thank the two anonymous reviewers for their constructive comments and are grateful to Mr. Taeyeon Kim for testing our RW software.

References

1. Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326(5950):289–93.
- View Article
- Google Scholar
2. Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159(7):1665–80.
- View Article
- Google Scholar
3. Lafontaine DL, Yang L, Dekker J, Gibcus JH. Hi-C 3.0: Improved Protocol for Genome-Wide Chromosome Conformation Capture. Curr Protoc. 2021;1(7):e198. pmid:34286910
- View Article
- PubMed/NCBI
- Google Scholar
4. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485(7398):376–80. pmid:22495300
- View Article
- PubMed/NCBI
- Google Scholar
5. Ong C-T, Corces VG. CTCF: an architectural protein bridging genome topology and function. Nat Rev Genet. 2014;15(4):234–46. pmid:24614316
- View Article
- PubMed/NCBI
- Google Scholar
6. Wutz G, Várnai C, Nagasaka K, Cisneros DA, Stocsits RR, Tang W, et al. Topologically associating domains and chromatin loops depend on cohesin and are regulated by CTCF, WAPL, and PDS5 proteins. The EMBO journal. 2017 Dec 15;36(24):3573–99.
- View Article
- Google Scholar
7. Nagano T, Lubling Y, Stevens TJ, Schoenfelder S, Yaffe E, Dean W, et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature. 2013;502(7469):59–64. pmid:24067610
- View Article
- PubMed/NCBI
- Google Scholar
8. Ramani V, Deng X, Qiu R, Gunderson KL, Steemers FJ, Disteche CM, et al. Massively multiplex single-cell Hi-C. Nat Methods. 2017;14(3):263–6. pmid:28135255
- View Article
- PubMed/NCBI
- Google Scholar
9. Stevens TJ, Lando D, Basu S, Atkinson LP, Cao Y, Lee SF. 3D structures of individual mammalian genomes studied by single-cell Hi-C. Nature. 2017;544(7648):59–64.
- View Article
- Google Scholar
10. Nagano T, Lubling Y, Várnai C, Dudley C, Leung W, Baran Y, et al. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature. 2017;547(7661):61–7. pmid:28682332
- View Article
- PubMed/NCBI
- Google Scholar
11. Flyamer IM, Gassler J, Imakaev M, Brandão HB, Ulianov SV, Abdennur N, et al. Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition. Nature. 2017;544(7648):110–4. pmid:28355183
- View Article
- PubMed/NCBI
- Google Scholar
12. Zhou J, Ma J, Chen Y, Cheng C, Bao B, Peng J. Robust single-cell Hi-C clustering by convolution-and random-walk–based imputation. Proceedings of the National Academy of Sciences. 2019;116(28):14011–8.
- View Article
- Google Scholar
13. Li X, Feng F, Pu H, Leung WY, Liu J. scHiCTools: A computational toolbox for analyzing single-cell Hi-C data. PLoS Comput Biol. 2021;17(5):e1008978. pmid:34003823
- View Article
- PubMed/NCBI
- Google Scholar
14. Yang T, Zhang F, Yardımcı GG, Song F, Hardison RC, Noble WS, et al. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 2017;27(11):1939–49. pmid:28855260
- View Article
- PubMed/NCBI
- Google Scholar
15. Zhang Y, An L, Xu J, Zhang B, Zheng WJ, Hu M, et al. Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus. Nat Commun. 2018;9(1):750. pmid:29467363
- View Article
- PubMed/NCBI
- Google Scholar
16. Ursu O, Boley N, Taranova M, Wang YR, Yardimci GG, Stafford Noble W. GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics. 2018;34(16):2701–7.
- View Article
- Google Scholar
17. Zhu H, Wang Z. SCL: a lattice-based approach to infer 3D chromosome structures from single-cell Hi-C data. Bioinformatics. 2019;35(20):3981–8. pmid:30865261
- View Article
- PubMed/NCBI
- Google Scholar
18. Hong H, Jiang S, Li H, Du G, Sun Y, Tao H, et al. DeepHiC: A generative adversarial network for enhancing Hi-C data resolution. PLoS Comput Biol. 2020;16(2):e1007287. pmid:32084131
- View Article
- PubMed/NCBI
- Google Scholar
19. Zhen C, Wang Y, Han L, Li J, Peng J, Wang T. A novel framework for single-cell hi-c clustering based on graph-convolution-based imputation and two-phase-based feature extraction. bioRxiv. 2021;:2021–04.
- View Article
- Google Scholar
20. Yu M, Abnousi A, Zhang Y, Li G, Lee L, Chen Z, et al. SnapHiC: a computational pipeline to identify chromatin loops from single-cell Hi-C data. Nat Methods. 2021;18(9):1056–9. pmid:34446921
- View Article
- PubMed/NCBI
- Google Scholar
21. Li X, Lee L, Abnousi A, Yu M, Liu W, Huang L. SnapHiC2: A computationally efficient loop caller for single cell Hi-C data. Computational and Structural Biotechnology Journal. 2022;20:2778–83.
- View Article
- Google Scholar
22. Wu H, Wu Y, Jiang Y, Zhou B, Zhou H, Chen Z, et al. scHiCStackL: a stacking ensemble learning-based method for single-cell Hi-C classification using cell embedding. Brief Bioinform. 2022;23(1):bbab396. pmid:34553746
- View Article
- PubMed/NCBI
- Google Scholar
23. Zhang R, Zhou T, Ma J. Multiscale and integrative single-cell Hi-C analysis with Higashi. Nat Biotechnol. 2022;40(2):254–61. pmid:34635838
- View Article
- PubMed/NCBI
- Google Scholar
24. Xie Q, Han C, Jin V, Lin S. HiCImpute: A Bayesian hierarchical model for identifying structural zeros and enhancing single cell Hi-C data. PLoS Comput Biol. 2022;18(6):e1010129. pmid:35696429
- View Article
- PubMed/NCBI
- Google Scholar
25. Lyu H, Liu E, Wu Z, Li Y, Liu Y, Yin X. scHiCPTR: unsupervised pseudotime inference through dual graph refinement for single-cell Hi-C data. Bioinformatics. 2022;38(23):5151–9.
- View Article
- Google Scholar
26. Pan J-Y, Yang H-J, Faloutsos C, Duygulu P. Automatic multimedia cross-modal correlation discovery. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. 653–8. doi: https://doi.org/10.1145/1014052.1014135
27. Imakaev M, Fudenberg G, McCord RP, Naumova N, Goloborodko A, Lajoie BR, et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat Methods. 2012;9(10):999–1003. pmid:22941365
- View Article
- PubMed/NCBI
- Google Scholar
28. Zhan Y, Mariani L, Barozzi I, Schulz EG, Blüthgen N, Stadler M, et al. Reciprocal insulation analysis of Hi-C data shows that TADs represent a functionally but not structurally privileged scale in the hierarchical folding of chromosomes. Genome Res. 2017;27(3):479–90. pmid:28057745
- View Article
- PubMed/NCBI
- Google Scholar
29. Lévy-Leduc C, Delattre M, Mary-Huard T, Robin S. Two-dimensional segmentation for analyzing Hi-C data. Bioinformatics. 2014;30(17):i386-92.
- View Article
- Google Scholar
30. Shin H, Shi Y, Dai C, Tjong H, Gong K, Alber F, et al. TopDom: an efficient and deterministic method for identifying topological domains in genomes. Nucleic Acids Res. 2016;44(7):e70. pmid:26704975
- View Article
- PubMed/NCBI
- Google Scholar
31. Zufferey M, Tavernari D, Oricchio E, Ciriello G. Comparison of computational methods for the identification of topologically associating domains. Genome Biol. 2018;19(1):217. pmid:30526631
- View Article
- PubMed/NCBI
- Google Scholar
32. Karlin S. A first course in stochastic processes. Academic Press. 2014.
33. Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA Journal of Numerical Analysis. 2012;33(3):1029–47.
- View Article
- Google Scholar
34. Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218.
- View Article
- Google Scholar
35. Dekker J, Marti-Renom MA, Mirny LA. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat Rev Genet. 2013;14(6):390–403. pmid:23657480
- View Article
- PubMed/NCBI
- Google Scholar
36. Lajoie BR, Dekker J, Kaplan N. The Hitchhiker’s guide to Hi-C analysis: practical guidelines. Methods. 2015;72:65–75.
- View Article
- Google Scholar
37. Park J, Lin S. A random effect model for reconstruction of spatial chromatin structure. Biometrics. 2017;73(1):52–62. pmid:27214023
- View Article
- PubMed/NCBI
- Google Scholar

[ref1] 1. Lieberman-Aiden E, Van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science. 2009;326(5950):289–93.
View Article
Google Scholar

[2] View Article

[3] Google Scholar

[ref2] 2. Rao SS, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159(7):1665–80.
View Article
Google Scholar

[5] View Article

[6] Google Scholar

[ref3] 3. Lafontaine DL, Yang L, Dekker J, Gibcus JH. Hi-C 3.0: Improved Protocol for Genome-Wide Chromosome Conformation Capture. Curr Protoc. 2021;1(7):e198. pmid:34286910
View Article
PubMed/NCBI
Google Scholar

[8] View Article

[9] PubMed/NCBI

[10] Google Scholar

[ref4] 4. Dixon JR, Selvaraj S, Yue F, Kim A, Li Y, Shen Y, et al. Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature. 2012;485(7398):376–80. pmid:22495300
View Article
PubMed/NCBI
Google Scholar

[12] View Article

[13] PubMed/NCBI

[14] Google Scholar

[ref5] 5. Ong C-T, Corces VG. CTCF: an architectural protein bridging genome topology and function. Nat Rev Genet. 2014;15(4):234–46. pmid:24614316
View Article
PubMed/NCBI
Google Scholar

[16] View Article

[17] PubMed/NCBI

[18] Google Scholar

[ref6] 6. Wutz G, Várnai C, Nagasaka K, Cisneros DA, Stocsits RR, Tang W, et al. Topologically associating domains and chromatin loops depend on cohesin and are regulated by CTCF, WAPL, and PDS5 proteins. The EMBO journal. 2017 Dec 15;36(24):3573–99.
View Article
Google Scholar

[20] View Article

[21] Google Scholar

[ref7] 7. Nagano T, Lubling Y, Stevens TJ, Schoenfelder S, Yaffe E, Dean W, et al. Single-cell Hi-C reveals cell-to-cell variability in chromosome structure. Nature. 2013;502(7469):59–64. pmid:24067610
View Article
PubMed/NCBI
Google Scholar

[23] View Article

[24] PubMed/NCBI

[25] Google Scholar

[ref8] 8. Ramani V, Deng X, Qiu R, Gunderson KL, Steemers FJ, Disteche CM, et al. Massively multiplex single-cell Hi-C. Nat Methods. 2017;14(3):263–6. pmid:28135255
View Article
PubMed/NCBI
Google Scholar

[27] View Article

[28] PubMed/NCBI

[29] Google Scholar

[ref9] 9. Stevens TJ, Lando D, Basu S, Atkinson LP, Cao Y, Lee SF. 3D structures of individual mammalian genomes studied by single-cell Hi-C. Nature. 2017;544(7648):59–64.
View Article
Google Scholar

[31] View Article

[32] Google Scholar

[ref10] 10. Nagano T, Lubling Y, Várnai C, Dudley C, Leung W, Baran Y, et al. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature. 2017;547(7661):61–7. pmid:28682332
View Article
PubMed/NCBI
Google Scholar

[34] View Article

[35] PubMed/NCBI

[36] Google Scholar

[ref11] 11. Flyamer IM, Gassler J, Imakaev M, Brandão HB, Ulianov SV, Abdennur N, et al. Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition. Nature. 2017;544(7648):110–4. pmid:28355183
View Article
PubMed/NCBI
Google Scholar

[38] View Article

[39] PubMed/NCBI

[40] Google Scholar

[ref12] 12. Zhou J, Ma J, Chen Y, Cheng C, Bao B, Peng J. Robust single-cell Hi-C clustering by convolution-and random-walk–based imputation. Proceedings of the National Academy of Sciences. 2019;116(28):14011–8.
View Article
Google Scholar

[42] View Article

[43] Google Scholar

[ref13] 13. Li X, Feng F, Pu H, Leung WY, Liu J. scHiCTools: A computational toolbox for analyzing single-cell Hi-C data. PLoS Comput Biol. 2021;17(5):e1008978. pmid:34003823
View Article
PubMed/NCBI
Google Scholar

[45] View Article

[46] PubMed/NCBI

[47] Google Scholar

[ref14] 14. Yang T, Zhang F, Yardımcı GG, Song F, Hardison RC, Noble WS, et al. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Res. 2017;27(11):1939–49. pmid:28855260
View Article
PubMed/NCBI
Google Scholar

[49] View Article

[50] PubMed/NCBI

[51] Google Scholar

[ref15] 15. Zhang Y, An L, Xu J, Zhang B, Zheng WJ, Hu M, et al. Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus. Nat Commun. 2018;9(1):750. pmid:29467363
View Article
PubMed/NCBI
Google Scholar

[53] View Article

[54] PubMed/NCBI

[55] Google Scholar

[ref16] 16. Ursu O, Boley N, Taranova M, Wang YR, Yardimci GG, Stafford Noble W. GenomeDISCO: a concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics. 2018;34(16):2701–7.
View Article
Google Scholar

[57] View Article

[58] Google Scholar

[ref17] 17. Zhu H, Wang Z. SCL: a lattice-based approach to infer 3D chromosome structures from single-cell Hi-C data. Bioinformatics. 2019;35(20):3981–8. pmid:30865261
View Article
PubMed/NCBI
Google Scholar

[60] View Article

[61] PubMed/NCBI

[62] Google Scholar

[ref18] 18. Hong H, Jiang S, Li H, Du G, Sun Y, Tao H, et al. DeepHiC: A generative adversarial network for enhancing Hi-C data resolution. PLoS Comput Biol. 2020;16(2):e1007287. pmid:32084131
View Article
PubMed/NCBI
Google Scholar

[64] View Article

[65] PubMed/NCBI

[66] Google Scholar

[ref19] 19. Zhen C, Wang Y, Han L, Li J, Peng J, Wang T. A novel framework for single-cell hi-c clustering based on graph-convolution-based imputation and two-phase-based feature extraction. bioRxiv. 2021;:2021–04.
View Article
Google Scholar

[68] View Article

[69] Google Scholar

[ref20] 20. Yu M, Abnousi A, Zhang Y, Li G, Lee L, Chen Z, et al. SnapHiC: a computational pipeline to identify chromatin loops from single-cell Hi-C data. Nat Methods. 2021;18(9):1056–9. pmid:34446921
View Article
PubMed/NCBI
Google Scholar

[71] View Article

[72] PubMed/NCBI

[73] Google Scholar

[ref21] 21. Li X, Lee L, Abnousi A, Yu M, Liu W, Huang L. SnapHiC2: A computationally efficient loop caller for single cell Hi-C data. Computational and Structural Biotechnology Journal. 2022;20:2778–83.
View Article
Google Scholar

[75] View Article

[76] Google Scholar

[ref22] 22. Wu H, Wu Y, Jiang Y, Zhou B, Zhou H, Chen Z, et al. scHiCStackL: a stacking ensemble learning-based method for single-cell Hi-C classification using cell embedding. Brief Bioinform. 2022;23(1):bbab396. pmid:34553746
View Article
PubMed/NCBI
Google Scholar

[78] View Article

[79] PubMed/NCBI

[80] Google Scholar

[ref23] 23. Zhang R, Zhou T, Ma J. Multiscale and integrative single-cell Hi-C analysis with Higashi. Nat Biotechnol. 2022;40(2):254–61. pmid:34635838
View Article
PubMed/NCBI
Google Scholar

[82] View Article

[83] PubMed/NCBI

[84] Google Scholar

[ref24] 24. Xie Q, Han C, Jin V, Lin S. HiCImpute: A Bayesian hierarchical model for identifying structural zeros and enhancing single cell Hi-C data. PLoS Comput Biol. 2022;18(6):e1010129. pmid:35696429
View Article
PubMed/NCBI
Google Scholar

[86] View Article

[87] PubMed/NCBI

[88] Google Scholar

[ref25] 25. Lyu H, Liu E, Wu Z, Li Y, Liu Y, Yin X. scHiCPTR: unsupervised pseudotime inference through dual graph refinement for single-cell Hi-C data. Bioinformatics. 2022;38(23):5151–9.
View Article
Google Scholar

[90] View Article

[91] Google Scholar

[ref26] 26. Pan J-Y, Yang H-J, Faloutsos C, Duygulu P. Automatic multimedia cross-modal correlation discovery. In: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, 2004. 653–8. doi: https://doi.org/10.1145/1014052.1014135

[ref27] 27. Imakaev M, Fudenberg G, McCord RP, Naumova N, Goloborodko A, Lajoie BR, et al. Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat Methods. 2012;9(10):999–1003. pmid:22941365
View Article
PubMed/NCBI
Google Scholar

[94] View Article

[95] PubMed/NCBI

[96] Google Scholar

[ref28] 28. Zhan Y, Mariani L, Barozzi I, Schulz EG, Blüthgen N, Stadler M, et al. Reciprocal insulation analysis of Hi-C data shows that TADs represent a functionally but not structurally privileged scale in the hierarchical folding of chromosomes. Genome Res. 2017;27(3):479–90. pmid:28057745
View Article
PubMed/NCBI
Google Scholar

[98] View Article

[99] PubMed/NCBI

[100] Google Scholar

[ref29] 29. Lévy-Leduc C, Delattre M, Mary-Huard T, Robin S. Two-dimensional segmentation for analyzing Hi-C data. Bioinformatics. 2014;30(17):i386-92.
View Article
Google Scholar

[102] View Article

[103] Google Scholar

[ref30] 30. Shin H, Shi Y, Dai C, Tjong H, Gong K, Alber F, et al. TopDom: an efficient and deterministic method for identifying topological domains in genomes. Nucleic Acids Res. 2016;44(7):e70. pmid:26704975
View Article
PubMed/NCBI
Google Scholar

[105] View Article

[106] PubMed/NCBI

[107] Google Scholar

[ref31] 31. Zufferey M, Tavernari D, Oricchio E, Ciriello G. Comparison of computational methods for the identification of topologically associating domains. Genome Biol. 2018;19(1):217. pmid:30526631
View Article
PubMed/NCBI
Google Scholar

[109] View Article

[110] PubMed/NCBI

[111] Google Scholar

[ref32] 32. Karlin S. A first course in stochastic processes. Academic Press. 2014.

[ref33] 33. Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA Journal of Numerical Analysis. 2012;33(3):1029–47.
View Article
Google Scholar

[114] View Article

[115] Google Scholar

[ref34] 34. Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218.
View Article
Google Scholar

[117] View Article

[118] Google Scholar

[ref35] 35. Dekker J, Marti-Renom MA, Mirny LA. Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data. Nat Rev Genet. 2013;14(6):390–403. pmid:23657480
View Article
PubMed/NCBI
Google Scholar

[120] View Article

[121] PubMed/NCBI

[122] Google Scholar

[ref36] 36. Lajoie BR, Dekker J, Kaplan N. The Hitchhiker’s guide to Hi-C analysis: practical guidelines. Methods. 2015;72:65–75.
View Article
Google Scholar

[124] View Article

[125] Google Scholar

[ref37] 37. Park J, Lin S. A random effect model for reconstruction of spatial chromatin structure. Biometrics. 2017;73(1):52–62. pmid:27214023
View Article
PubMed/NCBI
Google Scholar

[127] View Article

[128] PubMed/NCBI

[129] Google Scholar

Figures

Abstract

1 Introduction

2 Materials and methods

2.1 The theory of RWS as a special case of Markov chains

2.2 The limiting matrix of RWR for Hi-C matrix with TAD structures

2.3 Implementation of KR normalization

3 Results

3.1 Simulation studies

3.1.1 Study 1: An idealized Hi-C data matrix with TADs.

3.1.2 Study 2: Biophysical-law-based simulated Hi-C data.

3.1.3 Study 3: Subsampling from a bulk Hi-C dataset.

3.2 Analysis of two experimental Hi-C datasets

3.2.1 Human embryonic stem cells bulk Hi-C data.

3.2.2 GM and PBMC single cell Hi-C data.

4 Discussion and conclusion

Supporting information

S1 Table. Summary of Hi-C data improvement methods.

S2 Table. Summary of three TAD detection algorithms.

Acknowledgments

References