Sequence-based multiscale modeling for high-throughput chromosome conformation capture (Hi-C) data analysis

In this paper, we introduce sequence-based multiscale modeling for biomolecular data analysis. We employ spectral clustering method in our modeling and reveal the difference between sequence-based global scale clustering and local scale clustering. Essentially, two types of distances, i.e., Euclidean (or spatial) distance and genomic (or sequential) distance, can be used in data clustering. Clusters from sequence-based global scale models optimize spatial distances, meaning spatially adjacent loci are more likely to be assigned into the same cluster. Sequence-based local scale models, on the other hand, result in clusters that optimize genomic distances. That is to say, in these models, sequentially adjoining loci tend to be cluster together. We propose two sequence-based multiscale models (SeqMMs) for the study of chromosome hierarchical structures, including genomic compartments and topological associated domains (TADs). We find that genomic compartments are determined only by global scale information in the Hi-C data. The removal of all the local interactions within a band region as large as 10 Mb in genomic distance has almost no significant influence on the final compartment results. Further, in TAD analysis, we find that when the sequential scale is small, a tiny variation of diagonal band region in a contact map will result in a great change in the predicted TAD boundaries. When the scale value is larger than a threshold value, the TAD boundaries become very consistent. This threshold value is highly related to TAD sizes. By the comparison of our results with those previously obtained using a spectral clustering model, we find that our method is more robust and reliable. Finally, we demonstrate that almost all TAD boundaries from both clustering methods are local minimum of a TAD summation function.


Introduction
The chromosome, the physical realization of genetic information, is one of the most complex and important cellular entities [1][2][3][4][5][6][7]. Over the past few decades, the significance of its threedimensional architecture for supporting essential biological functions, such as DNA that clusters from sequence-based global scale models optimize Euclidean distance relations, and these models can be used in genomic compartment analysis. In contrast, clusters from local scale models optimize genomic distance relations, and these models can be used in TAD analysis. Essentially, our SeqMMs provide a way to explore the hierarchical structures of chromosomes.
Mathematically, genomic compartments are defined from principal component analysis [17], they are global structural features. The loci in the same genomic compartment are spatially close to each other. But their sequential distances can be very large. Based on global scale clustering, we design Type-1 SeqMM and use it for genomic compartment analysis. In contrast, TADs are local structural features. The loci in the same TAD are not only spatially close to each other, but also sequentially adjacent to each other. Their sequential distances are usually within a certain genomic distance. Based on local scale clustering, we introduce Type-2 SeqMM and use it in TAD analysis.

Methods
As a discrete representation of geometries, manifolds, high-dimensional structures, abstract relations and complicated subjects, point cloud data (PCD) are widely used in computer science, engineering, scientific computing and data science. Particularly, PCD and PCD based classification or clustering methods [37], including K-means, hierarchical clustering, spectral clustering, modularity, graph centrality, network approaches, etc, have been constantly used in biomolecular data analysis. However, as demonstrated in Fig 1, biomolecular structure data are essentially different from the general PCD, as they incorporate a unique sequential information. The simulated structure corresponds to chromosome 22 from Human ES Cell line and is generated by using software shRec3D [33].
To have an intuitive understanding of the sequential information in PCD analysis, we consider a DNA structure with PDB ID 1ZEW. Using atomic coordinates, a weight matrix is constructed. The weight values are defined by using the rigidity function [38], where r ij is the Euclidean distance between i-th and j-th atoms, N is the total number of atoms and η is the scale parameter that controls the influence range of each atom. In this case, we choose η = 8 Å. The weight matrix is illustrated in Fig 2(c 2 ). Two more matrices are constructed by dividing the weight matrix into a diagonal band region as in Fig 2(a 2 ) and the remaining off-diagonal regions as in Fig 2(b 2 ). Based on these three matrices, we can decompose the DNA structure into two parts using the spectral clustering method [37,39]. Results are illustrated in Fig 2(a 1 ), 2(b 1 ) and 2(c 1 ). It can be seen that, if we only consider relations between sequentially adjacent atoms, which are represented in the diagonal region, the DNA structure will be clustered into two complementary helix chains. However, if we use the whole matrix or only off-diagonal regions, the DNA structure will be divided in the middle region with two chains in each cluster. Generally speaking, Fig 2 demonstrates two types of sequence-based models, i.e., sequence-based local models and sequence-based global models. It can be seen that their properties in structure decomposition differ greatly. In the first type, atoms with shorter sequential distances are more likely to be grouped into the same cluster. In the second one, spatially close atoms, i.e., atoms with large weight values, are more likely to be assigned to the same cluster. Mathematically, the sequence-based local model optimizes sequential distances, while the global model optimizes spatial distances or Euclidian distances. All PCD based classification and clustering methods belong to the second type. Therefore, the direct application of these methods in biomolecular data analysis may have some limitations.
In Hi-C data analysis, sequential information is usually highly relevant. Fig 3 demonstrates a potential problem for global scale clustering in TAD analysis. In this figure, genomic loci are represented by red pentagons. It can be seen that the spatial distance between the two loci in any red circle is much shorter than the one in green circles, while sequential distances are exactly the opposite. If we use the traditional PCD based clustering methods, two loci in the same red circle will always have priority to be clustered into the same TAD. Obviously, this will cause serious interpretation problems, as the sequential distance between the two loci can be much larger than the size of a TAD.

Sequence-based multiscale modeling
It should be noticed that two distance definitions, i.e., Euclidean distance and sequential distance, are greatly different and matter a lot in multiscale modeling. The Euclidean distance is just the three dimensional distance between two elements. In Hi-C data, Euclidean distance between genomic loci is inversely related to their contact frequencies [35]. In contrast, the sequential distance is defined between two elements on chains or polymers. If sequential numbers are assigned to elements, a sequential distance is just the difference between these two integers and it is always an integer. In Hi-C data, sequential distance between two loci is their genomic distance.
Even though graph and network based multiscale models are widely used in biomolecular structure and function analysis [40][41][42][43][44][45][46][47][48][49], the measurement defined in these models are in terms of the Euclidean distance. To be more specific, when we discuss atomic scale, residue scale, second structure scale, tertiary structure scale, etc, we are analyzing structural elements based on their sizes measured in Euclidean distances.
In this section, the sequence-based multiscale modeling is proposed for biomolecular data analysis, particularly for Hi-C data analysis. In our multiscale models, a scale parameter N b is defined not from the Euclidean distance but from the sequential distance. The parameter N b can be viewed as a cut-off sequential distance. In Hi-C matrices, the parameter N b specifies the size of the diagonal band region. Further, two sequence-based multiscale models are proposed for analyzing chromosome genomic compartments and TADs. These two models, denoted as Type-1 SeqMM and Type-2 SeqMM, are derived from the perspective of local scale clustering and global scale clustering, respectively.
Type-1 SeqMM. In Type-1 SeqMM, we remove sequentially short-range interactions by changing the value of scale parameter N b . More specifically, for a Hi-C matrix, a diagonal band A potential problem for global scale clustering in TAD analysis. Each locus is represented by a red pentagon. Global scale clustering considers only spatial relations, thus groups loci in each red dash circle as a cluster. Biologically, loci in each green circle are more favorable to be clustered into the same TAD, as their sequential distances are much shorter. The missing of sequential information in global scale clustering will cause problems in the TAD analysis. https://doi.org/10.1371/journal.pone.0191899.g003 Sequence-based multiscale modeling for Hi-C data analysis region with size N b is systematically removed from the model, resulting a new Hi-C matrix as following, Here M ij can be the original or normalized contact frequencies. It can seen that our Type-1 SeqMM is defined by taking away the local interactions from the model and is designed for global scale clustering. An example can be found in Fig 6(a) to 6(c). We suggest that it can be used in chromosome genomic compartment analysis. Type-2 SeqMM. In Type-1 SeqMM, when short-range interactions are systematically removed from the biomolecular data, long-range interactions are preserved. Type-2 SeqMM is designed in the exact opposite way, The scale parameter N b controls the size of the diagonal band region. Mathematically, our SeqMM matrix in Eq (3) is a weighted Laplacian matrix, which plays an important role in graph representation and spectral clustering [37,39,50]. The second smallest eigenvalue and its associated eigenvector from the Laplacian matrix, are known as the Fiedler value (or algebraic connectivity) and the Fiedler vector, respectively. The Fiedler value is an important measurement of the general topological connectivity of a graph. The Fiedler vector gives an optimized classification of a graph into two separated domains [39,50]. In our Type-2 SeqMM, the local interaction region can be systematically enlarged to model the different scales of interactions.
Type-2 SeqMM is proposed for chromosome TAD analysis. After Hi-C data preprocessing, a weighted Laplacian matrix can be generated by using a suitable scale value N b . The TAD number in the data is estimated based on size and resolution of the Hi-C matrix. We assume the size of TAD to be around 2Mb, and TAD number N c can be roughly calculated by dividing the total length over 2Mb. The basic procedure is presented in Algorithm 1. It should be noticed that the final number of TADs is usually larger than N c . The Code is available at S1 File.

Algorithm 1 Type-2 SeqMM for TAD analysis
Pre-processing: Remove all rows and columns, that summed together equal to zero (or smaller than a pre-defined range); Transform the Hi-C contact frequencies to spatial distances (default function f(x) = log (1 + x)); Step 1: Choose a scale parameter N b to construct a weighted Laplacian matrix as in Eq 3; Step 2: Calculate the first N c eigenvectors. Here N c is the estimated number of TADs; Step 3: Employ the K-means algorithm on the N c eigenvectors to identify N c clusters; Step 4: Subdivide each cluster into several TADs until the loci in each TAD are sequentially contiguous.

Genomic compartments
The genomic compartment is defined from the principal component analysis of Hi-C data. Mathematically, the principal component captures the global shape of a structure. In Chen's spectral method [36], it shows that the genomic compartment results from the first principal component (FPC) are identical to the predictions made from the lowest-frequency eigenvector of weighted Laplacian matrices. More interesting, as proved in the elastic network model and normal mode analysis, these lowest-frequency eigenvectors are uniquely determined by the global geometric information of structures [51][52][53][54].
Since the FPC describes the global properties of a structure, we use the Type-1 SeqMM for our genomic compartment analysis. We consider the GM06990 chromosome 14 data with resolution 100Kb. This is a classic example used for genomic compartment analysis [17]. Before the principal component analysis (PCA), the chromosome 14 Hi-C matrix is processed. We remove all columns and rows with all zero values and normalize the matrix using the Toeplitz matrix [17]. After that, we construct a new matrix by removing the diagonal band region with N b = 60 from the normalized Hi-C matrix, and calculate its FPC. Further, we compare this new FPC with the original one. Results are shown in Fig 4. The blue line represents the FPC from the original Hi-C matrix and red line represents the FPC from the off-diagonal matrix. It can be seen that they are almost identical to each other. Actually, the Pearson correlation coefficient (PCC) between the two FPCs is as high as 0.991, meaning that the removal of the diagonal band region have almost no influence to the FPC.
To have a more quantitative understanding of the FPC and Hi-C diagonal regions, we continuously change the value of the scale parameter N b to generate a series of Hi-C matrices at different scales. Then we systematically calculate their FPCs and measure the similarity between these FPCs with the original one by PCCs. Results are shown in Fig 5. It can be seen that PCCs changes very slowly when scale parameter is smaller than 100, which is 10 Mb in genomic distance. State differently, we can get almost the same genomic compartment even when we remove all the Hi-C data within the 10 Mb band region. It should be noticed that almost all the largest Hi-C value, i.e., contact frequencies, are located within this 10 Mb band. These values, however, are irrelevant to the chromosome genomic compartment! We further test our SeqMM on other GM06990 chromosomes. A very consistent pattern can be observed. Results of chromosomes 1, 5, 9 and 13 are illustrated in Fig 6. It can be seen that the shape decrease of PCCs is usually found at around 100 locus (10 Mb in genomic distance). This indicates a transition between local scales to global scales. Further studies are needed to explain its biological implications.

Topological associated domain
Another very important finding in Hi-C data analysis is the topological associated domain. TADs are megabase-sized local chromatin interaction domains. They have loop structures and are highly stable and conserved across various cell types and species. TAD boundaries are found to be enriched with the protein CTCF, housekeeping genes, transfer RNAs and short interspersed element (SINE) retrotransposons [18,23,24,26]. These components play important roles in establishing and supporting TADs and other architectural structures of the chromosome. Due to the structural and functional importance of TADs, various algorithms have been proposed for the identification of TADs as stated in the introduction part. However, the sequential information is not considered in any of these models.
In this section, Type-2 SeqMM is used to study chromosome TADs. In our Type-2 SeqMM, the clustering is done by using K-means method on eigenvectors from spectral graph model. The basic procedure of the algorithm is illustrated in Algorithm 1. To explore the relation between the band size and TAD boundaries from the clustering, we consider a 100Kb resolution Hi-C matrix for chromosome 22 from IMR90 cell line. We systematically change the band size N b from 20, 80, 140 to 200. The corresponding TAD boundaries are illustrated in Fig  7. It can be seen that the TAD regions evaluated from different scales are not exactly the same and have some variations. Particularly when the band size N b change from 20 to 80, the calculated TAD regions are quite different. Further, when the band size is larger than 80, although the TAD boundaries are still not the same, they share more and more common values.
To have a more quantitative understanding of this, we systematically change the scale parameter N b from 2 to 351 (the size of the normalized Hi-C matrix) and calculate the TAD boundaries. Results are shown in Fig 8. We can find that when the value of scale parameter N b is small, a tiny change of its value can result in huge variations of the predicted TAD boundaries. However, when the scale parameter is larger than a certain value, the fluctuations in the    Sequence-based multiscale modeling for Hi-C data analysis predicted TAD boundaries are greatly reduced. The threshold value is roughly about 20, which is 2 Mb in genomic distance.
We further apply the spectral approach used in Chen's method [36] on the multiscale Laplacian matrices in Eq (3). Results are shown in Fig 9. It can be seen that the variation of the predicted TAD boundaries by his method is much larger than that of our Type-2 SeqMM. More interestingly, the amplitude of variation below the threshold (2 Mb) is much larger than the one after the threshold, which is the same as in our model. Biologically, the threshold value should be highly related to the TAD properties. This is because when the band sizes N b of our multiscale matrices are smaller than the size of TADs, local interactions within TADs are removed from our models, resulting in a much larger variation in predicted TAD boundaries. However, when the band size is larger than 2Mb, all TAD-related local interactions will be considered, thus a much consistent TAD boundaries should be expected. Stated differently, since TADs are mainly determined by local interactions within the 2Mb band region, the calculated TAD boundaries should always be the same for multiscale matrices with N b larger than 2Mb. In this sense, our Type-2 SeqMM is much more robust and reliable than Chen's method [36] as a much smaller variation is observed in our model when N b is larger than 2Mb. Mathematically, in Chen's spectral method, the global scale clustering is iteratively used to subdivide contact matrix or matrix region into two subregions until the algebraic connectivities within the submatrices are all smaller than certain threshold. Therefore, this method optimizes only spatial distances between different loci. Sequence-based multiscale modeling for Hi-C data analysis Further, even with the difference between the two models, both methods capture the local minimum of a TAD summation function. We consider the 100Kb resolution Hi-C matrix for chromosome 22 from IMR90 cell line. The band size N b is chosen as 60, which is amount to 6 Mb in genomic sequence. We summarize the contact matric values along the direction that is perpendicular to the matrix diagonal. Results are shown as the black lines in Fig 10(a) and 10(b). The TAD boundaries from Chen's method and our Type-2 SeqMM are illustrated by blue and red lines. It can be seen that nearly all of these lines are located at the local minima of the summation function. More interestingly, the two methods share many common TAD boundaries. This indicates that the situation illustrated in Fig 3 does not widely exist. This can also be confirmed from the behavior of off-diagonal values. Usually, the off-diagonal values decrease very quickly outside the TAD regions, meaning the distance between a locus from a TAD and a locus outside this TAD is usually very large.

Conclusion
In this paper, we discuss a sequence-based multiscale clustering model for biomolecular data analysis. Biomolecules and their complexes are hierarchical structures made from one or several polymer chains. With the sequential information embedded in these polymer chains, biomolecular data are fundamentally different from the general point cloud data. Traditional clustering methods derived from point cloud data, fall short when sequential information  Sequence-based multiscale modeling for Hi-C data analysis matters. To overcome this problem, we propose a sequence-based multiscale model for biomolecular structure analysis. We generate a series of structural matrices by gradually and systematically removing the short-range or long-range interactions. These new matrices focus on different sequential scales and their clustering has different biological interpretations. Two SeqMMs have been applied to Hi-C data analysis. We find that the genomic compartments only relate to the global scale information. The removal of a diagonal band region as large as 10 Mb has very little influence to the finally compartment results. Further, we study TADs with our local scale models. We find that when sequence scale is small, a tiny variation of its value will result in great changes in TAD boundaries. However, when the scale value is larger than a threshold value, the TAD boundaries become very consistent. This threshold value is highly related to the sizes of TADs. Interestingly, our method is much more robust than a previous spectral clustering method in the TAD analysis.