Advertisement
  • Loading metrics

scHiCTools: A computational toolbox for analyzing single-cell Hi-C data

  • Xinjun Li,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation Department of Statistics, University of Michigan, Ann Arbor, Michigan, United States of America

  • Fan Feng,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Visualization, Writing – original draft

    Affiliation Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America

  • Hongxi Pu,

    Roles Formal analysis, Investigation

    Affiliation College of Literature Science, and the Arts, University of Michigan, Ann Arbor, Michigan, United States of America

  • Wai Yan Leung,

    Roles Investigation, Validation

    Affiliation Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America

  • Jie Liu

    Roles Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Writing – original draft, Writing – review & editing

    drjieliu@umich.edu

    Affiliation Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America

scHiCTools: A computational toolbox for analyzing single-cell Hi-C data

  • Xinjun Li, 
  • Fan Feng, 
  • Hongxi Pu, 
  • Wai Yan Leung, 
  • Jie Liu
PLOS
x

Abstract

Single-cell Hi-C (scHi-C) sequencing technologies allow us to investigate three-dimensional chromatin organization at the single-cell level. However, we still need computational tools to deal with the sparsity of the contact maps from single cells and embed single cells in a lower-dimensional Euclidean space. This embedding helps us understand relationships between the cells in different dimensions, such as cell-cycle dynamics and cell differentiation. We present an open-source computational toolbox, scHiCTools, for analyzing single-cell Hi-C data comprehensively and efficiently. The toolbox provides two methods for screening single cells, three common methods for smoothing scHi-C data, three efficient methods for calculating the pairwise similarity of cells, three methods for embedding single cells, three methods for clustering cells, and a build-in function to visualize the cells embedding in a two-dimensional or three-dimensional plot. scHiCTools, written in Python3, is compatible with different platforms, including Linux, macOS, and Windows.

Author summary

Single-cell Hi-C contact maps describe the numbers of interactions among genomic loci across the entire genome, and provide researchers 3D chromatin organization in each cell. There are growing demands for an easy and fast way to analyze and visualize single-cell Hi-C data, and analyzing single-cell Hi-C data exposes several inherent data analysis challenges. To move beyond existing computational tools and methods to analyze and visualize single-cell Hi-C data, we present a software package, scHiCTools, which is implemented in Python. The software package provides researchers a collection of methods to investigate the cell-to-cell similarity based on their 3D chromatin organization, cluster cells into groups accordingly, and visualize cells in two-dimensional or three-dimensional scatter plots. In this paper, we provide an overview of scHiCTools’ structure and capabilities. We then apply scHiCTools to several single-cell Hi-C datasets to benchmark the performance of the methods provided in our toolbox, and present some plots generated using the software package.

This is a PLOS Computational Biology Software paper.

Introduction

Recent single-cell Hi-C sequencing (scHi-C) technologies profile three-dimensional (3D) chromatin contact maps in individual cells, allowing us to characterize chromatin organization dynamics and cell-to-cell heterogeneity [13]. However, the interpretation of scHi-C data exposes several inherent data analysis challenges [4]. First, unlike RNA-seq data and ATAC-seq data which are vectors of m-dimensional measures, Hi-C data are essentially symmetric matrices of m × m-dimensional pairwise measures, where the number of genomic loci m is usually more than tens of thousands, depending on the resolution of the contact maps. Second, scHi-C analysis suffers from high-dimensionality, the sparsity of the contact maps, and sequencing noise. Typically in a scHi-C experiment, up to a few thousands of single cells are profiled, whereas the number of contacts in each cell ranges from a few thousands to hundreds of thousands. Third, single cells in one experiment usually reside in a low-dimensional manifold, such as a circular cell cycle structure or a bifurcation differentiation structure. Thus, proper embedding of scHi-C data in a low-dimensional Euclidean space is vital in scHi-C data analysis.

In a previous exploratory study [4], different similarity methods [59] have been applied to scHi-C data from n single cells, and coupled with multidimensional scaling (MDS) to project the n single cells into a low-dimensional Euclidean space. Among these methods, HiCRep [5] yields reasonable similarity measures and satisfactory embedding of the single cells, but its O(n2) computational complexity makes it impractical when the number of cells is large. In addition, this proof-of-concept study [4] did not provide any software implementation to embed scHi-C data, let along upstream analysis such as screening single cells and smoothing contact maps, and downstream analysis such as clustering and visualization.

In this work, we implemented a versatile scHiCTools which includes many common approaches in the entire workflow of analyzing single-cell Hi-C data. In particular, we implemented three similarity measures, including a faster version of HiCRep, a new “InnerProduct” approach, and another efficient Hi-C similarity measure named Selfish [10]. Among the three methods implemented, InnerProduct provides the most efficient and satisfactory similarity measure. Benchmarking experiments demonstrate that the new InnerProduct approach runs thousands of times faster than the original HiCRep, and produces comparably accurate projection. To deal with the sparsity in scHi-C data, different smoothing approaches are implemented, including linear convolution, random walk, and network enhancing [11]. Among the three approaches, linear convolution appears to be most effective for smoothing contact maps in our experiments. In addition to the computational components, our toolbox supports different input file formats, diagnostic summary plots, and flexible projection plots. Our open-source toolbox, scHiCTools, as the first toolbox of such kind, can be useful for analyzing scHi-C data.

Design and implementation

Overview

Our scHiCTools implements commonly used approaches to analyze single-cell Hi-C data. The key component of the toolbox is a number of dimension reduction approaches which takes a number of single cells’ contact maps as input, and embeds the cells in a low-dimensional Euclidean space. The toolbox also provides a number of built-in auxiliary functions for flexible and interactive visualization. The entire workflow of scHiCTools, illustrated in Fig 1, includes five steps: (1) reading single-cell data in .txt, .hic, or .cool format, generating diagnostic summary plots, and screening cells by their contact number and contact distance profile, (2) smoothing scHi-C contact maps using linear convolution, random walk, or network enhancing, (3) calculating pairwise similarity between cells using fastHiCRep, InnerProduct, or Selfish, (4) embedding or clustering the cells in a low-dimensional space using dimension reduction methods, and (5) visualizing the two-dimensional or three-dimensional embedding in a scatter plot. Except for the two pairwise similarity calculation methods, fastHiCRep and InnerProduct, other methods are implemented as originally stated.

thumbnail
Fig 1. The workflow of scHiCTools.

The workflow of scHiCTools includes five steps: (1) reading input single-cell data in .txt, .hic, or .cool format, generating the summary plots of the cells, and screening cells based on their contact number and contact distance profile, (2) smoothing the scHi-C contact maps using linear convolution, random walk, or network enhancing, (3) calculating the pairwise similarity between cells using fastHiCRep, InnerProduct, or Selfish, (4) embedding or clustering the cells in a low-dimensional Euclidean space using dimension reduction methods, and (5) visualizing the two-dimensional or three-dimensional embedding in a scatter plot.

https://doi.org/10.1371/journal.pcbi.1008978.g001

Loading data and screening cells

Users can load scHi-C data in different file formats, including .hic files, .cool files, and sparse contact matrices in text files. When users choose to load sparse matrices in text files, they are able to customize each column in the text files, and specify additional information including reference genome and the resolution of the contact maps. In addition, scHiCTools supports parallel file loading on a multi-core processor. After loading the data files, scHiCTools allows users to plot two summary plots to examine the quality of the loaded single-cell Hi-C data, namely a histogram of contact numbers, and a scatter plot of cells with the proportion of short-range contacts (<2 Mb) versus the proportion of the contacts at the mitotic band (2 ∼ 12 Mb) (Fig 2). A low-quality contact map is characterized by a low number of contacts and a relatively high proportion of short-range contacts. Users can further remove the low-quality cells by thresholding the number of contacts and the proportion of short-range contacts in the cells.

thumbnail
Fig 2. Summary plots for examining the quality of input scHi-C data.

(a) A histogram of contact numbers in the individual cells. (b) A scatter plot showing the percentage of short-range contacts (<2 Mb) versus the percentage of contacts at the mitotic band (2 ∼ 12 Mb) in individual cells.

https://doi.org/10.1371/journal.pcbi.1008978.g002

Smoothing contact maps

Smoothing single cell contact maps is necessary when they are sparse, especially in a comparative analysis. Our toolbox scHiCTools implements three smoothing approaches.

Linear convolution is essentially a two-dimensional convolution filter with equal weights in every position, which can be viewed as smoothing over neighboring bins in Hi-C contact maps. For example, original HiCRep [5] uses a parameter h to describe a (2h + 1) × (2h + 1) convolution filter, i.e., h = 1 indicating a 3 × 3 kernel with each element .

Random walk [12] is another approach to smooth chromatin contact maps. Unlike linear convolution which takes information from neighbors on the contact map, random walk captures the signals from a global setting as follows. Let W be the input m × m contact matrix. Random walk updates W by W(t) = W(t−1)B in the t-th iteration, where W(0) = W, and . The matrix B is the input matrix divided by its row sum (i.e., every row of B sum up to 1). We can write B as B = D−1W, where . After t steps of update, W(t) = WBt is the output matrix after smoothing.

Network enhancing [11] is a special type of random walk which enhances network signals by increasing gaps between leading eigenvalues of a doubly stochastic matrix (DSM, the sum of each row and the sum of each column are both 1). To get a DSM, Knight-Ruiz (KR) normalization [13] is applied to the original contact matrix W, i.e., finding a vector a, such that W′ = aWa is a DSM. Network enhancing makes the partition of contact maps more prominent, and enhances boundaries of topologically associated domains (TADs) in chromatin contact maps [11].

Calculating pairwise similarity

Hi-C contact maps are m × m-dimensional measures, where m is the number of genomic loci. By calculating pairwise similarity among the cells, we avoid dealing with the high-dimensional measure in the following embedding and clustering steps. Our toolbox scHiCTools includes the following three approaches for calculating pairwise similarity among the single cells.

InnerProduct calculates the pairwise similarity matrix among cells in two steps. The first step is “scaling”, which directly sets the first s z-normalized strata of one cell’s contact map as a feature vector for the cell. Because we only need to apply this step to the individual cells sequentially, this scaling step has an O(n) time complexity. Denote vi = [vi,1, vi,2, ⋯, vi,mi] to be the i-th stratum of the chromosome, i from 1 to s. vi,k = Wk,k+i is the k-th element of vi, where W is the contact matrix of a chromosome. Subsequently, z-normalization is applied to each vi to get a zero-mean and unit-variance vector . By concatenating all strata, the feature vector for each contact map is . The second step is “multiplication”, which calculates an inner product of the n feature vectors from the n cells to obtain the n × n similarity matrix. The second step has an O(n2) time complexity, but this step can be implemented efficiently with matrix multiplication in NumPy. Namely, we directly calculate the inner product of the two feature vectors Vx and Vy of map x and map y as their similarity, (1)

Run time of InnerProduct is also linear with respect to the length of V, which depends on the product of m and s, where m is the size of the contact map, and s is the number of strata from the diagonal considered in the calculation. Taking the average of rxy over all chromosomes keeps the matrix positive definite, and gives us an overall kernel matrix of all the cells in the input. In practice, although taking the median may make the matrix no longer positive definite, it makes the similarity more stable and robust to noisy measurements.

fastHiCRep is a faster implementation of the original HiCRep approach [5]. Original HiCRep [5] calculates s stratum-adjusted correlation coefficients (SCCs) of the s strata near the diagonal of two contact maps, namely (2) where vi is the i-th stratum of the chromosome, mi is the length of vi, and ri is the Pearson’s correlation coefficient of and .

We have , then (3)

Note that the denominator term of ri can be cancelled out with in the numerator of SCC in Eq (2). We can write SCC between contact maps x and y as (4) where and .

From above, SCC can be calculated efficiently because for each cell, cell x for example, vecx and varx only need to be calculated once. The vectors required in both the numerator and the denominator are generated sequentially for individual cells with an O(n) time complexity. Note that both the numerator and the denominator are inner products. Calculating the inner products has an O(n2) time complexity, but this step can be implemented efficiently with matrix multiplication in NumPy. However, there is one subtle difference between original HiCRep and our implemented fastHiCRep. In original HiCRep, if one chromatin contact in both cells x and y is zero, then this contact position will not be used in the calculation of SCC. Because HiCRep need to remove zeros and this operation is specific to the two cells to be compared, we need to calculate ri, , and in Eq (2) times. Empirically, the SCC calculated by fastHiCRep and HiCRep did not differ too much. We randomly select a cell from Nagano dataset (cell 205 of Early-S stage) and compare the SCC scores using fastHiCRep versus HiCRep of other 1710 cells from Nagano dataset. The correlation coefficient of the SCC scores calculated by fastHiCRep and SCC scores calculated by HiCRep is 0.991.

The third similarity measure Selfish [10] was recently proposed for bulk Hi-C comparative analysis. It first uses a sliding window to obtain a number of rectangular regions along the diagonal of the contact map, and then counts overall contact numbers in each region. Then it generates a one-hot “fingerprint matrix” for each contact map. Finally, Gaussian kernels over the fingerprint matrices are calculated as similarities among the cells.

Embedding and clustering

Although single cell measures are high-dimensional, the single cells usually reside in a low-dimensional manifold such as a circular cell cycle structure and a bifurcation differentiation structure. By embedding the cells in a lower-dimensional Euclidean space, we can easily explore the heterogeneity and structures among the single cells. scHiCTools includes three different dimension reduction methods that use pairwise similarity matrices among the cells to embed them in a low-dimensional Euclidean space. The three dimension reduction methods are as follows.

MDS (Multidimensional scaling) takes in a pairwise distance matrix evaluated in the original space, and embeds the data points in a lower-dimensional space which preserves the pairwise distance matrix. In our package, we use the classical MDS which finds the p dimensional embedding that minimizes the loss function: loss = ‖XXGF, where , Dij is the distance between i-th and j-th cells, and 1 denotes a column vector of all ones.

t-SNE [14] embeds high-dimensional data in a low-dimensional space with an emphasis on preserving local neighborhood. t-SNE assumes that data points x1, ⋯, xn in the high-dimensional space follow a Gaussian distribution, and the embedded points y1, ⋯, yn follow a Student’s t-distribution. In the higher-dimensional space, the conditional probability of xi picking xj as its neighbor is . In our implementation, instead of computing the norm between two points, we directly use distances between two cells to calculate the similarity. The similarity between xi and xj is defined as the probability of picking xi and xj as neighbors, which is . In the embedding space, since y1, ⋯, ynt1, the probability of picking yi and yj as neighbors is . The Kullback–Leibler divergence of the distribution of data points from the distribution of embedded points is . The embedding of the cells are optimized by minimizing the KL-divergence above.

PHATE (Potential of Heat-diffusion for Affinity-based Trajectory Embedding) [15] is a dimension reduction approach which preserves both local and global similarity. PHATE first calculates a local affinity matrix based on k-nearest neighbor distance Kk, which captures the local structure of the data points. PHATE normalizes Kk using a Gaussian kernel to obtain a new matrix P, and then diffuses P by t steps to get a new matrix Pt which preserves the global structure of data points. Based on matrix Pt, a potential representation of data can be calculated as Ut = −log(Pt). Finally, PHATE applies non-metric MDS to Ut and generates low-dimensional embedding of the cells.

Clustering methods are desirable in the situation that the single cells come from discrete clusters rather than a continuous manifold. The following optional clustering methods are implemented in our toolbox.

k-means assigns an observation to the cluster with the nearest cluster centroid, which is the mean of all observations belonging to the cluster. We use k-means++ [16] to initialize the centroids of clusters. Iterations of k-means algorithm renew the clusters by finding the points closest to the centroid in the last iteration, and update the centroid of each cluster by taking the mean of points in each cluster. Since k-means needs coordinates of cells in a Euclidean space to find the centroid of each cluster, we use MDS to embed the cells into a l-dimensional space first, and then perform k-means accordingly.

Spectral clustering [17] takes in a distance matrix of data points and divides n observations into k clusters. Spectral clustering directly takes in the distance matrix and constructs a similarity graph with Gaussian similarity function. Based on the similarity graph, spectral clustering calculates the Laplacian of the similarity graph, and projects the points into a k-dimensional space based on the first k eigenvectors of the graph Laplacian matrix. Finally, it uses k-means to divide the data points into k group.

scHiCluster [18] is a clustering method designed explicitly for single-cell Hi-C data. It uses convolution and random walk for smoothing and imputation, then converts the contact matrices to binary matrices. scHiCluster conducts principal component analysis for embedding, and then k-means for clustering.

Visualization of embedding

scHiCTools supports two-dimensional and three-dimensional plotting of the cell embedding. Fig 3 shows the scatter plots of the two-dimensional embedding of the cells in a cell cycle study [2]. In the situation that the cells reside on a three-dimensional manifold, a three-dimensional scatter plot of the cells can be visualized (S2 File). As a convenient option, an interactive scatter plot can be visualized, in which the cell label is displayed when the user’s mouse hovers (see S3 File). In order to use this feature, ploty is required on user’s device.

thumbnail
Fig 3. Two-dimensional scatter plots of the embedding from the three methods that calculate the similarity between contact matrices, including InnerProduct, fastHiCRep and Selfish.

(Dataset: Nagano et al., 2017). (a) Two-dimensional projection from InnerProduct and MDS shows a clear circular pattern along the four stages of cell cycle. (b) Two-dimensional projection using fastHiCRep and MDS does not show clear separation between the four stages of cell cycle. (c) Two-dimensional projection using Selfish and MDS does not show clear separation between the four stages of cell cycle. (d) Evaluating the three embedding methods in a cell-cycle phasing task by ACROC. The ACROC values from InnerProduct, fastHiCRep and Selfish are 0.904, 0.858 and 0.642, respectively.

https://doi.org/10.1371/journal.pcbi.1008978.g003

Results

In this section, we apply the toolbox on a number of scHi-C datasets [2, 19], and benchmark the performance of different methods implemented. In addition to plotting the two-dimensional embedding of cells and examining whether the embedding is sensible, we benchmark the projection performance on a scHi-C dataset [2] with the average area under the curve of a circular ROC calculation (ACROC) proposed in a recent work [4]. ACROC measure ranges between 0 to 1. An embedding representing a better circular pattern produces a larger ACROC value. We record the run time of these methods to compare their efficiency. We evaluate the clustering performance with two criteria: normalized mutual information (NMI) [20] and adjusted rand index (ARI) [21] on another scHi-C dataset [19]. We have the following observations.

InnerProduct is effective for calculating the pairwise similarity among single-cell Hi-C contact maps

When InnerProduct is coupled with MDS, it produces an accurate projection of single cells along four stages of cell cycle (Fig 3a), achieving an average area under a circular ROC calculation curve (ACROC) of 0.904, which is comparable to original HiCRep reported in the recent work [4]. Two-dimensional scatter plots from fastHiCRep and Selfish show a circular pattern along the four stages of cell cycle, but the separation between the stages is not as clear as that from InnerProduct (Fig 3b and 3c). The ACROC measure from fastHiCRep is 0.858 and the ACROC measure from Selfish is 0.642, which are both lower than that from InnerProduct (Fig 3d).

PHATE and t-SNE produce satisfactory projections

Since MDS recovers global pairwise distance in its projection, it is unclear whether methods that preserve local pairwise distance (i.e., PHATE and t-SNE) can produce a better projection. Fig 4a and 4b show that PHATE and t-SNE are suitable embedding methods that project the smoothed contact maps into a lower-dimensional space while preserving a cell-cycle pattern. The ACROC measure of PHATE is 0.920, which is better than t-SNE (0.901) and MDS (0.904). Therefore, PHATE and t-SNE can be used alternatively, especially when faraway neighbors’ similarity cannot be properly evaluated.

thumbnail
Fig 4. Two-dimensional embedding using different dimension reduction methods.

(Dataset: Nagano et al., 2017). (a) Two-dimensional projection from InnerProduct/PHATE shows a circular pattern similiar to MDS projection (ACROC: 0.920). (b) Two-dimensional projection from InnerProduct/t-SNE shows a circular pattern (ACROC: 0.901).

https://doi.org/10.1371/journal.pcbi.1008978.g004

All three methods to calculate pairwise similarity are computationally efficient

The run time of InnerProduct, fastHiCRep, Selfish, and original HiCRep was measured on an Intel Xeon W-2175 CPU with a frequency of 2.50GHz (Table 1). Each experiment was replicated ten times, and the average run time was reported (Run time from different replicate experiments showed little variance. See S5 File for raw data). For embedding 1,000 cells randomly selected from data [2], three methods in our package, including InnerProduct, fastHiCRep, and Selfish, finished within minutes. In contrast, original HiCRep, also implemented in Python, took around five hours. For InnerProduct, we further measured the time spent in the first scaling step and in the second multiplication step. It was observed that the run time from the “scaling” step was linear in terms of cell number n, whereas the run time from the “multiplication” step was quadratic in terms of cell number n.

thumbnail
Table 1. Average run time (in seconds) of different methods as the number of cells vary.

(Run time is averaged from 10 replicate experiments, performed on an Intel Xeon W-2175 CPU with a frequency of 2.50GHz).

https://doi.org/10.1371/journal.pcbi.1008978.t001

Linear convolution smoothing and random walk improve projection at high dropout rates

To examine the three smoothing approaches implemented in our toolbox, we sparsified the scHi-C dataset [2] with two methods. The first sparsification method is to remove 40% ∼ 99.9% of the contacts randomly from all genomic positions (i.e., contact number ranging from ∼200,000 to ∼500 in each cell). The second sparsification method is to simulate dropout events in sequencing data which discard contacts from 5% ∼ 60% of the genomic loci. With the sparsified datasets, we examined the quality of single cell embedding when different smoothing approaches were used. It was observed that none of the three smoothing methods, including linear convolution, random walk, and network enhancing, improved the embedding performance when the single cell contact maps were down-sampled by the first sparsification method (Fig 5a). The ACROC decreased from around 0.9 to around 0.6 as down-sampling rate changed from 1 to 0.05. Under the second sparsification method (dropout), linear convolution and random walk produced better single cell embedding, compared with the situation when no smoothing was used (Fig 5b). At a high dropout rate (0.7), linear convolution kept the average ACROC above 0.9, whereas other methods’ average ACROC’s dropped to below 0.9. Therefore, we recommend users use linear convolution to smooth scHi-C contact maps if they suspect dropout events exist moderately in their scHi-C data.

thumbnail
Fig 5. Average ACROC measures from InnerProduct without any smoothing, InnerProduct with random walk smoothing, InnerProduct with linear convolution, and InnerProduct with network enhancing.

(a) When the dataset was sparsified with the first sparsification method, ACROC measures from the four approaches decreased when down-sampling rate increased. (b) When the dataset was sparsified with the second sparsification method, ACROC measures from the four approaches decreased when down-sampling rate increased, but at a high dropout rate (0.7), linear convolution’s ACROC remained high, whereas other methods’ ACROC dropped to below 0.9.

https://doi.org/10.1371/journal.pcbi.1008978.g005

scHiCluster produces better clustering results than InnerProduct coupled with spectral clustering or with k-means, but less efficient

Table 2 shows the performance of different clustering methods implemented in scHiCTools on the dataset of Collombet et al., 2020 [19]. We use all 750 mouse embryo cells at five differentiation stages in the study of Collombet et al., including the 1-cell, 2-cell, 4-cell, 8-cell, and 64-cell stages. We separate the cells into five clusters and evaluate the clusters using two evaluation measures, including normalized mutual information (NMI) [20], and adjusted rand index (ARI) [21]. A better clustering produces larger values of NMI and ARI. For random clustering, the values of NMI and ARI are close to 0. When the cluster agrees with the true label, NMI and ARI reach the upper bound 1. scHiCluster shows better NMI and ARI values than InnerProduct with k-means and InnerProduct with spectral clustering. InnerProduct with k-means and InnerProduct with spectral clustering are more efficient than scHiCluster. The run time of scHiCluster, InnerProduct with k-means, Selfish, and InnerProduct with spectral clustering was measured on an Intel Xeon W-2175 CPU with a frequency of 2.50GHz. Each experiment was replicated ten times, and the average run time was reported (Run time from different replicate experiments showed little variance. See S5 File for more details).

thumbnail
Table 2. The normalized mutual information (NMI), adjusted rand index (ARI), and run time of three clustering approaches.

(Data: 750 embryo cells at five differentiation stages, including 1-cell, 2-cell, 4-cell, 8-cell and 64-cell stages, Collombet et al., 2020).

https://doi.org/10.1371/journal.pcbi.1008978.t002

Availability and future directions

Our scHiCTools is implemented in Python. The source code is available and maintained at Github: https://github.com/liu-bioinfo-lab/scHiCTools. This package is also available on PyPI python package manager. The current code runs under Python 3.7 or newer versions. Other dependency includes numpy, scipy, matplotlib, pandas, simplejson, six, and h5py. For the interactive scatter plot function, you need to have plotly installed. In the future, we will keep updating the toolbox with new scHi-C analysis algorithms, including new embedding methods such as UMAP and new clustering methods such as hierarchical clustering.

Supporting information

S1 File. Plots from different similarity measures and embedding methods applied to Nagano single-cell dataset.

This zip file includes the plots of other combination of similarity measures (InnerProduct, fastHiCRep and Selfish) and embedding methods (MDS, t-SNE and PHATE).

https://doi.org/10.1371/journal.pcbi.1008978.s001

(ZIP)

S2 File. Three-dimensional scatter plots.

This zip file includes the 3D scatter plots of different embedding methods (MDS, t-SNE and PHATE) applied to Nagano single-cell dataset.

https://doi.org/10.1371/journal.pcbi.1008978.s002

(ZIP)

S3 File. Interactive plots.

This file includes examples of interactive 2D and 3D scatter plots of cells from Nagano et al.

https://doi.org/10.1371/journal.pcbi.1008978.s003

(ZIP)

S4 File. scHiCTools source code, documentation and test dataset.

This zip file is a clone of scHiCTools public Git repository. To install from this file rather than from PyPI, please follow installation instructions in the readme file.

https://doi.org/10.1371/journal.pcbi.1008978.s004

(ZIP)

S5 File. Run time details.

This PDF file includes the run time of similarity calculation methods and clustering methods.

https://doi.org/10.1371/journal.pcbi.1008978.s005

(PDF)

S6 File. Details of applying our toolbox to Flyamer et al. [1], Collombet et al. [19] and Ramani et al. [3] datasets.

This PDF file includes the number of cells and the contacts numbers of Flyamer et al., Collombet et al. and Ramani et al. datasets, and their scatter plots.

https://doi.org/10.1371/journal.pcbi.1008978.s006

(PDF)

References

  1. 1. Flyamer IM, Gassler J, Imakaev M, Brandão HB, Ulianov SV, Abdennur N, et al. Single-nucleus Hi-C reveals unique chromatin reorganization at oocyte-to-zygote transition. Nature. 2017;544(7648):110. pmid:28355183
  2. 2. Nagano T, Lubling Y, Várnai C, Dudley C, Leung W, Baran Y, et al. Cell-cycle dynamics of chromosomal organization at single-cell resolution. Nature. 2017;547(7661):61. pmid:28682332
  3. 3. Ramani V, Deng X, Qiu R, Gunderson KL, Steemers FJ, Disteche CM, et al. Massively multiplex single-cell Hi-C. Nature Methods. 2017;14(3):263–266. pmid:28135255
  4. 4. Liu J, Lin D, Yardımcı GG, Noble WS. Unsupervised embedding of single-cell Hi-C data. Bioinformatics. 2018;34(13):i96–i104. pmid:29950005
  5. 5. Yang T, Zhang F, Ymcı GG, Song F, Hardison RC, Noble WS, et al. HiCRep: assessing the reproducibility of Hi-C data using a stratum-adjusted correlation coefficient. Genome Research. 2017;27(11):1939–1949. pmid:28855260
  6. 6. Ursu O, Boley N, Taranova M, Wang YR, Yardimci GG, Stafford Noble W, et al. GenomeDISCO: A concordance score for chromosome conformation capture experiments using random walks on contact map graphs. Bioinformatics. 2018;34(16):2701–2707. pmid:29554289
  7. 7. Yan KK, Ymcı GG, Noble WS, Gerstein M. HiC-Spector: A matrix library for spectral analysis and reproducibility of Hi-C contact maps. Bioinformatics. 2017;33(14):2199–2201. pmid:28369339
  8. 8. Sauria MEG, Taylor J. QuASAR: Quality Assessment of Spatial Arrangement Reproducibility in Hi-C Data. bioRxiv. 2017;.
  9. 9. Yardımcı GG, Ozadam H, Sauria ME, Ursu O, Yan KK, Yang T, et al. Measuring the reproducibility and quality of Hi-C data. Genome biology. 2019;20(1):57. pmid:30890172
  10. 10. Ardakany AR, Ay F, Lonardi S. Selfish: discovery of differential chromatin interactions via a self-similarity measure. Bioinformatics. 2019;35(14):i145–i153. pmid:31510653
  11. 11. Wang B, Pourshafeie A, Zitnik M, Zhu J, Bustamante CD, Batzoglou S, et al. Network enhancement as a general method to denoise weighted biological networks. Nature communications. 2018;9(1):3108. pmid:30082777
  12. 12. Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal amplifier of genetic associations. Nature Reviews Genetics. 2017;18(9):551–562. pmid:28607512
  13. 13. Knight PA, Ruiz D. A fast algorithm for matrix balancing. IMA Journal of Numerical Analysis. 2013;33(3):1029–1047.
  14. 14. Maaten Lvd, Hinton G. Visualizing Data using t-SNE. Journal of Machine Learning Research. 2008;9(Nov):2579–2605.
  15. 15. Moon KR, van Dijk D, Wang Z, Gigante S, Burkhardt DB, Chen WS, et al. Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology. 2019;37(12):1482–1492. pmid:31796933
  16. 16. Arthur D, Vassilvitskii S. k-means++: The Advantages of Careful Seeding. Stanford InfoLab; 2006. 2006–13.
  17. 17. von Luxburg U. A tutorial on spectral clustering. Statistics and Computing. 2007;17(4):395–416.
  18. 18. Zhou J, Ma J, Chen Y, Cheng C, Bao B, Peng J, et al. Robust single-cell Hi-C clustering by convolution- and random-walk–based imputation. Proceedings of the National Academy of Sciences. 2019;116(28):14011–14018. pmid:31235599
  19. 19. Collombet S, Ranisavljevic N, Nagano T, Varnai C, Shisode T, Leung W, et al. Parental-to-embryo switch of chromosome organization in early embryogenesis. Nature. 2020;580(7801):142–146. pmid:32238933
  20. 20. Strehl A, Ghosh J. Cluster ensembles—a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research. 2003;3(null):583–617.
  21. 21. Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2(1):193–218.