Dr.seq2: A quality control and analysis pipeline for parallel single cell transcriptome and epigenome data

An increasing number of single cell transcriptome and epigenome technologies, including single cell ATAC-seq (scATAC-seq), have been recently developed as powerful tools to analyze the features of many individual cells simultaneously. However, the methods and software were designed for one certain data type and only for single cell transcriptome data. A systematic approach for epigenome data and multiple types of transcriptome data is needed to control data quality and to perform cell-to-cell heterogeneity analysis on these ultra-high-dimensional transcriptome and epigenome datasets. Here we developed Dr.seq2, a Quality Control (QC) and analysis pipeline for multiple types of single cell transcriptome and epigenome data, including scATAC-seq and Drop-ChIP data. Application of this pipeline provides four groups of QC measurements and different analyses, including cell heterogeneity analysis. Dr.seq2 produced reliable results on published single cell transcriptome and epigenome datasets. Overall, Dr.seq2 is a systematic and comprehensive QC and analysis pipeline designed for parallel single cell transcriptome and epigenome data. Dr.seq2 is freely available at: http://www.tongji.edu.cn/~zhanglab/drseq2/ and https://github.com/ChengchenZhao/DrSeq2.


Data description
mainly describes the input file and mapping and analysis parameters.

Bulk-cell level QC
In the bulk-cell level QC step we measured the performance of total scATAC reads. In this step we did't separate reads, just like treated the sample as bulk ATAC-seq sample.

Reads alignment summary
The following table shows reads number after each filter strategy and mapped reads of final selected reads. It measures the general sequencing quality. Low mappability indicates poor sequence quality (see "Reads level QC") or library quality (caused by contaminant). In summary, if the percentage of "total mapped reads" is less than 5%, users may consider reconstruct your library (redo the experiment), but first you should make sure you already trim the adapter and map your reads to the corresponded species (genome version). Mappable reads was after Q30 filtering if Q30 filter function was turned on.

Chromosomal Distribution of Open Regions
The blue bars represent the percentages of the whole tiled or mappable regions in the chromosomes (genome background) and the red bars showed the percentages of the whole open region. These percentages are also marked right next to the bars. P-values for the significance of the relative enrichment of open regions with respect to the gnome background are shown in parentheses next to the percentages of the red bars.

Peaks over Chromosomes
Barplot show open regions distributed over the genome along with their scores or peak heights. The line graph on the top left corner illustrates the distribution of peak heights (or scores). The red bars in the main plot open regions in the input BED file. The x-axis of the main plot represents the actual chromosome sizes. Distribution of Peak Heights 0.0e+00 5.0e+07 1.0e+08 1.5e+08 2.0e+08 2.5e+08

Open Regions (Peaks) over Chromosomes
Chromosome Size (

Distribution of fragment numbers (excluded fragments in mitochondria)
ATAC-seq indicated factor occupancy and nucleosome positions with periodicity fragment length distribution.

Distribution of peak numbers per cell
To measure whether the cell is informative for post-analysis, peak number per each cell is calculated, The cells with small number of peaks indicated the limited informative of cells.

Cell clustering
We conducted a h-cluster based on macs14 peaks to measure sample's ability to be separated to different cell subtypes.

Silhouette of clustering
Silhouette method is used to interprate and validate the consistency within clusters defined in previous steps. A poor Silhouette (e.g. average si < 0.2 ) score indicate that the experiments (if not properly done) may not separate well the subpopulations of cells. If most of your clusters have poor Silhouette score, it may indicate a poor quality of your experiments.

Clustering heatmap
Cell Clustering tree and peak region in each cell. The upper panel represents the hieratical clustering results based on each single cell. The second panel with different colors represents decision of cell clustering. The bottom two panels (heatmap and color bar) represent the "combined peaks" occupancy of each single cell.

Ideogram
Cluster specific regions were show in each chromsome.

Output list
All output files were described in the following table