Dr.seq2: A quality control and analysis pipeline for parallel single cell transcriptome and epigenome data

doi:10.1371/journal.pone.0180583

Table 1.

Meta data and accession ID for the scATAC-seq data used in simulation for pipeline tolerance evaluation.

More »

Expand

Table 2.

Comparison of functions between Dr.seq2 and other software developed for single cell transcriptome data.

More »

Expand

Fig 1.

Flowchart illustrating the Dr.seq2 pipeline with default parameters.

The workflow of the Dr.seq2 pipeline includes QC and analysis components for parallel single cell transcriptome and epigenome data. The QC component contains reads level, bulk-cell level, individual-cell level and cell-clustering level QC.

More »

Expand

Fig 2.

Dimensional reduction results for different single cell transcriptome data types.

(A-I) Cell clustering results using dimensional reduction methods (PCA, t-SNE and SIMLR) on different types of single cell transcriptome data (Drop-seq, 10x genomics and MARS-seq).

More »

Expand

Fig 3.

Bulk-cell level QC for scATAC-seq datasets.

A) Peak region number distribution on each chromosome. The blue bars represent the percentages of the whole tiled or mappable regions in the chromosomes (genome background) and the red bars showed the percentages of the whole open region. These percentages are also marked right next to the bars. P-values for the significance of the relative enrichment of open regions with respect to the gnome background are shown in parentheses next to the percentages of the red bars. B) Open region distribution over the genome along with their scores or peak heights. The line graph on the top left corner illustrates the distribution of peak score. The x-axis of the main plot represents the actual chromosome sizes. C) Average profiling on different genomic features. The panels on the first row display the average enrichment signals around TSS and TTS of genes, respectively. The bottom panel represents the average signals on the meta-gene of 5 kb. D) Red line shows number distribution of different fragment length.

More »

Expand

Fig 4.

Cell-clustering level QC and single-cell level QC for scATAC-seq data.

A) Upper panel shows cell-clustering results for combined scATAC samples generated from 3 different cell types. Bottom panel shows corresponding cell type labels of each cell marked by different colors (red stand for H1 cells, yellow stand for GM12878 cells and blue stand for K562 cells). The clustering step of Dr.seq2 clearly separated the scATAC-seq samples from three different cell types into different groups that were consistent with the cell type labels. B) Distribution of peak number for each single cell. C) Cell Clustering tree and peak region in each cell. The upper panel represents the hieratical clustering results based on each single cell. The second panel with different colors represents decision of cell clustering. The bottom two panels (heatmap and color bar) represent the “combined peaks” occupancy of each single cell. D) Barplot shows Silhouette score of each cluster. Silhouette method is used to interpret and validate the consistency within clusters defined in previous steps. E) Cluster specific regions in each chromosome. Specific regions for different cell clusters are marked by different colors and ordered according to genomic loci.

More »

Expand

Fig 5.

Cell clustering stability on simulated scATAC-seq data.

A) Clustering stability of Dr.seq2 on simulated data with different numbers of reads per cell. The lambda index (y-axis) is plotted as a function of the number of reads per cell (x-axis). Error bars represent 95% confidence intervals calculated from 20 simulations. B) Clustering stability of Dr.seq2 on simulated data with different cell proportion depths. The lambda index (y-axis) is plotted as a function of the target cell number (x-axis). Error bars represent 95% confidence intervals calculated from 20 simulations.

More »

Expand

Table 3.

Running time of each QC and analysis step for scATAC datasets.

More »

Expand