clusterExperiment and RSEC: A Bioconductor package and framework for clustering of single-cell and other large gene expression datasets

doi:10.1371/journal.pcbi.1006378

Fig 1.

Main steps of RSEC workflow.

(a) shows a diagram of the steps to the workflow while (b)-(d) demonstrate these steps on the olfactory epithelium dataset. (b) The clusterMany step produces many clusterings from the different combinations of algorithms and tuning parameters. These clusterings are displayed using the plotClusters function. Each column of the plot corresponds to a sample and each row to a clustering from the clusterMany step. The samples in each row are color-coded by their cluster assignment in that clustering; samples that are not assigned to a cluster are left white. The colors across different clusterings (rows) are assigned so as to have similar colors for clusters with similar samples across clusterings. The consensus clustering obtained from the makeConsensus step is also shown below the individual clusterings. (c) The makeConsensus step finds a consensus clustering across the clusterMany clusterings based on the co-occurrence of samples in these clusterings. The heatmap of the matrix of co-occurrence proportions is plotted using the plotCoClustering function. The resulting cluster assignments from makeConsensus are color-coded above the matrix, as are the assignments from the next step, mergeClusters. (d) The makeDendrogram step creates a hierarchy between the consensus clusters and then similar clusters in sister nodes are merged with mergeClusters. Plotted here with the function plotDendrogram is the hierarchy of the clusters from makeDendrogram, with merged nodes indicated with dashed lines. The makeConsensus clusters and resulting mergeClusters clusters are indicated as color-coded blocks below the dendrogram, sized according to the number of samples in each cluster. The estimated proportions of DE genes of each node are shown in S1 Fig.

More »

Expand

Fig 2.

Biomarker detection, demonstrated on the olfactory epithelium dataset.

Heatmap from the plotHeatmap function showing genes found differentially expressed (DE) between clusters by the function getBestFeatures, using both the global F-statistic (a) and hierarchical contrasts (b) options. Each of the contrasts in (b) corresponds to nodes in the dendrogram color-coded as in Fig 1d; we retained only the top 50 DE genes per node. Genes found DE in multiple contrasts may be plotted multiple times. For comparison purposes, in (a), we retained the top 256 DE genes according to a global F-statistic, where 256 is the number of unique genes in the hierarchical contrasts shown in (b).

More »

Expand

Table 1.

Parameters varied when applying RSEC in the analysis of the olfactory epithelium and hypothamlus datasets.

See section clusterMany in S1 Text for a complete list of arguments that can be varied in the RSEC workflow.

More »

Expand

Fig 3.

Comparison of methods and tuning parameter choices using clusterMany and plotClusters, demonstrated on the olfactory epithelium dataset.

The figure provides examples of using clusterExperiment to compare clustering methods and tuning parameter choices via the function clusterMany to implement the clustering procedures and the function plotClusters to visualize results. (a) shows the clustering results after running PAM with different choices of K, the number of clusters. (b) shows the clustering results for different between-sample distance measures. ‘Euclidean’ refers to the standard Euclidean distance; ‘Pearson Corr.’ and ‘Spearman’s Rho’ to a correlation-based distance, d(i, j) = 1/2(1 − ρ(i, j)), where ρ(i, j) is either the standard Pearson correlation coefficient or the robust Spearman rank correlation coefficient between samples i and j, respectively. (c) shows the clustering results for different choices of clustering algorithms. Each method is shown with the “best” choice of K, as determined by the maximum average silhouette width; “NN” refers to a user-defined, nearest-neighbor clustering (see Section Data used in the Manuscript in S1 Text). Also shown is the result of applying the consensus and merging steps of the RSEC workflow to this set of clusterings. The clusterings in (a) and (c) were run with the top 50 PCA dimensions as input. The clusterings in (b) involve comparing different between-gene distance measures and therefore were run directly on the gene expression measures after filtering to the top 1,000 most variable genes, as determined by the median absolute deviation (MAD), a robust version of variance.

More »

Expand

Table 2.

Computational costs: For each of the above runs, we give the total number of hours, total CPU time, and maximum memory usage to run the RSEC workflow when parallelized across 15 cores on an AMD Opteron(TM) Processor 6272 node with 270GB of RAM.

The olfactory analysis consisted of 432 clusterings, while that of the hypothalamus of only 14.

More »

Expand