PanoView: An iterative clustering for single-cell RNA sequencing data

Single-cell RNA-sequencing (scRNA-seq) provides new opportunities to gain a mechanistic understanding of many biological processes. Current approaches for single cell clustering are often sensitive to the input parameters and have difficulty dealing with cell types with different densities. Here, we present Panoramic View (PanoView), an iterative method integrated with a novel density-based clustering, Ordering Local Maximum by Convex hull (OLMC), that uses a heuristic approach to estimate the required parameters based on the input data structures. In each iteration, PanoView will identify the most confident cell clusters and repeat the clustering with the remaining cells in a new PCA space. Without adjusting any parameter in PanoView, we demonstrated that PanoView was able to detect major and rare cell types simultaneously and outperformed other existing methods in both simulated datasets and published single-cell RNA-sequencing datasets. Finally, we conducted scRNA-Seq analysis of embryonic mouse hypothalamus, and PanoView was able to reveal known cell types and several rare cell subpopulations. Author summary One of the important tasks in analyzing single-cell transcriptomics data is to classify cell subpopulations. Most computational methods require users to input parameters and sometimes the proper parameters are not intuitive to users. Hence, a robust but easy-to-use method is of great interest. We proposed PanoView algorithm that utilizes an iterative approach to search cell clusters in an evolving three-dimension PCA space. The goal is to identify the cell cluster with the most confidence in each iteration and repeat the clustering algorithm with the remaining cells in a new PCA space. To cluster cells in a given PCA space, we also developed OLMC clustering to deal with clusters with varying densities. We examined the performance of PanoView in comparison to other existing methods using ten published single-cell datasets and simulated datasets as the ground truth. The results showed that PanoView is an easy-to-use and reliable tool and can be applied to diverse types of single-cell RNA-sequencing datasets.


Introduction 29
Single-cell RNA-sequencing (scRNA-seq) has attracted great attention in recent years. Unlike 30 traditional bulk RNA-seq analysis, scRNA-seq provides access to cell-to-cell variability at the 3 31 single-cell level. This allows defining individual cell types, and subtypes, among a population 32 containing multiple types of cells, and also makes possible following how individual cell types 33 change over time or after being exposed to various perturbations (1)(2)(3)(4). 34 Classifying single cells based on their expression profile similarity is the basis for scRNA-seq 35 analysis. A variety of clustering approaches have been developed and applied to scRNA-seq 36 analysis such as hierarchical clustering (5-7), K-means clustering(8-11), SNN-Cliq(12), 37 pcaReduce(13), SC3(14), Seurat(3,15), SCANPY (16), RCA(17), and dropClust(18). There are also 38 algorithms, like RaceID/RaceID2(4,19) and GiniClust (20), were developed specifically to identify 39 rare cell types. Nevertheless, one challenge is that clustering results are often highly sensitive to 40 input parameters, and sometimes the required parameters are not intuitive to users (S1 Table). 41 For example, DBSCAN(21) is a clustering that required two parameters to classify clusters based 42 on the densities of subpopulations, and has been applied in some scRNA-seq studies(3,22,23). 43 However, it is difficult for users to pick proper required parameters without the aid of other  To address these issues, we have developed Panoramic View (PanoView), which utilizes an 49 iterative approach that searches cell types in an evolving principal component analysis (PCA) 50 space. The strategy is that we identify the cell cluster with the most confidence in each iteration 51 and repeat the clustering algorithm with the remaining cells in a new PCA space ( Fig 1A). We 52 define the most confident cluster as the "mature" subpopulation that has the lowest variance in the 53 current PCA space. To cluster cells in a given PCA space, we have developed a novel density-54 based algorithm, namely Ordering Local Maximum by Convex hull (OLMC) (Fig 1B-D), that uses a 55 heuristic approach to estimate the required parameters based on the input data structures (see 56 Methods). To evaluate the performance of PanoView, we first tested 1,200 simulated data with varying 69 configuration parameters (e.g. numbers of clusters and standard deviation of the members within 70 clusters). The performance of the clustering was evaluated using the Adjusted Rand Index (ARI), 71 which measures the similarity between the cell membership produced by a chosen method and the 72 ground truth(24). 73 We compared the performance of PanoView with 9 existing methods, including pcaReduce(13), 74 SC3(14), Seurat(15), SCANPY (16), RCA(17), K-means without prior dimensional reduction, PCA 5 75 followed by DBSCAN, PCA followed by K-means, and TSNE followed by K-means. The results 76 showed that PanoView and SCANPY outperformed other benchmarking methods in all datasets 77 tested using default parameters. Although we input the correct number of clusters for K-means and 78 pcaReduce, their performance decreased in the datasets with a large number of clusters (K-means, 79 TSNE+Km, PCA+Km, pcaReduce in Fig 2A). For DBSCAN, we tuned the required parameters until 80 they reached optimal performance in datasets with n=3 and 4 (PCA+DB in Fig 2A). However, its 81 performance dropped significantly when . We also observed a similar outcome in Seurat, > 10 82 whose performance dramatically dropped for . It is worthy to note that these methods could > 17 83 achieve much better performance if we tune the parameters for each dataset. In this study, we only  Results of published scRNA-seq datasets 100 We applied PanoView to 11 published scRNA-seq datasets, ranging in size from 90 cells to 101 20,921 cells (S2 Table). We used the reported clustering results as the ground truth for the 102 calculation of ARI, assuming that the authors optimized their analysis correctly with the expertise in 103 the research topics. Based on the overall performance of eight tested methods, we divided them of ARI values for all methods is provided in S3 Table. 118 119 Computational cost 120 We also examined the computational cost of PanoView in the real scRNA-seq datasets. It is not 121 surprising that data analysis takes longer when datasets contain more cells (Fig. 3). We also 122 compared the computational cost with other methods, which generated reasonable clustering 123 results. It is obvious that PanoView is not the fastest algorithm. SCANPY, Seurat and RCA are 124 faster than PanoView. It is interesting that SC3 and pcaReduce are slower than PanoView and 125 they failed to generate clustering results for the largest dataset.  The 500 clustering results are provided in Table S4.  Results of detection of rare cell types 10 156 To evaluate the ability to identify rare cell types, we first applied PanoView to 260 simulated 157 datasets and benchmarked it with Seurat, GiniClust, RaceID2, and SCANPY. GiniClust and 158 RaceID2 are two single-cell methods that were specifically designed for detecting rare cell types. 159 We used recovery rate and false positive rate to evaluate the performance of detecting rare cell 160 types (table in Fig 5). PanoView had the best performance that it correctly recovered the rare cell  In addition to simulated datasets, we also used Patel dataset to examine the performance of 182 detecting rare cells (Fig. S4). GiniClust reported that it successfully detected one rare cell type in 183 this dataset (20), which consists of 9 cells in glioblastoma tumors. These cells were also 184 discovered by the original study showing highly expressed oligodendrocyte genes(6). In our result 185 (Fig S4), PanoView identified a cluster (cluster #2) that includes 7 cells, which are corresponding to 186 the rare cells in the original study. SCANPY reported a cluster with 9 cells, among which 8 were 187 the rare cells. SC3 identified a cluster with 10 cells, among which 8 were the rare cells. Seurat 188 assigned 9 rare cells to a major cluster, which has 88 cells in total. A similar outcome was also 189 observed in RCA and pcaReduce that both algorithms merged the rare cells to a major cluster.  performance once we fine-tune their parameters with the input from experienced experts. We 225 believe that PanoView can offer reliable performance with moderate computational cost and can be 226 applied to diverse types of scRNA-seq dataset. The clustering of single cells will automatically 227 identify cell specificity. After the identification of cell types, we are also able to determine the 228 marker genes that show specific expression in each cell type (e.g. Fig. 6B). We believe that the cell 229 atlas and the corresponding marker genes will be a valuable resource to study various biological  where the global maximum density is and use convex hull to locate the next local maximum.

298
To illustrate OLMC, a toy model consisting of 500 random points is provided (Fig 1B-D). In where the highest density is. The first convex hull (the cyan in Fig 1D) is constructed by the points 302 within the first bar (Fig 1C) of the distance histogram. After removing the points in the cyan convex 303 hull, the next point with the highest density is where number of 23 is, and the second convex hull is 304 constructed by the points in the first bar (in green) of the second histogram that is calculated by 305 distance distribution to the point of 23, a local maximum density. Followed by the same procedure, 306 the next local maximum (point of 22 in yellow) is located and the third convex hull is built. In the 307 end, OLMC identifies the locations of three local maximums, and assign rest of the points to the 308 nearest local maximums.

309
In PanoView, the goal is to find as many clusters as possible during the iterations. Therefore, 310 we adopted a heuristic approach to optimize the bin size that controls the histogram of distance 311 to local maximums for constructing convex hulls. We generated a simulated data of 500 2D points is the mean of a population of variances.

331
If there is a Gini smaller than the threshold of 0.05, PanoView will keep the cluster with the 332 minimum variance (i.e. the "mature" cluster) and put the rest of cells into the next iteration. ). For each n, we generated 20 random configurations (i.e. = 0.5,1,2 342 datasets). In total, we generated 1,200 different random datasets.

343
For evaluating the ability to identify rare cell-types, we followed the same procedure to 344 generate simulated datasets. The number of clusters ranged from 3 to 15, and the standard 345 derivation of each cluster was 1. In each dataset, we randomly picked one cluster and removed 346 90% of the cells from that cluster. This cluster was defined as the rare cell subpopulation. In other 347 words, the size of the rare cluster is about 0.6% to 3% of the total population. We also varied the scRNA-seq to classify dendritic and monocyte populations from human blood (37) (GSE94820). 365 Zeisel used scRNA-seq to study the transcriptome of mouse somatosensory cortex S1 and  space. We also used Scikit's default setup for executing Kmeans (n_clusters = k, init = 'random') 383 and TSNE (n_components = 2, random_state = 1, init = 'random', n_iter = 1000).

384
For benchmarking RaceID2 in our simulated datasets, we used the default setup from the 385 manual and did not pass the step of findoutliers. Therefore, we used @cluster$kpart as the final 386 clustering result. For benchmarking GiniClust in our simulated datasets, we used the default 387 parameters from the manual except for Gini.pvalue_cutoff. We adjusted it from 0.0001 to 0.005 388 because the default value of 0.0001 did not produce useable clustering results.

390
Evaluation of performance in detecting rare cell types 391 We used recovery rate and false positive rate to evaluate the performance of clustering 392 methods on detecting rare cell types. In each simulated dataset, we always have one rare cell 393 cluster and n (n=2 to 14) major cell clusters. If the rare cell cluster was perfectly detected with the 394 correct number of cells within the cluster, we considered that the algorithms recovered the rare cell 395 type. On the other hand, if cells from a major cluster were grouped into multiple clusters and at 396 least one of the sub-cluster had the size less than 10% of the major cluster, we considered that the 397 algorithm generated a false positive rare cell type.  The complete user manual is provided at Github repository.