Scedar: A scalable Python package for single-cell RNA-seq exploratory data analysis.

In single-cell RNA-seq (scRNA-seq) experiments, the number of individual cells has increased exponentially, and the sequencing depth of each cell has decreased significantly. As a result, analyzing scRNA-seq data requires extensive considerations of program efficiency and method selection. In order to reduce the complexity of scRNA-seq data analysis, we present scedar, a scalable Python package for scRNA-seq exploratory data analysis. The package provides a convenient and reliable interface for performing visualization, imputation of gene dropouts, detection of rare transcriptomic profiles, and clustering on large-scale scRNA-seq datasets. The analytical methods are efficient, and they also do not assume that the data follow certain statistical distributions. The package is extensible and modular, which would facilitate the further development of functionalities for future requirements with the open-source development community. The scedar package is distributed under the terms of the MIT license at https://pypi.org/project/scedar.


Introduction
Cost-effective large-scale transcriptomic profiling of individual cells is enabled by the development of microfluidic, nanodroplet, and massively parallel sequencing technologies. Using these technologies, single-cell RNA-seq (scRNA-seq) experiments usually generate transcriptomic profiles of thousands to millions of individual cells [1]. Therefore, scRNA-seq has become more commonly used to either study specific biological questions or comprehensively profile certain tissues or organisms [2][3][4].
Analyses of scRNA-seq datasets therefore require efficient computational programs and sophisticated statistical methods. The programs should be able to manage memory efficiently, exploit multiple cores of the processing units, and handle errors and exceptions gracefully. The statistical methods must be able to function against high dimensionality, low signal-to-noise ratio, and different characteristics of data generated from different technologies and protocols  [5][6][7]. Such requirements can become a barrier between experimental design and biological interpretations of the results. In order to be scalable, methods have been designed to minimize the usage of hardware resources, so that a large-scale scRNA-seq dataset can be analyzed using a desktop computer, such as Seurat v3.0 [8] and Scanpy [9]. Seurat is an R package providing visualization and robust statistical methods to explore and interpret the heterogeneity of the dataset. Scanpy is a Python package providing efficient reimplementations of pre-existing statistical methods and analytical workflows, which can be used to perform general exploratory data analysis and special inferences.
Here, we attempt to achieve the scalability of scRNA-seq data analysis through an alternative approach, which is to enable the user to exploit powerful analytical methods using modern high-performance computing architectures [10][11][12], such as servers and clusters with large amount of memory, multiple central and graphical processing unit cores, and solid-state drives (SSD) with higher read and write speed than traditional hard disk drive (HDD). Such computational resources have been made easily accessible by cloud computing services like Amazon Web Services, Google Cloud, and Microsoft Azure. Applying this approach, we designed the analytical methods to be able to run in parallel processes with minimized memory overhead.
Additionally, we also aim to develop robust exploratory data analysis (EDA) methods, so that they could be applied to various datasets generated by different experimental designs, technologies, and protocols. Single-cell RNA-seq datasets have distinct statistical characteristics if generated from different scRNA-seq technologies and platforms [6,7,13], such as SMAR-Ter [14], Drop-seq [15] and 10x Genomics [16]. In order to avoid assumptions on the statistical distributions of the datasets, we incorporated efficient implementations of machine learning methods into the data exploration process. Through extensive exploration, the statistical properties of the data could be observed and used to guide the selection of appropriate methods for biological interpretation [17].
Therefore, we developed a scalable and reliable Python package, single-cell exploratory data analysis for RNA-seq (scedar), to facilitate the exploration of large-scale scRNA-seq datasets. Scedar provides analytical routines for visualization, gene dropout imputation, rare transcriptomic profile detection, clustering, and identification of cluster separating genes. The visualization methods are integrated with the efficient scRNA-seq data structures to provide intuitive, convenient, and flexible plotting interfaces. We implemented methods to impute gene dropouts (Algorithm A in S1 Text) and detect rare transcriptomic profiles (Algorithm B in S1 Text) based on the k-nearest neighbor (KNN) algorithm. The detected rare transcriptomic profiles could be compared with their nearest neighbors in detail to identify rare cell states or types. For clustering analysis, we provide a novel cell clustering algorithm: Minimum Description Length (MDL) Iteratively Regularized Agglomerative Clustering (MIRAC), shown in Algorithm 1. To identify genes that distinguish cell clusters we provide a method utilizing XGBoost, a sparsity-aware gradient boosted tree system [18].

Scedar package design
We designed scedar in an object-oriented manner for quickly exploring large-scale scRNA-seq transcription level matrices on a remote server utilizing parallel computing techniques, in order to provide a robust and extensible platform for EDA, rather than surpassing the analytical performance of any of � 275 existing scRNA-seq data analysis methods [17]. The application programming interface (API) is designed to be intuitive to users familiar with the R programming language. The standard analysis workflow is to explore the dataset, cluster cells, and identify cluster separating genes (Fig 1), which is implemented as four main modules: data structure, KNN, clustering and visualization.
The core data structure stores each transcription level matrix as a standalone sparse matrix or full array instance, and it can easily be extended to support customized analytical procedures. The built-in extensions include common procedures like pairwise distance computation, Principal Component Analysis (PCA), t-SNE [19], UMAP [20], and k-nearest neighbor graph [21]. We optimized time and memory efficiency with the following design patterns: parallel computing, lazy loading, caching and copy-on-write.
The KNN and clustering modules utilize the data structure and parallel computing to efficiently perform analytical procedures, and the results are stored in the data structure for further reference.
The visualization module contains plotting methods optimized for large datasets, especially for plotting heatmaps. For example, it takes less than two minutes to generate a heatmap image from a 50,000 x 20,000 matrix containing random standard normal entries.
Preprocessing is implemented as selection and transformation routines of the core data structure, which is not a focus of scedar. The package is designed to identify the necessity of certain preprocessing procedures by extensively exploring the original state of the data. Although preprocessing, such as batch effect correction and normalization, could facilitate the detection of biological variances between cells, applying preprocessing methods correctly requires careful validation of their assumptions, otherwise they may introduce unwanted bias or variability [22].

Minimum description length iteratively regulated agglomerative clustering
Minimum description length (MDL) iteratively regulated agglomerative clustering (MIRAC) extends hierarchical agglomerative clustering (HAC) [23] in a divide and conquer manner for scRNA-seq data. Input with raw or dimensionality reduced scRNA-seq data, MIRAC starts with one round of bottom-up HAC to build a tree with optimal linear leaf ordering [24], and the tree is then divided into small sub-clusters, which are further merged iteratively into clusters. Because each individual cluster becomes more homogenous with higher number of clusters, the iterative merging process is regularized with the MDL principle [25]. The asymptotic time complexity of the MIRAC algorithm is O(n 4 +mn 2 ) where n is the number of samples, and m is the number of features. The space complexity is O(n 2 +mn). Relevant mathematical theories and notations of MDL are briefly described in the following section. The pseudo-code of MIRAC is shown in Fig 2. Comparing to HAC, MIRAC is not designed to be faster but rather to improve the cluster robustness. The asymptotic time complexity of MIRAC is the same as HAC with optimal leaf ordering. However, the use of MDL rather than deterministic similarity metrics, improves the noise tolerance by estimating similarity with probabilistic models to give more weight on signal and less weight on noise, assuming that the signal is stronger than noise.
Rationale. The rationale behind MIRAC is to reduce the number of cell partitions that need to be evaluated and find a good one among them.
The number of all possible partitions of a set with n elements is the Bell number, which could be computed with the following recurrence where n�1 and B 0 = 1 [26]. It is computationally intractable to compute the code lengths of B n partitions, so we reduced the number of cell partitions to evaluate with the following steps: 1. We only evaluate the partitions of n cells that consecutively divide the optimal HAC tree T leaf ordering. The HAC tree with optimal leaf ordering maximizes the sum of similarity measurements between adjacent items in the leaf ordering [24], which could be computed with an algorithm developed by Bar-Joseph et al. that runs in time O(n 4 ). The number of partitions that consecutively divide an ordering is the same as the number of compositions of n, which is C n = 2 n−1 . Because this number still grows exponentially with regard to n, it is necessary to further reduce the set of partitions for evaluation.
3. Within step 2 partitions, we only evaluate the ones that could be generated by merging adjacent clusters of a specific partition P s ¼ fP s 1 ; P s 2 ; . . . ; P s k g, where P s is generated by recursively dividing the root HAC tree T until the partitioned subtree has �t−1 leaves. Thus, dn/ (t−1)e�k�n. Let M(P s ) denote the number of step 2 partitions that could be generated by The minimum description length iteratively regulated agglomerative clustering (MIRAC) algorithm. MIRAC extends hierarchical agglomerative clustering (HAC) in a divide and conquer manner for scRNA-seq data. Input with raw or dimensionality reduced scRNA-seq data, MIRAC starts with building an HAC tree (Line 1-3), and the tree is then divided into small sub-clusters (Line 4-5), which are further merged iteratively into clusters (Line 9-37). The rationales and detailed procedures are described in the Methods section. merging adjacent clusters of P s . The upper bound of M(P s ) is the same as the number of step 2 partitions, which could be reached when k = n. The lower bound of M(P s ) is not straightforward, since the merged partitions should have all clusters with �t cells.
In order to find a good cell partition P g in the subset of all possibles, we iteratively inspect each cluster of P s and merge it with either its left or right adjacent cluster according to the similarity determined by the two-stage MDL scheme described in the following sections. The procedure has the following steps: 1. Merge P s clusters in the beginning of the optimal ordering until the merged set contains �t cells (line 9 Algorithm 1).
2. Similarly, merge P s clusters at the end of the optimal ordering until the merged set contains �t cells (line 11 Algorithm 1).
3. For every cluster P s i in the middle, determine the similarity between P s i and the cluster on the left and right, and merge P s i with the more similar cluster (line 20-26 Algorithm 1). Although the cluster on the left P s iÀ 1 always has � n cluster min cells, the cluster on the right P s iþ1 may have a minimum of one cell, so that the determined similarity between P s i and P s iþ1 is sensitive to noise. In order to improve the robustness of similarity comparison, instead of determining the similarity between P s i and P s iþ1 , we determine the similarity between P s i and where jP s rmm j � n cluster min ; jP s rmm j À P s iþm < n cluster min , and rmm is the shorthand for right minimax. 4. Once jP s i j � n cluster min , determine the similarity between P s iÀ 1 and P s i and merge them if their similarity is above a certain threshold, otherwise split them (line 32-36 Algorithm 1). The code length of merged X listðP s The code length of divided X listðP s If L d À L m � À r mdl min absðL m Þ, merge P s iÀ 1 and P s i , otherwise split them. This conditional statement generalizes the situations where L m is either non-negative or negative. 5. Continue inspecting the next cluster in P s . 6. After inspecting all middle clusters, P g is the final updated P s .
We use a standard HAC tree to perform MIRAC when the number of samples is larger than ten thousand, which usually results in decreased but still acceptable performances (Fig 3). Although the optimal leaf ordering of the HAC tree is an important assumption, its computation takes too long when the number of samples is large. For example, it takes more than a week to compute the optimal leaf ordering of a dataset with about 68,000 samples. However, the time complexity of computing a good HAC linear ordering could be significantly reduced by implementing other ordering techniques [28].
Analysis of complexity. The time complexity of MIRAC is (n 4 +mn 2 ), where n,m�1. The time complexity of computing the HAC tree with optimal leaf ordering is O(n 4 ) [24]. The time complexity of finding P g is O(mn 2 ), which is briefly analyzed as following.
The time complexity is O(nm) for computing the code length L(X n×m ,P). In computing L (X n×m ,P) we compute the code lengths of each cluster and their labels. Since it takes O(n) time to compute the code length of n observations with a single feature, computing the code length of all clusters and features individually takes O(nm) time, and computing the code length of n labels take O(n) time.
Similarly, the time complexity is O(n 1 m+n 2 m) for computing the code length of X n 1 �m encoded by a model fitted with X n 2 �m . Fitting a model on X n 2 �m takes O(n 2 m) amount of time.
Using the model to encode X n 1 �m takes O(n 1 m) amount of time.
The time complexity of searching for Pg is O(mn 2 ). In the worst-case scenario, |P s | = n, and the inspecting cluster is always merged with the left-hand side cluster in step 4. However, the impact of n cluster min and step 3 on the time complexity is not straightforward, so we divide the step 3 code length computation into three parts and analyze the upper bound of them individually. The divided three parts are the left cluster, inspecting cluster, and right minimax cluster. Because we keep increasing the size of the left cluster, the upper bound of computing the left cluster code length throughout the execution is O(mn 2 ). The maximum size of the inspecting cluster is n cluster min À 1, so the upper bound of computing the inspecting cluster code length is also O(mn 2 ), when every middle singleton is evaluated n cluster min À 1 times before merging with the left. The maximum size of the right minimax cluster is 2n cluster min À 1, so the upper bound of evaluating the right minimax cluster is also O(mn 2 ), when every increment of the left cluster takes n cluster min times of evaluating the right minimax cluster. Summarizing these three parts, the overall upper bound is therefore O(mn 2 ).
The space complexity of MIRAC is O(nm+n 2 ), which could be decomposed into the following three parts. The space complexity of storing the n×m data matrix X is O(mn). The space complexity of storing the HAC tree with optimal leaf ordering is O(n). The space complexity of storing the pairwise distance matrix is O(n 2 ).
Extension with community detection. We extended MIRAC with community detection to improve scalability. We apply MIRAC on a relatively large number of detected single-cell communities to identify final clusters. We used the Leiden algorithm for community detection on KNN graphs [29]. We also provide a KNN graph construction method that supports approximate nearest neighbor (ANN) search using a Hierarchical Navigable Small World (HNSW) graph [30].

Cluster separating genes identification
We use XGBoost [18], a scalable and sparsity-aware boosted tree system, to identify genes that are able to separate a specific set of clusters. This method is designed for data exploration after applying any one of a number of various statistical approaches that have been developed to identify differentially expressed genes [31][32][33][34][35], in order to quickly identify genes or sets of genes that are able to separate an arbitrary set of clusters for further inspection.
Rather than providing a meticulous p-value for each gene among the compared clusters, we rank the genes by their individual importance on separating the clusters under comparison. The importance of a gene in the trained classifier is the number of times it has been used as an inner node of the decision trees. We use cross validation to train an XGBoost classifier on the compared clusters, and the classifier is essentially a bag of decision trees [18]. In order to alleviate the influences of stochasticity on interpretation, we use bootstrap with feature shuffling to better estimate the importance of genes in separating the compared clusters. The obtained list of important genes could be further explored by inspecting transcription level fold changes and decision tree structures.
Comparing to NSForest [36], a method based on random forest [37,38] to identify a parsimonious set of cluster separating genes from scRNA-seq data, our method identifies all possible cluster separating genes using gradient boosting. Practically, NSForest version 1.3 is distributed on GitHub as a Python script without encapsulation or testing (checked on Oct 11, 2018), but our method is distributed through the Python Package Index, comprehensively tested, and easy to use through the user-friendly API. With regard to scalability, our method uses a scalable implementation of gradient boosting algorithm [18], whereas NSForest uses the implementation of random forest in scikit-learn [38].

K-nearest neighbor methods
A k-nearest neighbor analytical strategy exploits the similarity between data points for classification, regression, and imputation [39,40]. In k-nearest neighbor methods, samples are considered as points in a space with each dimension representing a measured property, which is often referred to as a feature, of the samples. The similarity between samples can be evaluated with various distance metrics, which is extensively reviewed by Bellet et al. [41]. Generally, a distance metric is a function that takes two samples and output a numeric value to represent the distance between the two samples in their feature space. For example, the Euclidean distance metric is the following function, dðp; qÞ ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi where the p and q are two samples, and q i and p i are the data values on their ith dimension. The k-nearest neighbors of a sample are the k number of other samples that have the smallest distances from the sample. The k-nearest neighbors of a sample are informative, due to their similarity, to determine the category of the sample in classification, the relevant continuous property in regression, and the missing values in imputation.
We developed two methods based on the KNN algorithm to facilitate the exploration of scRNA-seq datasets. For a relatively large number of cells profiled in each scRNA-seq experiment, we assume that each one of the non-rare cells is similar to at least k other cells in their transcriptomic profiles. With this assumption, we impute gene dropouts and detect rare transcriptomic profiles. In the scedar implementation, we also support approximate nearest neighbor (ANN) search using Hierarchical Navigable Small World (HNSW) graph [30], which greatly improves the scalability of KNN graph construction. Impute gene dropouts. In an scRNA-seq experiment, if a truly expressed gene is not detected in a cell, the gene is considered a "dropout", and such events are called gene dropouts [32]. Gene dropouts may be caused by biological and technical reasons [32,42,43]. The rationale behind the possible causes of biological dropouts are related to phenomena such as transcriptional bursting [44,45] and RNA degradation. With regard to technical dropouts, the main concerns are the relatively small number of RNA transcripts of a gene, amplification efficiency, and batch effects [13].
We exploit the transcriptomic profiles of the k-nearest neighbors of a cell to impute the gene dropouts in the cell (Algorithm A in S1 Text). The algorithm could take multiple iterations, so that the dropped-out genes that are expressed in all k-nearest neighbors could be imputed at first, and the ones that are expressed in most but not all of k-nearest neighbors could be imputed in the following iterations.
Detecting rare transcriptomic profiles. We mark transcriptomic profiles as rare if they are distinct from their k-nearest neighbors, according to the pairwise similarity between cells (Algorithm B in S1 Text). The algorithm could take multiple iterations, so that the most distinct transcriptomic profiles could be marked at first and less distinct ones in the following iterations.
This method is provided mainly to facilitate detailed inspection of rare transcriptomic profiles rather than removing outliers from the data. Because rare transcriptomic profiles may have various biological and technical causes, samples and features in a dataset should only be removed after extensive exploratory data analysis and rigorous reasoning with domain specific knowledge. Closely comparing rare transcriptomic profiles with their nearest neighbors may also yield insights into their biological differences, which may further facilitate the identification of rare cell types and states.
We performed all benchmark analyses on a high-performance computing cluster, of which the computing resources are strictly managed by Univa Grid Engine. The cluster nodes have CPUs of Intel Xeon E5-2680 v3 or Intel Xeon E7-8880 v3. The memory sizes are either 128GB, 256GB, or 1TB. When scheduling analytical jobs for benchmarking, we make sure that the number of cores and allocated memory are enough for the program.
Clustering. The clustering accuracy and stability of MIRAC were benchmarked together with several other clustering methods on experimental scRNA-seq datasets listed in Table 1.
The following clustering methods are directly applied on the original data without preprocessing.
• Single-cell consensus clustering (SC3) version 1.7.7 [46]. SC3 is selected for comparison because it was extensively compared with other methods on different experimental datasets.
• Community clustering on KNN graph constructed using the Euclidean distances of 100 principal components. We used the Leiden algorithm for community detection [29].
Although MIRAC could be directly applied on the expression matrix, dimensionality reduction is able to improve the performance of similarity and density-based clustering methods when the number of features is high [48]. The mathematical influences of the high number of features are briefly described in the previous sections.
Although t-SNE projections are stochastic and influenced by the perplexity parameter, t-SNE is proved to be able to recover well-separated clusters [49] and t-SNE has been extensively used as dimensionality reduction method for scRNA-seq data [3,15]. Thus, we chose to use t-SNE for demonstration in this report, but we note that scedar also supports the use of PCA and UMAP for MIRAC clustering, which can be applied just as easily as the t-SNE method. When benchmarking for accuracy, we cluster the cells in each experimental dataset using the compared clustering methods with a grid of parameters. We use the maximum similarities between the clustering results and the cell types from the publications in order to compare the accuracy of different clustering methods. Although taking the maximum increases the chance of overfitting, it resembles the procedure of clustering analysis in practice.
We also recorded the runtime of clustering methods on different datasets. In Fig 3A, SC3 was performed with 20 cores, and community clustering is performed with 60 cores. Other clustering methods were performed with a single core. When running MIRAC, we did not require the hierarchical tree to be optimal. The community-detection-extended MIRAC clustering method was performed with 25 CPU cores, and other methods were performed with 60 CPU cores (S8 Fig, panel C).
We also benchmarked the influences of t-SNE random states on the clustering results (S1 Fig). The t-SNE embeddings generated with different random states may be distinct from each other, although current mathematical results have shown that clustering using t-SNE embeddings is able to recover well separated clusters [49]. We characterized the stability to random states of clustering methods by running them on each experimental dataset with the same parameters but ten different random states. The similarity of clustering results between different random states are used to compare the stability of different clustering methods. The stability of community clustering is not characterized, because it is based on PCA.
Cluster similarity metrics. We use two cluster similarity metrics, cluster consistent ratio (CCR) and adjusted Rand index (ARI) [50] for measuring the accuracy and stability of clustering methods respectively. When we have a coarse reference partition P r and a finer clustering partition P c , the CCR is computed as the ratio of pairs within each cluster of P c that are also in the same cluster of P r , with the number of clusters kept the same across compared methods. The ARI is computed with the Python package scikit-learn [38] using the mathematical formula given by Hubert and Arabie [50].
The reference partitions P r of real datasets are obtained from their original publications (Table A in S1 Text). The clusters in Deng et al. dataset [51] are experimentally determined by manual single cell isolation. The clusters in Pollen et al. [14] are experimentally determined by the cell line or tissue type of the isolated single cells. The clusters in Kolodziejczyk et al. [52] are determined by the embryonic stem cell culturing conditions and batches. The clusters in Zeisel et al. [53] are determined by the BackSPIN clustering method [53] and further inspected with domain specific knowledge.
We choose CCR to measure clustering accuracy rather than ARI, because ARI greatly penalizes the split of a large cluster in P r into multiple smaller ones in P c . This behavior of ARI could prevent the evaluation of transcriptomic variabilities within a large group of cells in clustering analysis, which is an important goal of scRNA-seq experiments that cannot be easily achieved by bulk RNA-seq experiments. In addition, we also provide a method in scedar to easily merge multiple clusters together, in case the users found the sub-types of a cell type are very similar to each other.
However, we used ARI to measure clustering stability rather than CCR, because the differences between P r and P c are completely caused by different random states, hence splitting a cluster in P r should be penalized.
MDL is not used as a cluster similarity metric, even though MDL is used in MIRAC to guide the process of finding good partitions. MDL is only used to guide and regularize the merging process of local adjacent sub-clusters by evaluating their similarity, but it is not used as an objective to be optimized globally. Using MDL as a benchmarking metric would introduce a bias as among the various methods, MIRAC is the only method using MDL. Moreover, interpreting relative global MDL differences is not straightforward as it is not only affected by cluster labels, but also transcription levels.
Detection of rare transcriptomic profiles. We visualize real datasets before and after removing rare transcriptomic profiles in t-SNE projection and pairwise distance heatmaps, without quantitative evaluations like receiver operating characteristic (ROC) curve, since rare transcriptomic profiles are not well defined with a large number of genes [48]. Approaches to identify rare data points in low-dimensional spaces (�3) do not scale well to higher dimensions due to the exponentially decreased density of data points in space and increased instability of distances, which is elaborated in the previous section about mathematical theories on high-dimensional data analysis. KNN rare transcriptomic profile detection was performed with 25 CPU cores, as shown in S8 Fig panel B.
It is important to note that rare transcriptomic profiles are detected to facilitate detailed inspection rather than the removal of them from the data. The visualizations are only used to illustrate the capability of the KNN method for detecting rare transcriptomic profiles.
Gene dropout imputation. We simulate gene dropouts using Splatter [13] to obtain a dropout rate around 5%. Then, we benchmark the performance of imputing gene dropout as two parts: detection and smoothing. On simulated data, because the true gene dropouts are known, we use a ROC curve and mean squared errors (MSEs) to characterize the performance of gene dropout detection and smoothing respectively. In Fig 4A, S7 Fig panel A, and S8 Fig  panel A, all imputation methods were executed with a single core. On real data, we visualize the cells in 2D t-SNE space before and after imputation. The compared methods are KNN gene dropout imputation (KNNGDI) version 0.1.5, SAVER version 1.0.0 [54], MAGIC version 1.1.0 [55], and scImpute version 0.0.6 [56].
Method speed-up by parallel processing. We characterized the analysis speed-up by parallel processing on different datasets (S9 Fig). KNN dropout imputation is not evaluated on larger datasets, because the speed limiting step has not yet been parallelized. In addition to parallelizing the computation in the implemented methods, we also provide an easy-to-use utility function (scedar.utils.parmap) to parallelly run analytical methods with different parameters, which would significantly facilitate exploratory parameter search.

Basic workflow of scedar
We illustrate the basic workflow of using scedar for scRNA-seq exploratory data analysis with the dataset published by Zeisel et al. [53] (Fig 1). The dataset contains the RNA unique molecule identifier (UMI) counts of 19,972 genes in 3005 cells from the mouse brain. We selected this dataset for demonstration because it could be clearly visualized in small graphs, although our package is capable of analyzing much larger datasets with over 2 million cells (S2 Fig, S3  Fig, S4 Fig and S5 Fig).
In Fig 1, each box represents a data analysis step, and they are consecutively executed according to the arrow. The purpose and runtime are listed in the upper ribbon. The code that is essential to the step is listed in the box, and their results are also shown.
The preparation step imports required packages and loads the data. The class SampleDis-tanceMatrix is one of the core data structures in the package that is used to store the read counts and pairwise distances of an scRNA-seq dataset. Because the pairwise distance computation is delayed until necessary, i.e. lazily loaded, this step only takes 12 seconds. We use cosine distance rather than correlation distance to greatly speed up the computation, since we implemented the computation procedure of pairwise cosine distances completely with NumPy linear algebra operations with OpenBLAS backend [57,58].

PLOS COMPUTATIONAL BIOLOGY
The t-Distributed Stochastic Neighbor Embedding (t-SNE) scatter plot and KNN graph are used to explore the dataset. The cell type labels published by Zeisel et al. [53] are truncated to fit in the space. We also provide methods to visualize arbitrary statistics, e.g. number of expressed genes, of individual cells as color gradient. The layouts of cells in t-SNE and KNN graph are similar to each other. Although KNN graph is faster than t-SNE, the runtime for t-SNE could be reduced by optimizing its parameters, e.g. lowering the number of iterations.
The MIRAC step clusters the cells and visualizes them with t-SNE scatter plot and pairwise distance matrix heatmap. The heatmap generation procedure in scedar is optimized for largescale datasets, which is able to generate a heatmap with tens of thousands of columns and rows in a few minutes. Users could also generate heatmaps for the read count matrix to directly inspect the sparsity of datasets (all panels B in S2 Fig, S3 Fig, S4 Fig and S5 Fig).
The last step identifies cluster separating genes with XGBoost [18]. Users could choose an arbitrary set of clusters to compare, and the genes are ranked by their importance in separating the clusters. Then, the read counts of a gene across clustered labels could easily be visualized by t-SNE scatter plot.

Performance of MIRAC clustering
We benchmarked several clustering methods on the datasets listed in Table 1 (Fig 3). Each dataset is clustered multiple times with each clustering method to obtain different numbers of clusters (Table A in S1 Text).
The t-SNE based clustering methods are faster than SC3 (Fig 3A). Also, the t-SNE based clustering methods have similar runtimes (Fig 3A), since the time limiting step is the computation of t-SNE projection, when optimal hierarchical clustering ordering is not required in MIRAC. When the optimal ordering is required, the time limiting step is the computation of optimal ordering (Fig 1). With regard to practical usage, we recommend the users to explore different parameters of MIRAC without optimal ordering. Once appropriate parameters have been identified, users can perform MIRAC with optimal ordering. However, for datasets with more than 10,000 cells, we recommend that users perform MIRAC without optimal ordering (S2 Fig panel C and S3 Fig panel C). For datasets with more than 100,000 cells, we recommend that users perform community-detection-extended MIRAC without optimal ordering (S8C Fig), which takes for instance 2885.5 ± 176.0 (mean ± standard deviation) seconds on the mouse organogenesis cell atlas (MOCA) dataset with 2,058,652 single cells using 25 CPU cores (S8C Fig) [59]. We also performed clustering with the community clustering methods implemented in scedar, Seurat [8] and Scanpy [9] on relatively large scRNA-seq datasets with more than 10,000 cells (S8C Fig). The runtime of community clustering methods Seurat and Scanpy is comparable to community clustering and community-detection-extended MIRAC (S8C Fig), except that Scanpy and Seurat were not able to cluster the MOCA dataset that contain 2,058,652 single cells on a server with 1TB memory due to a mandatory conversion of the sparse read count matrix into dense matrix.
The cluster consistent ratios (CCRs) of t-SNE based clustering methods are comparable to SC3 (Fig 3B). The PCA based community clustering results showed relatively lower CCRs than other clustering methods (Fig 3B), which might due to the non-optimal representation of similarities between high-dimensional single cell read counts in the linear PCA space, comparing to t-SNE embeddings that are optimized to capture the similarities between high-dimensional single cell read counts. The representative MIRAC clustering results of Zheng et al. [16] and Macosko et al. [15] datasets are visualized with t-SNE scatter plots and pairwise distance matrix heatmaps (S2 Fig panel C and S3 Fig panel C). For smaller datasets, the representative MIRAC results are shown (S6 Fig). Although t-SNE projections obtained with different random states are distinct from each other, the consistency of clustering results is comparable to SC3 (S1 Fig).

Performance of imputing gene dropouts
We benchmarked several gene dropout imputation methods on the simulated 10x Genomics (Fig 4) and Drop-seq (S7 Fig) datasets. K-nearest neighbor gene dropout imputation (KNNGDI) is faster than other compared methods on relatively small datasets (Fig 4A) and is capable of handling scRNA-seq datasets with over 10,000 cells (S8 Fig panel A). The runtime of KNNDGI on 2,058,652 cells is 71.56 ± 1.71 (mean ± standard deviation) hours.
The performance of KNNGDI on detecting gene dropouts is comparable to SAVER and better than scImpute and MAGIC. The ROC curve of scImpute sharply turns around TPR = 0.25 (Fig 4B and S7 Fig panel B) because its threshold parameters, determining whether a zero entry is a dropout or not, are not sensitive enough to achieve any higher TPRs. Although the AUCs of KNNGDI and SAVER are higher than scImpute and MAGIC, they all have comparable performances when FPRs are lower than 0.05 (Fig 4B), except that MAGIC has worse performance on the simulated Drop-seq datasets that have higher sparsity than the Zeisel et al. [53] dataset (S7 Fig panel B).
The MSEs of KNNGDI on correcting gene dropouts are comparable to SAVER and smaller than scImpute and MAGIC (S7 Fig panel C and S7 Fig panel D). However, these methods all have MSEs many times higher than the MSEs of the observed counts to the true counts, which implies that these methods all introduced many times more reads than the true dropout reads. Despite different levels of MSEs, the t-SNE embeddings of the imputed read counts are consistent with the t-SNE embeddings of original read counts (Fig 4C, and all panels E in S2 Fig, S3  Fig, S4 Fig and S5 Fig).

Performance of detecting rare transcriptomic profiles
We detected rare transcriptomic profiles in datasets listed in Table 1 with the KNN method ( Fig 5, and all panels F in S2 Fig, S3 Fig, S4 Fig, S5 Fig and S6 Fig) Although the detection method has many limitations, rare transcriptomic profiles tend to be points on the t-SNE scatter plots either far away from the majority of their same types or along the edges of a group of agglomerated points. On the pairwise distance matrix heatmap, rare transcriptomic profiles tend to be small chunks that are distinct from their neighbors, and the heatmap becomes smoother after removing the rare transcriptomic profiles. The detected rare transcriptomic profiles could be further inspected as potential rare cell types and states by comparing with their nearest neighbors. The KNN rare transcriptomic profile detection method is also capable of handling scRNA-seq datasets with over 10,000 cells (S8 Fig panel B), with a runtime of 22.14 ± 0.87 (mean ± standard deviation) minutes on 2,058,652 cells.

Identification of cluster separating genes
We used scedar to identify the genes distinguishing the MIRAC cluster 1, 15, and 22 of the Zeisel et al. [53] dataset (Fig 6 and Table B in S1 Text). In the original publication, the upper to lower MIRAC clusters labeled 22, 1, and 15 in the t-SNE scatter plot are assigned to microglia, endothelial cells, and astrocytes respectively (Fig 4C). We choose these three clusters to inspect one of the discrepancies between cell types and MIRAC clustering results, where the small isolated upper part of the MIRAC cluster 15 is assigned to microglia instead of endothelial cells in the original publication.
The smaller upper isolated part of MIRAC cluster 15 might be a distinct cell sub-type of microglia. Although it expresses a microglia marker gene C1qb (Fig 6B) [60], it does not express Mrc1 or Apoe in the same pattern as the MIRAC cluster 22 (Fig 6B and 6C). According to the transcription levels of the cluster separating genes (Fig 6C), the smaller upper isolated part of MIRAC cluster 15, which is located at the top of the cluster 15 rows, has some genes expressed in the same pattern as the microglia, but it also has some other genes expressed distinctly from the microglia.

Discussion
Comprehensive profiling of the transcriptomes of individual cells within an organism or tissue by scRNA-seq is able to facilitate the systematic study of physiological or pathological states [3,61,62]. Previous scRNA-seq experiments identified novel cell types or states [61], obtained insights into the regulation of differentiation process [3,59,63,64], and inferred molecular mechanisms of tissue functions [53,65,66].
The biological results of scRNA-seq experiments are obtained from extensive data analyses, which could take more time than doing the experiments. As the size of scRNA-seq datasets increases rapidly [1,67], computational methods also need to be scalable by exploiting hardware and software techniques for big data analytics, in order to complete an analysis within a reasonable amount of time.
Scedar is able to facilitate gene dropout imputation, rare transcriptomic profile detection, clustering, and identification of cluster separating genes for scRNA-seq data by exploiting scalable system design patterns and high-performance computing architectures. We parallelized time-consuming computational procedures by multi-processing without excessive copies of shared data in memory. We also decompose the whole analytical procedure into multiple steps, so that certain steps could be specifically optimized without repeating others. In addition, intermediate results, such as pairwise distances and t-SNE projections, are lazy loaded and cached in a unified data structure to speed up analytical routines, prevent repeated computations, and alleviate the burden on users to keep track of all intermediate results.
Comparing to other computational tools that were developed or updated for large scale scRNA-seq data analysis, like PAGODA2 [68], Seurat v2.0 [8], and Scanpy [9], scedar distinguishes itself with an additional research focus on developing new analytical methods, including those based on machine learning. In scedar, we adapted KNN, a typical machine learning PLOS COMPUTATIONAL BIOLOGY algorithm, to impute gene dropouts and detect rare transcriptomic profiles. MDL principle, an important concept in computational learning, is applied to cluster single cells [25]. A scalable and sparsity-aware gradient boosted tree system XGBoost [18], which implements a typical machine learning algorithm, is used to identify genes that are able to distinguish different clusters. In addition, these machine learning methods are able to exploit modern high-performance computing architecture, which improves the scalability of the package.
In scedar, we developed a clustering algorithm, MIRAC, for scRNA-seq data. MIRAC clusters observations in three steps: 1) build a tree by hierarchical clustering, 2) divide the tree into sub-trees, and 3) merge similar sub-trees into individual clusters. This clustering strategy adapts the basic ideas of BIRCH [69] and BackSPIN [53]. Instead of building a balanced tree structure, like the clustering feature tree in BIRCH for further partitioning, MIRAC divides the tree structure built by hierarchical clustering optionally in a balanced manner (S10 Fig and  S1 Text section 1.4), which simplifies the clustering procedure. In BackSPIN, a sorted pairwise correlation matrix is recursively bi-partitioned into clusters according to criteria based on the normalized sums of correlation coefficients. In contrast, MIRAC iteratively merges the relatively small sub-clusters according to criteria based on the MDL principle, which increases the robustness for determining whether two groups of observations should be put in the same cluster or not. Especially, when the number of observations is large within a group, finding an optimal bipartition is not straightforward, since there may be multiple distinct sub-groups. Although the performance of MIRAC under certain metrics is comparable to other clustering algorithms, it is able to provide distinct clusters that are sensitive to local structures, which could be used as alternative perspectives to interpret the source of heterogeneity within the dataset.
There are still many possible improvements on scedar. To improve the scalability of MIRAC, we could provide more efficient methods to obtain the optimal leaf ordering using linear embedding techniques [28]. To improve the scalability of the backend data structure, we could extend it with distributed analytic systems such as Apache Spark [10]. To visualize the differences between different clusters, we could plot the fold changes of gene transcription levels on the pathway maps in the KEGG (Kyoto Encyclopedia of Genes and Genomes) database [70].
Scedar is at its core developed as an open-source, community-extensible package for exploratory data analysis, with strong testing and code development standards (Table C in S1 Text). Scedar will continue to be developed in an open-source environment that can offer an analytical platform for developing novel algorithms, support state-of-the-art software quality control standards, and provide easy-to-use programming interfaces for exploratory data research.

Code availability
The scedar package is distributed through Pypi https://pypi.org/project/scedar The code and other resources are available in Github at https://github.com/logstar/scedar Supporting information S1 Text. Supplementary text. Algorithm A, Algorithm B and Table A, Table B, and Table C. Includes discussion of scedar package development, minimum description length method, two-stage coding scheme for clustered scRNA-seq, mathematical theories on high-dimensional data analysis including distances between points in high dimensional space and the Johnson-Lindenstrauss lemma. Discussion of the skewed root division of a hierarchical agglomerative clustering tree. Pollen et al. [28], Deng et al. [29], and Kolodziejczyk et al. [30] respectively. Within each subfigure, the plots are 1) t-SNE scatter plot with cell type labels, 2) t-SNE scatter plot with MIRAC cluster labels, 3) pairwise cosine distance heatmap with left strip as MIRAC labels and upper strip as cell type labels, 4) t-SNE scatter plot with common or rare rare transcriptomic profile labels, 5) pairwise cosine distance heatmap with left strip as MIRAC labels and upper strip as common or rare transcriptomic profile labels, 6) pairwise cosine distance heatmap with rare transcriptomic profiles removed.