Column subset selection for single-cell RNA-Seq clustering

The first step in the analysis of single-cell RNA sequencing (scRNA-Seq) is dimensionality reduction, which reduces noise and simplifies data visualization. However, techniques such as principal components analysis (PCA) fail to preserve non-negativity and sparsity structures present in the original matrices, and the coordinates of projected cells are not easily interpretable. Commonly used thresholding methods avoid those pitfalls, but ignore collinearity and covariance in the original matrix. We show that a deterministic column subset selection (DCSS) method possesses many of the favorable properties of PCA and common thresholding methods, while avoiding pitfalls from both. We derive new spectral bounds for DCSS. We apply DCSS to two measures of gene expression from two scRNA-Seq experiments with different clustering workflows, and compare to three thresholding methods. In each case study, the clusters based on the small subset of the complete gene expression profile selected by DCSS are similar to clusters produced from the full set. The resulting clusters are informative for cell type.


Introduction
Advances in RNA sequencing technology have recently made it possible to measure the genomewide expression profile of single cells (Tang et al., 2009). This promising technology is not without computational and analytical challenges, some of which include quality control, quantification, normalization, technical variability, and other confounding factors such as batch effects (Stegle, Teichmann and Marioni, 2015;Wagner, Regev and Yosef, 2016). More general challenges stem from the high dimensionality of the expression profiles: for example, selecting informative features from within the expression profiles.
One use for single-cell RNA sequencing (scRNA-Seq) data is the characterization of heterogeneity of expression within a population of cells and the discovery of new cell types through clustering of expression profiles (Zeisel et al., 2015). This note explores the following question: is it possible reduce the number of features in the expression profile without a large effect on the error rate for clustering and classification? This question is inspired by the quality control and technical variability challenges of scRNA-Seq. Common techniques for quality control and technical variability reduction include simple thresholding schemes and principal components analysis (PCA). Both of these techniques reduce the number of features in the data matrix.
One commonly used technique to reduce the number of features in the data matrix involves selecting columns from the original data matrix A, to form a column submatrix C, by thresholding the individual columns based on a score. Frequently used scores are on measures of abundance (Lun, McCarthy and Marioni, 2016), empirical variance (Kwon, Fan and Kharchenko, 2017), abundance and empirical variance (McCarthy et al.), and index of dispersion (empirical variance/mean) (Satija et al., 2015;Trapnell et al., 2014). Read count thresholds are intended to reduce low-abundance genes (Bourgon, Gentleman and Huber, 2010) or genes with high dropout rates (Brennecke et al., 2013), as these genes are not considered informative. Variance thresholding methods assume that the most variable genes are responsible for the important differences between cells (McCarthy et al.). Index of dispersion thresholding has a natural interpretation in terms of formal hypothesis testing, when the null model for gene abundance is the Poisson distribution (Cox and Lewis, 1966). We call these methods simple thresholding methods, because the score for each column i depends only on column i. Furthermore, within each column i, covariance between the rows (cells) of that column is not taken into account. By selecting columns and not linear combinations of columns from A, the elements of C will maintain the properties of non-negativity, sparsity, and interpretability, an advantage over PCA, but there are no guarantees that C will have similar properties to the original data matrix A.
Replacing the original data matrix of scRNA-Seq expression profiles with a rank-k PCA truncation of the profiles is another commonly used technique to reduce the number of features and the technical variability (Wagner, Regev and Yosef, 2016). To understand the PCA truncation, we must establish some matrix notation that we will use throughout this note. We orient the original data matrix A so that the n rows are cells and d columns are features, where n < d. For PCA, singular value decomposition (SVD) is performed on the column-mean centered matrix A = A − 1µ T , where 1 is an n × 1 column vector and µ = 1 n A T 1 is a d × 1 column vector of column-means. The sum of the spectrum of eigenvalues ofÃÃ T is proportional to the total empirical variance of A. The rank-k PCA truncation of A, which we callT, is the rank-k SVD truncation ofÃ. SVD is reviewed in Sec. 6.1, and the formula forT is provided there. As a consequence of the SVD, the spectrum of the square of the rank-k PCA truncationT is identical to the spectrum of the square of the mean-centered data matrixÃ up to rank k; PCA gives a rank-k approximation to the mean-centered dataÃ that preserves the maximum empirical variance of A. PCA is performed to reduce technical variability under the assumption that the technical variation is primarily captured by the non-leading eigenvalues and eigenvectors ofÃÃ T .
The drawback of replacing the original data matrix with the rank-k PCA truncation of the data that it fails to preserve non-negativity and sparsity structures present in the original data matrix, and the coordinates of projected cells are not interpretable in terms of single features.
The goal of column subset selection (CSS) is to extract from a matrix A a column submatrix C that conserves favorable properties, such as conditions on the spectrum of the column submatrix C (Tropp, 2009). Like the simple thresholding methods, CSS maintains the properties of nonnegativity, sparsity, and interpretability, and like PCA, CSS conserves favorable matrix properties.
Similar to the simple thresholding methods discussed above, each column has a score, however in CSS algorithms, the score for each column i also depends on all of the other columns. We will consider rank-k subspace leverage scores in this note. Leverage scores have been considered for regression diagnostics and outlier detection in statistics (Velleman and Welsch, 1981;Chatterjee and Hadi, 1986) and were brought to prominence more recently in the context of randomized matrix algorithms (Drineas, Mahoney and Muthukrishnan, 2006). The rank-k subspace leverage score τ i (A k ) for the i th column of A is, where the i th column of A is an (n × 1)-vector denoted by a i , M + denotes Moore-Penrose pseudoinverse of M, and A k is the rank-k SVD approximation to A, defined in Sec. 6.1. The leverage score τ i (A k ) can also be written as the solution to the following optimization problem (Cohen et al., 2015),  (2014) show that for datasets with power-law decay in τ i (A k ), DCSS will select a least-squares approximation for A, CC † A, requiring fewer columns with the same accuracy than random sampling methods. One of the contributions of this note is a new bound for the spectrum of the square of C selected by DCSS projected onto the rank-k subspace that best approximates A (Eqn. 2.9). This bound means that, once both C and A are projected onto the rank-k subspace that best approximates A, CC T is "close" to AA T . Another consequence is that the Frobenius norm of C is bounded (Eqn. 2.10). The Frobenius norm is a measure of the "size" of a matrix, so this bound provides confidence that the DCSS column matrix C is also similar in "size" to A and A k . In the event that DCSS is performed on a mean-centered matrixÃ, the Frobenius norm provides a measure of empirical variance. We also show a similar bound holds for random sampling (Eqn. 2.11), and under the assumption of power-law decay, DCSS requires fewer columns for the same error than random sampling.
In addition to the spectral bound, we present two case studies on two different scRNA-Seq experimental and analysis workflows to illustrate empirically the effect of thresholding features with DCSS compared to read count, variance, and index of dispersion on clustering and classification.
To the best of our knowledge, this is the first time DCSS has been applied to scRNA-Seq data.
The first case study is the genome-wide expression profiles of 3, 005 cells from the mouse cortex and hippocampus (Zeisel et al., 2015) and the clustering workflow of Ntranos et al. (2016). The second case is the genome-wide expression profiles of 4, 423 cells from mouse bone marrow (Paul et al., 2015) and the trajectory workflow of Setty et al. (2016). In both case studies, DCSS reduces the low abundance genes and maintains many of the most variable and over-dispersed genes. This shows that DCSS shares the best features of the simple thresholding methods and, like PCA, comes with additional bounds on the spectrum. This supports our conclusion that DCSS can be used instead of the simple thresholding methods for quality control and to reduce technical variability, in addition to selecting informative features. In both case studies, only a small fraction of the features are necessary to obtain clusters reflecting cell types, consistent with results in (Kwon, Fan and Kharchenko, 2017). We show that the error rate between the clustering assignments computed with the complete expression profile and the reduced expression profile is small.

Methods
The aim of this note is to explore the effect of thresholding features, measurements of gene expression, with DCSS. We compare DCSS to simple thresholding methods and also to the complete data. These thresholding methods are the first step in the pre-processing workflow. In this section, we include the DCSS algorithm for completeness, and we describe the new bounds for DCSS.

2.1
The DCSS algorithm (Papailiopoulos, Kyrillidis and Boutsidis, 2014) Algorithm 1. The DCSS algorithm selects for the submatrix C all columns i with a rank-k subspace leverage score τ i (A k ) above a threshold θ, determined by the error tolerance and the rank, k. The algorithm is as follows.
1. Choose the rank, k, and the error tolerance, .
3. Sort the columns by τ i (A k ), from largest to smallest. The sorted column indices are π i .
4. Define an empty set Θ = {}. Starting with the largest sorted column index π 0 , add the corresponding column index i to the set Θ, in decreasing order, until,

3)
and then stop. Note that k = d i=1 τ i (A k ). It will be useful to define˜ = i ∈Θ τ i (A k ). Eqn.
2.3 can equivalently be written as >˜ .
6. The leverage score τ i (A k ) of the last column i included in Θ defines the leverage score threshold θ.

Introduce a rectangular selection matrix
Theorem 3 of Papailiopoulos, Kyrillidis and Boutsidis (2014) states that when the rank-k subspace leverage scores exhibit a power-law decay in the sorted column index π i , the number of sample columns selected by DCSS is, (2.5) Papailiopoulos, Kyrillidis and Boutsidis (2014) demonstrate the power-law decay behavior of many real-world datasets; we show that this behavior is a reasonable assumption for the scRNA-Seq applications in Sec. 3.
For a statistical interpretation of DCSS, consider the data a i , i = 1, . . . , d to be identically and independently distributed (i.i.d.) according to the degenerate multivariate distribution Rao (1973) pg. 527-528 for a discussion of the degenerate multivariate distribution. Then the total likelihood of the data matrix A is, where |σ j | are the k largest singular values of A k . In contrast, the total likelihood of the DCSS matrix C is, This shows that the DCSS matrix C preserves the total likelihood of the data up to a factor of exp 1 2˜ < exp 1 2 and a normalization constant, under the assumption that the data is i.i.d.
. Any other selection set Θ of the same number of columns (|Θ | = |Θ|) will have equal or greater error ( ). This interpretation illustrates that DCSS accounts for covariance A k A T k between rows (cells). In contrast, the Poisson null model for the index of dispersion assumes independence between rows (cells) for each column (gene).
The DCSS method has two parameters, k, which jointly determine the number of columns |Θ| in the DCSS column submatrix C. The parameter k determines the rank of interest of the SVD approximation to A. The tuning parameter is a measure of the error tolerance in the "size" of C compared to A k . The selection of these parameters is a model selection problem, and in concert with a loss function, one could select these parameters using one's preferred model selection method (e.g. cross-validation). The aim of this note, to compare clustering performed with the complete data matrix and a column submatrix, does not have a well-defined loss function, and so we use the heuristic "elbow" method for selecting k (Jolliffe, 2002), and we choose to be 0.1 or 0.05 in our applications.
As a toy example to illustrate how DCSS differs from the simple thresholding methods, consider the following toy data matrix, A = 40 20 10 20 10 15 . (2.8) If the goal is to select a column submatrix with two columns, it is easy to check that simple thresholding by mean, variance, and index of dispersion all select the first and second columns.
However, the resulting column submatrix is only rank 1, because the first and second columns are linearly dependent. In contrast, DCSS with (k = 2, > 0.2) will select the first and third columns, and the resulting DCSS column submatrix will be rank 2. Unlike the first three methods, DCSS takes into account the collinearity between columns in the selection procedure. If the DCSS error tolerance for this toy example is less than 0.2, DCSS will select all three columns.
We also mention two asides: first, in applications where the number of cells is far greater than the number of gene features (n > d), the method can instead be applied to A T instead of A to filter cells instead gene features; second, the method can be modified to select columns for any rank-k subspace defined by k singular vectors of A, and not just the leading-k subspace (e.g. drop component 1 but include component 2). This could be useful when some of the leading singular vectors are highly correlated with batch or other confounding effects.

New bounds for DCSS
We derive a new spectral approximation bound (Bound 2.9) for the square of the submatrix C selected with DCSS and projected onto the rank-k subspace that best approximates A.
Theorem 2.1 Let A ∈ R n×d be a matrix of at least rank k and τ i (A k ) be defined as in Eqn. 1.1.
Construct C following the DCSS algorithm described in Sec. 2.1. Then C satisfies, The symbol denotes the Loewner partial ordering which is reviewed in Sec 6.1. Conceptually, the Loewner ordering is the generalization of the ordering of real numbers (e.g. 1 < 1.5) to Hermitian matrices. This bound means that after projection onto the rank-k subspace that best approximates A, CC T is "close" to AA T on that subspace. Statements of Loewner ordering are quite powerful; important consequences include inequalities for the eigenvalues and Euclidean distances. Some of the consequences of the Loewner ordering are reviewed in Sec 6.1. Bound 2.9 and the fact that CC T AA T implies a bound on the Frobenius norm of C, a measure of the "size" of a matrix, In the event that A is mean-centered, this means that the total empirical variance of C is bounded from below by (1 − ) the variance in A k and bounded from above by the total variance of A.
The proof of Bound 2.9 and Bound 2.10 is included in Sec. 6.2.
One simple consequence of Bound 2.9 is the following bound, Bound 2.11 also holds for C selected by random sampling methods with t columns (see Sec. 6.3 for the theorem and proof). Thus, DCSS selects fewer columns with the same accuracy in Bound 2.11 for power-law decay in the rank-k subspace leverage scores when, In this expression, m is the number of columns with zero rank-k subspace leverage score, γ is the minimum non-zero leverage score, and δ is the probability that Bound 2.11 fails to hold under random sampling.

Results
We present two case studies where we compare DCSS to the simple thresholding methods of variance, count, and index of dispersion. We analyze the overlap in the selected columns. We also illustrate the effect of DCSS compared to the complete data for single-cell clustering.
3.1 Zeisel et al. (2015) As a concrete illustration of the DCSS method, we focus on the genome-wide expression profiles of 3005 cells from the mouse somatosensory cortex and hippocampal CA1 region (Zeisel et al., 2015) and the clustering workflow of Ntranos et al. (2016). The main contribution of Ntranos et al. (2016) is to perform clustering directly on the partition of reads into equivalence classes (ECs) rather than on a full quantification of reads into gene expression. ECs are a partition of reads into distinct classes, such that every read in a class maps to exactly the same set of transcripts (Nicolae et al., 2011). This method is computationally scalable, comparable across scRNA-Seq experiments, and can be more accurate than clustering performed on a full quantification of reads into gene expression profiles (Ntranos et al., 2016).
The Ntranos et al. (2016) data matrix A is 3, 005 cells × 246, 981 EC counts. By the elbow method, we choose k = 5 for the DCSS workflow (Fig. 1a). We select an error tolerance of = 0.1.
The rank-5 subspace leverage scores and the power-law fit for the top-scored 10, 000 ECs are shown in Fig. 1b. The column submatrix C has only 862 ECs, or approximately 0.3% of the total ECs. These ECs contain 42.3% of the reads. These 862 ECs map to 2, 748 transcripts and to 1, 642 genes. Table 1  We compare DCSS to the three simple thresholding methods with the same number of columns in Fig. 1c and Fig. 1d. These figures show the similarities and differences in columns selected by the four thresholding methods. The simple thresholding methods have sharp boundaries in Fig. 1c, while the DCSS boundary is not linearly separable. The DCSS boundary approximately interpolates between the count and variance boundaries, and is most distinct from the index of dispersion boundary. Fig. 1d summarizes the overlap between selected columns in a Venn diagram.
These figures illustrate that the DCSS method selects columns that are highly variable, have large counts, and frequently are over-dispersed; as such, the DCSS method is prescribed for quality control and to control technical variability.
The Ntranos et al. (2016) workflow for the Zeisel et al. (2015) dataset is to perform spectral clustering on pairwise Jensen-Shannon (JS) distances derived from the partition of reads into ECs.
The spectral clustering clustering algorithm used is standard; the algorithm is to perform k-means clustering on the k-dimensional SVD projection of the normalized Laplacian of the symmetric similarity matrix S. The similarity matrix used for spectral clustering is S(p, q) = 1 − D JS (p, q), where D JS (p, q) is the JS distance between two probability mass functions p, q ∈ R d . JS distances are well-suited to high-dimensional data, and provide more accurate clustering than L 2 distances on scRNA-Seq data (Ntranos et al., 2016). For the Zeisel et al. (2015) data, the probability mass function for each cell is the vector of EC counts, normalized to sum to one. For the four thresholded workflows (DCSS, count, variance, and index of dispersion), the probability mass function for each cell is the subset vector of EC counts, normalized to one.
We evaluate the average spectral clustering classification error between the complete data and thresholded workflows, regarding the complete data workflow as the ground-truth. Since spectral clustering requires a random initialization for k-means, the average is over T = 10 random initializations. Fig. 2 shows the average spectral clustering classification error rate for both two and nine spectral clusters for the workflow with the matrix A and the workflow with the column submatrix C for various k, . The different cells were curated into 47 subtypes by Zeisel et al.
(2015), but we evaluate our method on courser-grained classifications because we have higher confidence in the spectral clustering ground-truth. Two spectral clusters identify neurons and non-neurons, while nine spectral clusters only loosely correspond to the nine major cell types. We also include the error for the three simple thresholding methods with the same number of columns as the DCSS method. We find that 0.3% of the total ECs give an error rate of 1.7% compared to the complete data for two clusters for k = 5, = 0.1 DCSS; only a small fraction of the gene expression profiles currently produced in scRNA-Seq experiments may be necessary to obtain the clusters reflecting cell types.

Paul et al. (2015)
As a second application of the DCSS method, we focus on the genome-wide mRNA expression profiles of 4, 423 cells from mouse bone marrow myeloid progenitors (Paul et al., 2015), We choose k = 14 for the DCSS workflow by the elbow method (Fig. 3a). We choose k = 14 rather than an elbow at a smaller k because the diffusion component workflow is sensitive to more components. We select an error tolerance of = 0.05. The rank-14 subspace leverage scores and the power-law fit for the top-scored 5, 000 genes are shown in Fig. 3b. The column submatrix C has 4, 693 genes, or approximately 31.4% of the total genes. These genes contain 90.4% of the UMI counts.
We compare DCSS thresholding with k = 14, = 0.05 to the three simple thresholding methods with the same number of columns in Fig. 3c and Fig. 3d. which we perform with keyword string matching. We find that for the k = 14, = 0.05 DCSS, 31.4% of the total genes give an error rate of 3.3% for three branch assignments compared to the complete data; this supports our conclusion that only a small fraction of the gene expression profile from scRNA-Seq experiments may be necessary to obtain meaningful cell-type classifications.

Discussion
scRNA-Seq experiments allow researchers to probe the cell-specific heterogeniety in gene expression.
Quality control and technical variability are significant challenges for scRNA-Seq experiments, and additionally the whole-genome expression profile is high-dimensional. In this note, we explore three existing simple thresholding schemes-count, variance, and index of dispersion-and propose a novel application of a thresholding scheme -DCSS-to select informative features and control quality and technical variability. We prove a bound on the "closeness" of the DCSS data submatrix to the complete data matrix (Eqn. 2.9), enlarging upon the existing set of error guarantees for DCSS (Papailiopoulos, Kyrillidis and Boutsidis, 2014), and illustrating the advantage of DCSS over the three simple thresholding schemes. Other advantages of DCSS include sensitivity to collinearity of features and covariance of cells. Since scRNA-Seq experiments are frequently used to cluster and classify cells, we choose the error rate for clustering and classification compared to the complete data as the evaluation metric for these thresholding schemes.
We present two case studies, the first on mouse cortex and hippocampus scRNA-Seq (Zeisel et al., 2015;Ntranos et al., 2016), and the second on mouse bone marrow scRNA-Seq (Paul et al., 2015;Setty et al., 2016). For the mouse cortex, the data matrix is cells × ECs, and only an incredibly small fraction of the ECs are necessary to obtain neuron and non-neuron cell clusters.
For the mouse bone marrow, the data matrix is cells × genes, and only a small fraction of the genes are necessary to obtain HSPC, myeloid progenitor, and erythroid progenitor branch assignments.
For both case studies, DCSS performs similarly to the simple thresholding schemes, in that it reduces the low abundance genes, maintains the most variable and over-dispersed genes. This supports our recommendation to use DCSS to control quality and technical variability. In both case studies, the error rate between the clustering computed with the complete expression profile and the reduced expression profile is small, suggesting that the clustering algorithms rely on a small subset of informative features.

Software
The Python-package containing code to perform the methods described in the article can be Defining U k as the first k columns of U and analogously for V, and Σ k the square diagonal matrix with the first k entries of Σ, then A k = U k Σ k V † k is the rank-k SVD approximation to A, and T = AV k = U k Σ k is a rank-k SVD truncation of A. Furthermore, we refer to matrix with only the last n − k columns of U, V and last n − k entries in Σ as U \k , V \k , and Σ \k .
The Moore-Penrose pseudo inverse of a rank r matrix A is given by The Frobenius norm ||A|| F of a matrix A is given by ||A|| F = tr (AA † ). The spectral norm ||A|| 2 of a matrix A is given by the largest singular value of A.
The Eckart-Young-Mirsky theorem (Eckart and Young, 1936) states that, for A = UΣV † the SVD of A, and B any complex matrix with compatible dimension to A and rank k, The minimizer A k is unique if and only if σ k+1 = σ k , where σ i are the respective non-increasingly ordered singular values in Σ.
A square complex matrix F is Hermitian if F = F † . Symmetric positive semi-definite (S.P.S.D) matrices are Hermitian matrices. The set of n × n Hermitian matrices is a real linear space. As such, it is possible to define a partial ordering (also called a Loewner partial ordering, denoted by ) on the real linear space. One matrix is "greater" than another if their difference lies in the closed convex cone of S.P.S.D. matrices. Let F, G be Hermitian and the same size, and x a complex vector of compatible dimension. Then, (6.14) A few simple consequences of the Loewner partial ordering are as follows. If F is Hermitian and S.P.S.D., then 0 F, where 0 is the zero matrix.
If F is Hermitian with smallest and largest eigenvalues λ min (F), λ max (F), respectively, then, Let F 1 , G 1 , F 2 , G 2 be Hermitian and the same size. Then if F 1 G 1 and F 2 G 2 , then As a simple consequence of Eqn. 6.14, consider the real matrices FF T and GG T , and the vector x which has a one in row i and a minus one in row j, and zeros elsewhere. The Euclidean distance between rows i, j with respect to G is d i,j (G): Thus, if FF T GG T , by Eqn. 6.14 with appropriate vectors, Furthermore, let F be Hermitian and dimension n, U k be a semi-orthogonal rectangular matrix The upper bound (Bound 2.9) in Theorem 2.1 follows from the fact that 0 I − SS T and the conjugation rule (Eqn. 6.16), This upper bound is true for any column selection of A. A second application of the conjugation rule gives the upper bound in Bound 2.9.
For the lower bound (Bound 2.9), consider the quantity providing the lower bound of Bound 2.9.
For Bound 2.10, the lower bound of Bound 2.9 implies, by Eqn. 6.18 and the cyclic property of the trace. Similarly, Eqn. 6.23 implies tr CC T tr AA T .
Since U k is semi-orthogonal (U T k U k = I), by Eqn. 6.21, every ordered eigenvalue of U T k CC T U k is smaller than its counterpart ordered eigenvalue of CC T . Since the trace is the sum of eigenvalues, this implies Bound 2.10, Note that if A is full rank and k = rank(A) = n, then Bound 2.9 becomes, (1 − )AA T CC T AA T . (6.27) 6.3 Proof of Bound 2.11 for random sampling.
The following theorem pertains to a new spectral bound for the square C selected by rankk subspace leverage scores and the random sampling procedure from Drineas, Mahoney and Muthukrishnan (2006).
Theorem 6.1 Let A ∈ R n×d be a matrix of at least rank k and τ i (A k ) be defined as in Eqn.
1.1. Construct C by sampling t columns of A, reweighted to 1 √ tpi a i , with probability p i = where 1() is the indicator function and γ is a small, positive, It follows immediately that ||V|| 2 = k+mγ t and tr V = k(k+mγ) t .
Then, for k+mγ t + k+mγ 3t , (6.36) Solving for t as a function of , δ, and γ gives, Bound 6.28 also holds for C selected by the DCSS algorithm, as a consequence of Bound 2.9.
Thus DCSS selects fewer columns with the same accuracy for power-law decay for Bound 6.28 when |Θ| < t.  (c) Count-Variance plot for each column of A. The color for each column represents whether the column is selected or not by k = 5, = 0.1 DCSS. The plot also shows the thresholds for count, variance, and index of dispersion with same number of selected columns as DCSS.
(d) Venn diagram comparing the overlap between selected columns between k = 5, = 0.1 DCSS, count, variance, and index of dispersion thresholding.