SCMarker: Ab initio marker selection for single cell transcriptome profiling

Single-cell RNA-sequencing data generated by a variety of technologies, such as Drop-seq and SMART-seq, can reveal simultaneously the mRNA transcript levels of thousands of genes in thousands of cells. It is often important to identify informative genes or cell-type-discriminative markers to reduce dimensionality and achieve informative cell typing results. We present an ab initio method that performs unsupervised marker selection by identifying genes that have subpopulation-discriminative expression levels and are co- or mutually-exclusively expressed with other genes. Consistent improvements in cell-type classification and biologically meaningful marker selection are achieved by applying SCMarker on various datasets in multiple tissue types, followed by a variety of clustering algorithms. The source code of SCMarker is publicly available at https://github.com/KChen-lab/SCMarker.


Introduction
Current single-cell RNA-sequencing (scRNA-seq) data generated by a variety of technologies such as Drop-seq and SMART-seq, can reveal simultaneously the mRNA transcript levels of thousands of genes in thousands of cells [1][2][3]. However, the increased dimensionality makes it challenging to delineate cell types, due to complex and often undefined associations between individual genes and cell-types [4,5]. It is well accepted that genes are not equally informative in delineating cell types [6,7]. Certain genes are only expressed in certain cell types, but not others [8]. Moreover, the expression levels of certain genes cannot be robustly measured (e.g., zero inflated), due to technological bias [9][10][11]. Thus, it has become a common practice to retain only highly expressed or highly variable genes for cell population analysis [12][13][14][15]. Several scRNA-seq data clustering packages (S1 Table) perform marker selection through dimensionality reduction techniques such as principal component analysis and tSNE [16], which are equivalent to identifying the set of highly variable genes. Unfortunately, the biological implications and the technical optimality of these gene selection strategies retain unclear, despite their wide use in cell-type clustering.
Here, we propose an ab initio method, named SCMarker, which applies information-theoretic principles to determine the optimal gene subsets for cell-type identification, without referencing to any known transcriptomic profiles or cell ontologies. The central idea of our method is to select genes that are individually discriminative across underlying cell types, based on a mixture distribution model, and are co-or mutually exclusively expressed with some other genes, due to cell-type specific functional constraints. Although the techniques of applying a mixture distribution model for a set of continuous data points have been widely used in clustering analysis of gene expressions, it is unclear whether this approach can benefit this problem context [17,18]. In particular, because single-cell gene expression measurements have vast dimensions (>20,000 genes), are highly noisy (e.g., zero-inflated, drop-off errors), and are generated by technologies of varied properties [19,20]. For example, SMART-seq is aimed at sequencing the entire RNA transcript, while Drop-seq only the 3' end using unique molecular indices (UMI) to track individual transcript [2,21,22]. Part of our investigation here is to examine whether the previously applied data analytical techniques can be reapplied in the single-cell data-type that have different properties and population structures. Our main goal is to identify not only cell-types, but also biologically meaningful cell-type markers from scRNAseq data at accuracies higher than results derived using canonical gene selection strategies.

Discriminativeness of gene expressions for subpopulation clustering
By definition, cell-type-discriminative markers should have distinctive expression levels across cell subpopulations. Therefore, in a dataset with mixed cell subpopulations, the expression level of a marker should follow a bi-or multi-modal, instead of a unimodal distribution (Fig  1A and 1B) [23][24][25]. Following this assumption, we quantify the degree of modality based on the probability density distribution (f) of each gene expression using a Gaussian kernel function, instead of a mixture model which requires knowing the number of mixture components: where g j is the expression level of gene g in cell j, L is the number of cells and h>0 is a smoothing parameter called bandwidth, which can be leveraged to alleviate biases introduced by uneven sequencing depths.
is a standard Gaussian density function. The optimal value for h can be calculate from where the standard deviation s ¼ ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi , g i is the expression level of gene g in cell i and � g is the average expression level of g across all the cells [26]. We estimated h from the datasets used in this study [27][28][29][30] and obtained a mean h = 0.3 on the datasets generated by the SMART-seq platform and 0.05 on the datasets generated by the Drop-seq platform. We set these as the default values for subsequent analyses. For each gene, we count the number (T) of peaks in the estimated probability density functionf h ðgÞ. A peak is found at the density value c, if there exists a 2 times h long interval I centred at c such thatf h ðgÞ �f h ðcÞ for all g in I. A gene expression level follows a multi-modal distribution, if it has multiple (T�2) local maximum probability density values. Only genes with multimodal probability density distributions are considered as markers.

Co-or mutually exclusively expressing gene pairs (CMEGPs)
Cell-type-discriminative markers are often co-or mutually exclusively expressed, due to modularized regulatory interactions specific to cell types [31]. Consequently, identifying these CMEGPs, will help identify markers which delineate cell type. Because scRNA-seq data are often sparse with limited sequencing depth [32], binarization of the counts would help mitigate technical artifacts and improve robustness over different sequencing platforms (e.g., whole transcript vs 3' sequencing protocols). To identify CMEGPs, we only consider genes with multimodal distribution and discretize a gene-cell expression matrix with N genes and L cells into an N×L binary matrix X2{0,1} N×L , with x ij = 1 designating an expressed gene i in cell j, if the expression level is above the average and x ij = 0 otherwise (Fig 1C).
x iL ) is a binary string. We can calculate a co-occurrence matrix (S) that measures the pair-wise co-occurrence between all the gene pairs, S can also be represented as a directed graph (Fig 1D), in which a node denotes a gene and an edge from gene A to gene B represents that B co-occurs with gene A in at least n cells. Among the connected nodes, the k genes that co-occurred with A in k largest sets of cells are termed the k-nearest co-occurrence neighbours (KNCONs).
In addition, we calculate a mutually exclusive matrix (M) that measures the pair-wise mutual exclusivity between all the gene pairs through Eq (4), where M represents a directed k-nearest mutually exclusive neighbour (KNMEN) graph ( Fig  1D). Similar to KNCON, but in opposite ways, the KNMENs of a gene A are the k genes that occur mutually exclusively with A in k largest sets of cells. The detailed algorithm for constructing these graphs is shown in S1 Fig. Under these definitions, an CMEGP is identified as two genes bi-directionally connected in the KNCEN or the KNMEN graph. We selected as markers the genes that belong to at least one CMEGP, because these genes are more likely associated with cell-type specific functions than those that do not have any CMEGP (likely due to random, non-function-related fluctuation). This concept has been previously examined in the RNA microarray data analysis, but has not been successfully applied in the context of single-cell RNA-seq data analysis, due to vastly different properties between the technologies [17,18].

Comparison with other marker selection strategies
We applied SCMarker to the scRNA-seq data obtained from 1) 19 melanoma patients, which include 4,645 cells; and 2) 18 head and neck cancer patients, which include 5,902 cells sequenced by the SMART-seq2 platform [27,28]. In the original studies, each cell in the sets was labelled as a malignant or non-malignant cell through copy number analysis. The expression levels of known marker genes were used to further classify the non-malignant cells, such as T cells, B/plasma cells, macrophages, dendritic cells, mast cells, endothelial cells, fibroblasts, and myocytes. We found that most known marker genes (96%) demonstrated bi/multi-modal distributions across cells (S2 Fig). Overall, around 6% of genes with bi/multi-modal distributions are identified as marker genes, among which half are the known marker genes.
We assessed SCMarker results with those obtained under two canonical strategies: selecting genes with A) the highest average expression levels and B) the highest variance across cells. In our experiments, the highest variable genes were determined using Seurat [12]. We used five clustering methods: k-means, Clara, hierarchical clustering, DBSCAN and Seurat to cluster single cells based on the selected markers [33,34]. Same numbers of clusters are specified for DBSCAN, k-means, Clara, and hierarchical clustering. The adjusted rand index (ARI), which measures the similarity of two sets of clustering results, was used to quantify the consistency between the clustering results and the known cell labels [35]. Compared to marker sets A and B, selected by the canonical strategies, the marker set selected by SCMarker (equal numbers of markers) resulted in a higher ARI with fairly evident margins (Fig 2). The conclusion appeared to be robust over a range of k and n parameters and were unaffected by using different clustering methods (S3 Fig).These experiments indicated that setting k between 100 and 300 resulted in the most accurate cell type identification results irrespectively to n (S4 Fig). Hence, we select k = 300 and n = 30 as the default parameters for applying SCMarker.
We obtained 902 markers from the melanoma data and more distinguishable cell types using SCMarker than using the canonical strategies (Fig 3A to 3C, S1 Data). Better performance of SCMarker was also obtained in analysing the head and neck cancer data (S5 Fig). Moreover, the genes selected by SCMarker had substantially higher degrees of overlap with the known cell-type markers reported in the original publications than the sets returned by other approaches (the same number of 902 top scoring genes were selected for fair comparison), including the "FindMarker" approach in Seurat (Fig 3D to 3G). Notably, SCMarker selected significantly more immune cell surface markers specific to T cytotoxic, T helper, B lymphocyte, and macrophage cells that are likely present in the tumour microenvironment, as indicated by gene set enrichment analysis (S6 Fig) [36].

Application of SCMarker to 3' UMI count data
To avoid introducing biases due to potential overfitting and assess the utility of our approaches on other platforms, we further assessed the utility of SCMarker in analysing the 3' UMI count scRNA-seq data generated by the droplet platforms. We determined the optimal values of k and n under various sample sizes via resampling the melanoma data (S7 Fig). When the sample size was 500, the optimal values were k = 100 and n = 20 (or 30). When the sample size was 2,500 or 25,000, the optimal values were k = 300, n = 30, the same as what we obtained from the original dataset (4,645 cells). When sample size increased to 50,000, the optimal values increased to k = 400, n = 100. Overall, although the sample size did affect the optimal choices of k and n, their influences were relatively modest.

Fig 2. Comparison of 3 marker selection methods for cell-type identification over a range of k (from 50 to 1000 at a step-size of 50) and n (from 5 to 100 at a stepsize of 5).
Accuracy of cell-type identification (in terms of adjusted rand index) are compared across 3 marker sets selected respectively by SCMarker, the highest expressed and the highest variable gene approaches, using two scRNA-seq datasets from (A) melanoma and (B) head-and-neck cancer samples by 5 clustering algorithms: k-means, Clara, hierarchical clustering (hc), DBSCAN, and Seurat. We first analysed a set of 5,602 cells from the cerebellar hemisphere of normal brain tissues generated by Drop-seq under k = 300 and n = 30 [29]. SCMarker selected 699 genes as markers, which differentially expressed across cell subpopulations under the default parameters (S1 Data). Alternatively, the default mode of Seurat led to the selection of 6,111 highest variable genes (HVGs). For comparison, we selected 699 highest expressed genes (HEGs). Although SCMarker selected less markers than Seurat, the clustering result showed a clearer separation than that based on the Seurat HVGs and on the HEGs (Fig 4A to 4C). In particular, SCMarker successfully delineated Purkinje neurons into purk1 (cluster4, Fig 4A) and purk2 (cluster7, Fig 4A) and recapitulated the differential levels of SORCS3 between two clusters (Fig 4D), which are consistent with the results in the original paper. In contrast, although the Purkinje neurons were clustered into four groups by Seurat (Fig 4B), purk1 and purk2 were not well separated (Fig 4B), and the expression levels of SORCS3 showed mosaic patterns across the 4 groups (cluster4, 6, 11 and 12, Fig 4E). As additional controls, we performed clustering using the top 500 and 1000 Seurat HVGs. That did not result in any improvement (S8 Fig).We then analysed the scRNA-seq data of 52,698 cells from 5 lung tumours generated by the 10X Chromium platform (10X Genomics) under k = 400 and n = 100 [30]. SCMarker identified 950 markers under the parameters (S1 Data), while Seurat identified 1,832 HVGs. We also selected 950 HEGs for comparison. SCMarker led to 23 clearly distinguishable clusters, while the Seurat HVGs led to 12 and the HEGs led to 19 (Fig 5A to 5C). The 10 highest expressed markers per cluster derived by SCMarker showed a high degree of cluster-specificity in the heatmap (Fig  5D). Among the selected markers were the 17 known markers reported by the original study ( Table 1). SCMarker also discovered multiple putative subtypes for some cell types, such as the T, B and myeloid cells (Table 1, Fig 5A). For example, cluster 4 and 23 are the B cells expressing known surface marker CD79A (Fig 5A), yet cells in cluster 4 are evidently different from cells in cluster 23, due to differential IGHG1 and BANK1 expression levels (Fig 5D). For comparison, we selected the 10 highest expressed genes from the clusters determined by the Seurat HVGs and by the HEGs, respectively (Fig 5E and 5F). They appeared non-specifically distributed across clusters (Fig 5E and 5F). These genes also contained fewer known markers ( Table 1). For example, cluster 3 determined by the Seurat HVGs contained markers (CLDN5, CAV1 and IFITM3) from 3 cell-types (endothelia, alveolar and B cell, respectively). Most clusters expressed IFITM3, except for clusters 1 and 6 (Fig 5E). Only T cell and fibroblast markers appeared to be cluster-specific. As additional controls, we also performed analysis using fewer (i.e., 500 and 1000) Seurat HVGs. That resulted in worse results with fewer known markers and marker-specific clusters (S9 Fig). Overall, SCMarker demonstrated higher sensitivity and specificity for cell type and celltype specific marker identification than the alternative approaches. Moreover, markers selected by SCMarker were more significant among genes which were identified by Seurat to define clusters (S10

Discussion
In this manuscript, we reported a new bioinformatics tool, SCMarker, which performs ab initio cell-type discriminative marker selection from scRNA-seq data. SCMarker operates based on two new information-theoretic metrics: 1) bi/multi-modal distribution of subpopulationdiscriminative gene expression in mixed cell populations and 2) co-or mutually-exclusively expressing gene pairs, which quantifies populational structural properties intrinsic to singlecell RNA-seq data. We found that SCMarker can consistently significantly boost cell-type identification accuracy in datasets from a variety of tissues such as cancer and brain, generated by both SMART-seq and Drop-seq platforms. Because SCMarker does not depend on any prior knowledge, we anticipate that it will prove most useful in discovery settings for analysing cell populations of a high degree of plasticity and heterogeneity [37]. SCMarker can potentially be expanded to analyse other types of single-cell data, including mass cytometry and single cell ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data [38,39]. It can be easily incorporated as a module into current scRNA-seq data analysis workflows to preprocess the cell-gene count/expression matrix before performing further downstream analysis.
Supporting information S1  Heatmaps of the average expression levels of the 10 highest expressed genes per cluster identified respectively by SCMarker (D), the highest variable genes (E) and the highest expressed genes (F). Cell types in (A) to (C) are labelled based on the known cell-type specific markers (Table 1), which are highlighted in red boxes in (D) to (F).