RgCop-A regularized copula based method for gene selection in single-cell RNA-seq data

Gene selection in unannotated large single cell RNA sequencing (scRNA-seq) data is important and crucial step in the preliminary step of downstream analysis. The existing approaches are primarily based on high variation (highly variable genes) or significant high expression (highly expressed genes) failed to provide stable and predictive feature set due to technical noise present in the data. Here, we propose RgCop, a novel regularized copula based method for gene selection from large single cell RNA-seq data. RgCop utilizes copula correlation (Ccor), a robust equitable dependence measure that captures multivariate dependency among a set of genes in single cell expression data. We formulate an objective function by adding l1 regularization term with Ccor to penalizes the redundant co-efficient of features/genes, resulting non-redundant effective features/genes set. Results show a significant improvement in the clustering/classification performance of real life scRNA-seq data over the other state-of-the-art. RgCop performs extremely well in capturing dependence among the features of noisy data due to the scale invariant property of copula, thereby improving the stability of the method. Moreover, the differentially expressed (DE) genes identified from the clusters of scRNA-seq data are found to provide an accurate annotation of cells. Finally, the features/genes obtained from RgCop is able to annotate the unknown cells with high accuracy.

4. In the Relevancy test, genes need to calculate Ccor score with class labels. What does "class label" mean? Is RgCop a supervised method to select genes? In my opinion, Kendall tau is used to measure ordinal association. Are class labels ordered?
Answer: RgCop is a supervised method for gene selection as it selects the initial feature by computing correlation between features and class labels. Here the class labels are the cell types of the original single cell datasets. For defining Kendall tau we used copula, we don't need to order the labels. Kendall's tau (τ ) is the measure of concordance between two variables; defined as the probability of concordance minus the probability of discordance. The concordance function (Q) is the difference of the probabilities between concordance and discordance between two vectors (X1, Y 1) and (X2, Y 2) of continuous random variables with different joint distribution H1 and H2 and common margins F and G. It can be proved that the function Q is depend on the distribution of (X1, Y 1) and (X2, Y 2) only through their copulas. According to Nelson et. al. 2007, there is a relation between Copula and Kendall tau that can be expressed as: (1) where E c denotes the expectation with respect to the copula distribution function C.
5. In the case study of PBMC68k, the ARIs are quite low across all methods. This may be caused by the low gene numbers.
Answer: As suggested, we have now selected the top 1000 genes from PBMC68k. The updated ARI is given in figure 5, panel-E of the revised manuscript.
6. The authors split data in training and test set to prove RgCop's predicting ability. This result is based on analysis within the same dataset. However, scRNA-seq data may be affected by batch effect. Can RgCop be trained in dataset and predicted accurately in another dataset? Is RgCop influenced by batch effect when it is utilized in a merged data? In addition, it seems RgCop can only predict cells from same tissues. Of note, the proposed method is a wrapper-based step wise feature selection method, so it is obvious that it takes more time than any filter-based feature selection approach (e.g. CV2 Index, Gini Clust). However from the table-5 it can be noticed that the execution time is almost comparable with the other state-of-the-arts.
2 Answer of Reviewer 2

General Comments
Informative gene selection is a critical step for scRNA-seq data clustering analysis. In this manuscript, Lall et. al present a new tool called RgCop for informative gene selection from scRNA-seq data. RgCop utilizes copula correlation (Ccor), a robust equitable dependence measure that captures multivariate dependency among a set of genes. The concept of this method is novel. However, some limitations and errors are found during my reading. My concerns are detailed below.

Major comments
1. In the abstract and Fig 1, the description of RgCop give the reader the impression that RgCop can start from unlabeled (unannotated) scRNA-seq dataset to select informative genes. However, in the later part, it seems class labels or in other words, cell annotation is required for RgCop(line 283).What is the point to select genes for a labeled(annotated) dataset? For example, selecting a small number of genes for targeted single-cell RNA-seq for a larger cohort or for probe based spatial transcriptomics. The authors need to clarify what is their goal, otherwise the description is quite confusing.
Ans: RgCop is a supervised gene selection technique and it needs class labels/ cell annotation. However, we can use RgCop for classifying new (unannotated) cell samples. To avoid confusion regarding the applicability of RgCop we have now performed two analyses. First, we show how the selected genes using RgCop are important for discriminating the unknown cell samples. In the second analysis we show how RgCop would be effective on data of several batches. The selected genes from one batch can be applied to cluster the data of another batch of unknown type. Please see the subsection 'Classifying test samples using the selected feature' (page no. 08) for the first analysis and subsection 'RgCop is robust for data with different batches' (page no. 11) for the second analysis.
2. If selecting minimum marker gene combination is the only goal, I suggest the authors to compare the performance of RgCop with recently polished method called NS-Forest version 2.0(Aevermann, Brian D et.al, Genome Research, 2021). If the authors want to apply RgCop to select genes for clustering unannotated scRNA-seq dataset, then please provide the code and modify the text correspondingly.
Answer: Our goal is to select genes for clustering/discriminating the cells of unknown type. Later we find out cluster specific DE genes which can be treated as markers to annotate the cell clusters.
As advised by the reviewer we have now modified the text of the subsection 'marker analysis' and included two analyses (see subsections 'Classifying test samples using the selected feature' and 'RgCop is robust for data with different batches') to avoid any confusion about the applicability of the proposed method. The code of the analysis are provided in the github page.
3. Nowadays, it's normal that the cell number of a single scRNA-seq experiment ranges from 10 4 to 10 6 . So, scRNA-seq data with a sample size of 68k cannot be regarded as "ultra" large data. It is important to evaluate the performance of RgCop with a dataset with a sample size larger than 500k, or the number of labels larger than 50.

Minor comments
4. In the proof of correctness, the authors claim that the gene set G s obtained from RgCop is optimal. However, there is a error. In line 320, is still valid. And G s = G s in this situation. Thus, you cannot claim gene set G s obtained from RgCop is optimal. Correct me if I am wrong.
Answer: We claim k > (i + 1) and for this by the definition of the objective function we can get In line 387 (in the revised manuscript) we explained why k cannot be equal to i, and so f (g k ) = f (g i ).
If it is then G s will contains redundant genes. 8. Please check if legends are consistent with the text. For example, in Fig.3 legend, it says "Each box represents ten ARI scores of clustering results for selected 10 sets of features ranging from 10 to 100", while in text, it says "We vary the number of selected features in the range from 20 to 100 and compute the ARI scores for each method". Which one is correct? From 10 to 100 or from 20 to 100?
Answer: As advised we have now corrected the text and the legends.
9. In line 184, it says "Four widely used classifiers" but only three classifiers are listed. It is confusing that the column names of the table2 are ARI and three classifiers. Please make the table2 easy to understand.
Answer: It is now corrected. Please see tble-4 of the revised version.
10. Some contents in text and supplementary are the same. For example, the same pseudo-code of RgCop is presented in both text and supplementary. Please remove the redundancy.
Answer: As advised we have now removed the redundancy.
11. In line 241, it says Yan dataset consists of a transcriptome of 124 individual cells. But table3 shows Yan dataset consists of 90 cells. Which one is correct?
Answer: The raw data consists of 124 individual cells. We downloaded the processed data from https://hemberg-lab.github.io/scRNA.seq.datasets/human/edev/, which consists of 90 cells. This processed data is also utilized in the SC3 clustering work by Kiselev, et al. SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14, 483-486 (2017). It is now corrected in the main text.
12. In line 247, the GEO accession number is written as "no GSM1832359". Please correct it and other GEO accession numbers.
Answer:It is now corrected 13. A spelling error happened in line 111. Word "doenstream" should be "downstream".
Answer:It is now corrected 3 Answer of Reviewer 3

General comments
In this study, the authors developed RgCop, a gene selection method based on regularized copula correlation for scRNA-seq data analysis. RgCop utilizes copula correlation to find the most informative genes from scRNA-seq datasets. In order to filter redundant genes, RgCop employs an l1 regularization term to punish the redundant coefficients of genes. The authors proved that RgCop is superior to other state-of-the-art methods in clustering cells, capturing the dependence between noisy data features, and annotating unknown cells. The main opinions and concerns we raised are listed below:

Major comments
1. The author has provided demo data and RgCop software in Github, which allows users to use the demo data to test RgCop. But the author does not seem to mention how to use RgCop to analyze other scRNA-seq data. In order for the algorithm to be widely used, RgCop needs to allow users to enter their data and adjust parameters.
Answer: As suggested we have now reorganized our Github page with proper documentation. We have also created an R package that can be easily installed using devtools. Please see our Github page for reference.
2. In lines 121 and 122, the authors mentioned "The features are selected by several supervised feature selection algorithm and the classification accuracy are compared with RgCop." What algorithm was applied to select features?
Answer: We compared the performance of RgCop with four widely used supervised feature selection techniques:CMIM, JMIM, DISR , MRMR. It is now updated in the revised version.
3. Based on the Wilcoxon rank-sum test p-value of each cluster, the first five differentially expressed genes are selected as marker genes. These genes will be used to annotate the corresponding clusters. The annotation of each cluster can be directly affected by the marker gene. Therefore, how to select marker genes in scRNA-seq data analysis is a very difficult problem. Why did the authors decide to use the first five genes as marker genes?
Answer: The DE genes are ranked based on their p-value and first five are selected as cluster specific markers. We follow the standard approach of Seurat and scanpy which are well-known pipeline for scRNA-seq analysis. One can use any number of DE genes based on the p-value scores.
4. In line 145, the authors selected 100 characteristic genes to cluster the cells. They then used the Adjusted Rand Index (ARI) to evaluate the clustering performance between the predicted cluster and the known group. Do the top 100 features make RgCop the best performance? Is this number of features consistent in the analysis of different scRNA-seq data? Figure 2A shows that the variation of the ARI value in the Muraro dataset is greater than the Yan and Pollen datasets during the process of changing the number of features from 10 to 100. The Muraro data set contains 2126 instances, which is more than the Yan and Pollen datasets. Does this mean that the size of the feature will significantly affect the performance of RgCop in the datasets with a large number of cells?
Answer: As suggested by the reviewer (and also by other reviewers), we evaluate the performance of RgCop and other competing methods by selecting 1000 genes. We have now updated the clustering results of the competing methods ( Figure 2) by varying the number of selected features in the range from 500 to 1000 and compute the ARI scores for each method. There is no optimal number of selected features for RgCop (and for other competing methods as well). The number of selected features is user choice. However, with more number of selected features (here more than 500), ARI performance gets better.
To know the performance of RgCop in data with ultra-large feature size we used mouse single-cell RNA-seq hippocampus dataset from [Saunders, Arpiar, et al. "Molecular diversity and specializations among the cells of the adult mouse brain."Cell 174.4 (2018): 1015-1030.] consisting of 690k cells of Adult Mouse Brain. The results shows high ARI score (0.78) which proves the efficacy of RgCop for the dataset with an ultra-large number of cells. Please see subsection 'A case study on ultra large scRNA-seq data' (page no. 11) of the revised version of the manuscript.
5. In the section "Classification performance on real dataset using supervised method". Four classification methods were mentioned, but only three were presented. What method was used to measure the accuracy? What is the conclusion of this section?
Answer: We have updated the section by mentioning three classifiers: Neural Network (NNET), Support Vector Machine (SVM), and Gradient Boosting Machine (GBM) to measure the accuracy. . The section represents comparison of RgCop with the benchmark supervised feature selection techniques. The selected features are validated by using three classification techniques. The takeout of this section is that Rgcop outperforms the benchmark for selecting genes for the scrn-seq data. We have now mentioned in the subsection 'classification performance on real dataset using supervised method', of the revised version of the manuscript.
6. In the abstract, the authors claim that "the differentially expressed (DE) genes identified from the clusters of scRNA-seq data are found to provide an accurate annotation of cells". However, in the "Marker gene selection" section, the authors only described the differentially expressed genes of each dataset. Based on the result, how to conclude that RgCop can provide an accurate annotation of cells?
Answer: Please note that we follow the standard pipeline/procedure of Seurat/Scanpy for marker gene analysis. The procedure find out the cluster specific DE genes which will be treated as biological markers of the cells within the cluster. To know the annotation of cells using markers (cluster specific DE genes) we have now performed one experiment. We matched the DE genes with experimentally verified cell markers downloaded from CellMarker [Zhang et al., CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 2019 Jan 8;47(D1):D721-D728.] and annotate the cell clusters with specific type. As we know the labels of cells, we can verify the annotations with the cell labels and observe good annotations (high accuracy). We have now discussed this in the last paragraph of the subsection 'Marker gene selection'

Minor Comments
In the bottom panel of figure 1, the black text with a dark blue background is hard to read.
Answer: The figure-1 is now updated.