A copula based topology preserving graph convolution network for clustering of single-cell RNA-seq data

Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. There are various issues in single cell sequencing that effect homogeneous grouping (clustering) of cells, such as small amount of starting RNA, limited per-cell sequenced reads, cell-to-cell variability due to cell-cycle, cellular morphology, and variable reagent concentrations. Moreover, single cell data is susceptible to technical noise, which affects the quality of genes (or features) selected/extracted prior to clustering. Here we introduce sc-CGconv (copula based graph convolution network for single clustering), a stepwise robust unsupervised feature extraction and clustering approach that formulates and aggregates cell–cell relationships using copula correlation (Ccor), followed by a graph convolution network based clustering approach. sc-CGconv formulates a cell-cell graph using Ccor that is learned by a graph-based artificial intelligence model, graph convolution network. The learned representation (low dimensional embedding) is utilized for cell clustering. sc-CGconv features the following advantages. a. sc-CGconv works with substantially smaller sample sizes to identify homogeneous clusters. b. sc-CGconv can model the expression co-variability of a large number of genes, thereby outperforming state-of-the-art gene selection/extraction methods for clustering. c. sc-CGconv preserves the cell-to-cell variability within the selected gene set by constructing a cell-cell graph through copula correlation measure. d. sc-CGconv provides a topology-preserving embedding of cells in low dimensional space.

9. Line 96: "A robust equitable correlation measure Ccor". The manuscript would greatly benefit if the authors provided an overview of the copula correlation concept to a greater extent. How does this method compare to other correlation metrics? A brief literature survey about the usage of this concept is also warranted to drive home the usage of this correlation measure. Which other fields use copula-based correlation and to what extent are they helpful? Answer: We have now provided a brief explanation and overview about the usage and advantage of the copula correlation measure over the other correlation measure in the revised version of the manuscript. Please see the subsection 'Copula based correlation measure (Ccor)' (page no. 11) of the revised version of the manuscript.
10. Line 111: "The result of this step is a feature matrices (Fi).." Grammatical error -"… is a feature matrix (Fi).." Answer: Corrected in Revised Version.
11. Line 112: Which network features are inferred here? Unclear.
Answer: The node features are inferred here. The connectivity among the nodes are inferred as a low dimensional embedding using GCN.
12. Line 120: Consider providing an explanation behind the use of Adam optimizer and ReLu as activation functions. A comment on the learning rate for the optimization procedure would be helpful.
Answer: As advised we have now included the rationale for using Adam optimizer and ReLu as activation functions. Here we used ADAM which used adaptive learning procedures in the optimization process. Please see the section "Training of graph convolution network on cell-cell graph" (page no -4).
13. Line 122: Manuscript could benefit by the addition of a figure for the ROC curve, in addition to the ROC statistics in table1.
Answer: As advised we have now included the ROC curve (validation ROC) corresponding to table-1 as a supplementary figure-1 of the supplementary text.
14. Table1: Table legend states -"First two columns of the table shows total number of nodes and number of edges.." whereas in the table itself number of edges is written first followed by number of nodes. 18. Line 132: The correct usage of the phrase is "state of the art" and not "arts". Please rectify. Also in Table 2 (Table caption).
Answer: Corrected in Revised Version.
19. Figure 2: 'KL div' as a short form for KL divergence should be mentioned in the legend.
Answer: Corrected in Revised Version. 20. Line 157: What is the mathematical interpretation of the Clayton Copula? How does the value of theta matter? What is its physical implication towards the performance of scCGconv?
Answer: The clayton copula is a special type of asymmetric archimedean copula which is appropriate for high dimensional datasets to capture the dependence structure. Here \theta defines the generator function of the copula. The degree of dependence is adjusted by varying \theta. For Clayton copula \theta is defined as \theta \in [-1,Inf]/(0). We have now briefly defined the Clayton copula in the subsection Clayton copula (page no. 11) of the revised version of the manuscript.
Here we select theta as 0.001 as it gives better performance in our experiments.  Table 3: What do the authors mean by pubmed id of the markers? The numbers in parenthesis do not refer to the NCBI gene id or Ensembl Gene Stable ID. Please provide an explanation of what nomenclature of genes is used here. Also why do all 3 CD8 T cell markers have the same ID i.e. 28622514 although they correspond to 3 different genes? Are these IDs specific to the Cell Marker database used by the authors?
Answer: The pubmed id is an unique identifier specific for every published literature. The pubmed id represents published evidence of the results that supports the existence of the markers for the particular cell type. The same pubmed id signifies the corresponding published literature has the evidence about the fact that these three genes are the marker of CD8 T cells.
Here we have utilized oficial gene id to represent the gene names.
28. Line 189: What are the markers identified by scCGconv for other 2 datasets (Baron and Klein)? It would be beneficial to have a list of these markers as a Supplementary Table. Answer: As advised it is now added as Supplementary table-1.

Line 209: What is meant by Human Klein and Pollen? Please explain.
Answer: These represent the datasets used here. We have now updated this line to avoid any confusion.. Answer: Feature selection within the proposed method is driven by the LSH based sampling method. The selected genes obtained from the bucket of LSH sampling stage are non-redundant and relevant over all the cells. This is ensured by selecting the nearest neighbour within the k-nn graph generated from the samples within one bucket. Moreover, the hash function of the LSH step ensures a structure-aware sampling of the features (genes) over all samples (cells). Thus the genes picked by the proposed method are not only relevant and non-redundant, but preserve the properties of the whole datasets.
33. The authors don't do any intermediate dimensionality reduction (typically one would do feature selection, run PCA, and use the PCs for further computation rather than genes). The construction of the cell kNN graph is done directly in the selected feature space. This could be justified since their gene selection is supposed to remove redundant information, so running PCA might not add much. But this is another place where a formal comparison of the two approaches would be worthwhile to see. PCA and their gene selection method are both essentially trying to remove redundant information; so if one compares their approach with one where there is no feature selection and only PCA (and using PCs for further steps), what does the comparison look like? One example comparison: are the genes that contribute heavily to top PCs in PCA the same (or similar) as the genes retained by their feature selection method?