Cell-type annotation with accurate unseen cell-type identification using multiple references

The recent advances in single-cell RNA sequencing (scRNA-seq) techniques have stimulated efforts to identify and characterize the cellular composition of complex tissues. With the advent of various sequencing techniques, automated cell-type annotation using a well-annotated scRNA-seq reference becomes popular. But it relies on the diversity of cell types in the reference, which may not capture all the cell types present in the query data of interest. There are generally unseen cell types in the query data of interest because most data atlases are obtained for different purposes and techniques. Identifying previously unseen cell types is essential for improving annotation accuracy and uncovering novel biological discoveries. To address this challenge, we propose mtANN (multiple-reference-based scRNA-seq data annotation), a new method to automatically annotate query data while accurately identifying unseen cell types with the aid of multiple references. Key innovations of mtANN include the integration of deep learning and ensemble learning to improve prediction accuracy, and the introduction of a new metric that considers three complementary aspects to distinguish between unseen cell types and shared cell types. Additionally, we provide a data-driven method to adaptively select a threshold for identifying previously unseen cell types. We demonstrate the advantages of mtANN over state-of-the-art methods for unseen cell-type identification and cell-type annotation on two benchmark dataset collections, as well as its predictive power on a collection of COVID-19 datasets. The source code and tutorial are available at https://github.com/Zhangxf-ccnu/mtANN.

The workflow of mtANN Algorithm    Chi-squared test. This method selects genes based on the proportion of expressed cells. Here, a gene is considered to be expressed in a cell if its expression level is greater than 1. For each cell type, the null hypothesis for each gene is that the gene expressed in the cells from this cell type is not dependent on this cell type. This procedure is implemented by the chisq.test function in R. The significant genes are extracted as differentially proportioned genes for one cell type.
 Bimodality index. This method identifies genes with bimodal expression patterns.
For each cell type, the bimodality index (BI) [2,3] of each gene can be calculated where 1 and 1 are the mean and the standard deviation of the expression of the cells from the cell type, 2 and 2 are the mean and the standard deviation of the expression of the rest cells, and is the proportion of the cells from the cell type.
The genes with the largest BI values are extracted as high BI genes for one cell type.
 Gini index. This method has advantages in identifying rare cell-type-specific genes [4]. For each gene, the Gini index is defined as twice the area between the Lorenz curve and the diagonal. This procedure is implemented by the calGini function of the giniclust3 package in Python, which also includes a normalizing procedure to remove bias for low-expressed genes [5]. The genes with the largest Gini index values are extracted as high Gini genes.
The parameters of the resulting embedding component and decoder component are then used as the initial parameters for training the classification model.

Module III： Query dataset annotation
With each base classification model {{ , } =1 } where ̂= , and calculate where ̂̃̃=̃. Thus, the final annotation of cell is if equation (3) is greater than equation (4) or vice versa.

Module IV： Metrics for unseen cell identification
With all the base classification models, we can obtain a series of prediction probabilities . For cell , we integrate all the prediction probabilities into a matrix For the inter-model measurement (2) , we sum up the prediction probabilities by column and divide it by the number of reference subsets that containing the corresponding cell type which is denoted as (2) : For all the cells, (2) is transformed into a probability matrix ̃( 2) by dividing each value by the row sum. Finally, the inter-model measurement (2) = (̃ (   2) ).
For the inter-prediction measurement (3) , we obtain all the base prediction label of cell according to prediction probabilities {{̂} =1 } represents the integrated result for this measure. We also transform (3) into a probability matrix ̃( 3) by dividing each value by the row sum.
Finally, the inter-model measurement

Methods for benchmark
We compare mtANN with seven cell-type annotation methods including scmapclust [7], scmap-cell [7], Seurat v3 [6], ItClust [8], scGCN (enrichment) [9], scGCN (entropy) [9], and scANVI [10]. scmap-clust, scmap-cell, Seurat v3, and scANVI can be applied to cell-type annotation based on multiple reference datasets. For the other three methods which are designed to annotate cell types based on a single reference dataset, we directly combine multiple well-annotated datasets as reference datasets in order to use consistent datasets with other methods. As ItClust and scGCN are not designed for multiple reference datasets, we also use combat [11] to correct batch effects between different reference datasets before combining them, and compare the annotation results of the corrected reference data and the directly combined reference data. The following is a brief introduction to the steps of these methods to identify unseen cells.
• Seurat v3. We first integrate multiple reference datasets using the function IntegrateData in Seurat under the default parameters, and then perform celltype annotation with function TransferData. In addition, Seurat v3 identifies "unassigned" cells according to the predicted probability. The smaller the probability, the more likely the corresponding cell is of an unseen cell type.
The default threshold for unseen cell type identification provided by Seurat v3 is the 20-ℎ percentile of the probabilities.
• scmap. It projects query cells onto cell types or individual cells, and these two types of methods are denoted as scmap-clust and scmap-cell, respectively. We perform cell-type annotation based on multi-reference datasets according to the vignette provided by scmap. In addition, scmap-clust and scmap-cell identify "unassigned" cells according to the integrative predicted probability.
The smaller the probability, the more likely the corresponding cell is unseen.
For identifying unseen cell types, the default thresholds provided by scmapclust and scmap-cell are 0.7 and 0.5.
• ItClust. It identifies unseen cell types using a confidence score defined by calculating the similarity between clusters and cell types in the reference dataset. The smaller the similarity, the more likely the corresponding cluster is of an unseen cell type.
• scGCN. It proposes two metrics to identify unseen cell types, which are entropy and enrichment. Identifying unseen cell types with different metrics is regarded as two methods: scGCN (enrichment) and scGCN (entropy). The smaller the enrichment, the more likely the corresponding cell is of an unseen cell type. Conversely, the greater the entropy, the more likely the corresponding cell is of an unseen cell type.
• scANVI. The unseen cells are identified based on the prediction probability.
The smaller the probability, the more likely the corresponding cell is of an unseen cell type.
For all the methods, the data preprocessing procedures followed their original manuscript and all the parameters are the default values.

Performance assessment
We use the area under the precision-recall curve (AUPRC) score to evaluate the performance of unseen cell type identification metrics of different methods. The x-axis and y-axis of the precision-recall curve are and which can be calculated as below: where is the number of cells identified as "unassigned" that belong to the truly unseen cell type. represents the number of cells identified as "unassigned" that are not belong to the truly unseen cell type, and means the number of cells of the truly unseen cell type that are not identified as "unassigned".
At a fixed threshold, we use the F1 score to compare the accuracy of each method in identifying unseen cell types in the query dataset. F1 score can be calculated as below: To compare the performance of mtANN with other methods in annotating the entire query dataset, we use accuracy which is defined as the proportion of correct predictions in the query dataset. Accuracy can be calculated as below: where is the number of query cells, is the true cell-type of -th cell where the cell belongs to the real unseen cell type is labeled as "unassigned", and ̂ is the predicted cell type of -th cell where the cell labeled as "unassigned" belongs to the predicted unseen cell type.
To investigate whether combining the three complementary measurements of uncertainty provides superior performance compared to using a single evaluation metric. We run mtANN with four different settings: using only one of the three metrics (1) , (2) , and (3) , and using a combination of the three metrics ( ) for determining unseen cell types. To more clearly compare the performance of each unseen cell type evaluation metric, we calculate an Accuracy Ratio (AR) index which is defined as the ratio of the number of tests in which one setting outperforms another setting to the number of tests in which it performs worse. Let represent the AUPRC or accuracy of one setting, and denotes the AUPRC or accuracy of another setting.
The calculation of AR index is: