Iterative point set registration for aligning scRNA-seq data

Several studies profile similar single cell RNA-Seq (scRNA-Seq) data using different technologies and platforms. A number of alignment methods have been developed to enable the integration and comparison of scRNA-Seq data from such studies. While each performs well on some of the datasets, to date no method was able to both perform the alignment using the original expression space and generalize to new data. To enable such analysis we developed Single Cell Iterative Point set Registration (SCIPR) which extends methods that were successfully applied to align image data to scRNA-Seq. We discuss the required changes needed, the resulting optimization function, and algorithms for learning a transformation function for aligning data. We tested SCIPR on several scRNA-Seq datasets. As we show it successfully aligns data from several different cell types, improving upon prior methods proposed for this task. In addition, we show the parameters learned by SCIPR can be used to align data not used in the training and to identify key cell type-specific genes.

: Comparison of features and properties of various scRNA-seq alignment methods. The "Corrects input?" column refers to whether the method actually aligns (transforms) the input data batches in order to integrate them. The "Maintains semantics?" column refers to whether the output of the method retains the gene semantics given as input. The "Generalizable?" column refers to whether the method learns a model which can be applied to new data. The "Transfers labels?" column refers to whether the method also explicitly aims to apply the cell type labels of one data batch onto another, unlabeled batch. * ScAlign is theoretically able to be applied on new data, as it learns a neural network embedding model, but the ability to save and load the function in different sessions to apply it on new data was not available in software at the time of testing.
to identify key genes related to the cell types being analyzed. . 93 We used three datasets to test and compare two versions of SCIPR to prior alignment methods (Methods).

94
These comparisons were performed by testing the methods on several "alignment tasks".  Figure 1: Summary of steps in iterative point set registration for scRNA-seq data. Each cell in an scRNA-seq dataset can be viewed as a point in high dimensional space. 1) We start with two unaligned batches (sources, blue and targets, orange). 2) A matching algorithm (e.g. picking the closest corresponding point, or using mutual nearest neighbors) is used to pair source cells from A with a corresponding target cell in B. The number of source and / or target cells matched can vary for different matching strategies. 3) Based on the selected pairs, a global transformation function is learned so that source cells in A become closer to their paired cell in B. 4) The learned transformation is next applied to all points in A. 5) This process (steps 2-4) is repeated, iteratively aligning set A onto B until the mean distance between the assigned pairs of cells no longer improves. 6) The final global transformation function is the composition of the functions learned in each iteration at step 3.
Figure 2: Quantitative scoring of alignment methods on benchmark datasets. Each row of subplots are tasks from the same dataset, where each column uses a different source batch (all are aligned to the same largest reference batch). The scores are iLISI (green, batch integration score), and cLISI (orange, cell type mixing score). In each subplot the methods are ordered from top to bottom in order of largest difference (median iLISI -median cLISI) of scores. The center of each box is the median, and whiskers represent 1.5 times the IQR past the low and high quartiles. at the same dataset, on the CELseq2→10x task the other methods such as ScAlign (iLISI: 1.00, cLISI: 1.00) 114 or SeuratV3 (iLISI: 1.51, cLISI: 1.00) are also able to avoid cell type mixing, but are not able to mix the 115 batches as much as SCIPR ( Figure 2). Full alignment quantitative scores for these tasks and all others in the 116 paper are listed in Table S5. These quantitative metrics are also corroborated by a qualitative assessment 117 ( Figure 4). There we can see that both SCIPR-gdy and SCIPR-mnn (top two rows) mix the batches well 118 (1st and 3rd columns) compared to methods like MNN and ScAlign while successfully keeping cell types 119 separate (2nd and 4th columns). 120 Figure 4: Embedding (t-SNE) visualization from alignment tasks on the CellBench dataset using various alignment methods. Each row is a different alignment method (the bottom row is with no alignment). The columns are in two groups based on alignment task: the left two columns pertain to aligning the CELseq2 batch onto the 10x batch, the right two columns are for aligning the Dropseq batch onto the 10x batch. The first and third columns are colored by batch, and the second and fourth columns are colored by cell type. In each alignment task (rows), a different cell type is completely held-out from the source set. The model is then fitted to align the source and the target, and the model is then used to transform the full source set, including the held-out cell type which the model did not see in the source set used for fitting. The first column shows just the held-out cell type colored by batch, after applying the fitted SCIPR-gdy model to align it. The second column shows all of the data after applying the fitted model, colored by cell type. One of the advantages of SCIPR compared to most previous methods is the fact that it learns a general 135 transformation function that can be applied to additional data when it becomes available (Table 1). Such  Figure S8). Figure 5 displays the aligned results for the cell type not used in the learning. As the figure   147 shows, for CD4+ and Cytotoxic T cells SCIPR-gdy is able to mix the two batches even though it had never 148 seen these in fitting.

SCIPR identifies biologically relevant genes 150
The above results demonstrate SCIPR's ability to integrate batches quantitatively and qualitatively. Since genes, we next asked whether the learned weights provide information on the importance of specific genes 153 for the set of cells being studied. To evaluate such an approach we compared ranking genes based on their 154 SCIPR coefficients to a baseline that ranks them based on differential expression (DE) (Appendix S6, S7).

155
Next we performed gene set enrichment analysis using the Gene Ontology to identify the significant functions 156 associated with top genes and test their relevance (Appendix S8). The PBMC dataset, which is the largest, 157 was also the one with the most number of significant categories identified ( are "Defense response", "Regulation of immune response", and "Humoral immune response" (all with adj. 162 p-value 9.743e-9, Table S3). These categories are very relevant for blood cells given their immune system 163 function. On the other hand, the top three categories recovered by top DE genes are much more generic 164 and include "Ribonucleoprotein complex biogenisis", "Ribosome assembly", and "Cellular amide metabolic 165 process" (Table S2).  and we used the five largest cell types and the set of 2629 highly variable genes (Table S1b) (Table S1c). See Appendix for complete details.

238
• ICP assumes that a rigid transform relates the two sets. This may have been appropriate for 3D rigid 239 objects, but not for the complicated, high-dimensional single cell transcriptome data.   and finds the element in each row with the smallest distance to match to that point in A.

262
In contrast, for scRNA-seq alignment we would like to require the following:

263
• Not too many points in A are matched with the same point in B (avoid collapsing many points in one 264 dataset onto a single point).

265
• Not all points must be assigned (since the two dataset may not fully overlap in terms of cell types).

266
One approach for addressing the first requirement is using a bipartite matching algorithm [32] instead 267 of picking the closest point. In such an algorithm a global optimal matching is found such that each point

279
Instead, in Algorithm S1 we propose an efficient greedy algorithm for partial assignment between A and 280 B. The algorithm sorts all of the edges between members of the two sets based on distance. Next, we 281 proceed along the ordered edges starting from the smallest distance. If an edge includes a point in set B 282 that has already been selected β times by previously chosen edges, we discard it and continues down the list.

283
Our algorithm has parameters to adjust how many times we allow each point in B to be matched to (β), 284 and how many of the points in A must be matched (α). In our experiments, we set β = 2 and α = 0.5. The solution to the partial bipartite matching problem, we find that this works well in our related scRNA-seq 288 cell pair assignment problem for alignment. 289 We also experiment with a matching procedure which follows the foundational work of using mutual So far we focused on matching points given their distance. In the next stage, we will fit a transformation 295 function to align the matched points. As discussed above, the family of rigid transforms is not well suited 296 to compute such alignments for scRNA-seq data ( Figure S3). Instead, we propose to use the family of affine  To learn this function, we minimize an objective function that aims to move the assigned points closer to 300 each other, via such an affine transformation. Given a pair assignment S from the previous step (section 4.5)

301
(which may be the result of the classic "closest" strategy from ICP, our greedy algorithm, or MNN matching), 302 learning the transformation function (Figure 1 panel 3 where A, B ∈ R d , and d is the number of genes This is a least-squares objective function. If the system is overdetermined, this could be solved exactly.   is the goal of the alignment. The second is cell type coherence. A method that randomly mixes the two 322 datasets would score high on the first measure and low on the second while a method that clusters each of 323 the datasets very well but cannot match them will score high on the second and not on the first.

324
To track both dataset mixing and biological signal preservation, we follow [24] and use the local inverse 325 Simpson's Index (LISI). LISI measures the amount of diversity within a small neighborhood around each 326 point in a dataset, with respect to a particular label. The lowest value of LISI is 1 (no diversity). As in [24], 327 we define integration LISI (iLISI) as the score computed when using the batch label for each datapoint, and 328 cell-type LISI (cLISI) as the score when using the cell-type label. iLISI measures the effective number of 329 datasets within the neighborhood (so the higher the better). cLISI measures the effective number cell types 330 within the neighborhood (so the lower the better). With these two metrics in hand, we can keep track of 331 not only the ability of our algorithms to align one dataset onto another, but also their ability to preserve 332 original signal. In our figures which report the iLISI and cLISI scores, we rank the methods based on the 333 difference of medians iLISI − cLISI score to capture the ability of the emthods to maximize and minimize 334 these two quantities respectively. The authors declare no competing interests.

Supporting information captions
• S4 Appendix. Computing the final affine transformation at the end of SCIPR.

452
• S7 Appendix. Selecting top genes from SCIPR models for enrichment analysis.