Partial label learning for automated classification of single-cell transcriptomic profiles

Single-cell RNA sequencing (scRNASeq) data plays a major role in advancing our understanding of developmental biology. An important current question is how to classify transcriptomic profiles obtained from scRNASeq experiments into the various cell types and identify the lineage relationship for individual cells. Because of the fast accumulation of datasets and the high dimensionality of the data, it has become challenging to explore and annotate single-cell transcriptomic profiles by hand. To overcome this challenge, automated classification methods are needed. Classical approaches rely on supervised training datasets. However, due to the difficulty of obtaining data annotated at single-cell resolution, we propose instead to take advantage of partial annotations. The partial label learning framework assumes that we can obtain a set of candidate labels containing the correct one for each data point, a simpler setting than requiring a fully supervised training dataset. We study and extend when needed state-of-the-art multi-class classification methods, such as SVM, kNN, prototype-based, logistic regression and ensemble methods, to the partial label learning framework. Moreover, we study the effect of incorporating the structure of the label set into the methods. We focus particularly on the hierarchical structure of the labels, as commonly observed in developmental processes. We show, on simulated and real datasets, that these extensions enable to learn from partially labeled data, and perform predictions with high accuracy, particularly with a nonlinear prototype-based method. We demonstrate that the performances of our methods trained with partially annotated data reach the same performance as fully supervised data. Finally, we study the level of uncertainty present in the partially annotated data, and derive some prescriptive results on the effect of this uncertainty on the accuracy of the partial label learning methods. Overall our findings show how hierarchical and non-hierarchical partial label learning strategies can help solve the problem of automated classification of single-cell transcriptomic profiles, interestingly these methods rely on a much less stringent type of annotated datasets compared to fully supervised learning methods.


Description of the methods compared in the supplementary
We provide additional experimental results that investigate, first the respective merits of the two algorithms that we introduced in the paper, namely the Iterative Refinement Learning (IRL) (Algorithm 1) and the Iterative Full Retraining (IFR) (Algorithm 2); secondly the impact of the SVM implementation in the nonlinear case.
In the following Tab S1, we provide additional details for all the methods implemented in the main manuscript or in this supplementary results.

Methods
Acronym Algo.Note that all the results reported in this supplementary material have been obtained with the same experimental protocol as results that are reported in the main manuscript, see Section Datasets and experimental settings, subsection Experiments.This means in particular that results are averaged over 5 train/test splits, after a 5-fold cross-validation gridsearch.

Comparison of Algorithm IRL and IFR
We now compare the two learning algorithms that we described in detail in the main paper, namely the Iterative Refinement algorithm (IRL, Algorithm 1) and the Iterative Full Retraining algorithm (IFR, Algorithm 2), which we described in the Methods section of the manuscript.Recall that the difference lies in the optimisation scheme.Given a best candidate in the partial labeling at a time step t, IRL performs a single-step gradient reestimation and then re-estimates the best candidates in the partial labeling; whereas the IFR algorithm performs a full training from scratch with the current affected candidates in the partial labeling and the next affectation of best candidates is performed after convergence.
Partially labeled setting To make a fair comparison between the two algorithms, we ran the same method (Logistic Regression and linear SVM, both trained with gradient descent optimization) but optimized with the two different algorithms.The results reported in Fig. S1 show that the IRL algorithm significantly outperforms the IFR algorithm whatever the dataset and the experimental setting for SVM implementations, while it is most often the opposite for LR models.We do not have any clear explanation for this behavior given that linear SVM and logistic regression are usually considered very similar classifiers.We hypothesize that tuning the hyperparameters of the logistic regression model may be more sensitive and would have required a larger grid search, but we limited the gridsearch effort to a similar budget whatever the method to gain a more fair comparison.This motivated us to systematically report the SVM results obtained with the IRL algorithm in the experimental study reported in the main manuscript.
Finally, we observe that for ensemble models, the more expressive the models are, the lower their accuracy on partially labeled data.In particular, for both RF and XGBM, we noticed an improved accuracy on partially labeled data when reducing the capacity of the model (e.g.lowering the maximal depth of the trees), which seems consistent with the idea that too much expressivity of a classifier, when learned with the IFR algorithm, would make it able to learn the first (possibly random) labeling of partially labeled data and hence suffer from strong overtraining.The red squares indicate the best performing method per row, i.e. per experimental setting, using a t-test and as significance criteria p-value < 0.1.We also compute paired t-test and highlight the results when model with IRL algorithm is performing significantly better than the IFR algorithm, we highlighted in magenta and in blue for the opposite situation.

SVM implementation and kernel approximation
As mentioned in S1 Text, we implemented nonlinear SVM in several ways to deal with the particular settings encountered in our experimental study.
In particular, some datasets that we consider are quite large (e.g.all our artificial datasets include hundreds of thousands of training samples) preventing the use of standard implementation of kernel SVM, named here k-SVC (whose optimization does not scale with the number of training samples).Instead, we relied on the approximation of the Radial Basis Function (RBF) kernel through the Fourrier transformation [4] .This choice enables gradient descent optimization and scales very well with large datasets, we call this method k-SVM.

Supervised results
We first provide in Fig S2 a comparison of all the methods under investigation in a supervised setting.Performances (accuracy) are reported for the three real datasets and two sizes of the training set.One may see here that linear SVM and nonlinear SVM are similar in all these supervised classification tasks, and that the two nonlinear variants ( k-SVM and k-SVC) exhibit the same level of performance as the linear variant (SVM).More generally one sees that these models are more or less similar with well-performing baselines.and the setting, but with some exceptions where true kernel SVM (k-SVC) may significantly outperform k-SVM.Yet we chose to systematically report the performances of k-SVM in the manuscript since only this implementation scales to all our datasets.Finally, note that linear SVM (SVM) are most often on par with their nonlinear counterparts.

Fig S1 .
Fig S1.Comparison of IRL and IFR algorithm in the partially labeled scenarios.A-B-C) Accuracies (in percentage) on the 3 real world datasets, where each heatmap corresponds to performance on X test s , Precision on X test pl , Precision on X test pl when prior information is available for the test too.Each row corresponds to the setting in partial label scenarios, depending on, by order, o: Overlap between supervised and partial label, k: number of partial label set, p: proportion of fully supervised training data.Each column corresponds to the performance of a method, where SVM stands for SVM with Algo IRL (as in the main manuscript), and SVM -IFR for SVM with Algo IFR.The red squares indicate the best performing method per row, i.e. per experimental setting, using a t-test and as significance criteria p-value < 0.1.We also compute paired t-test and highlight the results when model with IRL algorithm is performing significantly better than the IFR algorithm, we highlighted in magenta and in blue for the opposite situation.

Fig S2 .
Fig S2.Comparison of classification performance of main baseline methods on the three real-world datasets in the fully supervised setting.Performances (accuracy in %) are reported for two sizes of training datasets (full training dataset p = 1.0 or half of it, p = 0.5).

Fig S3 .
Fig S3.Comparison of SVM implementations in the partially labeled setting.A-B-C) Accuracies (in percentage) on the 3 real-world datasets, where each heatmap corresponds to the Precision on X test s , Precision on X test pl , Precision on X test pl when prior information is available for test too.Each row corresponds to the setting in partial label scenarios, depending on, by order, o: Overlap between supervised and partial label, k: number of partial label set, p: proportion of fully supervised training data.Each column corresponds to the performance of a method, where SVM stands for linear SVM, k-SVM for SVM with kernel approximation.Both of these implementations rely on Algo IRL (as in the main manuscript).Finally, kSVC stands for real kernel and its computation relies on IFR algorithm.The red squares indicate the best performing method per row, i.e. per experimental setting, using a t-test and as significance criteria p-value < 0.1.

Table S1 .
Details of methods' acronyms, learning algorithm and implementation.IRL stands for Iterative Refinement Learning algorithm, IFR Iterative Full Retraining algorithm, N.A means not applicable.