A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa

Single-cell RNA sequencing (scRNA-seq) has been widely applied to discover new cell types by detecting sub-populations in a heterogeneous group of cells. Since scRNA-seq experiments have lower read coverage/tag counts and introduce more technical biases compared to bulk RNA-seq experiments, the limited number of sampled cells combined with the experimental biases and other dataset specific variations presents a challenge to cross-dataset analysis and discovery of relevant biological variations across multiple cell populations. In this paper, we introduce a method of variance-driven multitask clustering of single-cell RNA-seq data (scVDMC) that utilizes multiple single-cell populations from biological replicates or different samples. scVDMC clusters single cells in multiple scRNA-seq experiments of similar cell types and markers but varying expression patterns such that the scRNA-seq data are better integrated than typical pooled analyses which only increase the sample size. By controlling the variance among the cell clusters within each dataset and across all the datasets, scVDMC detects cell sub-populations in each individual experiment with shared cell-type markers but varying cluster centers among all the experiments. Applied to two real scRNA-seq datasets with several replicates and one large-scale droplet-based dataset on three patient samples, scVDMC more accurately detected cell populations and known cell markers than pooled clustering and other recently proposed scRNA-seq clustering methods. In the case study applied to in-house Recessive Dystrophic Epidermolysis Bullosa (RDEB) scRNA-seq data, scVDMC revealed several new cell types and unknown markers validated by flow cytometry. MATLAB/Octave code available at https://github.com/kuanglab/scVDMC.


Introduction
In recent years, single-cell RNA sequencing (scRNA-seq) has emerged as the dominant method for quantifying transcriptome-wide mRNA expression in individual cells. While traditional bulk RNA-seq ignores the differences between individual cells and treats the population of cells as homogeneous, scRNA-seq identifies sub-populations of single cells and can be useful for characterizing sub-population structure, mechanisms of transcription regulation, and understanding disease progression [1] and immunology [2]. A typical scRNA-seq protocol consists of several steps: isolation of single cells and RNA, reverse transcription, amplification, library generation, and sequencing. In addition to the noise and bias that exist in bulk RNAseq experiments, issues unique to scRNA-seq include those from biological sources, such as cell-cycle stage or cell size, as well as from technical/systematic sources, such as capture inefficiency, material degradation, sample contamination, amplification biases, GC content, and sequencing depth. These experimental biases and limitations cause uneven coverage of the entire transcriptome and result in an abundance of zero-coverage regions [3,4].
Typically, the cost of scRNA-Seq is much higher than bulk RNA-Seq per sample, and thus, scRNA-Seq of a large patient cohort is still prohibitively costly. When a large number of single-cells from multiple samples are sequenced, more complex batch effects might be introduced. Finally, some poorly sampled cell populations might only contain very few cells for the analysis. To address all these challenges, proper integration of multiple scRNA-Seq datasets generated from different experiments is important. When multiple single-cell populations from biological replicates or related samples such as a patient cohort are analyzed to discover the common and sample-specific cell types, technical biases and irrelevant biological variance among independent samples cannot be easily identified and removed from the signal before clustering the single cells. For example, when the scRNA-seq profiles from multiple patients are pooled together for clustering, the clusters will highly overlap with the division of the single cells by the sample origins rather than similar types such as pathogenic cells vs normal cells.
In this paper, we introduce a multitask learning method with embedded feature selection to simultaneously capture the differentially expressed genes among cell clusters and across all cell populations to achieve better single-cell clustering. The key advantage of multitask clustering is the use of multiple single-cell populations to leverage the sample size limitation in each individual dataset while allowing dataset-specific variations among the same cell types across the datasets. To illustrate the objective, Fig 1 shows a simulation example of scRNA-seq data of 100 single cells from three cell populations (n = 33, 33 and 34) with 1000 expressed genes. Among the 1000 genes, gene A and gene B are the hidden markers that are differentially Populations. 3D plots of the three single-cell populations separated by the marker genes A, B and non-marker gene X. The simulation data are generated from the ground truth data with rotation and scaling to represent technical biases and biological variation with 998 random genes in addition to gene A and gene B (1000 genes in total). Additional noise is also introduced. Three different clustering strategies are shown below in (C), (D) and (E). (C) Pooled Clustering. The 2D plot with the true marker genes A and B on pooled data that simply combines 3 single-cell populations together for clustering is shown. Even with the correct marker selection, cells from different types are still mixed because of the rotation, scaling and noise. (D) Separated Clustering. The 2D plot on each individual cell population is shown. With the limited single-cell sample size and skewed cell-type distribution, incorrect marker genes may be selected, shown as genes P, Q and R. (E) Multitask Clustering and Embedded Feature Selection. The proposed multitask clustering can identify both the true marker genes and correctly cluster the individual cells into their respective types in each population. The clustering of each dataset is reinforced from the results in the other two datasets shown as the connected clusters across the three experiments.
https://doi.org/10.1371/journal.pcbi.1006053.g001 scVDMC: Variance-driven multitask clustering of single-cell RNA-seq data expressed across the four cell types (indicated by four different colors). In the ideal scenario, there is no technical bias and the marker genes are known as shown in the ground truth in Fig  1(A). Fig 1(B) shows the single-cell datasets after biological variation, technical biases, and noise are introduced. The data distributions are very different across the three cell populations after the rotation, re-scaling and addition of noise. It is also challenging to identify the true marker genes with a limited number of samples in each population. Simply pooling the singlecell data from the three populations together will confuse the clustering, even with the correct marker genes identified (Fig 1(C)). Conversely, separated clustering on each single-cell population suffers more from the biological variation as the number of single cells is not sufficient in each individual analysis to identify the true maker genes (Fig 1(D)). As shown in Fig 1(E), variance-driven multitask clustering of single-cell RNA-seq data (scVDMC) utilizes expression patterns of different single-cell populations with shared cell-type markers and corresponding similar clusters for better integration.

Materials and methods
In this section, we first introduce the model and the algorithm of variance-driven multitask clustering of single cells (scVDMC) and then discuss the parameter selection for scVDMC and related work in scRNA-seq clustering. We also describe the methods for the generation of the in-house Recessive Dystrophic Epidermolysis Bullosa (RDEB) scRNA-seq dataset and the flow cytometry experiments.

A multitask clustering and feature selection model
where w and α > 0 are hyper-parameters to balance the three error terms: the reconstruction error, the cluster center separation in each cell population, and the variance of the cluster centers across the different single-cell populations. l 2 Z þ is the predefined number of features to be selected. jjD B ðX ðdÞ À U ðdÞ V ðdÞ T Þjj 2 F in Eq (1) denotes the reconstruction error of the classic kmeans clustering as matrix factorization with D B selecting marker genes by B, i.e. the reconstruction error is only measured on the marker genes by ignoring the irrelevant (non-selected) genes. The second term B T Var(U (d) ) is introduced to maximize the separation of the cluster centers, where Var(U (d) ) is defined as a vector in which each element is the variance of the The samples in each domain are in four clusters separated by the vertical bars. Each dataset is clustered by factorization of the data matrix by the selected genes (with indicator 1 in B) common to the three domains. Two types of variance are controlled, 1) the variance among the cluster centers in the same domain are maximized for better cluster separation shown as a shadowed row; and 2) the variance among the shadowed cluster centers across the domains are minimized to match the similar clusters across the domains. vector U ðdÞ i;: 2 R kÂ1 [5]. The third term Var(Y (i,j) ) denotes the variance of the vector Y (i,j) , which is introduced to require similar gene expression centers across different single-cell populations. Note that the reconstruction error encourages selection of low expression genes since the errors are usually smaller on smaller values while the second variance term encourages selection of high expression genes since the variances tend to be larger on larger values. Together as the sum over all the domains, the cost function provides a balanced error on the compactness and separation of the clusters of the cell types tuned by feature selection across all the domains. The unique but similar cluster centers in each domain preserves the unique expression patterns while the features are selected as common marker genes for different cell types. For the three hyper-parameters in Eq (1), λ (the number of marker genes) is typically a small number based on prior knowledge of the cell types, and the selection of balancing weight w and α is discussed later in this section. The full scVDMC algorithm is shown in Algorithm 1. The goal is to minimize the cost function in Eq (1) to obtain the optimal U (d) , V (d) and B. We employ an alternating update strategy to solve the optimization problem. First, we fix the feature selection B, all the cluster centers U (i) , i = 1, 2, . . ., D and all other V (i) , i 6 ¼ d, to obtain a certain V (d) .

Alternating updating algorithm
where I k denotes the identity matrix of size k and 1 k is a length k column vector of all ones. Similarly, we have As shown in S1 Appendix, the analytical solution of Eq (3) when B i = 1 is Thus, the total time complexity of each iteration in Algorithm 1 will be O((m + λ) × n × k), which is comparable to k-means when λ <<m.

Parameter selection
There are four hyper-parameters to tune for the scVDMC algorithm, α and w: weights of the two variance terms, k: the number of clusters and λ: the number of marker genes. Below we describe our strategies for tuning α, w and k assuming that λ can be approximately informed by prior knowledge of the cell types.
Tuning α: The role of α is to weight the cost term on the cross-domain variance of the cluster centers. The larger the α the more similar the cluster centers are across the domains. Ideally, α should be relatively small to allow smaller reconstruction error but yet meet the consistency requirement across the domains. The strategy is to start with a small α and measure the total difference between the cluster centers of the corresponding cluster across the domains, and then increase α to reduce the difference until the total difference does not change much. This selection can also be achieved by visualization of the cluster centers with Principle Component Analysis (PCA) or other dimension reduction methods. After clustering, we can project the data in each domain into the first two PCs. The distance between the cluster centers of the same cluster in each domain can be compared for choosing an appropriate α. Several examples are shown later in the experiments.
Deriving the upper bound of w: Eq (3) is a sum of a few quadratic terms of variable U ðdÞ i;: . The global minimum of U ðdÞ i;: can be solved in closed-form if the Hessian below is positive semi-definite, In the following, we show that an upper bound on w will guarantee that H is positive semidefinite. By Gershgorin circle theorem (For any eigenvalue . This is equivalent to stating that H is diagonally dominant and only has non-negative diagonal entries. H can be rewritten as follows, where c i is the i th diagonal entry of matrix V ðdÞ T V ðdÞ , i.e., the size of cluster i. Then we have and thus, where c min is the minimum of c i , 8i = 1, . . ., k. Since c min ! 1 (no empty cluster), we obtain a loose upper bound of w ¼ k 2 4ðkÀ 1Þ þ ak 3 ðdÀ 1Þ 2d 2 ðkÀ 1Þ . In all the experiments, we set w to be smaller than the upper bound for feasible implementation.
Determining the number of clusters k: The number of clusters k is selected by an "elbow" plot of the within-clusters sum of squares T s computed as follows: T s represents the amount of variance to minimize for better clustering. Larger k will lead to smaller T s . By plotting T s under different options of k, we can select the best k at the so-called "elbow" of the curve. S6 and S7 Figs show the "elbow" plot on two datasets in the experiments. In addition, when an empty cluster is created, the calculation of cluster center variance will be invalid. To address the possible issue, we use a simple splitting procedure to handle empty clusters. Specifically, if there is an empty cluster in V (d) (i.e. the whole column is 0) we randomly split the largest cluster into two clusters. This procedure is repeated until there are exactly k clusters. This strategy is similar to commonly used k-mean rerun when a cluster center is collapsed on a single data point or no data point.

scRNA-seq of RDEB cohort
To identify sub-populations producing homing signals that could attract bone marrowderived cells to injured skin, we captured single dermal fibroblasts from six patients with severe generalized RDEB and their HLA-matched healthy siblings using the Fluidigm C1 system. The demographics information of the patients and donors are shown in S1 Table. Cell culture: Dermal fibroblasts from patients with severe generalized RDEB and their human leukocyte antigen (HLA) matched healthy siblings were obtained from skin biopsies and cultured in DMEM high glucose (Thermo Fisher Scientific) containing 10% fetal bovine serum (MilliporeSigma), 1% Pen/Strep (Thermo Fisher Scientific), 1% L-glutamine (Thermo Fisher Scientific), and 1% MEM NEAA (Thermo Fisher Scientific). For sub-culture, the medium was removed and cells were washed with 1X PBS (Thermo Fisher Scientific) and detached using Trypsin/EDTA (Thermo Fisher Scientific). Experiments were performed with fibroblasts at passages 4-9.
Single-cell capture and RNA-seq: Fibroblasts were collected by trypsinization and resuspended in 5 μL of fibroblast medium for loading into the capture chip. The medium-(10-17 μm diameter) and large-size (17-25 μm diameter) chips were used to capture cells with the C1 system (Fluidigm). Cells were loaded at a concentration of 2.5 x 10 5 per μL and stained with the Live/Dead Viability/Cytotoxicity kit (Thermo Fisher Scientific). Cells were imaged with phase-contrast and fluorescence microscopy to assess cell number and viability at each capture point. Capture sites with single, live cells were selected while capture sites with multiple, no, or an unclear number of cells were excluded from further analysis. Images for each single-cell used in this study are available upon request. In total, 295 patient cells and 248 sibling cells were selected. On the device, cDNA was created from the selected cells using the SMARTer Ultra Low RNA kit designed for the C1 system (Clontech). mRNA libraries were constructed using the Nextera XT kit (Illumina) according to the manufacturer's protocol. The libraries were sequenced on an Illumina MiSeqv3 with 75bp paired-end reads to a depth of 19-22 million reads per lane. Target sequencing depth for each library was 200K reads. The RNA-seq data have been deposited in NCBI's Gene Expression Omnibus and are accessible through GEO series accession number GSE108849 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi? acc=GSE108849).
Processing of RNA-seq data: Paired-end 75bp reads were mapped to the UCSC human transcriptome (hg19) using Bowtie2 (version 2.2.4) and Tophat (version 2.0.9). Gene expression levels were calculated using Cuffquant (Cufflinks version 2.2.1 with parameters -u -max-bundle-frags 10000000) and Cuffnorm (Cufflinks version 2.2.1). FPKM values as estimated by Cufflinks were added a value of 1 (to avoid zeros) and log2 transformed. We removed nine single-cell samples with low read counts (< 50K) and sub-sampled two single-cell samples sequenced as population controls with high read counts (> 1.5M) (random sub-sampling, 10% of total reads). 11 single-cell samples were excluded as outliers. We excluded lowly expressed genes (average log2 (FPKM) < 1.5) from further analysis. The remaining 543 single-cell samples met the requirement of expressing at least 2,000 of these remaining 5,196 genes. For each individual, the number of single-cells used in the analysis and the average number of reads for those single-cells is summarized in Table 1. The total number of the reads and the number of aligned reads in each cell are also shown in S3 Fig. Flow cytometry: Fibroblasts were collected by trypsinization (as above) and resuspended in fibroblast medium. A BD Cytofix/Cytoperm™ kit (BD Biosciences) was used to prepare the cells for intracellular staining. Cells were fixed for 15 min with 150 μl Fixation/Permeabilization solution before being resuspsended in 300 μl 1X BD Perm/Wash Buffer and incubated at 4˚C for 20 min. Primary antibodies (S2 Table) were diluted in 100 μl 1X BD Perm/Wash Buffer and cells were resuspended in this for 20-30 min at 4˚C, followed by one wash with 500 μl 1X BD Perm/Wash Buffer. Secondary antibodies (S3 Table) were diluted in 300 μl 1X BD Perm/ Wash Buffer and cells were resuspended in this for 20-30 min at 4˚C, followed by one wash with 500 μl 1X BD Perm/Wash Buffer, and resuspension in 300 μl 1X BD Perm/Wash Buffer. Flow cytometry experiments were carried out on a BD LSRII system equipped with FACsDiva 8.0 software (BD Biosciences) and analyzed using FlowJo (Tree Star Inc.).

Related work
Most existing methods focus only on sub-population clustering and differential gene expression detection among the learned cell clusters with one (pooled) cell population. Some of these methods were directly adopted from traditional bulk RNA-seq analysis and/or classical dimension reduction algorithms such as Principal Component Analysis [6][7][8], hierarchical clustering [9], t-SNE [10][11][12], Independent Component Analysis [13] and Multi-dimensional Scaling [14]. Other methods focus on special properties of scRNA-seq data, such as high variance and uneven expressions. For example, SNN-Cliq [15] uses a ranking measurement to get reliable results on high dimensional data; [16] proposed a special dimension reduction method to handle the large amount of zeros in scRNA-seq; [17] proposed a Latent Dirichlet Allocation model with latent gene groups to measure cell-to-cell distance; CellTree method [17] clusters single cells by a detected tree structure outlining the hierarchical relationship between single-cell samples to introduce biological prior knowledge; Seurat [18] was proposed to infer cellular localization by integrating single-cell RNA-seq data with in situ RNA patterns; and more recently a consensus clustering approach SC3 [19] was proposed to improve the robustness of clustering through combining multiple clustering solutions by consensus.
Mixed multiple batch strategies [9,20] have been proposed to reduce the technical variance, which does not directly improve clustering. To the best of our knowledge, multitask clustering with an embedded feature selection has not been previously applied to scRNA-seq data analysis.

Results
In the experiments, scVDMC was applied to two small scRNA-seq datasets: mouse embryonic stem cell (mESC) data [21] and mouse embryonic lung epithelial cell (Lung) data [22], and one large-scale droplet-based scRNA-seq peripheral blood mononuclear cells (PBMC) data [23]. We also applied scVDMC to our in-house Recessive Dystrophic Epidermolysis Bullosa (RDEB) data to detect RDEB relevant cell types and marker genes. The statistics of the four datasets are shown in Table 2.
To apply the SNN-Cliq method [15], we used the provided MATLAB code to transform the data into the SNN graph, then used the Python code to produce the clustering result by ranking measurement. There are three hyper-parameters: k (size of the nearest neighbor list), r (parameter for quasi-clique finding, range (0,1]), and m (parameter for cluster merging range (0,1]). We tested multiple combinations of the three hyper-parameters using k = 3, 5, 7, r = 0.1, 0.2, . . ., 0.9 and m = 0.1, 0.2, . . ., 0.9. We also required the program to annotate all the data instead of leaving singletons unlabeled (−n). Since SNN-Cliq identifies the number of clusters automatically, we only reported the results with the correct number of clusters. In all experiments with SNN-Cliq, we further removed genes with low expression and log-transformed the data, as recommended in [15].
To apply the CellTree method [17], we used the provided R package to first fit a Latent Dirichlet Allocation (LDA) model with the default method (joint MAP estimation) to choose the number of topics followed by learning a pair-wise distance for all cells. Then we ran hierarchical clustering with four different methods for computing cluster distance ('ward', 'complete', 'single', 'average') and selected the best clustering results.
To apply Seurat [18], Seurat v2.0 R package was downloaded from SATIJA LAB. The scRNA-seq data were converted into the required format (gene index | cell index | gene expression) as the input. The parameter "Resolution" tunes the granularity of the downstream clustering, with increased values resulting a larger number of clusters. We tested a range [0.5,1.5] to get the exact number of clusters for comparison with other methods. The reported result of Seurat is computed with the resolution parameter that gives the exact number of clusters and the lowest error.
To apply the SC3 [19] we downloaded SC3 v1.7.2 R package from Bioconductor. All parameters in SC3 are set to default. In the experiments with more than 5000 instances for clustering, the SVM mode will be trigged to run a second stage supervised learning to improve the scalability. To further test separated cluster, pooled clustering and SC3 combined with feature selection, we chose the genes with larger variance as the marker genes. Since the other three baselines use a different strategy for clustering and do not provide marker-gene selection, we only focused on the clustering result for these three baselines. The true cluster labels are obtained as the validated clusters with high confidence in the mESC data [21] and Lung data [22], and the known PBMC populations from donor A sorted with FACS analysis [23].

Experiment on mouse embryonic stem cell data
We downloaded the single-cell expression data for 250 mESCs [21] from the European Bioinformatics Institute's (EBI) ESpresso database. These 250 mESCs cultured in serum conditions were captured using the Fluidigm C1 on three different days from three different passages (biological replicates, n = 81, 90, and 79). After removing genes expressed uniformly within a single replicate, 12 Within a reasonable range of λ, such as from 20 to 200, scVDMC shows significant improvement compared with the baseline methods. When λ is too small, such as 10 genes selected, there are not enough markers to capture the difference among the cell types such that the error is larger. When λ is too big, scVDMC will consider almost all the genes and the variance selection will not play a role. As such, scVDMC will eventually degrade into separated k-means and the error will also increase. As shown in S1(A) Fig, it is worth noting that the results are less sensitive to the choice of the parameter w, for which the upper bound for w is 9 8 in this case. It is also interesting that the CellTree method performed better than both pooled and separated k-means, while SNN-Cliq and SC3 performed better than separated k-means but worse than pooled k-means. Under various tuning of the parameters, Seurat still performed poorly on this dataset. Both separated kmeans and pooled k-means performed much worse with the feature selection by variance, indicating that simple feature selection strategies will not identify correct markers in this dataset. Running scVDMC with α = 1 performed the best when 20 marker genes are selected but the overall performance is very similarly as running with α = 0, indicating that the control of the cross-domain variance could play a role in improving the results. However, since the cluster centers are already not very different when running with α = 0, the improvement will only be marginal. Fig 3(B) shows the detailed clustering errors by scVDMC, pooled k-means and separated k-means. Compared with the pooled k-means and separated k-means, scVDMC captures relatively high variance in the leading principle components and achieves improved clustering in every domain (fewer mixed-color dots). In S2(A) Fig, we also show the convergence of scVDMC by the number of iterations.
Analysis of the mESC transcriptome data using scVDMC yielded comparable results on marker gene selection in the original paper [21] as well as pooled and separated k-means. Both analyses were able to detect and highly rank the known markers of differentiation Krt8, Krt18, Anxa1, Anxa3, and Acta1. Further, scVDMC detected several additional genes that pooled kmeans, separated k-means and the original paper did not. These included Cav1, which is required for normal lung development [24] and Dsp, variants of which are associated with idiopathic pulmonary fibrosis [25].

Experiment on lung epithelial single-cell data
We downloaded the single-cell expression data for 80 embryonic mouse lung epithelial cells [22]. These 80 single-cell samples were taken from three different mice (biological replicates, n = 20, 34, and 23) and contained five cell types: ciliated, Clara, AT1, and AT2 cells, as well as a bi-potential progenitor (BP). Since only one replicate contained ciliated cells, we removed these from the analysis, leaving 77 single-cell samples. After removing genes expressed uniformly within a single replicate, 7,357 genes remained. To tune α for scVDMC, we examined the positions of the cluster centers across the domains and show the visualization by PCA in S4(C) & S4(D) Fig. α = 1 is chosen as the optimal parameter to achieve similar relative positioning of cluster centers in all the three domains. For the SNN-Cliq method, we further removed genes with log-transformed average expression less than 2.
With the limited number of single-cell samples in this dataset, scVDMC still much improved clustering over the baselines in the range of λ 2 [30,80] shown in Fig 3(C). In Fig  3(D), PCA plots of the top 50 genes show a trend similar to the ESC dataset, where scVDMC's scVDMC: Variance-driven multitask clustering of single-cell RNA-seq data top genes capture more variance and show less clustering error. Both SNN-Cliq and CellTree performed better than pooled k-means and separated k-means, with SNN-Cliq leading Cell-Tree by a very small margin. Similarly, Seurat also performed poorly while SC3 performed well on the dataset with only 5 mistakes. It is also interesting to observe that running scVDMC with α = 1 performed significantly better than running with α = 0, indicating that the control of the cross-domain variance played an important role in improving the results. Since the cluster centers are very different when running with α = 0, the improvement is significant. Another interesting observation is that the clustering performance is more sensitive to the number of marker genes to select by scVDMC. In particular, selection of 20-80 genes with scVDMC (α = 1) will give the optimal clustering results while selection of more than 90 genes will give much higher error. This is due to the small clusters in this dataset (e.g. purple cluster in domain 2 and yellow cluster in domain 1), which could be sensitive to the number of selected genes in low-read-coverage samples. Thus, the error will be more sensitive to the gene selection in this small dataset. On this dataset, both separated k-means and pooled k-means performed better with the feature selection by variance but never achieved zero clustering error as scVDMC does. As shown in S1(B) and S2(B) Figs, scVDMC behaved similarly by the choices of the w parameters and the convergence.
Analysis of the mouse lung epithelial transcriptome data using scVDMC yielded comparable results in the original paper [22] as well as pooled and separated k-means. Both analyses were able to detect and highly rank the known marker genes of the different cell types: Clara (Scgb1a1), AT1 (Pdpn, Ager), and AT2 (Sftpc, Sftpb). Further, scVDMC detected several additional genes that pooled k-means, separated k-means and the original paper did not. These included two components of the Notch signaling pathway (Notch1 and Nrarp) previously shown to be critical for the development of lung alveolar spaces, with AT2 cells being major sites of Notch activation [26].

Experiment on peripheral blood mononuclear cells data
We downloaded the peripheral blood mononuclear cells (PBMC) data generated by [23] from the 10xGenomics website. In the original data, there are 10 bead-enriched subpopulations of PBMC from a fresh donor (Donor A) with 93802 cells in total. In addition, there are also PBMC from two other frozen donors (Donor B and C) with 7783 and 9519 cells, respectively. A massive droplet-based method was applied to count the mRNAs in the tens of thousands of cells in parallel. To better evaluate the multitask learning setting, we sampled from each of the 10 subpopulations of Donor A in proportion to the sizes of the populations to obtain four subsets of cells from Donor A with 200, 500, 1000 and 10000 cells by sampling. We repeated the sampling procedure five times to generate the mean and variance of Adjusted Rand index (ARI) [19]. We kept all the cells in Donor B and C. We removed the genes expressed in less than 3 cells which results in 17647 genes remained.
To determine the number of clusters in the PBMC data, we examined the "elbow" plot in all the three cell populations shown in S6 Fig. The plots show consistent patterns in the three cell populations that the "elbow" is observed to start around k = 10 verifying that there are indeed around 10 cell types in the data. To tune α for scVDMC, we examined the positions of the cluster centers across the domains and show the visualization by PCA in S4(E) & S4(F) Fig.  α = 5 is chosen since the relative positioning of cluster centers are also relatively similar in the three domains. The baseline methods k-means and SC3 are tested on the pooled data (mixture of Donor A, B and C) and separated data (Donor A only). For SC3, the hybrid approach (consensus clustering + SVM) with its default parameters is applied on the pooled data due to the scalability issue [19]. Clustering performance is measured using Adjusted Rand index (ARI) [19] by comparing the predicted labels with the true labels from sampling the ten subpopulations of PBMC in Donor A. Fig 4 shows the clustering results. Compared with pooled k-means and SC3, scVDMC shows a consistently higher ARI with different choices of λs. scVDMC also shows a significant improvement compared with separated k-means and SC3 when there are 200, 500 and 1000 cells from Donor A. The improvement by scVDMC becomes only marginal when there are 10000 cells sampled from Donor A. The observation is common since larger dataset often benefit less from multitask learning, i.e. as the sample size in donor A increases, less additional information carried in the data of donor B and C can inform a better clustering of donor A data. On this dataset, we also observed that the clustering performance of scVDMC does not rely on the parameter α. This is likely because the agreement among the 10 clusters in the three domains is already high when α = 0 as shown in S4(E) Fig. Therefore enforcing stronger agreement by increasing α will not lead to big improvement as shown in S4(F) Fig. Overall, scVDMC performed well on the large-scale data showing the advantage of applying multitask learning. SC3 did not over-perform separated k-means indicating the consensus clustering is less effective on this dataset.

Case study of RDEB scRNA-seq data
Recessive Dystrophic Epidermolysis Bullosa (RDEB) is an inherited blistering disorder caused by loss-of-function mutations in the COL7A1 gene that codes for type VII collagen (C7) [27]. C7 forms the anchoring fibrils that attach the epidermis to the dermis [28]. When C7 is missing, the skin becomes extremely fragile, eroding at the slightest touch. From birth, patients scVDMC: Variance-driven multitask clustering of single-cell RNA-seq data with this disease must undergo intensive bandaging and daily wound care. They are also susceptible to a highly aggressive form of squamous cell carcinoma [29][30][31][32]. It has been shown that allogeneic hematopoeitic cell transplant (HCT) can partially rescue the RDEB phenotype. Cells from the bone marrow home to the skin and deposit C7 at the dermal-epidermal junction, greatly improving skin integrity in a subset of patients [33]. However, the molecular mechanism by which this occurs remains unknown.
To determine the number of clusters in the RDEB data, we examined the "elbow" plot in all the six cell populations shown in S7 Fig. The plots show consistent patterns in all six cell populations that the "elbow" starts from k = 4, which was chosen as the number of clusters for clustering in all the experiments on the RDEB data. The convergence of scVDMC on RDEB data is shown in S2(D) Fig. Applying scVDMC to the RDEB single-cell dataset revealed quite different cell population structures for the six patient-sibling pairs. As shown in Fig 5, the agreement among the cluster centers across the six populations varies under different choices of α. When α = 0, no agreement among the cluster centers are required. The arrangement of the four cluster centers are very different in the six populations (Fig 5(A)). With larger values of α, the arrangement of the cluster centers becomes more similar. When α = 20, the structure of the four cluster centers is almost identical for the six populations (Fig 5(C)). The visualization in Fig 5 clearly illustrates the effect of imposing variance constraint on the cluster centers across the populations to account for the population specificity and commonality. For comparison, we also applied SC3 scVDMC identified several marker genes previously known to be involved in RDEB (Fig 6). These included CXCL12/SDF1, the ligand for CXCR4, which directs cells of the bone marrow to damaged tissue including skin [34] and HMGB1, which has shown to be positively correlated with RDEB severity [35] and also mediates recruitment of bone marrow-derived cells to injured tissue [36]. Note that we empirically removed confounding cell cycle genes from the top 100 predicted markers and repeated scVDMC until there were no selected cell cycle genes.
We also identified several genes as markers not previously associated with RDEB. These included COL11A1, a minor fibrillar collagen shown to mark activated cancer-associated fibroblasts (CAFs) that is not typically expressed in fibroblasts associated with inflammation and fibrosis [37]. scVDMC also revealed GREM1, a BMP antagonist associated with renal and pancreatic fibrosis [38,39] and MFAP5, which promotes attachment of cells to micro-fibrils of the extracellular matrix and interacts with TGBβ growth factors [40]. We performed flow cytometry on the same RDEB patient and matched sibling fibroblasts to validate the expression levels of these genes at the single-cell level and found the results similar to our RNA expression data shown in Fig 7. To further investigate the expressions of the these markers among the cells in the six populations, we plot the distribution of the cells with highly expressed markers in the six pairs in Fig 8. In the plots, the expression patterns of GREM1 and MFAP5 are very consistent among the cells in all the six pairs with more enrichment in RDEB cells (GREM1) or WT cells (MFAP5). The expression pattern of COL11A1 is consistent in five of the pairs with enrichment in WT cells except RDEB-WT pair 3. Since the markers are selected to capture cell types rather than RDEB vs WT, there might be some discrepancy in the expression patters in each individual cell populations depending on the proportion of the cell types. As top hits, these genes potentially mark sub-populations of stromal cells that contribute to the transformation of the overlying epithelium and the development of squamous cell carcinoma in RDEB patients.

Discussion
In this research, we demonstrated multitask learning is useful in analysis across multiple single-cell populations. It is also possible to apply other multitask learning or transfer learning methods [41] for the clustering tasks. scVDMC is a multitask clustering method specifically designed for scRNA-seq data for selection of a smaller set of cell-type markers and allows large variability in gene expression across the cell populations. Other methods are often built using different assumptions of the data that might not be applicable to the characteristics of scRNAseq populations [42][43][44].
The amount of variation across multiple scRNA-Seq datasets depends on the nature of the datasets for the integrative analysis. For example, while we expect little variances among technical replicates and slightly more variances among biological replicates such that the variances do not play a major role in the pooled analysis, much larger variances might exist among samples of different tissue types or samples from different patients as those in the RDEB data. The key hypothesis of scVDMC is the existence of a common set of a small number of marker genes in every dataset that can partition each dataset into the same clusters. While the hypothesis is quite independent of the amount of variation across the datasets, scVDMC formulation accounts for the variation by tuning the parameter α to weight the variances. In theory, scVDMC is applicable to the general integration of scRNA-Seq datasets if the variances calculated among the cluster centers across the datasets well represent the underlying variations. However, in real applications, it is difficult to assess if the variations are captured by the computation of the variances. Thus, more careful practice of parameter tuning and validation of the results are necessary after the application of scVDMC.
There are limitations in the scVDMC method. In multitask clustering, assuming a global k as the number of clusters in each cell population dataset does not always hold true as for some rare cell types, the corresponding cells may only be present in some populations. scVDMC might incorrectly split a cluster of one cell type because no empty cluster is allowed. One possible improvement is to model each domain with an individual k (d) with a more adaptive strategy for choosing k (d) . In this case, the overall balance between within-cluster distance and the variance will need to be more carefully weighted. In addition, cell-cycle-associated genes could be a large source of confounders. Unless the stages of cell cycle are the biological signal under study, cell cycle-related variation could obscure biological signals of interest. It is possible to model the confounders directly in the scVDMC method with more complex modeling. Alternatively, we could pre-process the scRNA-seq data to remove the cell cycle signals. For example, a Gaussian processes-based latent-variable model [45] was used to account for confounding variations due to the cell cycle in scRNA-seq data sets and then linear regression was applied to remove them. In this approach, a clearly defined cell cycle gene set is necessary to  avoid removing true signals unexpectedly. Combined with the pre-precessing, scVDMC might achieve further improvement in clustering multiple cell populations.
For a better interpretation of scRNA-seq data, CellTree [17] based on Latent Dirichlet allocation also provides soft cluster assignment as opposed to the hard one-cluster assignment and more recently, a new method [46] was introduced for visualizing the cluster membership of single cells by the soft cluster assignment known as "grades of membership". It is also possible to extend scVDMC method to perform soft cluster assignment by relaxing V to contain positive real numbers rather than binary 0/1 in Eq 2. The relaxation will require solving many least-squares problems and increase the computational time complexity. We plan to investigate better solutions of scVDMC in the future for soft cluster assignment and handling cell-cycle-associated gene signatures.