Multi-view clustering by CPS-merge analysis with application to multimodal single-cell data

Multi-view data can be generated from diverse sources, by different technologies, and in multiple modalities. In various fields, integrating information from multi-view data has pushed the frontier of discovery. In this paper, we develop a new approach for multi-view clustering, which overcomes the limitations of existing methods such as the need of pooling data across views, restrictions on the clustering algorithms allowed within each view, and the disregard for complementary information between views. Our new method, called CPS-merge analysis, merges clusters formed by the Cartesian product of single-view cluster labels, guided by the principle of maximizing clustering stability as evaluated by CPS analysis. In addition, we introduce measures to quantify the contribution of each view to the formation of any cluster. CPS-merge analysis can be easily incorporated into an existing clustering pipeline because it only requires single-view cluster labels instead of the original data. We can thus readily apply advanced single-view clustering algorithms. Importantly, our approach accounts for both consensus and complementary effects between different views, whereas existing ensemble methods focus on finding a consensus for multiple clustering results, implying that results from different views are variations of one clustering structure. Through experiments on single-cell datasets, we demonstrate that our approach frequently outperforms other state-of-the-art methods.


Introduction
Multi-view data are becoming increasingly prevalent in real-world applications. For example, a data entry of a subject can contain image, audio, and text data. Single-cell genomics is a prominent biomedical area where multi-view data often arise. In the literature on single-cell data analysis, the term "multimodal" is usually used instead of "multi-view". In this paper, we use them interchangeably. To apply our methods developed here, data in different views can be of different modalities. For instance, RNA expression levels of cells can reveal much of the cellular heterogeneity, and many advanced techniques and tools have been developed to analyze such data, e.g., Matrix factorization [1] for revealing low-dimensional structure, CIDER [2] for clustering. However, other data modalities, e.g., DNA, protein expression, are found necessary to fully understand the cellular mechanics [3,4]. RNA expression is often inadequate to separate immune cells that are molecularly similar but functionally distinct, and many subpopulations of T cells, indistinguishable by scRNA-seq data, are identifiable in other modalities [5,6]. These examples present the following case: each view contains useful information about an instance, and the information in different views is complementary to some degree. A welldesigned learning algorithm that leverages all the views can greatly improve performance. In particular, analytical tools for multimodal single-cell data have helped reconstruct gene regulatory networks, a significant leap forward for revealing the inner workings of biological systems [7,8]. In biomedical multi-view learning, several related but different tasks have been pursued, e.g., multi-view classification [9], multi-view clustering [10], multi-view deconvolution [11], and multi-view data integration [12]. Here, we focus on unsupervised multi-view clustering, used to reveal the underlying cellular structure that can assist downstream analysis.
The authors of [13] proposed to divide multi-view clustering methods for genomics data into three types: early, intermediate, and late integration. The early integration type contains methods that concatenate variables across all the views. Many drawbacks, e.g., the sharp increase in dimension, the neglect of special statistical properties of particular views, have been noted for such methods [14]. These issues, especially severe for multimodal single-cell data, are mitigated to some extent by intermediate integration methods. According to a few highly regarded surveys on multi-view learning [15][16][17][18], intermediate integration methods include well-known multi-view clustering algorithms that belong to four schools: co-training, multiple kernel clustering, multi-view subspace clustering, and multi-view graph clustering. These methods combine data from multiple views into one set using weights, transformations, or simplification based on similarity or dimension reduction. In contrast, the late integration methods, which mostly belong to ensemble clustering methods, generate aggregated clusters based on clustering results obtained in every single view. Example methods of late integration ensemble clustering include [19][20][21][22]. Specifically, in [21,22], dissimilarity measures between clusters in different results are computed based on the cluster membership of samples in each result. Dendrogram clustering is then applied to yield an integrated clustering result.
Ensemble clustering methods are popular for treating multi-view data, for example, the fast multi-view clustering via ensembles (FastMICE) method [23]. However, not all ensemble methods strictly adhere to the early, intermediate, and late integration taxonomy. FastMICE [23] is of particular interest because it employs a hybrid of early and late integration. This method of mixed early-late integration aims to identify a consensus among view-group clustering results, with each view group containing a random number of views, such as a single view or multiple views. The clusters in multi-view groups may be established using early integration.
Late integration methods overcome some disadvantages of early or intermediate integration [19][20][21][22]. Apparently, it is straightforward for such methods to incorporate advanced clustering methods developed for single-view data since they operate on single-view cluster labels instead of the original data. Consequently, late integration methods are easier to adopt when data in multiple views cannot be pooled, for instance, due to privacy concerns. On the other hand, the advantages of late integration come at a cost. Since such methods only examine the single-view cluster labels for integration, naturally, relevant information in the original data but not retained in cluster labels cannot be leveraged.
There are two primary principles for multi-view clustering, namely, the consensus principle [24] and the complementary principle [15]. The consensus principle assumes a shared clustering structure across all views, so clustering results from different views are considered variations of a single clustering result. Methods developed under this principle seek a plausible "average" of the clustering results. In contrast, the complementary principle emphasizes that clusters may only emerge when data from all views are analyzed together, as is often observed with single-cell data. Early integration methods naturally follow this principle since original data from all views are combined. However, incorporating the complementary principle into late integration methods is challenging because they only have access to cluster labels from different views. In fact, existing late integration clustering methods ignore the complementary effect between views.
Most methods in the intermediate integration type also ignore the complementary principle. For instance, multi-view clustering algorithms by co-training [25,26] make the underlying assumptions of sufficiency and compatibility: (a) each view is sufficient for clustering on its own, (b) the target function of both views predict the same labels for co-occurring features with a high probability, and so on. Under the sufficiency assumption, co-training methods aim at maximizing agreement between two views (consensus principle). In addition, the compatibility assumption restricts clustering algorithms allowed. Specifically, similar algorithms are used in different views. Not only co-training methods but also other methods in the intermediate integration type, e.g., multiple kernel clustering [27][28][29], multi-view subspace clustering [30][31][32][33], and multi-view graph clustering [34][35][36], are by construction not ready to leverage state-of-the-art algorithms for clustering single-view data. Furthermore, the multiple kernel clustering methods do not scale well with the sample size due to the quadratic complexity (in sample size) for computing the kernel matrix. The multiple subspace clustering methods assume implicitly that a shared latent subspace across the views determines the clusters (the spirit of the consensus principle). Multi-view graph clustering methods, aiming at finding a fusion graph from multiple views, are vulnerable to noisy datasets because it ignores inconsistency between views [37]. A number of algorithms have been designed specifically for multimodal single-cell data, e.g., weighted-nearest neighbor (WNN) analysis [5], totalVI [38], and multi-omics factor analysis v2 (MOFA+) [39]. Based on the comparison in [5], WNN is the state-of-the-art multimodal single-cell clustering algorithm, but it is a multi-view graph clustering approach with disadvantages discussed previously. Last but not least, the intermediate integration methods must centralize the multiple views at the data level and hence are not applicable to distributed data.
In this paper, we aim at developing a late integration method that accounts for both the consensus and complementary principles. Although late integration has many benefits for single-cell data, existing methods overlook the importance of the complementary principle. We illustrate the significance of this principle through three simulated scenarios, with detailed findings provided in Fig A, Fig B, and Fig C in S1 Appendix. Our results demonstrate that the complementary effect between views plays an essential role for identifying meaningful clusters.
Our novel algorithm called Covering Point Set Merge (CPS-merge) analysis contributes to the paradigm of late integration by combining the two principles, namely, the consensus and complementary principles. In CPS-merge analysis, we create Cartesian product clusters based on single-view clusters. Many product clusters may arise due to randomness and may not represent meaningful subgroups. To address this issue, we have developed a computationally efficient approach to merge product clusters by considering the uncertainty level of each cluster. Furthermore, we propose a new measure to quantify the contribution of each view to the identification of any final cluster. This measure is valuable for understanding cell heterogeneity in single-cell studies.

Overview of analysis pipeline
The pipeline of CPS-merge analysis is shown in Fig 1. CPS-merge analysis generates an aggregated multi-view clustering result by the following modules.
• Module 1: Data are perturbed by random noises in each view. A collection of clustering results (aka, partitions) are obtained from the perturbed data using a view-specific and userspecified clustering algorithm. The same algorithm is applied to the original data to yield a clustering result which we call reference partition. Then clusters in different clustering results are aligned with the reference partition via optimal transport, a step to remove inconsistency in the cluster labels used in different results.
• Module 2: We form new clusters by the Cartesian product of the clusters from two or more views, that is, each ordered pair (or k-tuples in general) of cluster labels from the two views defines one cluster.
• Module 3: To obtain a final clustering result, we merge unstable clusters progressively to maximize tightness given a specified number of final clusters. The tightness measure is defined in [40], which quantifies the clustering stability. A comprehensive review of clustering stability is referred to [41]. If the number of Cartesian product clusters at the start of merging is large (for example, more than 100), we conduct a first-stage merging by bipartite clustering. Otherwise, we directly begin the second-stage merging using Covering Point Set (CPS) analysis, available via the R package OTclust [40].
The output of CPS-merge analysis contains an integrated clustering result and quantities that measure the contribution of each view to the final clusters. The Cartesian product clusters from multiple views capture the complementary effects between the views. On the other hand, these clusters are subject to merging based on cross-view correspondence between clusters. We establish this correspondence by CPS analysis under the consensus principle. The crossview correspondence exists as a mapping between clusters or between the so-called super-clusters, the former by optimal transport and the latter by bipartite clustering.
The computation complexity of CPS-merge analysis depends on all the modules. In Module 1, the complexity of generating perturbed data is linear in sample size. The complexity of the single-view clustering algorithms used can vary. However, many clustering algorithms have linear complexity in sample size. In Module 2 and 3, CPS-merge analysis involves optimal transport or bipartite clustering applied to the single-view clusters instead of the original data points. Hence the complexity is quadratic in the number of clusters, which is usually much smaller than the sample size. In summary, if the single-view clustering algorithms have linear complexity in sample size, the complexity of CPS-merge analysis will also be linear unless the number of clusters is in the same order as the sample size.
Although there are usually two views in current multimodal single-cell datasets, our method extends straightforwardly to more than two views. In such a case, our method can be applied either directly to the Cartesian product clusters across all the views or progressively to two views at a time, e.g., aggregating the first two views into one and then taking the third view as the other view, so on so forth. Without loss of generality, we assume there are two views in our discussion. Next, we elaborate on each module in CPS-merge analysis.

Module 1: Generate coherent cluster labels within each view
Because our algorithm aims at maximizing the overall tightness of a clustering result, we need to generate random variations of a clustering result within each view to evaluate tightness. The tightness measure is defined for both individual clusters and the entire partition to quantify the level of uncertainty. We first explain in the steps listed below how to obtain random variations of a clustering result. Details for the definition and computation of the tightness of a cluster are provided at the end of this Section.
1. Apply clustering to the original single-view data and call the result reference partition.
2. Perturb the original single-view data by adding random noise to each point. The noise is sampled from a Gaussian distribution with mean zero and variances adjusted with the data. We usually set the variance to be 10% of the average within-cluster variance. Repeat the perturbation step to obtain multiple perturbed versions of the whole dataset.
3. Obtain a collection of partitions by applying a user-chosen clustering algorithm to every perturbed dataset.
4. Align clusters in those partitions with the clusters in the reference partition.
Note that our algorithm works with any baseline clustering algorithm chosen for a single view. Thus users can easily incorporate state-of-the-art clustering algorithms. Because of the unsupervised nature of clustering, Step (4) to align clusters (the reference partition as a benchmark) is necessary since cluster labels used in different partitions are not automatically When there are more than two views, users can either directly treat the Cartesian product clusters with higher orders or conduct step-wise merging such that two views are treated at each step. Current mutlimodal single-cell datasets only contain two views.
https://doi.org/10.1371/journal.pcbi.1011044.g001 PLOS COMPUTATIONAL BIOLOGY consistent. For instance, even if two partitions are identical, the cluster labels may be permuted. In practice, precise correspondence between clusters in two partitions rarely exists. We use a cluster alignment algorithm based on optimal transport [42]. Next, we explain how clustering results are aligned within each view. More details can be found in [42].
Suppose there are two partitions denoted by P ðpÞ , p = 1, 2, each contains k p clusters C ðpÞ 1 ; . . . ; C ðpÞ k p . The alignment between clusters is captured by a so-called cluster aligning matrix: The entry w i,j is a coupling/matching weight between C ð1Þ i and C ð2Þ j , a higher value indicating a stronger match. For example, if P ð2Þ contains the same clusters in P ð1Þ but with permuted labels, W will encode the permutation by having w i,j > 0 if the ith cluster in P ð1Þ is the jth cluster in P ð2Þ and w i,j 0 = 0 if j 0 6 ¼ j. In general, W can specify partial matching between clusters in order to handle more complicated situations, e.g., k 1 6 ¼ k 2 , one cluster splitting into multiple clusters, etc. W is solved by optimal transport (OT) with the objective of minimizing the weighted sum of distances between every pair of clusters across the two partitions. OT ensures that if the two clustering results are permuted versions of each other, the permutation will be identified through the solution for W.

Suppose each cluster C ðpÞ
i is assigned with a significance weight q ðpÞ i , with be the distance between two clusters. We solve W by OT: The Jaccard distance is adopted as the distance between clusters, i.e., where |�| means the cardinality of a set, "\" the intersection of sets, and "[" the union of sets. The first two constraints on w i,j 's ensure that the total influence of any cluster is determined by its proportion. The objective is to minimize the weighted sum of the matching costs between clusters. The minimized objective function DðP ð1Þ ; P ð2Þ Þ is defined as the distance between the two partitions, often called the Wasserstein distance. Consider P ð2Þ as the reference partition. After obtaining W, we normalize its ith row and define g i;j ¼ w i;j =q ð1Þ i (q ð1Þ i is the proportion of data points in C ð1Þ i ), which indicates the proportion of cluster C ð1Þ i mapped to cluster C ð2Þ j . Denote this cluster mapping matrix as G ð1Þ ¼ ðg ð1Þ i;j Þ i¼1;...;k 1 ;j¼1;...;k 2 . For the general case of aligning with more than two partitions, suppose we have a reference partition P ðrÞ that contains κ clusters: C ðrÞ 1 ; . . . ; C ðrÞ k . Let the proportion of points in C ðrÞ j be q ðrÞ j , j = 1, . . ., κ. Similarly, suppose we have m other partitions to align with the reference, and each partition P ðpÞ contains k p clusters C ðpÞ 1 ; . . . ; C ðpÞ k p . Let the proportion of points in cluster C ðpÞ i be q ðpÞ i , i = 1, . . ., k p , p = 1, . . ., m. We align each P ðpÞ with P ðrÞ . Let the cluster aligning matrix from P ðpÞ to P ðrÞ be W ðpÞ ¼ ðw ðpÞ i;j Þ i¼1;...;k p ;j¼1;...;k and the cluster mapping matrix be G ðpÞ ¼ ðg ðpÞ i;j Þ i¼1;...;k p ;j¼1;...;k . Denote the cluster-posterior matrix of partition P ðpÞ by P ðpÞ ¼ ðpr ðpÞ h;i Þ h¼1;...;n;i¼1;...;k p , where pr ðpÞ h;i is the posterior probability of the hth data point belonging to cluster C ðpÞ i . We then define the aligned cluster-posterior matrix based on P ðpÞ but "translated" to the cluster labels of the reference P ðrÞ by P ðp!rÞ ¼ ðpr ðp!rÞ h;j Þ h¼1;...;n;j¼1;...;k ¼ P ðpÞ G ðpÞ : Each row of the cluster-posterior matrix P (p!r) is regarded as the posterior probabilities of the corresponding point belonging to each cluster in the reference clustering result. We then use the maximum a posteriori (MAP) rule to assign the aligned cluster labels (i.e., the cluster labels used in the reference partition) to the sample points in partition P ðpÞ ; p ¼ 1; . . . ; m. In this way, we obtain a collection of clustering results with consistent cluster labels under each view. Next, three key definitions are provided below.
Covering Point Set (CPS). Suppose we have already obtained the cluster mapping matrix Γ (p) . Similarly, we normalize W (p) column-wise to obtaiñ where q ðrÞ j is the proportion of data points in C ðrÞ j . Based on Γ (p) andG ðpÞ , four types of topological relationships between clusters are defined: "match", "split", "merge", and "lack of correspondence". For example, where z is a relaxation threshold set between 0.5 and 1. If the "match" relationship holds between C ðpÞ i and C ðrÞ j , they are considered to be the same cluster but possibly labeled differently in P ðpÞ and P ðrÞ . Detailed definitions about all topological relationships are referred to [42].
Suppose for the kth cluster in the reference clustering result, there is a collection of matched clusters S i , i = 1, . . ., m, each is a subset of the whole dataset {x 1 , . . ., x n }. Then the covering point set (CPS) S α of cluster k at a coverage level α is defined as the smallest set such that at least 100(1 − α)% of S i 's are subsets of S α , that is, to solve the optimization problem: min S |S|, s.t. P m i¼1 1 ðS i �SÞ ⩾ mð1 À aÞ (we use the Least Impact First Targeted-removal algorithm developed in [42]). In summary, CPS, a counterpart of the confidence interval of a numerical estimation, is a set of possible points for one cluster at a certain level of coverage.
Tightness. Suppose there are m other partitions in total, and the proportion of partitions that have a cluster "matched" with the kth cluster in the reference partition is p k (e.g., some partitions can have "lack of correspondence" or other relationships for reference cluster k). For those partitions that contain a matched cluster to cluster k, let the corresponding cluster k be sets The tightness of cluster k is defined as Also, the overall tightness of the whole partition, denoted by � R t , is defined as the average over the tightness values of individual clusters. A larger value of tightness indicates more stable clustering.
Cluster Alignment and Points based (CAP) separability. We first compute the CPS of each cluster in the reference partition, denoted by S a ðC ðrÞ j Þ. Large overlap between the CPSs of different clusters indicates poor separation between them. The Cluster Alignment and Points based (CAP) separability between two clusters C ðrÞ j and C ðrÞ j 0 is defined as where d(�, �) can be any distance between two sets of points. We use the Jaccard distance which lies in [0, 1]. . . . ; k B g, h = 1, . . ., n. After cluster alignment with the reference partition A (or B), described in the previous subsection, the cluster labels in any A ðpÞ (or B ðpÞ ) are consistent with those used in A (or B). In the subsequent discussion, we assume this is always the case.

Module 2: Form Cartesian product clusters
Consider a random pair of clustering results in the two views: A ðp a Þ and B ðp b Þ . For every point h, the pair of cluster labels ða ðp a Þ h ; b ðp b Þ h Þ determines the Cartesian product cluster which the point belongs to. The total number of Cartesian product clusters is κ A × κ B . We denote the Cartesian product clustering result of these two clustering results by We simply call ðA ðp a Þ ; B ðp b Þ Þ a product partition and ða ðp a Þ k ; b ðp b Þ k Þ a product label. Let the Cartesian product of the two sets A and B be To reduce computation, our algorithm uses a subset of A � B to carry out the analysis: C ¼ fðA ðpÞ ; B ðpÞ Þ; p ¼ 1; . . . ; mg. Since clustering results in different views are obtained independently, C is formed essentially by randomly pairing up the partitions across the two views and keeping m pairs.

Module 3: Integration across multiple views
Optimization objective of multi-view clustering. If we assume the clustering results in the two views are fully complementary, the product clusters induced by ðA; BÞ (the product partition of reference clustering results in the two views) can be taken as the final clusters, an example shown in Fig 2c. In practice, however, the views are usually not fully complementary. Moreover, the number of clusters in the product partition is often too large (roughly in the exponential order of the number of views). Due to randomness in data and nuances in the clustering algorithms, an observed product cluster, e.g., all the points with product label (1, 2), may not truly exist. We thus propose to merge the product clusters such that the overall tightness of the final clusters is maximized. The perturbed versions of ðA; BÞ, specifically, ðA ðpÞ ; B ðpÞ Þ, provide the basis for computing the tightness of product clusters. Let the product clusters generated by ðA; BÞ be denoted by labels (ξ A , ξ B Suppose the desired number of clusters is κ 1 . How the product clusters are merged into κ 1 clusters is given by a many-to-one mapping from a product label to a label in the set {1, 2, . . ., κ 1 }. Denote the mapping from the product clusters to the final clusters by f, where f(ξ A , ξ B ) 2 {1, . . ., κ 1 }. Denote the tightness of the kth cluster by R t (k), k = 1, . . ., κ 1 . The kth cluster is formed by the points in all the product clusters (ξ A , ξ B ) such that f(ξ A , ξ B ) = k. Then the optimization objective is: The above optimization problem is intrinsically combinatorial. We thus propose a greedy algorithm that exploits a two-stage merging procedure. The first stage is optional and aims at improving computational efficiency. If the number of clusters in the product partition is small to begin with (e.g., fewer than 100), we can skip the first-stage merging and thus bipartite clustering. First-stage: Generating and aligning super-clusters across views. In the first-stage merging, we use bipartite clustering to generate the so-called super-clusters. Correspondence between clusters in different views can happen at a structural level higher than the original clusters. For instance, a cluster may split into several clusters in another view, or vice versa, multiple clusters may merge into one. Bipartite clustering aims at finding groups of clusters (aka, super-clusters) for which cross-view correspondence is sharp. With details to be explained shortly, super-clusters help decrease the number of Cartesian product clusters that proceed to the second-stage merging, thus improving computational efficiency. For large product clusters containing high proportions of data points, we determine how they aggregate mostly in the second-stage merging, while smaller product clusters are more likely to be combined based on super-clusters. Note that we do not replace the original clusters by super- PLOS COMPUTATIONAL BIOLOGY clusters. The restrictive usage of super-clusters reflects a careful balance of applying the consensus and complementary principles.
We build a bipartite graph [43,44] for the clusters in the reference partitions A and B under the two views respectively. Let the nodes in set U correspond one-to-one with the clusters in A and the nodes in V correspond with those in B. Edges exist only between a node in U and a node in V. Recall that there are κ A clusters � 1 ; . . . ; � k A in the first view, and κ B clusters c 1 ; . . . ; c k B in the second view. A cluster aligning matrix W (a κ A × κ B matrix) is computed to indicate the extent of matching between any ϕ i , i = 1, . . ., κ A , and ψ j , j = 1, . . ., κ B . We compute W using OT in the same way as that described in Section "Module 1: Generate coherent cluster labels within each view". For each ðA ðpÞ ; B ðpÞ Þ 2 C , let the cluster aligning matrix between A ðpÞ and B ðpÞ be W (p) , p = 1, . . ., m. We define the average of W (p) 's, W ¼ X m p¼1 W ðpÞ =m, as the matching weight matrix. The matching weight matrix W ¼ ð � w i;j Þ i¼1;...;k A ;j¼1;...;k B , and � w i;j is taken as the edge weight between nodes ϕ i and ψ j , a larger � w i;j indicating a stronger connection between ϕ i and ψ j . Then we use the Leiden algorithm [45,46] for bipartite clustering. Each cluster generated in this way is what we call a super-cluster, containing in general multiple clusters in both views.
Next, we illustrate the first-stage merging with an example shown in Fig 2. Suppose 4 super-clusters S 1 , . . ., S 4 are formed, each containing multiple clusters in every view. For instance, suppose ϕ 1 , ϕ 2 , ψ 1 , and ψ 2 belong to S 1 (two from each view). Product clusters formed by two cluster labels belonging to the same super-cluster-those shown in the colored diagonal blocks in Fig 2b-are kept for further analysis in the second-stage merging, e.g., (ϕ 1 , ψ 1 ), (ϕ 1 , ψ 2 ). A product cluster will lie in an off-diagonal white block if the two cluster labels belong to different super-clusters, e.g., (ϕ 1 , ψ 3 ), (ϕ 2 , ψ 3 ). We call (S i , S j ), i 6 ¼ j, an unmatched product (UP) super-cluster, and (S i , S i ) a matched product (MP) super-cluster.
In a nutshell, we will analyze the product clusters belonging to a UP super-cluster at the granularity of the super-cluster but those belonging to an MP super-cluster at the granularity of the original product clusters. Specifically, we merge all the product clusters in any UP super-cluster-they become a single cluster in the second-stage. Because of the nature of bipartite clustering, small product clusters tend to locate in UP super-clusters. For example, in Fig  2a, only the following clusters proceed to the second-stage: all (S i , S j ), i 6 ¼ j, and all (ϕ m , ψ n ) with ϕ m and ψ n belonging to the same S i . Since the product clusters (ϕ i , ψ j )'s capture the interaction effects between the two views, in our approach, the interaction effect between clusters in the same super-cluster will be examined at a more refined granularity.
In practice, it is possible that some product clusters are empty. Obviously, empty clusters will not feature in later analysis. Furthermore, we often observe clusters that hardly arise, which we call "rare clusters". In particular, suppose there are m clustering results. If a product cluster label is not taken by sample points at least m times across the m results (less than one time per result on average), we say it is "rare". Points assigned with a rare cluster label are relabeled by a majority vote. For any such point, we find its most frequent cluster label among the m results and assign this label to this point in all the results.
Second-stage: Separability-based merging to maximize tightness. Recall that the reference partitions in the two views are A and B, containing κ A and κ B clusters respectively. In addition, the two views have multiple aligned clustering results, randomly paired up to produce m Cartesian product clustering results: C ¼ fðA ðpÞ ; B ðpÞ Þ; p ¼ 1; . . . ; mg. For the convenience of the following discussion, suppose the first-stage merging has generated κ 0 clusters assigned with labels 1, . . ., κ 0 . Let the mapping from (ξ A , ξ B ), ξ A 2 {1, . . ., κ A }, ξ B 2 {1, . . ., κ B } to those κ 0 labels be g 0 (ξ A , ξ B ) 2 {1, . . ., κ 0 }. Note that if the first-stage merging is skipped, g 0 is just the one-to-one identical mapping (otherwise, a many-to-one mapping). Denote the clustering result induced by g 0 on ðA; BÞ by C, which contains clusters C 1 ; . . . ; C k 0 . We call C the combined reference partition. To generate the combined partition based on any ðA ðpÞ ; B ðpÞ Þ, p = 1, . . ., m, we apply OT to align the Cartesian product clusters of ðA ðpÞ ; B ðpÞ Þ with those of C, in the same way as described in Section "Module 1: Generate coherent cluster labels within each view". Denote the pth combined partition obtained from ðA ðpÞ ; B ðpÞ Þ by C ðpÞ , and let e C ¼ fC ðpÞ ; p ¼ 1; . . . ; mg.
To solve optimization problem Eq (2), clusters in C are merged based on a quantity called Cluster Alignment and Points based (CAP) separability [42]. A higher separability corresponds to a lower similarity. These tools are collectively called CPS (Covering Point Set) analysis. To conduct CPS analysis, we only need cluster membership information for a collection of clustering results. In our case, the reference partition is C and the collection of clustering results obtained from perturbed datasets is e C . Since it is computationally infeasible to examine all the possible ways of merging the clusters into κ 1 final clusters, we propose to recursively merge clusters, two at a time, in the same manner as creating a dendrogram.
CPS merging. We use the pair-wise separability measure between clusters, provided by CPS analysis, as the cluster distance. We also compute the tightness of every cluster, a higher value of tightness indicating higher stability (or lower uncertainty). Suppose C i is the most unstable cluster. C i usually yields low separability from many other clusters, but the lowest pair-wise separability does not necessarily arise between C i and some other cluster. To increase the overall tightness, we first merge C i with a cluster closest to it, that is, in terms of low separability. After every merge, the per-cluster tightness and pair-wise separability measures are updated. The merging continues recursively, producing a dendrogram. In our experiment, we stop the process when the required number of clusters (usually the average tightness exceeds 0.8) is reached. The computation required is more intense than the usual way of generating dendrograms using a linkage scheme because we cannot update the separability or tightness recursively based on a linkage function. These quantities are computed from scratch after every merge. We thus have designed an accelerated version of the merging process, which is presented below.
Accelerated CPS merging. At each step of merging, we set a threshold for the tightness of clusters. Any cluster with tightness below the threshold will be merged with its closest cluster. This rule essentially allows multiple merges to occur in one round without updating separability or tightness. After all such clusters are processed, we update tightness and separability measures. If some of the updated tightness measures still fall below the same threshold, we repeat the procedure, sometimes going through several rounds under a fixed threshold. If the tightness of every cluster is above the current threshold, merging can also continue if we gradually increase the threshold. In practice, the thresholds are usually set as 0.35, 0.5, 0.65, 0.8. We can apply other stopping criteria, for instance, each cluster reaching a minimum size, or a certain number of clusters having been reached (excluding singletons or tiny clusters). In most cases, the result converges at threshold 0.8 or meets another stopping criterion, e.g., reaching a required total number of clusters. If computational efficiency is a concern, this accelerated merging scheme is a close substitute to the first scheme. Users also have the option to combine the two schemes, for instance, applying the second scheme first to reduce the number of clusters to a certain level and then switching to the first scheme.
After the second-stage merging, we obtain a many-to-one mapping of cluster labels from {1, . . ., κ 0 } to {1, . . ., κ 1 }, κ 1 ⩽ κ 0 , which is denoted by g 1 . Applying g 1 to C, the combined reference partition, we obtain the final clustering result, denoted by F that contains κ 1 clusters F 1 ; . . . ; F k 1 . In summary, the composite mapping f = g 1 � g 0 is a solution to the optimization problem Eq (2). Next, we present an approach to quantify the contribution of each view to the formation of any cluster. Understanding the role of each view in the generation of clusters is helpful in single-cell studies.

Evaluating cluster-wise contribution of each view
Clusters in the final result F usually do not correspond well with clusters in any single view, e.g., A or B. We propose two methods to assess the contribution of each view to the existence of any cluster in F . The two methods are suitable for different scenarios.
In the first scenario, we assume that final clusters F k , k = 1, . . ., κ 1 , have been approximately captured in a single view, which generally varies with the cluster. We treat F as the reference partition and the raw partitions fÃ ð1Þ ; . . . ;Ã ðmÞ g (or fB ð1Þ ; . . . ;B ðmÞ g) that have not been aligned with the single-view reference partition from the first view (or the second) as perturbed clustering results of F . We can then carry out CPS analysis and compute the tightness for each cluster F k . Let the tightness for F k computed from the results in the lth view be R t (k, l), l = 1, 2. Extension to more than two views is straightforward. Then we define the contribution of the lth view to cluster F k by We call z k,l tightness-based contribution. The rationale for the definition of z k,l is that if F k is stable in one view but not in another, the former view plays the dominant role in the rise of F k . Apparently, the defined cluster-wise contribution of each view is between 0 and 1, a higher score indicating a higher contribution. The definition of contribution presented above is based on the notion that the degree of uncertainty reflects the level of significance or contribution. This same concept has been utilized by existing methods that use a local weighting strategy, as seen in [47,48]. However, these methods presuppose that different partitions are independently generated. In our scenario, since different partitions within a single view are obtained by the same method on slightly perturbed data, the independence assumption is not appropriate, making those methods unsuitable for direct application.
We note, however, z k,l is not a good choice to quantify the contribution of each view when R t (k, l)'s across all the views are low. In such a case, no cluster in any single view corresponds reasonably well with F k . For instance, cluster F k is identified due to interaction effects. It is thus questionable to compare the contribution of views based on stability or tightness measures. We will use a different measure described below.
CPS analysis applied to the reference partition F and the partitions from the lth view, e.g., fÃ ð1Þ ; . . . ;Ã ðmÞ g from the first view, provide us the cluster aligning matrix W ðpÞ l , p = 1, . . ., m, l = 1, 2. W ðpÞ l is a matrix of size κ p,l × κ 1 , where κ p,l is the number of clusters in the pth partition from the lth view. Then we calculate the cluster aligning vector V ðpÞ l ¼ 1 T k p;l W ðpÞ l =k p;l and the which is the average matching weight of cluster F k under the lth view. Similar to z k,l , we define the contribution of the lth view to F k by

PLOS COMPUTATIONAL BIOLOGY
If ∑ l 0 v k,l 0 = 0, let η k,l = 0.5. We call η k,l matching-weight-based contribution. The rationale for the definition of η k,l is that if F k receives a larger weight in one view compared to another, we assume that the former view is more important for the existence of F k . In our experiments, we use η k,l instead of z k,l if more than 40% of the clusters in the reference partition have tightness 0.

Results
In this section, we present the experimental results of CPS-merge analysis and its accelerated version (referred to as A-CPS-merge) on three multimodal scRNA-seq datasets. Table 1 summarizes basic information about the three datasets. Details on how the datasets are pre-processed are provided in their respective sub-sections. We also conducted extensive simulation studies with results provided in Table A and Table B in S1 Appendix. We compare results with the following popular methods for multi-view clustering.

Co-training clustering (Co-train):
The EM algorithm for mixture of categorical data is used (implemented in R package mvc [49]).

Multiple kernel clustering (MKC):
This method is based on localized multiple kernel kmeans (implemented in R package klic [50]).

Ensemble clustering:
This method is based on a hybrid bipartite graph formulation [52] (implemented in GitHub repository ClusterEnsembles [53]). We input the entire collection of clustering results used in CPS-merge analysis to the ensemble algorithm so that the comparison with CPS-merge is fair. We also performed ensemble clustering using the default input of the software and obtained similar results. 6. Weighted-nearest neighbor (WNN): WNN is developed for multimodal single-cell clustering [5] (implemented in R package Seurat [56]). Briefly speaking, WNN generates weights for every modality based on within-modality prediction and cross-modality prediction of each cell and uses them to create a k-nearest neighbor (KNN) graph, based on which clustering is performed.

Concatenation cluster analysis (CCA):
We concatenated features from all the views. Then function FindClusters in the R package Seurat is applied to cluster the concatenated feature vectors.

PLOS COMPUTATIONAL BIOLOGY
of information shared by two clustering results. F-measure is the harmonic mean of precision and recall, assuming the ground truth is provided. All three metrics lie in [0, 1], with 1 indicating identical clustering. We use UMAP [60] to visualize the clustering result in each view. The hyperparameters in our method include α (CPS analysis coverage level), δ 2 (the variance of Gaussian noise to generate perturbed data), κ 1 (the number of final clusters), and m (the number of perturbed clustering results). The effects of α and δ 2 have been studied in [40]. We suggest α = 0.1 and δ 2 be set as 10% of the average within-cluster variance of the original data. We have also conducted a sensitivity analysis to study the performance under different values of κ 1 and m. Results show that the performance of CPS-merge is stable when κ 1 deviates from the true value. There is a general trend of better performance at a larger m until m reaches a certain level. Detailed results with discussion are provided in Table C

Multimodal single-cell data
We now analyze three gold-standard multimodal single-cell datasets to demonstrate the competitive performance of CPS-merge analysis and the quantification of the cluster-wise contribution of each view. In Table 2, the performance of CPS-merge on the three datasets is compared with six other algorithms listed previously. For the competing methods, default parameter settings in the algorithms are used. The algorithm MKC becomes computationally infeasible for HBMC and PBMC2 because of the quadratic (in sample size) complexity of computing the kernel matrix. We thus cannot report its performance. For single-view clustering (View 1, View 2 in Table 2), data are pre-processed to reduce dimension before being clustered. Details on the dimension reduction methods used in each view will be provided shortly when we discuss each dataset separately. The single-view data are then clustered using the FindClusters function in the R package Seurat.

CITE-seq Human Bone Marrow Cells (HBMC).
This dataset [61] is generated by the CITE-seq technology [62]. CITE-seq can simultaneously quantify RNA and surface protein abundance at the single-cell level by sequencing antibody-derived tags (ADTs). Thus each individual cell is measured in two views: RNA and protein (ADT). Moreover, each view individually is inadequate to identify all the cell types [61]. The data consist of 30, 672 human bone marrow cells (HBMC) of 27 different cell types.
For this example, we use one of the most popular single-cell clustering R packages Seurat for analyzing both views. Specifically, in each view, we follow the default Seurat clustering Table 2. Clustering results on three muti-view datasets (HBMC, PBMC1 and PBMC2) obtained by 8 methods (first 8 columns). Performance is measured by ARI, NMI and F-measure. Columns View 1 and View 2 are single-view clustering results on each dataset, where View 1 refers to RNA data in datasets and View 2 refers to ADT (protein) data in HBMC and PBMC2, and ATAC data in PBMC1. The highest ARI, NMI and F-measure achieved for each dataset are in bold. PLOS COMPUTATIONAL BIOLOGY procedure. We first log normalize the RNA data and perform the centered log-ratio transformation for ADT. We then perform dimension reduction using PCA, keeping the first 50 components for RNA and the first 24 components for ADT. For both RNA and ADT, the number of components follows the default setting in Seurat. Lastly, we perform cluster analysis using Seurat default functions FindNeighbors and FindClusters. The argument called resolution in FindClusters controls the number of clusters obtained. In multimodal single-cell data analysis, we either use the default resolution or slightly adjust it so that the number of clusters in a single view is similar to the ground-truth number of clusters. The single-view clustering results are visualized in Fig 3b and 3e. When comparing with the true cell type labels, shown in Fig 3a and 3d, we see that neither view can precisely identify all the clusters. Results by CPS-merge analysis are shown in Fig 3c and 3f. ARI is 0.823 for CPS-merge and 0.819 for A-CPS-merge. Compared with the two single-view results, the most obvious improvement is the identification of CD14 Mono cell. As aforementioned, MKC failed to run due to the large sample size. MSC, Ensemble clustering and DeepCC yield poor accuracy. Cotrain and WNN achieve relatively high ARI, but lower than that of CPS-merge analysis. CCA performs slightly better than single-view clustering, but not as accurately as CPS-merge analysis in terms of ARI.

ARI (NMI) [F-measure]
To evaluate the cluster-wise contribution of each view, we find that more than 40% of the clusters in the final result have tightness 0, suggesting that the matching-weight-based contribution is more appropriate here. As studied in [5,6,63], RNA is more informative for recognizing the progenitor populations (GMP, HSC, LMPP, Prog_B1, Prog_B2, Prog_DC,

PLOS COMPUTATIONAL BIOLOGY
Prog_MK, Prog_RBC), while protein is more informative for distinguishing T cells (CD4 Memory, CD4 Naive, CD8 Effector_1, CD8 Effector_2, CD8 Memory_1, CD8 Memory_2, CD8 Naive, gdT, MAIT, Treg). By our analysis, the contribution of RNA to clusters corresponding to progenitor populations is on average 0.653, and the contribution of protein to clusters corresponding to T cells is on average 0.688. Therefore, our measures of the clusterwise contribution of each view indicate that the RNA view plays a dominant role in separating progenitor populations while the protein view is more important for separating T cells. These findings are consistent with existing domain insights.
10x Multiome Human Peripheral Blood Mononuclear Cells (PBMC1). This dataset is generated by the 10x Genomics Multiome (https://support.10xgenomics.com/single-cellmultiome-atac-gex/datasets) ATAC + RNA kit [39], which contains 10032 peripheral blood mononuclear cells (PBMC) of 14 different cell types (Fig 4). Each cell has measurements in two views: RNA and ATAC (Assay for Transposase-Accessible Chromatin). As described in [64], ATAC-seq has much lower coverage and worse signal-to-noise than RNA-seq. Therefore, RNA provides most of the information for the clusters to be revealed, while ATAC can be used as an ancillary view. Motivated by the prior information, we use RNA as the dominant view and do not perturb the RNA data.
In each view, we follow the standard clustering procedure, which is slightly different between RNA and ATAC. For RNA data, we perform clustering as we have done with the HBMC data. For ATAC, we pre-process the data using R package Signac [65], which runs term frequency inverse document frequency (TF-IDF) normalization on the data and carries out dimension reduction by singular value decomposition (SVD) (we keep the 2nd to 50th principal components according to Seurat as the first component is typically correlated with the sequencing depth). Next, we use the default FindNeighbors and FindClusters functions in Seurat to cluster data.
The single-view clustering results are visualized in Fig 4b and 4e. The ARI of the singleview clustering result is 0.829 for the view RNA and 0.668 for ATAC. CPS-merge and A-CPSmerge yield the same ARI of 0.829. This result suggests that ATAC do not contribute extra information about the clusters in this dataset. As shown in Table 2, Co-train, MKC, MSC, Ensemble clustering, and DeepCC all perform worse than clustering in any single view. In particular, the ARIs obtained by WNN and CCA are 0.764 and 0.744 respectively, worse than the result obtained in the RNA view. This comparison indicates that ATAC is not very useful for clarifying the true clusters.
CITE-seq Human Peripheral Blood Mononuclear Cells (PBMC2). The last multimodal single-cell dataset is also from peripheral blood mononuclear cells (PBMC). It is generated by CITE-seq and provided in [5]. It consists of 161, 754 cells with 31 different cell types. Same as in the previous dataset, we have two views: RNA and protein (ADT), but both views contribute substantially to the identification of clusters. This dataset has already been pre-processed. We simply apply Seurat to perform clustering in each view. The single-view clustering results are shown in Fig 5b and 5e. CPS-merge analysis yields an ARI value of 0.823 (A-CPS-merge analysis achieves ARI 0.764). Again, MKC failed to run because of the large sample size, and the other methods do not yield substantially better results than those obtained in any single view. As for evaluating the cluster-wise contribution of each view, we find again that more than 40% of the clusters in the final result have tightness 0. Thus the matching-weight-based contribution is more suitable to use. The contribution of the protein view to clusters corresponding to CD8 + and CD4 + T cells is on average 0.622, consistent with the fact that these two cell types are usually mixed in the transcriptome data but separated clearly in the protein data.

Discussion
In this paper, we have introduced CPS-merge analysis, a new method for multi-view data clustering that is guided by both the consensus and complementary principles. As a late integration method, CPS-merge only requires cluster labels obtained from single views, making it compatible with advanced clustering algorithms designed for single-view data. We have also proposed novel measures to quantify the contribution of each view to the formation of any cluster. These measures have been validated using real datasets and domain knowledge.
However, CPS-merge has limitations in two scenarios where additional information is required for accurate results. The first scenario is when the final partition is highly unstable (e.g., the average cluster tightness falls below 0.65). While such cases can be easily identified, caution is necessary when interpreting the results. The second scenario is when stable but incorrect clustering results are generated in certain single views. As our algorithm uses cluster stability to perform merging, it cannot address this issue. One potential remedy is to identify which views are ancillary a priori, allowing the algorithm to adjust accordingly.
As suggested by one reviewer, exploring online learning for multi-view clustering is a promising direction for future research. Since CPS-merge analysis only uses cluster memberships but not the original data, it can be employed in an incremental learning mode as long as the clustering algorithms used in individual views allow online learning. Numerous clustering algorithms can be easily adapted to online learning, for instance, by representing previous data using per-cluster statistics, e.g., mean vectors and covariance matrices. Based on these stored representations, new data batches can be clustered or assigned to new clusters without accessing past data. Neural networks can also assist with online clustering. For instance, deep autoencoders can encode the original data in lower dimensions, which are typically easier to cluster, particularly under an online learning paradigm. Additionally, neural networks are frequently trained in batch mode, making them naturally suited for online learning. One challenge to consider for biomedical data, such as single-cell data, is that various data batches often contain batch effects that must be eliminated. Current methods for removing batch effects typically require processing all data in one view together, preventing effective online learning. Albeit interesting, how to overcome this issue in online learning is beyond the scope of our method here.
Supporting information S1 Appendix. This file contains the description of the simulation study and sensitivity analysis. (PDF)