Figures
Abstract
Incomplete multi-view clustering (IMVC) is an unsupervised technique for clustering multi-view data when some view information is absent. However, most existing IMVC methods usually suffer from several significant challenges: (1) Inaccurate imputation or padding of missing data degrades clustering performance; (2) The ability to extract view features may decrease due to low-quality views, especially those that are inaccurately imputed. To overcome these challenges, in this paper, we introduce a novel IMVC framework, called soft label collaborative view consistency enhancement (SLC_CE). Firstly, we leverage the encoders of Transformers to construct a soft-label view information interaction module, which fully utilizes soft-labels to enhance view feature embeddings. Secondly, we employ soft labels to collaboratively impute missing features, addressing the incomplete multi-view data problem. Finally, we implement a consistency enhancement strategy across multi-level view features and soft labels to ensure high-quality feature extraction and imputation. Extensive experiments on several benchmark datasets demonstrate that the proposed SLC_CE method outperforms other state-of-the-art methods in real IMVC tasks.
Citation: Zhang J, Tang J (2025) Soft label collaborative view consistency enhancement with application to incomplete multi-view clustering. PLoS One 20(7): e0326852. https://doi.org/10.1371/journal.pone.0326852
Editor: Zhe Liu, Xinyu University, CHINA
Received: November 5, 2024; Accepted: June 5, 2025; Published: July 1, 2025
Copyright: © 2025 Zhang, Tang. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: (1) The Aloi-100 dataset underlying the results presented in the study is available from Geusebroek at https://github.com/youweiliang/Multi-view_Graph_Learning/blob/master/data/ALOI_100.mat. (2) The Scene15 dataset underlying the results presented in the study is available from Oliva, Torralba, Fei-Fei Li, Perona, and Lazebnik at https://github.com/QinghaiZheng1992/Code-for-UGLTL/blob/master/dataset/scene15.mat. (3) The MNISTUSPS dataset underlying the results presented in the study is available from U.S. Postal Service at https://github.com/YangSkywalker/L1-MvDA-VC/blob/main/Data/MNIST-USPS.mat. (4) The NoisyMNIST dataset underlying the results presented in the study is available from Louisiana State University at https://github.com/fariba87/noisyMNIST/tree/main/noisyMNIST/noisyMNIST.
Funding: This work was supported by the Changzhou Science and Technology Program (CE20215029). The funders provide support for the decision to publish, and preparation of the manuscript.
Competing interests: The authors have declared that no competing interests exist.
1 Introduction
Multi-view clustering (MVC) is a well-known unsupervised learning technology that divides instances into clusters by utilizing their feature representations. These views can be derived from different sensors, domains, or feature extractors, providing a more comprehensive perspective of each instance [1–4]. The MVC technology [5–9] is fundamentally based on the assumption that all view data are fully available. However, in many real-world situations, multi-view data is frequently incomplete due to sensor malfunctions or missing information during collection. This poses significant challenges for directly applying MVC techniques to incomplete multi-view data.
To address this challenge, many incomplete multi-view clustering (IMVC) methods have been developed in recent years. Existing IMVC techniques [10–13] can be grouped into three main categories: matrix factorization-based IMVC, kernel learning-based IMVC, and graph learning-based IMVC. IMVC approaches based on matrix factorization [10,13–15] focus on decomposing multi-view data matrix to recover missing views and uncover shared representations. Wang et al. [16] fully explored spectral perturbation theory and then applies a tailored matrix completion approach to handle the similarity matrices of incomplete multi-view data. Rai et al. [15] adopted the non-negative matrix factorization (NMF) method to exploit the intrinsic geometric structure of the data distribution in each view. Kernel learning-based IMVC methods [11,17] cope with missing data by constructing a kernel matrix and then applying imputation techniques to estimate the missing values. For example, Liu et al. [17] integrated the imputation of incomplete kernel matrices with multiple kernel alignments to cluster in a unified framework. Graph-based methods [11,12,18] construct similarity graphs to represent relationships between data instances. This technique leverages the geometric structure of the graph to propagate information and handle missing data. Zhao et al. [12] employed unrestricted anchors to reconstruct relationships in high missing-rate data and integrated graph convolutional networks (GCNs) to obtain graph embeddings for clustering incomplete multi-view data. However, these aforementioned methods rely heavily on the quality of initial multi-view data and thus cannot fully capture the complex relationships between views.
Benefiting from the powerful feature representation capabilities of deep neural networks (DNNs), several deep IMVC methods [19,20,20–23] have been developed to deal with incomplete multi-view data. Autoencoder-based methods [24,25] use DNNs to learn feature representations and reconstruct missing views. Choudhury et al. [24] first imputed missing inputs using the k-nearest neighbor rule, and then preserved the structure of the input data in the latent space by incorporating Sammon’s stress as a regularizer in the objective function of the autoencoder. GAN-based deep IMVC methods [26–28] generate missing data through adversarial learning technology. Zhou et al. [26] employed adversarial learning and attention mechanisms to align latent feature distributions and quantify the importance of the modalities, respectively. With the development of contrastive learning methods, they have been integrated into the deep IMVC framework to learn consistent representations across views through contrastive learning strategies [29,30]. In [29], consistency learning is performed by maximizing mutual information between different views through contrastive learning, while missing views are recovered by minimizing conditional entropy through dual prediction. Despite the impressive progress of these methods, they still face issues with inaccurate imputation and low-quality feature extraction.
To mitigate these limitations, we introduce a novel IMVC framework, called soft label collaborative view consistency enhancement (SLC_CE). As illustrated in Fig 1, the proposed SLC_CE method is designed to leverage the synergy between multiple views and soft labels, enabling accurate recovery of missing views. The proposed method designs an information interaction module by using soft-label information to enhance view feature embedding. In addition, to address incomplete multi-view data, we employ generated soft labels to recover missing view features using the k-nearest neighbor approach. Finally, to ensure the quality of view feature extraction and missing data recovery, we adopt a consistency enhancement strategy to constrain soft labels and multi-level view features. Extensive experimental results show the effectiveness of the proposed method in IMVC tasks.
The contributions of this work can be summarized as follows:
- We propose an information interaction module, which enriches view feature embeddings by utilizing soft labels. This effectively promotes interaction between views, thereby learning more robust feature representations. Meanwhile, our method uses soft-label information to collaboratively impute missing features across views, ensuring that the imputation process is guided by learned feature complementarity and consistency.
- We adopt a consistency enhancement strategy to constrain soft labels and multi-level view features. This helps maintain the quality of feature extraction and imputation and thus reduces the negative impact of low-confidence soft labels.
- Extensive experimental results on four incomplete multi-view datasets demonstrate the effectiveness and robustness of our proposed SLC_CE method compared to other state-of-the-art methods in complex IMVC tasks.
2 Related work
In this section, we briefly review related work on contrastive learning-based MVC, Transformer-based MVC, and IMVC methods.
2.1 Contrastive learning-based MVC
Contrastive learning is a well-established and effective unsupervised representation learning method, known for its ability to effectively generalize across different types of data representations [31–33]. Inspired by constrastive learning, contrastive multi-view learning has been proposed in the past few years [23,29,34]. For example, Tian et al. [35] applied contrastive learning to maximize mutual information between representations of different views, facilitating the learning of shared information across these views. Contrastive learning aims to increase the similarity between positive pairs of representations while minimizing the similarity between negative pairs, which closely aligns with clustering objectives. [36] used contrastive learning to align multi-view representations obtained from view-specific encoders, and then fused these aligned representations for single-view clustering. Moreover, Xu et al. [37] introduced an approach where multi-view representations are initially aligned using a parameter-shared network, and then contrastive learning is applied to ensure consistency between multi-view features and semantic labels. These contrastive multi-view learning methods highlight the flexibility of contrastive learning techniques in multi-view clustering models, providing a promising approach to improving both representation learning and clustering outcomes in multi-view scenarios.
Although contrastive learning has achieved notable progress in IMVC tasks, it still encounters several challenges, particularly those arising from feature distribution discrepancies and view misalignment. Due to the differences in the distribution of multi-view data, existing contrastive learning methods cannot effectively capture and align the shared information between different views. Additionally, these methods often emphasize learning single-view features while neglecting global consistency and precise alignment between views. This oversight may result in suboptimal performance when handling complex multi-view data.
2.2 Transformer-based MVC
Attention was first introduced in sequence-to-sequence tasks to help models focus on the most informative parts of the input representations. The Transformer architecture [38] fully relies on attention mechanisms, capturing global dependencies between input and output sequences. The Vision Transformer [39] extends the Transformer architecture to image classification by treating non-overlapping image patches of moderate size as input sequences, similar to the use of labels in translation tasks. Then hierarchical Transformers [40,41] introduce a novel technique using shifted image patch windows and variational patch segmentation strategies. They shift windows over non-overlapping patches to capture information from each patch combination, while variational patch segmentation (also known as patch merging) ensures that the learning model incorporates local regions into the broader image context.
Recently, Transformer has been applied to real IMVC task [22,42,43]. Its attention mechanisms establish associations across positions to capture global contextual features. Transformer-based IMVC methods can learn relationships between different views through attention mechanisms, thereby enhancing clustering performance. The attention mechanisms dynamically learn key features and interactions within each view. Additionally, the multi-head attention mechanisms further strengthen the modeling of relationships between different views, leading to more accurate clustering results. Therefore, we introduce the Transformer to enhance feature representation capabilities in this work.
2.3 Incomplete multi-view clustering (IMVC)
Incomplete multi-view clustering (IMVC) focuses on improving clustering performance in scenarios where multi-view data are incomplete. One widely used approach is to extract a shared subspace from incomplete data using matrix factorization. A seminal method, called partial multi-view clustering (PVC) [44], directly computes a common latent representation for complete instances while deriving view-specific latent representations for incomplete samples through matrix decomposition. Therefore, several matrix decomposition-based IMVC methods have been developed in recent years. For example, Rai et al. [15] proposed a graph-regularized non-negative matrix factorization method based on PVC. Hu et al. [45] proposed a doubly aligned incomplete multi-view clustering (DAIMC) method, which employs weighted semi-non-negative matrix factorization with l2,1 regularized regression to extract a shared representation. An alternative strategy in IMVC involves inferring missing samples. Wen et al. [46] developed a unified embedding alignment framework (UEAF) that addresses missing data by using an error matrix and reverse graph regularization to both complete the data and identify common structures. Then Wen et al. [47] explored high-order correlations across multiple views using tensor constraints, thereby learning similarity across multi-view graphs while recovering missing instances. A subspace clustering method has also been proposed to jointly perform data imputation and self-representation learning [48]. Inspired by generative adversarial networks (GANs) [49], Wang et al. [20] introduced a generative partial multi-view clustering approach that leverages GAN models to fill in missing data. More recently, [23] proposed an IMVC framework by combining consistency learning with data recovery. In addition, Lin et al. [29] presented a more generalized approach to learning representations from incomplete multi-view data.
Although these IMVC methods have demonstrated impressive performance, they often entail high computational costs and risk compromising data fidelity. The inherent complexity of feature extraction, alignment, and missing data inference across multiple views further hinders their scalability to large-scale datasets. Additionally, handling incomplete data can introduce noise or lead to the loss of important information, reducing data fidelity and impacting clustering performance. Therefore, preserving data integrity while improving computational efficiency poses a substantial challenge in IMVC applications.
3 Method
In this section, we introduce the proposed SLC_CE method for implementing IMVC tasks in detail.
3.1 Notations
Formally, let represent the multi-view data, where N is the number of samples and
is the feature dimensionality. Here,
denotes the v-th view, and ′NaN′ represents missing instances. The parameter K is the cluster number.
3.2 Overall framework
Fig 1 illustrates the overall framework of the proposed SLC_CE method. First, the proposed model employs an information interaction Transformer to enable interactive learning between soft labels and view information. Therefore, it aims to fully utilize soft-label information to extract the features of multi-view data. To cope with incomplete data, we adopt soft-label information to collaborate with multi-view data using the k-nearest neighbor algorithm to generate the missing view features. Finally, to ensure the quality of view feature extraction and missing data recovery, we employ a consistency enforcement strategy to ensure the accuracy of generated soft labels and multi-level view features.
3.3 Information interaction transformer
As shown in Fig 1, we first learn the embedding of the multi-view data . The features between different views are embedded into a common feature space. For a given sample
from
, the embedding vector
can be expressed as
, where de represents the dimension of the embedding features. We then stack the embedding vectors to obtain the original multi-view embedding sequence
, which is further used as the input vector of the Transformer. Note that for incomplete multi-view data, we adopt the soft-label co-interpolation method (as detailed in Sect 3.4) to generate the embeddings of the missing views, ensuring that
is complete in all views. At the same time, the extracted view feature embedding
is fed into the Transformer to enhance the view feature embedding. Therefore, we have
where is a fully CNN and
is the first layer of Transformer.
is the incomplete multi-view data, and
is the view feature embedding after Transformerfv. Here, an adaptive fusion layer is introduced to fuse the information from multiple views into a shared view feature
. The fusion process can be formulated as follows:
where represents the learnable weight and
is the adjustment factor. By interacting with the shared view feature
to explore the correlations between the soft labels and the view embeddings, Transformerfl attempts to obtain complementary information from the soft label. This process results in the enhanced soft-label
and feature embedding
as follows:
where is the concatenation operation and
denotes enhanced cluster soft labels. Subsequently, the output features of Transformerfl are propagated into the second layer.
The second layer is designed to extract high-level shared features, which is achieved by promoting the interaction and fusion between soft-label information and view features extracted from the first layer. Therefore, it obtains a more discriminative representation of the multi-view data. This layer incorporates two Transformer blocks, denoted as and Transformersl.
is used to enhance information across views and extract a high-level multi-view embedding
. This is the enhanced representation of views by interacting with the shared soft label feature Sc and analyzing view correlations. Thus, we have
where is a linear layer as a projection function designed to map vectors from the soft label feature space to the view feature space. Correspondingly, Transformersl is employed to complement information across soft labels and extract high-level soft label vectors
by leveraging the shared feature
and discerning soft label correlations, as illustrated as follows:
where is a linear layer as a projection function to map vectors from the view feature space to the soft label feature space. Through the propagation of vectors
, Sc, and
among the transformer blocks, we facilitate the sharing of information between the view and soft label feature spaces, thereby extracting more refined and effective features of views and soft labels.
3.4 Soft-label collaborative imputation
It is well known that when a partial sample of multi-view data is missing, we cannot effectively learn the embedded features. Most existing methods try to use existing views to complete the missing views to improve the feature extraction performance in the scenario where samples are missing. However, most of these methods only use the k-nearest neighbor algorithm for completion. Therefore, in this work, we make full use of the soft-label information to cooperate with the k-nearest neighbor method for completion and use the clustered soft-label vector Q to help generate the missing views. Specifically, for a sample i, let represent the index of the existing view, and
represent the index of the missing view. To use the original multi-view embedding
to supplement the missing features of the sample i, we first find the k-nearest neighbors in the projected soft label feature space. The neighbor set D can be constructed as follows:
where is a function designed to identify the indices of the top K soft labels based on the smallest distance between embedding vectors and soft label vectors. Then, we employ a statistical method to describe the distribution of the missing views. We assume that the missing views
satisfy the multivariate Gaussian distribution
, whose mean vector and covariance matrix are denoted as follows:
For the missing views, we sample from this distribution times and substitute the missing views with the sampled results. Consequently, we can obtain the complete embeddings for the incomplete multi-view data. By reconstructing the missing multi-view data, our proposed method further enhances its performance in incomplete information clustering.
3.5 Soft-label and view consistency enhancement
Using the aforementioned soft-label view information interaction Transformer, we extract two multi-view embeddings and
from different layers, respectively. To enable our encoder to effectively extract the features, it is crucial to enhance the discriminative ability of these embeddings. Specifically, according to the consistency between multiple views, the embedded features of samples from different views should be aligned. In addition, we can fully utilize the consistent features of multi-view data to improve the discriminative ability of
and
. Taking these factors into consideration, we introduce the embedding enhancement of multi-level view features. To learn more effective embeddings
and
, we use contrastive learning to align the embeddings of the same sample from different views. Therefore, we employ the loss function in the proposed model as follows:
where m and n refer to the indices of the m-th and n-th views, respectively. represents the cosine similarity and
is the temperature parameter.
As previously mentioned, we utilize clustering soft labels to assist in completing missing data. This means that the quality of the recovery data depends largely on the accuracy of the soft labels. Here, we adopt contrastive learning to optimize the soft clustering process. For the m-th view, Qm(:,j) have (Vk–1) pairs, where the (V–1) pairs are positive and the rest V(k–1) pairs are negative. Thereby, the contrastive loss can be defined as follows:
Similarly, our refined soft label feature consistency enhancement optimization is as follows:
where represents the cosine distance to measure the similarity between two labels, and
is the temperature parameter. Moreover, we use the cross entropy as a regularization term to avoid the samples being assigned into a single cluster. Thus, the label consistency learning is formulated as follows:
where . After fine-tuning the labels through contrastive learning, the similarity between positive pairs is increased, resulting in latent features with a more distinct clustering structure.
Therefore, the full loss function of the proposed method is given as follows:
In this paper, the optimization of the objective function shown in Eq 18 is an end-to-end learning process. The total training process of the proposed model is summarized in Algorithm 1.
Algorithm 1. The proposed SLC_CE algorithm.
4 Experimental results and analysis
4.1 Datasets and metrics
We conducted experiments on four benchmark multi-view datasets: Aloi-100, Scene15, MNISTUSPS, and NoisyMNIST, as summarized in Table 1. To evaluate the robustness of our proposed method, we assessed the clustering performance of the proposed method under different missing rates, specifically [0.1, 0.3, 0.5, 0.7], across all datasets. The clustering performance was measured using three widely used clustering metrics: accuracy (ACC), normalized mutual information (NMI), and adjusted Rand index (ARI). Generally speaking, higher values for these indicators correspond to better clustering performance.
4.2 Comparison methods
In this experiment, we evaluated the proposed SLC_CE method against nine state-of-the-art IMVC techniques: COMPLETER [23] addresses missing views by minimizing the conditional entropy between different views through dual prediction. DCP [29] develops a unified framework to learn consistent representations across views and recover missing views in incomplete multi-view representation learning. CBG [50] proposes a flexible and efficient incomplete large-scale multi-view clustering method based on a bipartite graph framework to solve the problems of high complexity and expensive time consumption. CPSPAN [51] employs pair-observed data alignment to guide the construction of instance-to-instance correspondences across views. PIMVC [52] proposes a novel graph-regularized projective consensus representation learning model for IMVC. APADC [53] introduces an imputation-free deep IMVC method that incorporates distribution alignment in feature learning. DIVIDE [54] utilizes random walks to identify data pairs on a global scale, rather than locally, effectively reducing false negatives in contrastive learning. SCSL [55] proposes a sample-level cross-view similarity learning (SCSL) method for IMVC. DVIMC [56] introduces a variational autoencoder-based method to address the missing data problem in IMVC. VITAL [57] learns both common and specific information by modeling each sample as a Gaussian distribution. It uses variational inference for contrastive learning across views.
4.3 Implementation details
We employed a multi-layer perceptron (MLP) with a fully connected (Fc) network as the encoder to extract the features. For each view, the encoder structure was set as follows: Input–Fc500–Fc2000–Fc2000–Fc10. The temperature parameter was fixed at 1 for all experiments. We used the Adam optimizer with a learning rate of 1.0e-4. Due to differences in the distributions of the datasets, the hyperparameters were adjusted accordingly. For the Aloi-100 dataset, we used a batch size of 512, trained for 200 epochs, and set
to 0.1 and
to 1. For the Scene15 dataset, we used a batch size of 256, trained for 200 epochs, and set
to 0.01 and
to 1. For the MNIST-USPS dataset, we used a batch size of 512, trained for 200 epochs, and set α to 0.1 and
to 1. For the NoisyMNIST dataset, we used a batch size of 1024, trained for 200 epochs, with
set to 0.01 and
to 1. All experiments were carried out on an Ubuntu system with an NVIDIA GeForce RTX 3090 GPU (24.0 GB memory).
4.4 Experimental results
To evaluate the performance of our proposed SLC_CE method in IMVC tasks, we compared it with several state-of-the-art methods. Table 2 presents the clustering results of our SLC_CE method and the baseline models on four incomplete datasets. The best results are highlighted in bold, and the second-best results are underlined. From the experimental results, we can get the following observations:
- 1) It can be observed that our method outperforms other competitors, such as CBG, PIMVC, and SCSL. Traditional IMVC methods often rely on shallow learning models to process multi-view data, which limits their ability to capture nonlinear relationships and higher-order features. Most existing methods attempt to fill in missing views by leveraging available views, primarily using k-nearest neighbor (KNN) algorithms to complete the missing data and improve feature extraction. However, these methods struggle to fully capture the complex structural information inherent in multi-view data. In contrast, our method combines soft-label information with KNN for data completion and employs a clustered soft-label vector Q to recover the missing views. This allows our approach to more effectively handle complex real-world scenarios. The information interaction module leverages soft labels to enhance the feature embeddings across views, improving inter-view interactions and learning more robust feature representations. They ultimately lead to superior clustering performance, demonstrating the effectiveness of our soft-label imputation strategy.
- 2) Different other state-of-the-art deep IMVC approaches such as CPSPSN, DCP, and APADC, which predict missing views but do not fully leverage label information, our approach uses soft labels to fill in missing features across views more effectively. This is due to the guidance of learned feature relationships and consistency. This strategy significantly boosts the model’s performance and enhances its capability to handle missing data effectively.
- 3) We can observe from the results that our approach surpasses IMVC methods such as DIVIDE and COMPLETER, which also employ contrastive learning strategies to enhance view consistency. In contrast, our approach leverages a multi-level contrastive learning strategy to enforce consistency between soft labels and multi-level view features. This strategy not only preserves the quality of feature extraction and imputation, but also mitigates the negative effects of low-confidence soft-labels, resulting in more robust performance.
4.5 Ablation study
In this subsection, we evaluated the contribution of each component in our method with the same experimental setting. Specifically, we constructed three variants of the proposed method: (A) excluding the soft-label and view consistency enforcement part, called SV_CE (w/o SV_CE); (B) removing the soft-label view interaction Transformer in the graph and replacing it with Multi-Layer Perceptron (MLP), referred to as SV_IT (w/o SV_IT); (C) eliminating the soft-label collaborative part in the missing value recovery process, called SLC (w/o SLC). Table 3 shows the ablation results of our proposed method on four different datasets. It can be seen that removing any component from our method or replacing our proposed module with an alternative module significantly degrades the clustering performance. This shows that each component of our proposed method plays a vital role in IMVC tasks. Specifically, our SV_CE (w/o SV_CE) component performs consistency feature alignment operations from the view features and soft clustering levels through a contrastive learning strategy to learn feature consistency more effectively. This helps to reduce the negative impact of low-confidence soft labels and maintain the quality of feature extraction and filling. The SV_IT (w/o SV_IT) component plays a key role during the feature extraction. We flexibly employ the attention mechanism to interactively learn view features, and use soft clustering to maximize the utilization of soft labels, thereby enriching the view feature embedding. This effectively promotes the interaction between views and thus learns more powerful feature representations. The SLC (w/o SLC) component incorporates soft label information to guide the restoration of missing values, ensuring that the model accurately restores missing samples.
4.6 Convergence analysis
In this subsection, we conducted a convergence analysis experiment on four benchmark datasets. Fig 2 illustrates the convergence of the proposed SLC_CE method on different multi-view datasets, each with a missing rate of 0.7. It can be seen that the loss decreases quickly in the first 50 epochs, then continues to decline gradually with minor fluctuations before eventually stabilizing. These convergence results demonstrate the reliability and effectiveness of the proposed method in tackling the incomplete multi-view clustering (IMVC) problem, demonstrating its consistent performance even under challenging conditions.
4.7 Parameter analysis
In this subsection, we conducted experiments on four datasets to evaluate the parameter sensitivity of the proposed method. Here, we set the missing rate to 0.7 in this experiment. The proposed model includes two trade-off coefficients, and
in Eq 18, with values ranging from 10−3 to 10. Fig 3 shows the experimental results of our proposed method on four incomplete multi-view datasets. The results indicate that our method maintains stable clustering performances across a wide range of parameters, demonstrating the insensitivity of our proposed method under different real applications.
4.8 Visualization
To intuitively assess the effectiveness of the proposed SLC_CE model, we employed the t-SNE algorithm to visualize the distribution of latent features learned by the model with a missingness rate of 0.7. As illustrated in Fig 4, the generated clusters are distinctly separated with clear boundaries, demonstrating that our method effectively captures meaningful features from the multi-data. The clarity of these clustering results further confirms the robustness and effectiveness of the proposed method in handling complex clustering tasks.
4.9 Complexity analysis
In this subsection, we evaluate the computational efficiency of our method by measuring the number of parameters, running time, and floating point operations (FLOPs), and compare it with several state-of-the-art deep incomplete multi-view clustering approaches. The results in Table 4 show that our method outperforms other IMVC methods regarding the number of parameters, running time, and FLOPs. It highlights that the proposed model outperforms other state-of-the-art methods in clustering accuracy and maintains competitive computational efficiency, thus improving its overall effectiveness and scalability.
5 Conclusion
In this paper, we introduce a soft label collaborative view consistency enhancement (SLC_CE) method for IMVC. Our approach leverages a soft-label view information interaction Transformer to fully exploit soft-label information for enhancing view feature embeddings. To handle the challenge of incomplete multi-view data, we employ the k-nearest neighbor method, guided by soft-label information, to recover missing view features across views. Additionally, we incorporate a consistency enhancement strategy to ensure accurate view feature extraction and missing data recovery by constraining soft labels and multi-level view features. Extensive experimental results have demonstrated that our SLC_CE method outperforms other state-of-the-art methods in clustering tasks involving incomplete multi-view data.
Although the proposed method achieves satisfactory clustering performance, it has several limitations. Specifically, it employs traditional autoencoders as the backbone network, which limits its feature extraction capability. Therefore, we will incorporate a more powerful feature extraction model, such as multimodal vision-language models, to enhance multi-view feature representations. In addition, the semi-paired problem in multi-view data is common in many applications, and adapting the proposed method to handle it remains a significant challenge.
References
- 1. Huang S, Tsang IW, Xu Z, Lv J. Measuring diversity in graph learning: a unified framework for structured multi-view clustering. IEEE Trans Knowl Data Eng. 2022;34(12):5869–83.
- 2. Wang H, Yang Y, Liu B, Fujita H. A study of graph-based system for multi-view clustering. Knowl-Based Syst. 2019;163:1009–19.
- 3.
Huang Z, Zhou JT, Peng X, Zhang C, Zhu H, Lv J. Multi-view spectral clustering network. In: IJCAI. 2019. 4.
- 4. Shu Z, Sun T, Yu Z. Self-supervised disentangled representation learning with distribution alignment for multi-view clustering. Digit Signal Process. 2025;161:105078.
- 5. Sun T, Shu Z, Huang Y, Wang H, Yu Z. Semantic feature graph consistency with contrastive cluster assignments for multilingual document clustering. ACM Trans Asian Low-Resour Lang Inf Process. 2025;24(1):1–22.
- 6. Yang B, Zhang X, Chen B, Nie F, Lin Z, Nan Z. Efficient correntropy-based multi-view clustering with anchor graph embedding. Neural Netw. 2022;146:290–302. pmid:34915413
- 7. Shu Z, Yong K, Zhang D, Yu J, Yu Z, Wu X-J. Robust supervised matrix factorization hashing with application to cross-modal retrieval. Neural Comput Applic. 2022;35(9):6665–84.
- 8. Wang H, Yang Y, Liu B. GMC: graph-based multi-view clustering. IEEE Trans Knowl Data Eng. 2020;32(6):1116–29.
- 9. Liu Z, Qiu H, Deveci M, Pedrycz W, Siarry P. Multi-view neutrosophic c-means clustering algorithms. Expert Syst Appl. 2025;260:125454.
- 10. Wang Y, Wu L, Lin X, Gao J. Multiview spectral clustering via structured low-rank matrix factorization. IEEE Trans Neural Netw Learn Syst. 2018;29(10):4833–43. pmid:29993958
- 11. Xia D, Yang Y, Yang S, Li T. Incomplete multi-view clustering via kernelized graph learning. Inf Sci. 2023;625:1–19.
- 12.
Zhao L, Wang Z, Yuan Y, Ding F. Unrestricted anchor graph based GCN for incomplete multi-view clustering. In: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2023. p. 1–5. https://doi.org/10.1109/icassp49357.2023.10096284
- 13.
Wen J, Zhang Z, Xu Y, Zhong Z. Incomplete multi-view clustering via graph regularized matrix factorization. In: Proceedings of the European conference on computer vision (ECCV) workshops, 2018.
- 14. Li B, Shu Z, Liu Y, Mao C, Gao S, Yu Z. Multi-view clustering via label-embedded regularized NMF with dual-graph constraints. Neurocomputing. 2023;551:126521.
- 15.
Rai N, Negi S, Chaudhury S, Deshmukh O. Partial multi-view clustering using graph regularized NMF. In: 2016 23rd International Conference on Pattern Recognition (ICPR), 2016. p. 2192–7. https://doi.org/10.1109/icpr.2016.7899961
- 16. Wang H, Zong L, Liu B, Yang Y, Zhou W. Spectral perturbation meets incomplete multi-view data. 2019. https://arxiv.org/abs/1906.00098
- 17. Liu X. Incomplete multiple kernel alignment maximization for clustering. IEEE Trans Pattern Anal Mach Intell. 2021.
- 18. Wang Y, Chang D, Fu Z, Zhao Y. Consistent multiple graph embedding for multi-view clustering. IEEE Trans Multim. 2021.
- 19. Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. arXiv preprint 2016.
- 20. Wang Q, Ding Z, Tao Z, Gao Q, Fu Y. Generative partial multi-view clustering with adaptive fusion and cycle consistency. IEEE Trans Image Process. 2021;30:1771–83. pmid:33417549
- 21. Shu Z, Luo Y, Huang Y, Mao C, Yu Z. View-interactive attention information alignment-guided fusion for incomplete multi-view clustering. Exp Syst Appl. 2024;252:124258.
- 22. Liu C, Wen J, Wu Z, Luo X, Huang C, Xu Y. Information recovery-driven deep incomplete multiview clustering network. IEEE Trans Neural Netw Learn Syst. 2023.
- 23.
Lin Y, Gou Y, Liu Z, Li B, Lv J, Peng X. Completer: incomplete multi-view clustering via contrastive prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021. p. 11174–83.
- 24. Choudhury SJ, Pal NR. Deep and structure-preserving autoencoders for clustering data with missing information. IEEE Trans Emerg Top Comput Intell. 2021;5(4):639–50.
- 25.
Fan S, Wang X, Shi C, Lu E, Lin K, Wang B. One2Multi graph autoencoder for multi-view graph clustering. In: Proceedings of the Web Conference 2020. 2020. p. 3070–6. https://doi.org/10.1145/3366423.3380079
- 26.
Zhou R, Shen Y-D. End-to-end adversarial-attention network for multi-modal clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 14619–28.
- 27. Xu C, Guan Z, Zhao W, Wu H, Niu Y, Ling B. Adversarial incomplete multi-view clustering. IJCAI. 2019;7:3933–9.
- 28. Xu C, Liu H, Guan Z, Wu X, Tan J, Ling B. Adversarial incomplete multiview subspace clustering networks. IEEE Trans Cybern. 2022;52(10):10490–503. pmid:33750730
- 29. Lin Y, Gou Y, Liu X, Bai J, Lv J, Peng X. Dual contrastive prediction for incomplete multi-view representation learning. IEEE Trans Pattern Anal Mach Intell. 2023;45(4):4447–61. pmid:35939466
- 30. Zhe X, Li Y, Guan Z, Li W, Liang M, Zhou H. Robust multi-graph contrastive network for incomplete multi-view clustering. IEEE Trans Multim. 2024.
- 31.
Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. 2020. p. 1597–607.
- 32. Chuang C-Y, Robinson J, Lin Y-C, Torralba A, Jegelka S. Debiased contrastive learning. Adv Neural Inf Process Syst. 2020;33:8765–75.
- 33. Tian Y, Sun C, Poole B, Krishnan D, Schmid C, Isola P. What makes for good views for contrastive learning? Adv Neural Inf Process Syst. 2020;33:6827–39.
- 34.
Hassani K, Khasahmadi AH. Contrastive multi-view representation learning on graphs. In: International Conference on Machine Learning. 2020. p. 4116–26.
- 35.
Tian Y, Krishnan D, Isola P. Contrastive multiview coding. Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI. Springer. 2020. p. 776–94.
- 36.
Trosten DJ, Lokse S, Jenssen R, Kampffmeyer M. Reconsidering representation alignment for multi-view clustering. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021. p. 1255–65. https://doi.org/10.1109/cvpr46437.2021.00131
- 37.
Xu J, Tang H, Ren Y, Peng L, Zhu X, He L. Multi-level feature learning for contrastive multi-view clustering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 16051–60.
- 38. Vaswani A shish, Shazeer N, Parmar N, Uszkoreit J, Jones L lion, Gomez AN, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
- 39. Dosovitskiy A. An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint 2020. https://arxiv.org/abs/2010.11929
- 40. Chu X, Tian Z, Wang Y, Zhang B, Ren H, Wei X, et al. Twins: revisiting the design of spatial attention in vision transformers. Adv Neural Inf Process Syst. 2021;34:9355–66.
- 41.
Liu Z, Lin Y, Cao Y, Hu H, Wei Y, Zhang Z, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021. p. 10012–22.
- 42. Zhao M, Yang W, Nie F. MVCformer: a transformer-based multi-view clustering method. Inf Sci. 2023;649:119622.
- 43.
Ke T-W, Hwang J-J, Guo Y, Wang X, Yu SX. Unsupervised hierarchical semantic segmentation with multiview cosegmentation and clustering transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 2571–81.
- 44. Li S-Y, Jiang Y, Zhou Z-H. Partial multi-view clustering. AAAI. 2014;28(1).
- 45. Hu M, Chen S. Doubly aligned incomplete multi-view clustering. arXiv preprint 2019. https://arxiv.org/abs/1903.02785
- 46. Wen J, Zhang Z, Xu Y, Zhang B, Fei L, Liu H. Unified embedding alignment with missing views inferring for incomplete multi-view clustering. AAAI. 2019;33(01):5393–400.
- 47. Wen J, Zhang Z, Zhang Z, Zhu L, Fei L, Zhang B, et al. Unified tensor framework for incomplete multi-view clustering and missing-view inferring. AAAI. 2021;35(11):10273–81.
- 48.
Liu J, Liu X, Zhang Y, Zhang P, Tu W, Wang S, et al. Self-representation subspace clustering for incomplete multi-view data. In: Proceedings of the 29th ACM International Conference on Multimedia. 2021. p. 2726–34. https://doi.org/10.1145/3474085.3475379
- 49. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial nets. Adv Neural Inf Process Syst. 2014;27.
- 50.
Wang S, Liu X, Liu L, Tu W, Zhu X, Liu J, et al. Highly-efficient incomplete large-scale multi-view clustering with consensus bipartite graph. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. p. 9776–85.
- 51.
Jin J, Wang S, Dong Z, Liu X, Zhu E. Deep incomplete multi-view clustering with cross-view partial sample and prototype alignment. In: 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023. p. 11600–9. https://doi.org/10.1109/cvpr52729.2023.01116
- 52. Deng S, Wen J, Liu C, Yan K, Xu G, Xu Y. Projective incomplete multi-view clustering. IEEE Trans Neural Netw Learn Syst. 2023.
- 53. Xu J, Li C, Peng L, Ren Y, Shi X, Shen HT, et al. Adaptive feature projection with distribution alignment for deep incomplete multi-view clustering. IEEE Trans Image Process. 2023;32:1354–66. pmid:37022865
- 54. Lu Y, Lin Y, Yang M, Peng D, Hu P, Peng X. Decoupled contrastive multi-view clustering with high-order random walks. AAAI. 2024;38(13):14193–201.
- 55. Liu S, Zhang J, Wen Y, Yang X, Wang S, Zhang Y, et al. Sample-level cross-view similarity learning for incomplete multi-view clustering. AAAI. 2024;38(12):14017–25.
- 56. Xu G, Wen J, Liu C, Hu B, Liu Y, Fei L, et al. Deep variational incomplete multi-view clustering: exploring shared clustering structures. AAAI. 2024;38(14):16147–55.
- 57.
He C, Zhu H, Hu P, Peng X. Robust variational contrastive learning for partially view-unaligned clustering. In: Proceedings of the 32nd ACM International Conference on Multimedia. 2024. p. 4167–76. https://doi.org/10.1145/3664647.3681331