Skip to main content
Advertisement
Browse Subject Areas
?

Click through the PLOS taxonomy to find articles in your field.

For more information about PLOS Subject Areas, click here.

  • Loading metrics

SMILE: Semi-supervised multi-view classification based on dynamical fusion

  • Hui Yang,

    Roles Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing

    Affiliation School of Cyberspace Security, Hunan College of Information, Changsha, Hunan, China

  • Linyan Kang,

    Roles Data curation, Formal analysis, Validation, Visualization, Writing – review & editing

    Affiliation School of Science, Hangzhou Dianzi University, Hangzhou, Zhejiang, China

  • Xun Che

    Roles Conceptualization, Funding acquisition, Supervision, Writing – review & editing

    * chexun@njust.edu.cn

    Affiliation School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, China

Abstract

Semi-supervised multi-view classification plays a crucial role in understanding and utilizing existing multi-view data, especially in domains like medical diagnosis and autonomous driving. However, conventional semi-supervised multi-view classification methods often merely fuse features from multiple views without significantly improving classification performance. To address this issue, we propose a dynamic fusion approach for Semi-supervised Mult I-view c Lassification (SMILE). This approach leverages a high-level semantic mapping module to extract discriminative features from each view, reducing redundancy features. Furthermore, it introduces a dynamic fusion module to assess the quality of different views of different samples dynamically, diminishing the negative impact of low-quality views. We compare our method with six competitive methods on four datasets, exhibiting distinct advantages on the classification task, which demonstrates significant performance improvements across various evaluation metrics. Visualization experiments demonstrate that our approach is able to learn classification-friendly representations.

1 Introduction

With the development of multimedia technology, most real-life data exists in the form of multi-view/multi-modality. For example, during autonomous driving, different sensors perceive the surrounding environment, such as ultrasonic radar, cameras, millimeter-wave radar, etc., and the data collected by each sensor is regarded as a view [15]. A video consists of audio, images, and text, with each medium acting as a view [6]. Fully understanding and utilizing these multi-view data can better mine data and drive innovation and progress. However, with the continuous increase in manual annotation costs and massive data, there is an urgent need for a more effective way to process multi-view data [710]. Therefore, this work focuses on semi-supervised multi-view classification.

The main challenge to semi-supervised multi-view classification is: how to fully utilize a small amount of labeled data and a large amount of unlabeled data to obtain a complete (containing cross-view shared information and complementary information) multi-view representation. Existing semi-supervised multi-view classification methods can be mainly divided into traditional methods and deep learning-based methods [11, 12]. Traditional methods mostly rely on the shared representation after multi-view fusion. Due to the limited ability of shallow methods to extract high-level semantic information from data, their classification performance depends highly on the original data. With the rapid development of deep learning, deep semi-supervised multi-view classification methods utilize the powerful representation ability of deep models to learn multi-view fusion representations that are beneficial for classification, thus overcoming the shortcomings of traditional methods and making deep learning-based semi-supervised multi-view classification attract tremendous attention in the community [13, 14].

Existing deep learning-based semi-supervised multi-view classification methods usually utilize autoencoders or convolutional neural networks to extract features from multiple views and maximize the shared representation to obtain fusion representations, and then use semi-supervised strategies based on the fused representations to generate supervised information for unlabeled data [15, 16]. Although these methods have made some progress, they simply concatenate the features of multiple views or obtain fusion features through a neural network, and cannot effectively estimate the contributions of different views. Multi-view data is composed of data from different sources, and it contains not only shared information but also a large amount of specific view information. The informativeness of each view in different samples is different. Therefore, dynamically fusing different views for each sample is beneficial to improve the quality of the fusion representations, thereby improving the classification performance of the model.

To address this issue, we propose a novel method called Semi-supervised Mult I-view c Lassification based on dynamic fusion (SMILE). The method introduces a view-specific autoencoder (AE) for each view to extract low-level features. Since these low-level features are heavily tied to the reconstruction task and are not ideal for classification, we incorporate a high-level semantic mapping head for each view to transform these features into more suitable high-level features for classification. Additionally, a view confidence module evaluates the informativeness of each view for different samples, allowing for the dynamic fusion of views. This approach mitigates the negative impact of low-quality views, ensuring the model remains robust to variations in view quality. The main contributions of this work are summarized as follows:

  • This work proposes a novel algorithm for semi-supervised multi-view classification that is robust to low-quality views and significantly improves classification performance.
  • The introduced view confidence module dynamically evaluates the informativeness of each view of each sample, refines discriminative features and dynamically fuses features of multiple views.
  • It is proved through visualization experiments that the proposed method can learn fusion representations with clear classification structure; ablation experiments prove the importance of the dynamic fusion module and the high-level feature mapping head; classification performance comparison experiments with 6 competitive methods on 4 benchmark datasets show the effectiveness and superiority of the proposed method.

2 Related work

Most data in real life exists in the form of multi-view/multimodal data. For example, a video contains three modalities: audio, image, and text; data collected by different sensors in autonomous driving constitute multiple modalities; multiple views in medical diagnosis are formed by different medical images (such as X-ray, MRI, and CT scans) of the same lesion [1721]. Multi-view data contains richer information than single-view data and fully exploring multi-view data can better mine the knowledge behind the data, thereby it can promote social development and progress [22, 23]. However, with the explosive growth of data, existing data commonly consists of numerous unlabeled data and merely a small amount of labeled data. Due to the high cost of manual labeling, how to fully utilize the small amount of labeled data and the numerous unlabeled data has become a hot research topic and difficulty in the current research [2427]. In the face of the dual challenges of multi-view and few labels, there is an urgent need to propose a more effective method, so semi-supervised multi-view learning has emerged and become a research focus.

Existing semi-supervised multi-view learning roughly category two groups one is traditional methods and the other is deep learning-based methods. Traditional methods primarily fall into the following categories: (1) co-training techniques [19, 28, 29]; (2) graph-based strategies [3032]; and (3) regression-based approaches [3335]. Co-training, initially developed for dual-view data, starts by training a classifier on labeled data. It then assigns labels to the unlabeled data in each view. Subsequently, the most confidently predicted samples from one classifier are added to the training set of the other classifier, and this cycle is repeated [36]. Graph-based methods use unlabeled and labeled data as vertices of the consensus graph, and then propagate the labeling information using edges. For example, these methods [31, 32] initially construct a graph for each view, then learn the weights of the views to obtain a consensus graph, and use label propagation to predict labels for unlabeled data. In line with this, Nie et al. [31] propose a parameter-free method to simultaneously learn the consensus graph matrix, the commonly labeled indication matrix, and the view weights. Regression-based learning techniques are employed to derive projection matrices for each view, thereby exploring view-specific complementary information. They utilize the labeled matrix as a cross-view regression target to investigate consistency among views [33]. While these methods demonstrate significant advancements, they are not without their limitations. Co-training approaches tend to overlook the inherent diversity of multiple views, treating them as equivalent. This oversight not only fails to rectify the inaccuracies produced by low-quality views but also exacerbates these errors within the model, ultimately resulting in a decline in performance. Furthermore, these methods are constrained in their ability to perform representation learning and are unable to effectively explore high-level semantic information within the data. In contrast, deep learning-based approaches excel outstanding in representation learning have garnered significant attention [6, 37, 38].

Pseudo-labeling methods demonstrate superior performance compared to many other techniques in depth-based semi-supervised learning due to their lightweight and effective approach. This advantage may be attributed to the substantial number of parameters that require tuning in deep neural networks; an increased number of parameters necessitates a greater volume of labeled data. Pseudo-labeling methods provide labeled data directly, which is particularly advantageous for deep learning models [25, 27]. For example, Wang et al. [1] proposed generating pseudo-labels on the fused representation of multiple views as supervised information to guide the learning of the single-view representation. With more supervised information, the learning of representations for individual views is thus improved thus facilitating the generation of better-fused representations. It follows that the quality of pseudo-labeling depends on the quality of the input representations. However, these methods typically employ straightforward fusion techniques to integrate features from multiple views, neglecting the variability in informativeness across different samples. This oversight diminishes the distinguishability of the fused representations. Consequently, the representations generated by existing methods fail to mitigate the adverse effects of low-quality views, resulting in a limited acquisition of discriminative features. This limitation ultimately constrains the enhancement of classification performance.

Unlike these approaches, this study dynamically evaluates the informativeness of different views for different samples and then dynamically fuses the features of multiple views to generate classification-friendly representations.

3 Our method

3.1 Prelimilary

In this section, the proposed semi-supervised multi-view classification algorithm based on dynamic fusion is presented. For ease of exposition, this section begins with a description of the notation involved. Given a multi-view dataset , where labeled data is defined as , where V denotes the number of views, y is the label corresponding to the current sample, and L is the number of labeled data. Unlabeled data is defined as , where U is the number of unlabeled data. denotes the feature dimension of the v-view of the i-th sample is dm. The purpose of semi-supervised multi-view classification is to predict correct classification results for unlabeled samples using a small amount of labeled data and a large amount of unlabeled data. The framework of the proposed method in this work is shown in Fig 1.

thumbnail
Fig 1. Illustrated of our proposed SMILE,

which pipeline as follows: ( a) input multi-view data; ( b) obtain view-specific representations through view-specific autoencoders; ( c) high-level semantic projection module projects view-specific representations into high-level features; ( d) dynamic fuse multiple view-specific high-level features.

https://doi.org/10.1371/journal.pone.0320831.g001

3.2 Multi-view data reconstruction

The raw multiview data has a large number of redundant features, so representative features are first learned from the raw data. AutoEncoder is a model that maps raw data to feature space and is widely used because of its simplicity and effectiveness [39, 40]. Therefore, in this work, view-specific AutoEncoders are designed for each view to extract the features of each view. Specifically, for the v-view of sample i, is introduced as an encoder nonlinear function to map each view into view-specific representations:

(1)

where is the obtained low-level features that the decoder will reconstruct to obtain the reconstructed view. Specifically, the decoder is denoted by and the obtained reconstructed view is denoted as:

(2)

This work utilizes reconstruction loss to optimize this process, which defined as:

(3)

3.3 Dynamic multi-view fusion

The low-level features of a single view extracted by the autoencoder, which contains most of the features relevant for reconstruction, will face some challenges if the classification is performed directly on the low-level feature space. To obtain more class-specific features to facilitate the performance improvement of classification, we add an additional fully connected layer on the low-level features to map them to the high-level representation space , is the obtained high-level features, and is the high-level semantic projection module of the v-view.

In multi-view data, the informativeness of each view of different samples is invariant [41, 42], therefore, understanding the variation of the informativeness for different samples is the key to multi-view classification, which is related to the ability of the model to adapt to the quality of the modality changes. Inspired by the literature [43], this work introduces the True-Class-Probability (TCP) [44] to quantify the categorization confidence of different views, which is closely related to the amount of categorization information of the views. When the current view classification confidence is low, it indicates that the current classification is unreliable, which correspondingly means that the view contains less informative information. In order to obtain the classification confidence of the views, for each view v, this work designs a classifier as a probabilistic model that transforms the observed samples into a predictive distribution , C is the number of classes. The classifier can be trained using the maximum likelihood estimation framework to minimize the KL (Kullback-Leibler) divergence between the predicted distribution and the true distribution:

(4)

where Eq. 4 is also known as the cross-entropy function. The maximum class probability can be extrapolated to , and can also be considered as, the classifier’s confidence in the current prediction. Although this is effective in classification, its tendency is to lead to overconfidence in the model (assigning higher confidence scores for error predictions as well). Therefore, in order to obtain more reliable classification confidence, TCP is used in this work. Unlike MCP which utilizes the maximum softmax output as a measure of confidence, TCP uses the Softmax output probability that corresponds to the true label as its confidence. Specifically, for each view v, the corresponding prediction distribution and label y is obtained, then the TCP can be formalized as:

(5)

where denotes the inner product, and when the model predictions are correct, the output of TCP agrees with the output of MCP. At this point, both TCP and MCP are maximal Softmax outputs and both reflect classification confidence well. When misclassified, however, the TCP is a better reflection of classification because it is more likely to approach a lower value, reflecting the fact that the model tends to make incorrect predictions. Although TCP gives more reliable confidence, it cannot be used directly in the estimation of the test phase and unlabeled samples because of the need for real labeling information, so a confidence regression network is introduced for each view v to estimate TCP because , therefore, a sigmoid activation function is added to the last layer of the network and the confidence regression network is trained with loss:

(6)

where , then the TCP can be approximated with a view-specific classifier and a confidence regression network. Thus the fusion representation of multiple views is shown below:

(7)

where denotes a concatenation operator.

3.4 Objective function

In this work, for the fusion representation, an additional classifier is trained with cross-entropy loss to get the final classification result p, therefore, the supervised loss in this work is formulated as follows:

(8)

For unlabeled samples, this work employs a threshold-based method to evaluate the reliability of predicted pseudo-labels and selects trustworthy pseudo-labels for training. The unsupervised loss can be defined as the cross-entropy loss between the pseudo-labels and the model predictions:

(9)

where the pseudo-labels with a maximum class prediction probability equal to or exceeding the threshold are deemed credible, and is a predefined hyperparameter. denotes the cross-entropy loss. Thus the objective function of the proposed method in this work can be defined as:

(10)

where is the balance factor to balance different losses, which is set to 1 in the experiments in this work.

4 Experimental setup

4.1 Datasets

  • Handwritten (https://archive.ics.uci.edu/ml/datasets/Multiple+Features) is a handwritten digits dataset comprises 2,000 samples, categorized into 10 classes, and includes six different views.
  • Scene15 [45] is a dataset that consists of images categorized into 15 indoor and outdoor scene categories. In this work, we employ GIST, PJOG, and LBP features, utilizing these three views with a total of 4,485 samples to construct the dataset.
  • Out-Scene [46] dataset is specifically designed for scene classification tasks and comprises 2,688 outdoor scene images, which are categorized into eight classes. Each sample within the dataset includes four views.
  • GRAZ02 (http://www.emt.tugraz.at/pinz/data/GRAZ_02) is a widely utilized benchmark for object categorization and recognition tasks, consisting of 1,474 samples across four classes, with each sample encompassing six views.

4.2 Comparison methods

This study undertakes comparative experiments to assess and compare the proposed method against six state-of-the-art semi-supervised multi-view classification methods.

  • AMGL [31] is developed for multiview clustering and semi-supervised tasks, enabling the automatic learning of optimal graph weights without the need for additional parameters. This approach incorporates heterogeneous features to align with actual data distributions and ensures the attainment of a globally optimal solution.
  • MLAN [32] concurrently executes clustering or semi-supervised classification alongside local structure learning. The model autonomously determines optimal weights for each view without the need for explicit weight specifications or penalty parameters. Furthermore, it is capable of producing reliable graphs even in the presence of noisy data.
  • MVAR [33] employs regression-based loss functions that utilize the matrix norm, integrating them in a linear fashion. It features an efficient and convergent algorithm designed for the minimization of the non-smooth -norm, rendering it appropriate for large-scale datasets. Furthermore, MVAR automatically adjusts weights to accommodate low-quality views and streamlines the prediction process for new data.
  • JCD [30] learns both a common label matrix and view-specific classifiers. It proposes a novel probabilistic square hinge loss to handle uncertain sample contributions and uses power mean to weight losses from different views.
  • LACK [47] presents a label-driven auto-weighted approach that assesses the significance of views through labeling rather than through data representation. This methodology enables LACK to acquire labels with enhanced accuracy in view weights by decomposing the overarching problem into three smaller, more manageable sub-problems that can be solved efficiently.
  • IMvGCN [48] integrates Graph Convolutional Networks (GCN) with multi-view learning to enhance interpretability and performance. It combines reconstruction error and Laplacian embedding to address multi-view learning from both feature and topology perspectives.
  • SMILE-L is our proposed method that the model just trained with unlabeled data.

4.2.1 Experimental details.

In the experiments, the view-specific feature extraction network is implemented as a 3-layer Multilayer Perceptron (MLP). We use the Adam optimizer with weight decay to adjust the learnable parameters, setting the learning rate to 1e–3. The balancing parameter is selected from , and the threshold is fixed at 0.95. The proposed framework is implemented using the PyTorch platform. The experiments were conducted on a computer equipped with an Intel i9-13900HX CPU, an Nvidia GeForce RTX 4060 GPU, and 32 GB of RAM. To evaluate performance, we use classification accuracy (ACC), macro F1-score (F1), and area under the curve (AUC). Higher values in these metrics indicate better performance.

4.3 Experimental results and analysis

4.3.1 Classification results and display.

In this experiment, to assess the classification performance of our method, this work compares our method with six competitive semi-supervised multi-view classification methods on four benchmark datasets. The proportion of labeled samples is set to 5%, 10%, and 15%. The experimental results ACC, F1, and AUC scores are recorded in Tables 1–3, respectively. For ease of observation, the best results are bolded in this work. Based on the experimental results, the following points can be observed:

  1. (1) The results for ACC and F1 scores indicate that while traditional and deep methods each have their strengths on different datasets, the proposed method consistently delivers superior performance across nearly all datasets. For instance, on the Scene15 dataset, which has only 5% labeled samples, our method surpasses the second-best approach by 5.07% in ACC and 5.61% in F1. This outstanding performance is attributed to the dynamic fusion strategy in the high-level semantic space, which minimizes conflicts between reconstructed and category-specific features and reduces the adverse effects of low-quality views.
  2. (2) From the results of AUC in Table 3, it is found that the method in this work achieves the optimal performance on all datasets, indicating that the proposed method in this work has good robustness and better discriminative ability, and the learned representations have a clear classification structure.
  3. (3) SMILE-L demonstrates superior classification performance compared to the six comparison methods across almost all datasets, highlighting its ability to effectively extract task-related features while reducing redundant features. Furthermore, our method, SMILE, outperforms SMILE-L, underscoring the necessity and effectiveness of training the model with unlabeled data. This improvement stems from the inclusion of unlabeled data, which provides valuable information. Leveraging this information allows the model to gain a more comprehensive understanding of the data distribution, thereby enhancing the performance of downstream tasks.

thumbnail
Table 1. Accuracy results (%) compared among methods, the LACK algorithm cannot work on the GRAZ02 dataset and is replaced with “—”. The best results are highlighted in bold.

https://doi.org/10.1371/journal.pone.0320831.t001

thumbnail
Table 2. F1 results (%) compared among methods, the LACK algorithm cannot work on the GRAZ02 dataset and is replaced with “—”. The best results are highlighted in bold.

https://doi.org/10.1371/journal.pone.0320831.t002

thumbnail
Table 3. AUC results (%) compared among methods, the LACK algorithm cannot work on the GRAZ02 dataset and is replaced with “—”. The best results are highlighted in bold.

https://doi.org/10.1371/journal.pone.0320831.t003

4.3.2 Visualization analysis.

In order to demonstrate the fusion representations learned in this work more intuitively, this work visualizes the original experimental dataset features and the fusion representations learned by this method using the t-SNE [49] and UMAP [49] dimensionality reduction methods. t-SNE focuses more on the local similarity, while UMAP focuses more on maintaining the global structure. The results of t-SNE visualization for the original view and the fused representation are shown in Fig 2 and Fig 4, respectively. The UMAP visualization results for the original view and the fused representations are shown in Figs 3 and 5, respectively.

thumbnail
Fig 2. The t-SNE visualization of the original data.

https://doi.org/10.1371/journal.pone.0320831.g002

thumbnail
Fig 3. The t-SNE visualization of the fused features extracted in this work.

https://doi.org/10.1371/journal.pone.0320831.g003

thumbnail
Fig 4. The UMAP visualization of the original data.

https://doi.org/10.1371/journal.pone.0320831.g004

thumbnail
Fig 5. The UMAP visualization of the fused features extracted in this work.

https://doi.org/10.1371/journal.pone.0320831.g005

Observation of the visualization results reveals that the raw data are distributed in a complex manner, with confusion between classes, making it difficult to clearly distinguish between different classes, especially on the two datasets Scene15 and GRAZ02. This suggests that the original data may have more overlapping and mixing in the high-dimensional space. The learned fusion representation, however, possesses obvious inter-class separation and intra-class compactness, indicating that our method is able to better distinguish different classes in the low-dimensional representation of the data, and has achieved significant improvement in data representation.

4.3.3 Ablation study.

In this experiment, the work provides a further analysis of the importance of the high-level semantic mapping module and the dynamic fusion module used, in order to explore their key role in the model performance. The results of the ablation experiments are reported in Tables 4 and 5 to clearly demonstrate the impact of these two modules on the classification performance. The following conclusions can be drawn from the experimental results: Both modules play an active role in improving the classification performance of the model, indicating that the high-level semantic mapping module facilitates the extraction of classification features and avoids the influence of redundant features, while the dynamic fusion module avoids the influence of low-quality views and is able to deal well with the correlation and weight assignment between features of different views of different samples. The experimental results demonstrate the importance and effectiveness of these two modules.

thumbnail
Table 4. Ablation study of “w/” or “w/o” high-level semantic mapping module.

https://doi.org/10.1371/journal.pone.0320831.t004

thumbnail
Table 5. Ablation study of “w/” or “w/o” dynamic fusion module.

https://doi.org/10.1371/journal.pone.0320831.t005

Discussion

The proposed method achieves dynamic fusion in multi-view settings under semi-supervised scenarios, which helps mitigate performance degradation caused by low-quality views. However, we acknowledge a key limitation of this work: the reliance on pseudo-label quality and the approximation accuracy of the true class probability (TCP). While pseudo-labeling is employed to address the scarcity of labeled data, the accuracy of these pseudo labels is inherently difficult to evaluate in the absence of ground truth. In extreme cases of low-quality views, the method may produce more inaccurate predictions, which could further compromise the TCP approximation and, in turn, hinder the extraction of view-specific features. Addressing the challenge of evaluating and improving pseudo-label quality remains an open and complex problem, and it represents a key direction for our future research efforts.

Conclusion

With advancements in multimedia technology, the prevalence of multi-view data has increased, offering richer information for analysis and understanding while also presenting several challenges. Issues such as limited labeling information, effective multi-view fusion, and discriminative feature extraction need to be addressed. This work introduces a semi-supervised multi-view classification method based on dynamic fusion, which excels in extracting discriminative features and dynamically fusing multiple views from various samples. The high-level semantic mapping module reduces the impact of redundant features and retains important classification-related features, while the dynamic fusion module assigns weights to different views for each sample, minimizing the effects of noisy and low-quality views and exploring view associations effectively. Quantitative experiments validate the algorithm’s effectiveness and superiority, and visualization experiments demonstrate that the learned fusion features have a strong classification structure. Ablation studies highlight the importance and effectiveness of each module. Future work will focus on improving semi-supervised multi-view classification, enhancing pseudo-labeling accuracy, discovering more discriminative features, and achieving better fusion of multiple views for optimal classification representations.

Acknowledgments

We thank all reviewers for their valuable suggestion on this paper.

References

  1. 1. Wang X, Fu L, Zhang Y, Wang Y, Li Z. MMatch: semi-supervised discriminative representation learning for multi-view classification. IEEE Trans Circuits Syst Video Technol. 2022;1–1.
  2. 2. Wang X, Wang Y, Wang Y, Huang A, Liu J. Trusted semi-supervised multi-view classification with contrastive learning. IEEE Trans Multimedia. 2024;26:8268–78.
  3. 3. Tian Y, Sun S, Tang J. Multi-view teacher–student network. Neural Netw. 2022;146:69-84. pmid:34839092
  4. 4. Xu J, Ren Y, Tang H, Yang Z, Pan L, Yang Y, et al. Self-supervised discriminative feature learning for deep clustering. IEEE Trans Knowl Data Eng. 2022.
  5. 5. Mao Y, Zhang J, Qi H, Wang L. DNN-MVL: DNN-multi-view-learning-based recover block missing data in a dam safety monitoring system. Sensors (Basel). 2019;19(13):2895. pmid:31261982
  6. 6. Zhou H, Gong M, Wang S, Gao Y, Zhao Z. Smgcl: Semi-supervised multi-view graph contrastive learning. Knowledge-Based Syst. 2023;260:110120. https://doi.org/10.1016/j.knosys.2022.110120
  7. 7. Chao G, Sun S. Semi-supervised multi-view maximum entropy discrimination with expectation Laplacian regularization. Inf Fusion. 2019;45:296–306.
  8. 8. Jiang B, Zhang C, Zhong Y, Liu Y, Zhang Y, Wu X, et al. Adaptive collaborative fusion for multi-view semi-supervised classification. Information Fusion. 2023;96:37–50.
  9. 9. Xu J, Zheng H, Wang J, Li D, Fang X. Recognition of EEG signal motor imagery intention based on deep multi-view feature learning. Sensors. 2020;20(12):3496. https://doi.org/10.3390/s20123496 pmid:32575798
  10. 10. Alsulami N, Althobaiti H, Alafif T. MV-MFF: multi-view multi-feature fusion model for pneumonia classification. Diagnostics. 2024;14(14):1566. https://doi.org/10.3390/diagnostics14141566 pmid:39061703
  11. 11. Zhang B, Qiang Q, Wang F, Nie F. Fast multi-view semi-supervised learning with learned graph. IEEE Trans Knowl Data Eng. 2020;34(1):286–299.
  12. 12. Li S, Li WT, Wang W. Co-gcn for multi-view semi-supervised learning. In: AAAI Conference on Artificial Intelligence; 2020.
  13. 13. Wang Xl, Zhu Zf, Song Y, Fu Hj. GRNet: graph-based remodeling network for multi-view semi-supervised classification. Pattern Recognit Lett. 2021;151:95–102.
  14. 14. Wang X, Wang Y, Ke G, Wang Y, Hong X. Knowledge distillation-driven semi-supervised multi-view classification. Inf Fusion. 2024;103:102098.
  15. 15. Noroozi V, Bahaadini S, Zheng L, Xie S, Shao W, Philip SY. Semi-supervised deep representation learning for multi-view problems. In: 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA. 2018;56–64.
  16. 16. Jia X, Jing XY, Zhu X, Chen S, Du B, Cai Z, et al. Semi-supervised multi-view deep discriminant representation learning. IEEE Trans Pattern Anal Mach Intell. 2020;43(7):2496–509. https://doi.ieeecomputersociety.org/10.1109/TPAMI.2020.2973634
  17. 17. Yu J, Li J, Yu Z, Huang Q. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans Circuits Syst Video Technol. 2019;30(12):4467–4480.
  18. 18. Cao X, Zhang C, Fu H, Liu S, Zhang H. Diversity-induced multi-view subspace clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA. 2015;586–94.
  19. 19. Cheng Y, Zhao X, Cai R, Li Z, Huang K, Rui Y, et al. Semi-supervised multimodal deep learning for RGB-D object recognition. In: International Joint Conference on Artificial Intelligence. 2016;3345–51.
  20. 20. Liu Y, Liu Y, Duan Y. MVG-Net: LiDAR point cloud semantic segmentation network integrating multi-view images. Remote Sensing. 2024;16(15):2821. 16152821
  21. 21. Wu J, Yao Y, Zhang G, Li X, Peng B. Difficult airway assessment based on multi-view metric learning. Bioengineering. 2024;11(7):703. pmid:39061785
  22. 22. Liu C, Wen J, Liu Y, Huang C, Wu Z, Luo X, et al. Masked two-channel decoupling framework for incomplete multi-view weak multi-label learning. In: NIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023; 32387–400, Article No: 1406.
  23. 23. Liu C, Xu G, Wen J, Liu Y, Huang C, Xu Y. Partial multi-view multi-label classification via semantic invariance learning and prototype modeling. In: Proceedings of the 41st International Conference on Machine Learning, PMLR 235:32253–67, 2024.
  24. 24. Li J, Xiong C, Hoi SC. Comatch: Semi-supervised learning with contrastive graph regularization. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada. 2021;9455–64.
  25. 25. Wang Y, Chen H, Heng Q, Hou W, Savvides M, Shinozaki T, et al. FreeMatch: self-adaptive thresholding for semi-supervised learning. arXiv. preprint. arXiv.2205.07246.
  26. 26. Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, et al. Flexmatch: boosting semi-supervised learning with curriculum pseudo labeling. In: Advances in Neural Information Processing Systems. 2021;34:18408–18419.
  27. 27. Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel CA, et al. FixMatch: simplifying semi-supervised learning with consistency and confidence. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2020(33);596–608.
  28. 28. Ma F, Meng D, Dong X, Yang Y. Self-paced multi-view co-training. J Mach Learn Res. 2020;21;1–38.
  29. 29. Liu LY, Huang P, Min F. Safe multi-view co-training for semi-supervised regression. In: 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA). IEEE; 2022;1–10.
  30. 30. Zhuge W, Hou C, Peng S, Yi D. Joint consensus and diversity for multi-view semi-supervised classification. Mach Learn. 2020;109(3):445–65.
  31. 31. Nie F, Li J, Li X. Parameter-free auto-weighted multiple graph learning: a framework for multiview clustering and semi-supervised classification. In: International Joint Conference on Artificial Intelligence; 2016.
  32. 32. Nie F, Cai G, Li J, Li X. Auto-weighted multi-view learning for image clustering and semi-supervised classification. IEEE Trans Image Process. 2018;27(3):1501–11.
  33. 33. Tao H, Hou C, Nie F, Zhu J, Yi D. Scalable multi-view semi-supervised classification via adaptive regression. IEEE Trans Image Process. 2017;26(9):4283–96.
  34. 34. Huang A, Wang Z, Zheng Y, Zhao T, Lin CW. Embedding regularizer learning for multi-view semi-supervised classification. IEEE Trans Image Process. 2021;30:6997–7011.
  35. 35. Huang H, Liang N, Yan W, Yang Z, Sun W. Partially shared semi-supervised deep matrix factorization with multi-view data. arXiv:201200993 [cs]. 2020;
  36. 36. Kumar A, Daumé H. A co-training approach for multi-view spectral clustering. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11); 2011;393–400.
  37. 37. Guo W, Wang Z, Du W. Robust semi-supervised multi-view graph learning with sharable and individual structure. Pattern Recognit. 2023;140:109565.
  38. 38. Yu Y, Zhou G, Huang H, Xie S, Zhao Q. A semi-supervised label-driven auto-weighted strategy for multi-view data classification. Knowl-Based Syst. 2022;255:109694.
  39. 39. Kang M, Lee K, Lee YH, Suh C. Autoencoder-based graph construction for semi-supervised learning. In: Vedaldi A, Bischof H, Brox T, Frahm JM (Editors). Computer Vision – ECCV 2020. Springer International Publishing; 2020;12369:500–17.
  40. 40. Zhang C, Liu Y, Fu H. Ae2-Nets: autoencoder in autoencoder networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2019;2577–85.
  41. 41. Rideaux R, Storrs KR, Maiello G, Welchman AE. How multisensory neurons solve causal inference. Proc Natl Acad Sci U S A. 2021;118(32):e2106235118. pmid:34349023
  42. 42. Hou H, Zheng Q, Zhao Y, Pouget A, Gu Y. Neural correlates of optimal multisensory decision making under time-varying reliabilities with an invariant linear probabilistic population code. Neuron. 2019;104(5):1010–21.
  43. 43. Han Z, Yang F, Huang J, Zhang C, Yao J. Multimodal dynamics: dynamical fusion for trustworthy multimodal classification. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2022;20675–85.
  44. 44. Corbière C, Thome N, Bar-Hen A, Cord M, Pérez P. Addressing failure prediction by learning model confidence. In: Advances in Neural Information Processing Systems. 2019;32.
  45. 45. Cheng G, Han J, Lu X. Remote sensing image scene classification: Benchmark and state of the art. Proc IEEE. 2017;105(10):1865–1883.
  46. 46. Hu Z, Nie F, Wang R, Li X. Multi-view spectral clustering via integrating nonnegative embedding and spectral embedding. Inf Fusion. 2020;55:251–259.
  47. 47. Yu Y, Zhou G, Huang H, Xie S, Zhao Q. A semi-supervised label-driven auto-weighted strategy for multi-view data classification. Knowl-Based Syst. 2022;255:109694.
  48. 48. Wu Z, Lin X, Lin Z, Chen Z, Bai Y, Wang S. Interpretable graph convolutional network for multi-view semi-supervised learning. IEEE Trans Multimedia. 2023;1–14.
  49. 49. Van der Maaten L, Hinton G. Visualizing data using T-SNE. J Mach Learn Res. 2008;9(11).