Figures
Abstract
Gallbladder cancer, a common yet often under diagnosed malignancy, is typically characterized by late detection and a poor prognosis. The rise of deep learning has introduced new methods for its early identification through B-ultrasound imaging, but there are still challenges of inefficient data labeling and feature extraction. This paper introduces a novel classification algorithm, ASGBC, intended to tackle related challenges in diagnosing gallbladder cancer using B-ultrasound images. Firstly, we combine active learning with self-supervised learning to decrease the reliance on labeled data. Secondly, we introduce the MsHop module, which effectively captures the fine textures and patterns in ultrasound images through the integration of multi-scale and high-order information, thereby improving diagnostic accuracy. Additionally, we develop a dual-branch loss function that leverages data correlation and clustering features to enhance feature extraction and model stability. The experiments on a gallbladder ultrasound dataset have confirmed the effectiveness of our algorithm, achieving an accuracy of 0.884, a specificity of 0.932, and a sensitivity of 0.912—outperforming existing methods. The results exhibit lower variance, indicating improved model stability. Furthermore, the findings demonstrate that using active learning, one can achieve comparable results to those from the full dataset with only 35% of the data, reducing annotation costs and increasing model learning efficiency. Further research will concentrate on refining the algorithm for wider clinical use and identifying additional features that may further improve diagnostic accuracy.
Citation: Li J, Zhou Y-Q (2025) Enhanced gallbladder cancer detection via active and self-supervised learning integration: Innovating B-ultrasound image analysis. PLoS One 20(9): e0330781. https://doi.org/10.1371/journal.pone.0330781
Editor: Li Yang, Sichuan University, CHINA
Received: March 14, 2025; Accepted: August 5, 2025; Published: September 16, 2025
Copyright: © 2025 Li, Zhou. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All GBCU files are available from the GBCU database (https://gbc-iitd.github.io/data/gbcu).
Funding: This work was supported by the Scientific Research Start-Up Project of CUIT (KYTZ202120 to JL), Sichuan Science and Technology Program (2023ZYD00011 to YQ Z), Key Project of the Open Fund of the Sichuan National Applied Mathematics Center (2024-KDJJ-02-001 to YQ Z), and Key Project of the Open Fund of the Sichuan National Applied Mathematics Center (2025-KFJJ-02-001 to YQ Z).
Competing interests: The authors have declared that no competing interests exist.
Introduction
Gallbladder cancer (GBC) is a highly malignant tumor with a very poor prognosis, posing a significant threat to global public health. Although bile duct malignancies are relatively rare, GBC is the most common and aggressive among them, particularly affecting women [1]. Ranked 22nd among all common cancers, the incidence rate of GBC is higher in women (20th) compared to men (23rd) [2]. It is the sixth leading cause of cancer-related deaths among various malignancies. According to the GLOBOCAN 2020 report, in 2020, there were 115,949 new cases and 84,695 deaths worldwide, encompassing all ages and genders. Asia has the highest incidence and mortality rates, accounting for 71.9% and 75.0% of the global figures, respectively [3]. Therefore, early detection is crucial for effective treatment and improving patient survival rates.
Given the anatomical location of GBC and its asymptomatic nature or symptoms that mimic other diseases, early detection is challenging. It is often discovered incidentally after gallbladder removal surgery for other indications [4]. A study conducted in 2021 showed that less than 15% of GBC cases diagnosed in the United States between 2013 and 2017 were at a localized stage. The majority were diagnosed at later stages, with 38.7% at the regional stage and 44.4% at the distant stage [5].
Ultrasound is the primary diagnostic tool for gallbladder diseases due to its safety, cost-effectiveness, and ease of use. It is commonly used for the initial assessment of suspected gallbladder diseases. In resource-limited countries, it is often the only imaging examination available to patients with abdominal diseases. However, ultrasound images can be compromised by noise artifacts, such as speckle noise, which degrades image quality—a problem not commonly found in CT, MRI, PET, and SPECT imaging modalities. Detecting malignant gallbladders is more challenging due to the lack of clear boundaries or morphological features compared to normal and benign gallbladder areas. Therefore, the accuracy of ultrasound diagnosis largely depends on the experience of the sonographer and the diagnosing physician [6]. While certain characteristics aid in identifying GBC on ultrasound images, differentiating it from other abnormalities in the early stages remains difficult [7]. Thus, analyzing ultrasound images to establish and understand the characteristics of malignant gallbladder tumors is essential for enhancing the recognition and differentiation of GBC, ultimately preventing under-treatment and over-treatment.
The evolution of deep learning has opened new avenues for medical diagnostics using ultrasound images. A plethora of researchers has applied deep learning techniques to analyze ultrasound images across different organs, thereby assisting in medical diagnosis. This encompasses a range of conditions including breast tumors [8,9], prostate nodules [10,11], thyroid nodules [12,13], ocular diseases [14,15], pulmonary conditions [16,17], and fetal imaging [18,19]. However, compared with CT and MRI, the studies based on B-ultrasound images is much less, and there is even less about GBC.
Self-supervised learning(SSL) and active learning(AL) hold significant potential in medical image analysis, enhancing the model’s learning capability and diagnostic accuracy through intelligent data selection and utilization of the data’s inherent structure [20,21]. SSL trains models by predicting transformations or attributes of the data without external labeling, making full use of unlabeled image data [22–24]. AL allows models to identify and request labeling for the most uncertain samples, optimizing the learning process with limited expert resources [25,26]. These methods can strengthen the model’s ability to recognize subtle features in medical images and improve generalization to different lesion types, which is crucial for increasing diagnostic efficiency and accuracy. But the majority of existing studies have relied on supervised learning methodologies, necessitating the use of labeled datasets.
At present, there is no research on using AL in B-ultrasound images. A smaller subset of research has delved into SSL algorithms to pre-train models for extracting features from B-mode ultrasound images. Researchers Nguyen et al. [27] assessed the efficacy of the BYOL algorithm [28] for classifying breast ultrasound images using a public dataset of breast expert data. Mishra et al. [29] pre-trained an encoder-decoder architecture to perform deterministic edge detection or segmentation tasks without the need for machine learning. Experiments conducted on two public datasets demonstrated that SSL enhances performance, particularly when there is a scarcity of labeled training data. Zhao and Yang [30] utilized the public TN-SCUI2020 dataset to preprocess classifiers for distinguishing between benign and malignant thyroid nodules. Researchers Jiao et al. [31] and Chen et al. [32] applied Self-Supervised Learning to obstetric-related ultrasound image analysis tasks. Liu et al. [33] pre-trained an encoder-decoder model for downstream tasks of classifying gastrointestinal stromal tumors from endoscopic ultrasound images.
It can be found that using deep learning models to analyze ultrasound images poses a major challenge. Firstly, unlike MRI or CT, ultrasound images tend to have lower image quality and are susceptible to noise and sensor artifacts, existing feature extractors, which are primarily designed for natural images, are more likely to learn from false textures and fail to truly capture the characteristics of GBC. Moreover, most existing classification algorithms for GBC rely on fully supervised learning, requiring all data to be labeled. Algorithms proposed by Basu et al. [34,35] are based on unsupervised and self-supervised learning, but they all necessitate the use of B-ultrasound video information, leading to high training resource demands. The integration of AL and SSL offers a promising yet unexplored path. To reduce training costs while enhancing the accuracy of diagnosing GBC with B-ultrasound images, this paper introduces a classification algorithm for GBC that combines AL and SSL, called ASGBC. The specific contributions are as follows:
- Integration of AL and SSL: We implement AL prior to SSL to preselect data with high information value. This proactive selection reduces the training time for the classification model, decreases the demand for computational resources, and lowers deployment costs.
- Design of the MsHop Module: By extracting multi-scale and high-order information from images simultaneously, it comprehensively encodes tumor characteristics, ensuring a detailed and accurate feature representation.
- Design of a Dual-Branch Loss Function: It considers both data correlation and clustering features, making feature extraction more refined and robust, thereby improving the model’s predictive accuracy and stability.
Related works
Deep learning applications for gallbladder cancer diagnosis. With the development of deep learning technology, its application in medical image analysis has provided new perspectives for improving the accuracy and efficiency of diagnosing GBC. Recent studies have introduced several innovative methods in this domain. Lian et al. [36] presented an automatic segmentation method for gallbladder and gallstone regions in ultrasound images, integrating an improved Otsu algorithm, anisotropic diffusion, global morphology filtering, a parameter-adaptive pulse-coupled neural network (PA-PCNN), and locally weighted regression smoothing (LOESS) for enhanced accuracy and efficiency. Jeong et al. [37] developed a deep learning-based decision support system (DL-DSS) that significantly improved the performance of gallbladder polyp diagnosis on ultrasound through transfer learning, demonstrating that the diagnostic performance assisted by DL-DSS was superior to that of individual radiologists. Kim et al. [38] enhanced the classification accuracy of gallbladder polyps less than 20 millimeters by using an ensemble convolutional neural network model, showing the potential of deep learning to improve clinical diagnostic specificity. Basu et al. [39] developed GBCNet, a CNN-based model that excels in GBC detection from ultrasound images. It overcomes challenges of low image quality and spurious textures through a novel ROI extraction method, a multi-scale second-order pooling architecture, and a curriculum inspired by human visual acuity, outperforming both state-of-the-art models and expert radiologists. Basu et al. [34] also introduced an innovative unsupervised contrastive learning (UCL) framework that uses hard negatives from temporally distant frames within the same ultrasound video, along with a hardness-sensitive negative mining curriculum, to enhance image representation learning for gallbladder malignancy detection, achieving higher accuracy than state-of-the-art techniques. Shuvo and Chowdhury [40] proposed a method for GBC classification using an ensemble of well-known convolutional neural network models, significantly enhancing classification accuracy.
Active learning. Existing AL approaches can be divided into two main groups: distribution-based and uncertainty-based methods. AL has emerged as a pivotal paradigm for enhancing the efficiency of model training by selectively labeling informative data points. Sener and Savarese [41] redefined AL as core-set selection, focusing on choosing a diverse subset of data. Pinsler et al. [42] offered a Bayesian batch AL method that approximates the posterior for model parameters, enabling scalable AL. Sinha et al. [43] introduced Variational Adversarial Active Learning (VAAL), leveraging a VAE and adversarial network for representation learning. Xie et al. [44] proposed Energy-based Active Domain Adaptation (EADA), utilizing energy-based models to reduce domain gaps. Cabannes et al. [45] presented Positive Active Learning (PAL), a framework that integrates SSL with AL by querying semantic relationships. These works collectively advance the field by addressing challenges in labeling efficiency, representation learning, and scalability in various learning scenarios.
Self-supervised learning. SSL has made significant strides in recent years, with various innovative approaches proposed to learn meaningful representations without labeled data. The introduction of frameworks like SimCLR [46], MoCo [47], and BYOL [28] has revolutionized the field by simplifying the learning process and enhancing the quality of learned representations. These methods focus on maximizing the similarity between augmented versions of the same image, effectively learning to discriminate between different instances. Additionally, SwAV [48] introduced an online clustering approach that contrasts cluster assignments, further improving the efficiency and scalability of SSL. Barlow Twins [49] presented a redundancy reduction principle to ensure informative yet invariant representations. Innovative works by He et al. [50], Chen et al. [51], and Bao et al. [52] significantly advanced the state-of-the-art by introducing masked autoencoders (MAE) and their scalable variants, demonstrating the efficacy of reconstructing masked image patches for learning meaningful representations. The seminal paper by Huang et al. [53] on Vision Transformers has further catalyzed research in this domain, showing that self-supervised methods can be effectively scaled up with the right architectural choices. The recent breakthrough by Mishra et al. [54] presented a simple yet efficient contrastive masked autoencoder, highlighting the complementary nature of contrastive learning and MAE. Furthermore, the work of Oquab et al. [55] on DINOv2 underscores the potential of SSL to produce versatile visual features that are competitive with weakly-supervised models across diverse tasks. Collectively, these works have pushed the boundaries of unsupervised visual representation learning, achieving competitive results with supervised counterparts and opening new avenues for research in computer vision.
In summarizing the related work on diagnosing GBC on B-ultrasound images using deep learning, the majority of explorations have been based on deep learning schemes that rely solely on labeled data. While some methods have indeed harnessed the information stored in unlabeled data through SSL [34,35], they all necessitate the use of video-level data rather than individual B-ultrasound images. Furthermore, no methods have addressed the reduction of labeling costs and computational resources through the use of AL. In this paper, we combine AL with SSL, effectively saving on training-related costs and enhancing the model’s accuracy and robustness by improving the feature extraction network and loss function within SSL.
Method
Overview of the framework
Our proposed approach ASGBC combines AL and SSL to minimize training expenses and the labor of labeling, while enhancing the model’s precision and stability through the integration of multi-scale, high-order information, as well as features that account for image correlation and clustering. The flowchart of our algorithm is illustrated as shown in Fig 1.
(The proposed method, ASGBC, integrates AL in the first phase and SSL in the second phase. This integration helps to reduce the required training resources and labeling effort to less than 35% of the dataset.)
For details on the specific algorithmic process, please refer to Algorithm 1. Initially, AL is employed to train the data selection network, which selects the most informative samples XS from the pool of unlabeled data X(Details can be found in section Following this, a feature extractor that incorporates multi-scale modules and high-order pooling modules is trained using the selected data XS in conjunction with SSL techniques(For more information, please refer to section Get extractor). The features learned can then be applied to fine-tune various downstream tasks, including but not limited to classification, localization, and detection. For this paper, the downstream task focuses on the classification of GBC in B-ultrasound images. A classifier with a single linear layer is appended to the feature extraction backbone, with the feature extractor’s parameters frozen to retain the generalized features learned. The classifier is subsequently fine-tuned using a modest amount of labeled B-ultrasound data to achieve the final classification model.
Algorithm 1. ASGBC.
1: Input: Unlabeled dataset X
2: Output: Trained GBC classification model
3: Phase 1: Data Selection (Active Learning based on VAAL)
4: Randomly select 5% data from X to initialize as XS, and .
5: for epoch =1 to epochs do
6: Sample , Sample
7: Compute and
by using Eqs (1) and (2) respectively
(Eq (1): VAE reconstruction loss; Eq (2): adversarial loss for discriminator)
8: Compute by using Eq (4)
(Eq (4): Combined VAE loss integrating reconstruction and adversarial components)
9: Update VAE by descending stochastic gradients:
10: Compute LD by using Eq (3)
(Eq (3): Discriminator loss for active learning)
11: Update D by descending its stochastic gradient:
12: Train and update T:
13: end for
14: Select samples (Xs) with
(Select samples with highest uncertainty from unlabeled pool)
15:
16:
17: Phase 2: Feature Extractor Training (Self-Supervised Learning)
18: for epoch =1 to epochs do
19: for batch =1 to batchs do
20: Sample a batch data
21: Calculate two different enhancements and
respectively
22: Extract the features and
using feature extractor
23: Compute embeddings and
using expander
24: Compute encodings Q1 and Q2 using Eq (17)
25: Compute using Eq (11)
(Eq (11): Contrastive loss for feature invariance)
26: Compute using Eq (15)
(Eq (15): Clustering loss for feature discriminability)
27: Compute Ltotal using Eq (5)
(Eq (5): Combined loss integrating contrastive and clustering objectives)
28: The loss function Ltotal is minimized to train parameters θ and
29: end for
30: end for
31: Phase 3: Supervised Fine-tuning
32: Annotate the selected data XS for classification
(Perform manual annotation on the selected samples)
33: Freeze the weights of
(Fix the pre-trained feature extractor parameters)
34: Fine-tune the linear classification layer with supervised training on XS to obtain the GBC classification model
(Only update the final classification layer with labeled data)
35: return Trained GBC classification model
Data selection
This section explains how we utilize AL to select the most informative data. Although our method is compatible with any AL technique, we draw inspiration from the Variational Adversarial Autoencoder (VAAL) [43]. We employ a β-Variational Autoencoder (β-VAE) [56] and an adversarial network [57] to implicitly learn the sampling mechanism. The β-VAE is responsible for learning the latent space representation of the data, while the adversarial network discerns between labeled and unlabeled data. A minimax game is played between the β-VAE and the adversarial network, where the β-VAE aims to deceive the adversarial network into believing all data points originate from the labeled data pool. Concurrently, the adversarial network strives to differentiate between them within the latent space. For specific algorithm steps, refer to Algorithm 1 (Phase 1). This phase does not require any labels; instead, it is entirely based on the intrinsic characteristics of the data.
Among them, the transduction of VAE represents a transformational representation learning objective function is:
VAE’s antagonism represents the learning objective function is:
training objective function of countermeasure network is:
complete VAE objective function of VAAL is:
where is mathematical expectation;
and
are encoders and decoders, respectively, and are parameterized by parameters ϕ and θ; P(z) is the selected prior distribution, usually the unit Gaussian distribution;
is the Lagrange multiplier of the optimization problem;
indicates Kullback-Leibler divergence; D is the differentiator of the adversarival network;
and
are superparameters, which are used to determine the effect of each component in learning effective variational antagonism representation.
Get extractor
In this section, we will describe how to employ an enhanced SSL algorithm to train a feature extractor for gallbladder B-ultrasound images. The framework of this SSL algorithm is illustrated in Fig 1 (Phase 2). Similar to traditional contrastive learning methods, it features two branches for data augmentation, with distinctions in the following two aspects.
1. The feature extractor is specially designed, incorporating a multi-scale high-order feature extraction module(MsHop) into the Resnet backbone. This allows for the integration of multi-scale and high-order information from images, leading to more accurate extraction of GBC features.
2. A unique dual-branch loss function is designed, integrating the correlation and clustering features of the feature maps. It optimizes the model from multiple perspectives, enhancing the feature extraction capability while also strengthening the model’s stability.
Algorithm 1 (Phase 2) provides a detailed step-by-step guide for training a feature extractor using an improved self-supervised algorithm. Given a batch of images xS, two different batches of views X1 and X2 are generated by transformations T1 and T2, respectively, and then encoded into representations Y1 and Y2 using a feature extractor . These representations are processed by two different branches, which handle feature encoding differently, resulting in two losses Lcon and Lclu through intra-batch denormalization and exchange prediction, respectively. The final loss is a combination of these two losses:
where α is a hyperparameter.
We will now proceed to detail the MsHop module and the dual-branch loss function.
The core concept of multi-scale design is to establish hierarchical residual connections within a single residual block, representing multi-scale features and increasing the receptive field of each network layer. Higher-order pooling aims to use all three feature dimensions to learn a robust second-order covariance representation, thereby enhancing the accuracy of diagnosing GBC in ultrasound images. The MsHop module integrates these two approaches, as shown in Fig 2. This module takes into account both multi-scale information and the height, width, and channel dimensions to enhance the learned second-order statistical information.
Firstly, we introduce the MsHop module. Its computational process is depicted in Fig 2
Assuming the input feature map F has a size of , it is divided into four subsets
along the depth direction. Each subset Fi has a size of
and is processed by a 3x3 convolution Ki, resulting in output Si. Except for F1, each Fi is added to the output Si−1 of the previous filter bank and then input to the next filter bank
. This process can be expressed as:
where represents the i-th group of
convolutions. Finally, all Si are merged (e.g., by concatenation) and fused by a 1
1 convolution
, yielding the middle feature map
:
Assume the feature map has size of
, reduce the feature dimension to
,
, and
by three 1
1 convolution layers respectively,
where ,
,
. Then, compute the covariance matrices of the reduced features for each dimension
,
, and
:
Three statistical weight vectors ,
, and
are generated on each result covariance matrix using row-wise convolution, and these weight vectors are multiplied by the middle feather
. The scaled feature maps in three dimensions are fused to generate the output feature map
:
Then, we introduce the dual-branch loss function(According to Fig 1). The first branch sends Y1 and Y2 to an expander , resulting in embeddings Z1 and Z2 (the expander consists of three fully connected layers). Positive samples are different augmentations of the same image, and negative samples are different images in the same batch. The loss function is calculated using the following regularization terms:
where λ, μ, and are hyperparameters;
is the invariance criterion, n is the number of images in the batch ;
is the covariance regularization term, where C(Z) is the covariance matrix of Z;
is the variance regularization term, where d is the dimensionality of the embedding, S is the regularized standard deviation, defined as , and
is a small scalar to prevent numerical instability.
The second branch adopts the idea of online clustering. Features Y1 and Y2 are assigned to prototype vectors C to obtain “codes" Q1 and Q2. The clustering loss is calculated by “exchanging" the prediction problem:
where is defined as:
with p(k) indicating the matching probability of prototype ck and feature Y.
The optimization problem maximizes the similarity between features and prototypes while maintaining smooth coding:
where H(Q) is the entropy function of coding Q, and ε controls the smoothness of the mapping.
Experiments
In this section, we first compare our method with several CNN-based SSL methods proposed recent on GBCU datasets. Next, we perform an ablation study of the proposed model. Finally, we evaluate the impact of the AL module. All our experiments are tested under the Pytorch framework and run on a computer equipped with a GTX-3060 GPU and an i7-12700@2.10GHz CPU.
Dataset. The GBCU dataset, introduced by Basu et al. [39] (https://gbc-iid.github.io/data/gbcu, E-mail: soumen.basu@cse.iitd.ac.in), is a comprehensive collection of 1,255 ultrasound images. It includes 432 normal, 558 benign, and 265 malignant gallbladder cases, all derived from 218 patients. Among these 218 patients, 71, 100, and 47 belong to the normal, benign, and malignant categories, respectively. This meticulously annotated dataset features both image-level labels and bounding box annotations for malignant regions, which is pivotal for advancing gallbladder cancer (GBC) detection research. We report the cross-validation results from ten iterations on the entire dataset, which were used in key experiments to evaluate model generalization. To ensure generalization to unseen patients, during cross-validation, all images from any particular patient appear exclusively in either the training or validation set.
Experimental setting. We use the weights of ResNet50 pre-trained on the ImageNet1k dataset as the initial values for some weights in feature extraction, while other parameters are initialized using a normal distribution. The coefficient β is set to 1 in Eq (1), and
are set to 1 in Eq (4), Coefficients α is set to 0.1 in Eq (5), and λ and μ are set to 25, and γ is set to 1 in Eq (11), ε is set to 0.0001 in Eq (17). We use the Adam optimizer to minimize our total loss and train the entire framework for 800 iterations. The batch size is 64, the initial learning rate is 0.003, the momentum is 0.9, the weight decay is
, and the learning rate adjustment follows a cosine schedule.
Evaluation metric. During the experimental phase of our research, we meticulously assessed the performance of our proposed model through a comprehensive set of key evaluation metrics, including accuracy, Macro-F1 score, sensitivity , specificity and F1 score for the malignant class. Accuracy serves as the cornerstone of measurement standards, depicting the proportion of correct predictions made by the model across all samples. The Macro-F1 score, a simple average of the F1 scores across categories, is an important metric for assessing multi-class models. It takes into account the precision and recall of all classes, thereby reflecting the model’s overall diagnostic capability. We paid particular attention to the identification of the malignant category, aiming to reduce both missed diagnoses and misdiagnosis at the same time. Consequently, we also monitored the sensitivity, specificity, and F1 score for the malignant class to rigorously evaluate the model’s diagnostic performance for this category. By employing these metrics, we strive to conduct a thorough assessment of the model’s effectiveness, ensuring a consistent and reliable evaluation of our algorithm’s capabilities in classifying GBC.
Comparison experiment
In this section, we compare the performance of our method with several recent algorithms, including SwAV [48], Barlow Twins [49], VicReg [58], SimCLR [54], and the recently introduced GBCNet [34], which is specifically designed for gallbladder cancer detection. The comparison focuses on key diagnostic metrics, including accuracy, specificity, sensitivity, F1-score, and Macro-F1, as summarized in Table 1 and Fig 3.
(Each group represents an evaluation metric, with different colors signifying different algorithms. The height of the bars indicates the magnitude of the metric, and the line segments on top of the bars represent the standard deviation of the metric. The taller the bar, the better the performance; the shorter the line segment, the more stable the model.)
Our method, ASGBC, achieves the highest performance in most metrics. Specifically, it attains an accuracy of 0.884 ± 0.038 (95% CI: 0.857–0.911), outperforming all other methods with statistically significant differences (all p<0.05).
In terms of sensitivity, ASGBC reaches 0.912 ± 0.091 (95% CI: 0.847–0.977), significantly higher than SwAV, Barlow Twins, VicReg, and SimCLR (all p<0.01), indicating a superior ability to detect malignant cases. Although GBCNet achieves a slightly higher sensitivity 0.923 ± 0.071, the difference is not statistically significant (p>0.05). Notably, ASGBC achieves comparable performance using only 35% of the labeled data, highlighting its efficiency and practical value in clinical settings where annotated data are scarce.
In terms of specificity, ASGBC achieves 0.932 ± 0.047 (95% CI: 0.898–0.966), which is among the second highest, though not statistically different from other methods (p>0.05). This suggests that our method maintains a strong ability to correctly identify non-malignant cases, reducing the risk of misdiagnosis.
The F1-score and Macro-F1, which reflect the balance between precision and recall, further demonstrate the robustness of our method. ASGBC achieves an F1-score of 0.844 ± 0.101 (95% CI: 0.772–0.916) and a Macro-F1 of 0.865 ± 0.069 (95% CI: 0.816–0.914), both of which are the highest among all compared methods. The improvements over SwAV, Barlow Twins, VicReg, and SimCLR are statistically significant (all p<0.05). Again, while GBCNet performs slightly worse in these metrics, the difference is not statistically significant (p>0.05).
Moreover, the low standard deviations observed in ASGBC’s metrics (e.g., accuracy SD = 0.038, Macro-F1 SD = 0.069) indicate that our model provides stable and reliable predictions across different test folds. In contrast, methods like Barlow Twins and SimCLR exhibit higher variability, which may limit their clinical applicability.
These results suggest that our method not only achieves superior diagnostic accuracy but also maintains robustness and generalizability. The integration of multi-scale feature extraction and dual-branch loss optimization contributes to its strong performance. Furthermore, the reduced dependency on labeled data makes ASGBC a promising tool for real-world clinical deployment, especially in resource-limited settings.
Clinical relevance and human-AI collaboration
In addition to comparing our ASGBC model with other computational methods, it is essential to evaluate its performance in the context of real-world clinical applications, particularly in relation to human expert diagnostic accuracy. To this end, we reference the human baseline performance reported by Basu et al. [39], where two experienced radiologists independently evaluated the same dataset used in our experiments. As shown in Table 2, Radiologist A achieved an accuracy of 0.816, specificity of 0.873, and sensitivity of 0.707, while Radiologist B achieved an accuracy of 0.784, specificity of 0.911, and sensitivity of 0.732. In comparison, our ASGBC model significantly outperforms both radiologists across all three metrics, achieving an accuracy of 0.884, specificity of 0.932, and sensitivity of 0.912.
To further assess the diagnostic consistency between our model and human experts, we conducted a Kappa consistency analysis. Due to budget constraints, we were unable to organize a new blind test involving multiple radiologists. Instead, we referenced the blind test data from the original dataset publication, which included diagnostic results from two radiologists. Since individual diagnostic results for each image were not available, we employed a mathematical derivation to estimate the Kappa coefficient. The detailed derivation is provided in the Appendix.
As shown in Table 3, the Kappa coefficient ranges from 0.434 to 0.766 for Radiologist A and from 0.502 to 0.837 for Radiologist B. These results indicate that ASGBC demonstrates a stronger alignment with Radiologist B compared to Radiologist A. Importantly, the worst-case Kappa values for both radiologists exceed the minimum clinical threshold of 0.40, suggesting that the consistency is acceptable in both cases. Furthermore, the best-case Kappa values for both radiologists meet the diagnostic independence standards (>0.75), which implies that the consistency between the radiologists and ASGBC is strong and reliable. Overall, these findings highlight the robustness of ASGBC in achieving high consistency with human expert assessments. These results demonstrate that our model not only excels in computational benchmarks but also holds significant potential for enhancing clinical diagnostic accuracy. The high sensitivity of ASGBC (0.912) is particularly noteworthy, as it indicates a strong capability to correctly identify malignant cases, thereby reducing the risk of missed diagnoses—a critical factor in early-stage gallbladder cancer detection.
To further bridge the gap between AI performance and clinical utility, we envision the development of a real-time lesion prompting system that integrates the diagnostic strengths of ASGBC with the expertise of human clinicians. In practice, such a system would assist radiologists by providing rapid diagnostic suggestions and highlighting suspicious regions during ultrasound examinations. This collaborative approach would not only improve diagnostic speed and accuracy, especially in complex or early-stage cases, but also reduce the workload and cognitive burden on medical professionals.
Importantly, this human-AI partnership does not seek to replace clinicians but rather to augment their capabilities. By combining the computational efficiency and consistency of AI with the contextual judgment and experience of human experts, we aim to create a synergistic diagnostic workflow that maximizes the strengths of both. Ultimately, we believe that ASGBC can serve as a powerful clinical assistant, contributing to more accurate, efficient, and accessible medical diagnostics, and ultimately improving patient outcomes.
Noise robustness analysis
To evaluate the robustness of the proposed ASGBC model under noise interference, we conducted experiments by adding Gaussian noise with a kernel sigma of 5 to the B-ultrasound images. The performance metrics under different noise levels are presented in Table 4.
The results demonstrate that the ASGBC model maintains relatively stable performance under low noise levels. Specifically, with 1% Gaussian noise, the accuracy decreases by only 1.02%, specificity by 0.97%, sensitivity by 0.99%, F1-score by 0.95%, and Macro-F1 by 1.04%. As the noise level increases to 5%, the performance degradation becomes more pronounced, with accuracy decreasing by 2.49%, specificity by 2.47%, sensitivity by 2.63%, F1-score by 2.49%, and Macro-F1 by 2.54%. At 10% noise level, the model still retains reasonable performance, with accuracy at 0.859, specificity at 0.905, sensitivity at 0.885, F1-score at 0.820, and Macro-F1 at 0.840.
These findings indicate that the ASGBC model exhibits good robustness against Gaussian noise, which is crucial for clinical applications where ultrasound images may be affected by various noise artifacts. The model’s ability to maintain performance under noise interference highlights its potential for real-world deployment in clinical settings.
Ablation study
We use the VicReg [58] model with the best performance from the comparative experiment as the baseline. Its feature extractor is the backbone of ResNet, and the loss function only includes Lcon. The results presented in the Table 5 and Fig 4 provide insights into the impact of the proposed enhancements on the baseline model for GBC classification using B-ultrasound images.
Baseline: The baseline model, serving as a reference, achieved an accuracy of 0.826 with a standard deviation of 0.045, a specificity of 0.923 with a standard deviation of 0.059, a sensitivity of 0.750 with a standard deviation of 0.102, a F1 score of 0.743 with a standard deviation of 0.141 and a macro-F1 score of 0.746 with a standard deviation of 0.118. These figures establish a benchmark for the evaluation of the subsequent modifications.
Independent contribution of dual-branch loss: The integration of the proposed dual-branch loss function (Loss) into the baseline model has led to a comprehensive enhancement in model performance, with accuracy increasing by 1.7% (from 0.826 to 0.840) and the standard deviation reduced by 13.3% (from 0.045 to 0.039). There is a slight improvement in specificity, reaching 0.926 with a standard deviation of 0.055. Sensitivity is enhanced by 7.7% (from 0.750 to 0.808) with a standard deviation reduced by 7.8% (from 0.102 to 0.094). The F1 score and macro-F1 score have also seen improvements of 5.1% (from 0.743 to 0.781) and 6.4% (from 0.746 to 0.794), respectively. This indicates that the dual-branch loss function contributes to the model’s predictive performance while maintaining its ability to correctly identify negative samples.
Independent contribution of MsHop: After introducing the multi-scale high-order pooling module (MsHop) into the baseline model, the five evaluation metrics have been improved to 0.862, 0.928, 0.885, 0.826, and 0.854 respectively, with lower standard deviations compared to the baseline model, showing a more pronounced improvement than the loss function. Specifically, accuracy increased by 4.4% (from 0.826 to 0.862), specificity by 0.5% (from 0.923 to 0.928), sensitivity by 18.0% (from 0.750 to 0.885), F1 score by 11.2% (from 0.743 to 0.826), and macro-F1 score by 14.5% (from 0.746 to 0.854). This suggests that the multi-scale high-order pooling module effectively captures features at different scales, which is important for classification tasks.
Combined effect: When both the dual-branch loss function and the multi-scale high-order pooling module are combined with the baseline model, the five evaluation metrics reach 0.884, 0.932, 0.912, 0.844, and 0.877, representing an increase of 7.0%, 1.0%, 21.6%, 13.6%, and 17.6% respectively compared to the baseline model. The standard deviations have decreased by 15.6%, 20.3%, 10.8%, 28.4%, and 18.6%. It can be further observed that the loss module contributes more to the reduction of variance, effectively enhancing the model’s stability, while the multi-scale high-order pooling module contributes more to performance improvement. The combination of these techniques demonstrates a synergistic effect, enhancing the overall classification performance of the GBC model using B-ultrasound images.
Effect of active learning
The experimental results presented in Fig 5 offer insights into the benefits of using a subset of data for training deep learning models. With only 35% of the total data, the time required for one training epoch is significantly reduced to 20 seconds, compared to 36 seconds when using the full dataset. This reduction in training time is crucial for rapid prototyping and iterative model refinement. Additionally, the storage requirements are substantially lower, with only 19.1 megabytes needed for 35% of the data, compared to 54.7 megabytes for the complete set. This decrease in storage demand reduces pressure on memory and storage, facilitates more efficient data management, and can lower costs associated with data storage and processing. These advantages underscore the potential for a more sustainable and cost-effective approach to deep learning model development, especially in scenarios where computational resources are limited or rapid model deployment is desired.
The experimental analysis shown in Fig 6 reveals a stark contrast between random data selection and AL strategies. Initially, at 15% data selection, algorithm with AL achieves a modest accuracy of 0.5479, comparable to random selection. However, as the percentage of data increases, AL demonstrates a superior ability to enhance model accuracy, reaching an impressive 0.8023 accuracy with just 25% of the data. This trend continues, with accuracy peaking at 0.8831 when 35% of the data is utilized, significantly outperforming random selection, which stabilizes at 0.8839 even with the full dataset. These results underscore the efficacy of AL in optimizing model performance with a fraction of the data, highlighting its potential for efficient and targeted data acquisition in machine learning processes.
The dataset used in this study was collected from PGIMER, a tertiary care referral hospital located in Chandigarh, northern India. All ultrasound images were acquired by radiologists using the Logiq S8 system. While this dataset provides a valuable foundation for model development, we acknowledge that its single-center origin may limit the generalizability of the model. Variations in ultrasound equipment brands and models across different hospitals could potentially affect model performance when applied in new clinical settings. For instance, differences in image resolution, contrast, and noise levels among devices may lead to performance degradation in environments that differ from the training data.
In addition, patient demographics and disease distributions may vary across institutions, including factors such as age, gender composition, and prevalence of specific conditions. These discrepancies could further influence the model’s generalization ability.
To address these limitations, we have outlined a comprehensive plan for future external validation. With sufficient funding, we aim to collaborate with multiple medical centers to conduct multi-institutional validation studies. This will allow us to evaluate the model’s performance across a broader range of ultrasound devices, including those manufactured by Siemens, Philips, and other vendors. By testing the model in diverse clinical environments, we can better assess its stability and adaptability across different equipment and patient populations.
Furthermore, future work will focus on refining and optimizing the model to enhance its robustness and adaptability under varying conditions. We believe that through multi-center external validation, we can more thoroughly evaluate the clinical utility of the model and provide a more solid foundation for its real-world application.
Conclusions
This study introduces an innovative approach ASGBC to GBC classification by integrating AL with SSL. The proposed method effectively leverages the advantages of both strategies, reducing the need for annotation and training resources. It achieves accuracy comparable to using the full dataset with just 35% of the data, reducing training time by 44% and memory requirements by 65%. This is particularly valuable in scenarios with limited computational resources or the need for rapid model deployment. The feature extraction module, tailored for the characteristics of B-ultrasound imaging, integrates multi-scale and high-order information retrieval capabilities. The dual-branch loss function considers both data relevance and clustering features, enhancing accuracy and model stability. The classification accuracy, specificity, sensitivity, F1 score, and macro-F1 score achieved 0.884, 0.932, 0.912, 0.844, and 0.877, respectively. Compared to the baseline model, these metrics have seen improvements of 7.1%, 1.0%, 21.5%, 13.6%, and 17.4%, respectively. Additionally, the standard deviations have been reduced by 14.8%, 21.3%, 10.8%, 28.2%, and 19.1%, indicating greater model stability. These enhancements suggest that the proposed algorithm is suitable for clinical applications where precise and timely diagnosis is crucial. Overall, our research contributes to the advancement of medical imaging analysis by providing a practical solution for GBC detection. Future work will continue to refine the model, exploring additional enhancements and broader applications in medical diagnostics. We also plan to explore lightweight variants of the MsHop module, such as channel pruning, efficient convolution designs, or dynamic routing mechanisms, to reduce computational overhead while preserving performance. We will also consider optimizing the overall architecture to balance accuracy and efficiency more effectively.
Supporting information
S1 Appendix. Comprehensive derivation of confusion matrices and Kappa coefficients.
https://doi.org/10.1371/journal.pone.0330781.s001
(PDF)
References
- 1. Roa JC, García P, Kapoor VK, Maithel SK, Javle M, Koshiol J. Gallbladder cancer. Nat Rev Dis Primers. 2022;8(1):69. pmid:36302789
- 2.
International Agency for Research on Cancer. Gallbladder Fact Sheet. 2022. https://gco.iarc.who.int/media/globocan/factsheets/cancers/12-gallbladder-fact-sheet.pdf
- 3. Sung H, Ferlay J, Siegel RL, Laversanne M, Soerjomataram I, Jemal A, et al. Global cancer statistics 2020 : GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209–49. pmid:33538338
- 4.
Abou-Alfa GK, Jarnagin W, El Dika I, D’Angelica M, Lowery M, Brown K. Liver and bile duct cancer. Abeloff’s clinical oncology. Elsevier; 2020. p. 1314–41.
- 5. Ellington TD, Momin B, Wilson RJ, Henley SJ, Wu M, Ryerson AB. Incidence, mortality of cancers of the biliary tract, gallbladder,, liver by sex, age, race/ethnicity and stage at diagnosis: United States 2013 to 2017. Cancer Epidemiol Biomarkers Prev. 2021;30(9):1607–14. pmid:34244156
- 6. Yu MH, Kim YJ, Park HS, Jung SI. Benign gallbladder diseases: Imaging techniques and tips for differentiating with malignant gallbladder diseases. World J Gastroenterol. 2020;26(22):2967–86. pmid:32587442
- 7. Yuan H-X, Cao J-Y, Kong W-T, Xia H-S, Wang X, Wang W-P. Contrast-enhanced ultrasound in diagnosis of gallbladder adenoma. Hepatobiliary Pancreat Dis Int. 2015;14(2):201–7. pmid:25865694
- 8. Feng H, Yang B, Wang J, Liu M, Yin L, Zheng W. Identifying malignant breast ultrasound images using ViT-patch. Applied Sciences. 2023;13(6):3489.
- 9. He Q, Yang Q, Xie M. HCTNet: a hybrid CNN-transformer network for breast ultrasound image segmentation. Comput Biol Med. 2023;155:106629. pmid:36787669
- 10. Chen X, Liu X, Wu Y, Wang Z, Wang SH. Research related to the diagnosis of prostate cancer based on machine learning medical images: a review. Int J Med Inform. 2024;181:105279. pmid:37977054
- 11. Jiang H, Imran M, Muralidharan P, Patel A, Pensa J, Liang M, et al. MicroSegNet: a deep learning approach for prostate segmentation on micro-ultrasound images. Comput Med Imaging Graph. 2024;112:102326. pmid:38211358
- 12. Gong H, Chen J, Chen G, Li H, Li G, Chen F. Thyroid region prior guided attention for ultrasound segmentation of thyroid nodules. Comput Biol Med. 2023;155:106389. pmid:36812810
- 13. Zhou J, Tian H, Wang W. Fully automated thyroid ultrasound screening utilizing multi-modality image and anatomical prior. Biomedical Signal Processing and Control. 2024;87:105430.
- 14. Li Z, Yang J, Wang X, Zhou S. Establishment and evaluation of intelligent diagnostic model for ophthalmic ultrasound images based on deep learning. Ultrasound Med Biol. 2023;49(8):1760–7. pmid:37137742
- 15. Feng L, Zhang Y, Wei W, Qiu H, Shi M. Applying deep learning to recognize the properties of vitreous opacity in ophthalmic ultrasound images. Eye (Lond). 2024;38(2):380–5. pmid:37596401
- 16. Lucassen RT, Jafari MH, Duggan NM, Jowkar N, Mehrtash A, Fischetti C, et al. Deep learning for detection and localization of B-lines in lung ultrasound. IEEE J Biomed Health Inform. 2023;27(9):4352–61. pmid:37276107
- 17. Custode LL, Mento F, Tursi F, Smargiassi A, Inchingolo R, Perrone T, et al. Multi-objective automatic analysis of lung ultrasound data from COVID-19 patients by means of deep learning and decision trees. Appl Soft Comput. 2023;133:109926. pmid:36532127
- 18. Fiorentino MC, Villani FP, Di Cosmo M, Frontoni E, Moccia S. A review on deep-learning algorithms for fetal ultrasound-image analysis. Med Image Anal. 2023;83:102629. pmid:36308861
- 19. Ramirez Zegarra R, Ghi T. Use of artificial intelligence and deep learning in fetal ultrasound imaging. Ultrasound Obstet Gynecol. 2023;62(2):185–94. pmid:36436205
- 20. Ren P, Xiao Y, Chang X, Huang PY, Li Z, Gupta BB. A survey of deep active learning. ACM Computing Surveys. 2021;54(9):1–40.
- 21.
Dong Z, Huang X, Yuan G, Zhu H, Xiong H. Butterfly-core community search over labeled graphs. Proceedings of the VLDB Endowment. 2021;14(11):2006–18.
- 22. Shurrab S, Duwairi R. Self-supervised learning methods and applications in medical imaging analysis: a survey. PeerJ Comput Sci. 2022;8:e1045. pmid:36091989
- 23.
Hang J, Dong Z, Zhao H, Song X, Wang P, Zhu H. Outside in: Market-aware heterogeneous graph neural network for employee turnover prediction. In: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 2022. p. 353–62.
- 24. Ye Y, Dong Z, Zhu H, Xu T, Song X, Yu R, et al. MANE: organizational network embedding with multiplex attentive neural networks. IEEE Transactions on Knowledge and Data Engineering. 2022;35(4):4047–61.
- 25. Budd S, Robinson EC, Kainz B. A survey on active learning and human-in-the-loop deep learning for medical image analysis. Med Image Anal. 2021;71:102062. pmid:33901992
- 26. Shen D, Qin C, Wang C, Dong Z, Zhu H, Xiong H. Topic modeling revisited: a document graph-based neural network perspective. Advances in Neural Information Processing Systems. 2021;34:14681–93.
- 27.
Nguyen NQ, Le TS. A semi-supervised learning method to remedy the lack of labeled data. In: 2021 15th International Conference on Advanced Computing and Applications (ACOMP). 2021. p. 78–84.
- 28. Grill JB, Strub F, Altch´e F, Tallec C, Richemond P, Buchatskaya E. Bootstrap your own latent-a new approach to self-supervised learning. Advances in Neural Information Processing Systems. 2020;33:21271–84.
- 29. Mishra AK, Roy P, Bandyopadhyay S, Das SK. CR-SSL: a closely related self-supervised learning based approach for improving breast ultrasound tumor segmentation. International Journal of Imaging Systems and Technology. 2022;32(4):1209–20.
- 30.
Zhao Z, Yang G. Unsupervised contrastive learning of radiomics, deep features for label-efficient tumor classification. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2021 : 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II, 2021. p. 252–61.
- 31. Jiao J, Droste R, Drukker L, Papageorghiou AT, Noble JA. Self-supervised representation learning for ultrasound video. Proc IEEE Int Symp Biomed Imaging. 2020;2020:1847–50. pmid:32489519
- 32.
Qi H, Collins S, Noble JA. Knowledge-guided pretext learning for utero-placental interface detection. In: Medical Image Computing and Computer Assisted Intervention–MICCAI 2020 : 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I. 2020. p. 582–93.
- 33. Liu C, Qiao M, Jiang F, Guo Y, Jin Z, Wang Y. TN-USMA Net: Triple normalization-based gastrointestinal stromal tumors classification on multicenter EUS images with ultrasound-specific pretraining and meta attention. Med Phys. 2021;48(11):7199–214. pmid:34412155
- 34.
Basu S, Singla S, Gupta M, Rana P, Gupta P, Arora C. Unsupervised contrastive learning of image representations from ultrasound videos with hard negative mining. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. 2022. p. 423–33.
- 35.
Basu S, Gupta M, Madan C, Gupta P, Arora C. FocusMAE: gallbladder cancer detection from ultrasound videos with focused masked autoencoders. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. p. 11715–25.
- 36. Lian J, Ma Y, Ma Y, Shi B, Liu J, Yang Z, et al. Automatic gallbladder and gallstone regions segmentation in ultrasound image. Int J Comput Assist Radiol Surg. 2017;12(4):553–68. pmid:28063077
- 37. Jeong Y, Kim JH, Chae H-D, Park S-J, Bae JS, Joo I, et al. Deep learning-based decision support system for the diagnosis of neoplastic gallbladder polyps on ultrasonography: Preliminary results. Sci Rep. 2020;10(1):7700. pmid:32382062
- 38. Kim T, Choi YH, Choi JH, Lee SH, Lee S, Lee IS. Gallbladder polyp classification in ultrasound images using an ensemble convolutional neural network model. J Clin Med. 2021;10(16):3585. pmid:34441881
- 39.
Basu S, Gupta M, Rana P, Gupta P, Arora C. Surpassing the human accuracy: detecting gallbladder cancer from USG images with curriculum learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022. p. 20886–96.
- 40.
Shuvo SB, Chowdhury MZ. Classification of gallbladder cancer using average ensemble learning. In: 2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT). 2024. p. 1450–5.
- 41. Sener O, Savarese S. Active learning for convolutional neural networks: a core-set approach. arXiv preprint 2017. https://arxiv.org/abs/1708.00489
- 42. Pinsler R, Gordon J, Nalisnick E, Hernández-Lobato JM. Bayesian batch active learning as sparse subset approximation. Advances in Neural Information Processing Systems. 2019;32.
- 43.
Sinha S, Ebrahimi S, Darrell T. Variational adversarial active learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019. p. 5972–81.
- 44.
Xie B, Yuan L, Li S, Liu CH, Cheng X, Wang G. Active learning for domain adaptation: an energy-based approach. In: Proceedings of the AAAI Conference on Artificial Intelligence. 2022. p. 8708–16.
- 45.
Cabannes V, Bottou L, Lecun Y, Balestriero R. Active self-supervised learning: a few low-cost relationships are all you need. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. p. 16274–83.
- 46.
Chen T, Kornblith S, Norouzi M, Hinton G. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. 2020. p. 1597–607.
- 47.
He K, Fan H, Wu Y, Xie S, Girshick R. Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020. p. 9729–38.
- 48. Caron M, Misra I, Mairal J, Goyal P, Bojanowski P, Joulin A. Unsupervised learning of visual features by contrasting cluster assignments. Advances in Neural Information Processing Systems. 2020;33:9912–24.
- 49.
Zbontar J, Jing L, Misra I, LeCun Y, Deny S. Barlow twins: Self-supervised learning via redundancy reduction. In: International Conference On Machine Learning, 2021. p. 12310–20.
- 50.
He K, Chen X, Xie S, Li Y, Dollar P, Girshick R. Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; 2022. p. 16000–9.
- 51.
Chen K, Liu Z, Hong L, Xu H, Li Z, Yeung DY. Mixed autoencoder for self-supervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023. p. 22742–51.
- 52. Bao H, Dong L, Piao S, Wei F. Beit: bert pre-training of image transformers. arXiv preprint 2021. https://arxiv.org/abs/2106.08254
- 53. Huang Z, Jin X, Lu C, Hou Q, Cheng M-M, Fu D, et al. Contrastive masked autoencoders are stronger vision learners. IEEE Trans Pattern Anal Mach Intell. 2024;46(4):2506–17. pmid:38015699
- 54. Mishra S, Robinson J, Chang H, Jacobs D, Sarna A, Maschinot A, et al. A simple, efficient and scalable contrastive masked autoencoder for learning visual representations. arXiv preprint 2022. https://arxiv.org/abs/2210.16870
- 55. Oquab M, Darcet T, Moutakanni T, Vo H, Szafraniec M, Khalidov V. Dinov2: learning robust visual features without supervision. arXiv preprint 2023. https://arxiv.org/abs/2304.07193
- 56.
Higgins I, Matthey L, Pal A, Burgess C, Glorot X, Botvinick M. beta-VAE: learning basic visual concepts with a constrained variational framework. 2017. https://openreview.net/forum?id=Sy2fzU9gl
- 57. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S. Generative adversarial networks. Communications of the ACM. 2020;63(11):139–44.
- 58. Bardes A, Ponce J, LeCun Y. Vicreg: variance-invariance-covariance regularization for self-supervised learning. arXiv preprint. 2021. https://arxiv.org/abs/2105.04906