Threshold-based exploitation of noisy label in black-box unsupervised domain adaptation

Huiwen Xu; Jaeri Lee; U Kang

doi:10.1371/journal.pone.0321987

Abstract

How can we perform unsupervised domain adaptation when transferring a black-box source model to a target domain? Black-box Unsupervised Domain Adaptation focuses on transferring the labels derived from a pre-trained black-box source model to an unlabeled target domain. The problem setting is motivated by privacy concerns associated with accessing and utilizing source data or source model parameters. Recent studies typically train the target model by mimicking the labels derived from the black-box source model, which often contain noise due to domain gaps between the source and the target. Directly exploiting such noisy labels or disregarding them may lead to a decrease in the model’s performance. We propose Threshold-Based Exploitation of Noisy Predictions (TEN), a method to accurately learn the target model with noisy labels in Black-box Unsupervised Domain Adaptation. To ensure the preservation of information from the black-box source model, we employ a threshold-based approach to distinguish between clean labels and noisy labels, thereby allowing the transfer of high-confidence knowledge from both labels. We utilize a flexible thresholding approach to adjust the threshold for each class, thereby obtaining an adequate amount of clean data for hard-to-learn classes. We also exploit knowledge distillation for clean data and negative learning for noisy labels to extract high-confidence information. Extensive experiments show that TEN outperforms baselines with an accuracy improvement of up to 9.49%.

Citation: Xu H, Lee J, Kang U (2025) Threshold-based exploitation of noisy label in black-box unsupervised domain adaptation. PLoS One 20(5): e0321987. https://doi.org/10.1371/journal.pone.0321987

Editor: Lei Chu, University of Southern California UNITED STATES OF AMERICA

Received: September 22, 2024; Accepted: March 13, 2025; Published: May 12, 2025

Copyright: © 2025 Xu et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: All relevant data can be downloaded from the following URLs: https://faculty.cc.gatech.edu/ judy/domainadapt/, https://www.hemanthdv.org/officeHomeDataset.html, https://www.imageclef.org/2014/adaptation, https://gitlab.com/tringwald/adaptiope, https://ai.bu.edu/visda-2017/ The authors do not own these datasets and had no special access privileges that others would not have.

Funding: This work was supported by Institute of Information & communications Technology Planning & Evaluation(IITP) grant funded by the Korea government(MSIT) [No.2022-0-00641, XVoice: Multi-Modal Voice Meta Learning], [No.RS-2020-II200894, Flexible and Efficient Model Compression Method for Various Applications and Environments], [No.RS-2021-II211343, Artificial Intelligence Graduate School Program (Seoul National University)], and [NO.RS-2021-II212068, Artificial Intelligence Innovation Hub (Artificial Intelligence Institute, Seoul National University)]. The Institute of Engineering Research at Seoul National University provided research facilities for this work. The ICT at Seoul National University provides research facilities for this study.

Competing interests: The authors have declared that no competing interests exist.

Introduction

How can we transfer the knowledge from a black-box source model to a target task? Unsupervised domain adaptation (UDA) has emerged as a crucial research topic in the field of machine learning and computer vision. The goal of UDA is to adapt a model trained on a source domain with labeled data to a target domain with only unlabeled data, where the target domain has similar but different statistical characteristics to the source domain. The ability to perform UDA is essential in many real-world applications, such as image classification, object recognition, and natural language processing, where the target domain may not have labeled data for training.

Unsupervised domain adaptation [1, 2] has been shown to have limitations, one of which involves the necessity to access either the source data or a white-box source model. Nonetheless, sharing the source data might not be suitable due to privacy concerns, particularly in sensitive domains such as medical records or financial data. Additionally, even transferring a pre-trained white-box source model to a target domain raises security concerns, as the source data could potentially be reconstructed using techniques like generative adversarial training [3].

Recent studies focus on a new problem setting known as Black-box Unsupervised Domain Adaptation (Black-box UDA), where the source domain provides only a black-box model without revealing its model parameters. In this scenario, the knowledge that can be transferred to the target domain is limited to the outputs produced by the black-box source model. However, the outputs contain noise due to the intrinsic dissimilarities between the source and target domains, which makes the domain adaptation process more challenging. DINE [4] adopts knowledge distillation and pseudo-labeling strategies, instructing the target model to distill the labels produced by the black-box source model. IterLNL [5] utilizes a noisy labeling technique to select clean instances from the target data, thereby training the model solely on clean data. Both algorithms may result in decreased performance, as DINE is susceptible to learning from mislabeled data, while IterLNL experiences information loss owing to its exclusion of noisy data during training.

In this paper, we propose Threshold-Based Exploitation of Noisy Predictions (TEN), a precise method for Black-box UDA by distilling reliable high-confidence knowledge from the source labels. Owing to the presence of noise in the outputs from the source model, we partition the target data into distinct clean and noisy subsets, and apply distinct strategies to distill the high-confidence knowledge. Pseudo-labels associated with the clean subset align closely with the actual ground truths, whereas those linked to the noisy subset frequently deviate from them. In the process of data partitioning, a flexible threshold is determined for each class to ensure that hard-to-learn classes possess an adequate number of untainted instances. We harness knowledge distillation on the clean subset to emulate the source model’s labels. Conversely, on the noisy subset, we exploit negative learning to discern which classes the instances do not pertain to. In addition, we employ consistency regularization coupled with entropy regularization techniques to learn the structural features of the target domain. Extensive experiments shows that TEN surpasses baseline methods, with an accuracy increase of up to 9.49%.

Our contributions are summarized as follows:

Algorithm. We propose TEN, a precise method for distilling reliable high-confidence knowledge from the outputs of a black-box source model, even in the presence of noise.
Accuracy. Extensive experiments conducted on real-world datasets demonstrate that TEN outperforms baselines with up to 9.49 higher accuracy for single-source UDA, and 4.81% higher accuracy for multi-source UDA.
Ablation Study. We show that the performance of TEN exhibits an upward trend when more noisy labels are used for training.

Table 1 provides the definitions of symbols used in this paper.

Download:

Table 1. Table of symbols.

https://doi.org/10.1371/journal.pone.0321987.t001

Related works

Unsupervised domain adaptation

Unsupervised model adaptation, also known as source-free UDA, has garnered increasing attention due to its ability to operate without accessing the source domain, making it suitable for more practical scenarios. Early researches [6] provide a theoretical analysis of transfer learning, which motivated deep domain adaptation without source data. Zhu et al. [7] enhance domain adaptation by leveraging high-order graphs and low-rank tensors. Zhu et al. [8] propose a multiview latent space framework for UDA and MultiDA with selective pseudo-labeling. These methods are limited to solving UDA problems which does not fundamentally address privacy concerns.

In this paper, we tackle an even more challenging problem by leveraging only the predictions from a black-box model trained in the source domain for model adaptation. Few works have been conducted in this field. Zhang et al. [5] focus on selecting clean instances from noisy data, training the model only on these clean samples. However, this approach has a risk of information loss by excluding noisy data, potentially limiting the model’s generalization ability. Liang et al. [4] employ a knowledge distillation and pseudo-labeling strategy, where the target model distills labels from a black-box source model. Zhang et al. [9] use a bi-directional memorization mechanism to identify useful features and progressively correct noisy pseudo labels, improving generalization across visual recognition tasks. However, it is prone to learning from mislabeled data, which can reduce the model’s performance. In contrast, our TEN approach divides the target dataset into “clean" and “noisy" subsets, and distills high-confidence predictions from both, allowing the model to leverage information from both clean and noisy data. This strategy mitigates the risks of information loss and mislabeled data, offering a more robust learning process.

Semi-supervised learning with noisy labels

Noise can be easily accumulated during training when incorrect predictions are used in semi-supervised or unsupervised learning [10]. Such noise can cause the model to overfit to the noisy feature space, making it challenging to adapt to new domains [11]. In UDA, pseudo-labeling [12,13] and knowledge distillation [14,15] are effective techniques, but their performances can be degraded by noise. In particular, for transfer learning tasks involving distant domains, the pseudo labels for the target domain can be extremely noisy, resulting in a deterioration of subsequent training. Our proposed method in this work addresses the issue using (1) a flexible threshold technique which distills more instances for hard-to-learn classes, and (2) pseudo-labeling with negative learning which distills the information from noise.

Proposed method

Given a black-box source model f_s and unlabeled target data D_t, our objective is to train a target model f_t that performs well on the target data without accessing any source data or source model parameters. The target data consists of n_t instances distributed across C categories, where ; represents the target input space. The source model is pre-trained using labeled source data , which contains n_s labeled instances also in C categories, where and . and represent the source input and label spaces, respectively. We assume that the source label space and the target label space are identical, but the source and target input data have different distributions, i.e., . Distinctively diverging from Unsupervised Domain Adaptation, which mandates access to either the source data or its model parameters, the Black-box Unsupervised Domain Adaptation (Black-box UDA) facilitates the training of the target model in the absence of both source data or source parameters. Black-box UDA depends solely on the soft labels generated by the source model for target instances, denoted as .

Overview

The challenge of Black-box UDA resides in distilling the knowledge from the outputs of the black-box source model. Due to the dissimilarities between the source and target domains, the outputs generated by the black-box source model contain noise. Such noise can yield erroneous results, further exacerbating the challenge of accurately labeling the target data. Consequently, it is imperative to train the target model effectively by utilizing soft labels even in the presence of such noise.

The following detailed challenges need to be addressed for the goal.

C1. How can we effectively divide the target data into clean and noisy subsets? Utilizing a fixed high threshold for data separation may lead to extreme cases where no training data are selected for hard-to-learn classes.
C2. How can we distill meaningful information from noisy labels? When the gap between the source and target domains is substantial, the amount of noisy labels increases, and a failure to effectively learn from them can significantly impede the target model’s performance.
C3. How can we learn the structural information about the target data? Insufficient exploration of hidden representations leads to diminished performance of the target model owing to the disregard of the target domain’s structure.

We address the aforementioned challenges with the following main ideas:

I1. Flexible Threshold. We design a flexible threshold for each class, thereby facilitating the allocation of a larger amount of data to those classes that are difficult to learn.
I2. Negative Learning. We distill the information that reflects the absence of certain classes from the noisy labels.
I3. Structural Regularization. We exploit entropy regularization and consistency regularization so that the target model learns intrinsic data structure about the target data.

We propose TEN, an accurate method for Black-box UDA. The overview of the proposed TEN is depicted in Fig. 1. The entire procedure comprises two distinct phases: division and training. In the division phase, given a predefined high threshold, we count the number of instances of each class whose confidences surpass the threshold, and subsequently adjust the threshold for each class based on these counts. The target data are divided into clean and noisy subsets in accordance with the adjusted thresholds. Throughout the training phase, soft labels of the target data are generated by leveraging the black-box source model. The target model mimics the soft labels of clean and noisy data via knowledge distillation and negative learning, respectively. In order to facilitate the acquisition of the structural information of the target, we exploit entropy regularization and consistency regularization. These ideas cohesively establish a comprehensive strategy for enhancing the performance of the target model by exploiting the strengths of the black-box source model and structural information in the target data.

Download:

Fig 1. The overall structure of TEN.

During the division phase, we use a flexible threshold for each class by taking into account the number of instances whose confidences exceed a predefined threshold. In the training phase, soft labels of the target data are generated by leveraging the black-box source model. The target model is designed to mimic the soft labels by utilizing knowledge distillation and negative learning. Additionally, entropy regularization and consistency regularization are exploited to learn the structural information of the target.

https://doi.org/10.1371/journal.pone.0321987.g001

Flexible threshold

How can we select clean labels from the target data so that reliable knowledge can be learned during knowledge distillation? Clean subset comprises instances whose soft labels generated by the black-box source model closely align with the ground truths of the target task. Conversely, noisy subset primarily consists of instances whose soft labels tend to be inaccurate. Our goal is to train the target model to mimic only clean labels of the black-box source model, since the noisy labels may mislead the target model.

A naive technique involves a predefined high threshold to split instances into (1) clean instances whose confidences surpass the threshold, and (2) noisy instances whose confidences fall below the threshold. Noisy instances easily have wrong predictions due to the gap between the source and target domains. Thus, we consider only the soft labels of clean instances as teacher labels for knowledge distillation. Nonetheless, distinct classes have different properties for training; utilizing identical thresholds for each class to select clean instances may result in an undesirable scenario where hard-to-learn classes cannot identify appropriate instances for training the model, ultimately leading to inadequate performance.

We propose to use a flexible threshold to set a lower threshold for classes difficult to learn. When the threshold is high, the number of predictions that belong to a certain class and exceed the threshold represents the learning difficulty of the class [16]. We count the number of instances whose confidence exceeds the predefined threshold and belong to the class c:

(1)

where is the soft label of i-th target instance generated by the black-box source model, i.e., , represents a predefined positive threshold, and n_t represents the number of target instances. We scale the threshold for each class:

(2)

where is the flexible confidence threshold for class c, and represents the scale factor for class c. The flexible threshold is computed by multiplying the predefined threshold with the scale factor, and enables the selection of a greater quantity of clean labels for training. We train the target model by mimicking the labels of the source model on the selected clean instances whose confidence exceed the adjusted threshold:

(3)

where represents the soft label of i-th target instances obtained from the target model, i.e., . and represent the probabilities of the i-th instance for class c, generated by the source and target models, respectively.

IterLNL [5] also suggests the noise rate technique to adjust class-wise threshold, but it requires many hyperparameters which are sensitive to target performance. We automatically adjust the threshold for each class according to the outputs of the black-box source model, which does not need any extra validation set or hyperparameters.

Negative learning

The strategy of training the target model using only clean labels results in a reduction of available training data and information loss. Although the confidence of noisy labels may be too low for knowledge distillation, it does not imply that noisy labels lack learnable information. Indeed, the model’s confidence in data is reflected not only in the presence of certain classes, but also in their absence [17,18]. We select a subset of the noisy labels whose confidence is low enough, and employ negative learning on them.

Given a predefined negative threshold , we apply negative cross-entropy for the selected labels of noisy instances whose probability falls beneath the negative confidence threshold.

(4)

where represents the number of selected labels for the i-th instance.

Structural regularization

Knowledge distillation and negative learning are designed to learn reliable high-confidence knowledge from soft labels. However, they may not capture the intrinsic data structure of the target data [4]. To address the issue, we exploit consistency regularization and entropy regularization strategies.

We assume that the target model yields similar labels when provided with diverse augmentations of the same target input [19]. The target model is trained by strongly augmented target instances, which are supervised by the pseudo-labels derived from weakly augmented target instances, as follows:

(5)

where denotes the target labels of the i-th instance under weak augmentations. and denote the probabilities of the i-th instance for class c under weak and strong augmentations, respectively. represents a threshold to ensure that the target model is trained on high-confidence instances. We employ a flip-and-shift augmentation for the weak augmentation, while the strong augmentation is achieved by RandAugment [20].

Entropy regularization operates under the premise that the probability distribution of a well-trained model’s outputs resembles a one-hot vector across classes. The target model is trained under the supervision of the hard labels derived from the weakly augmented target instances:

(6)

Summarizing all the losses, the overall objective is formulated as:

(7)

where , , and are balancing hyperparameters. We present the complete algorithm for TEN in Algorithm 1.

Experiments

We present experimental results to answer the following questions about TEN:

Q1. Classification accuracy. Does TEN show better accuracies than baselines on benchmarks?
Q2. Effect of flexible threshold. Does flexible threshold improve the target performance?
Q3. Effect of negative learning. Does TEN exhibit desirable performance when a large amount of noisy data are used for training?
Q4. Ablation study. Do our ideas, such as knowledge distillation on clean subset, negative learning, and structural information, improve the target performance?
Q5. Hyperparameter sensitivity. Is accuracy sensitive to the positive and negative thresholds?

Algorithm 1 Threshold-Based Exploitation of Noisy Predictions (TEN)

Input: black-box source model f_s, unlabeled target data

, randomly initialized target model f_t, and

predefined thresholds and

Output: well-trained target model f_t

1: for each class do

2: Count the number of high-confidence instances for knowledge distillation using Eq. (1)

3: end for

4: Calculate positive threshold using Eq. (2)

5: Divide the target data into clean and noisy subsets in accordance with and

6: for each epoch do

7: for each batch do

8: Compute knowledge distillation loss L_kd for clean subset using Eq. (3)

9: Compute negative learning loss L_nce for noisy subset using Eq. (4)

10: Compute consistency regularization loss L_cr and entropy regularization loss L_er for all data using Eq. (5) and Eq. (6), respectively

11: Compute the overall loss L_total using Eq. (7), and update parameters of the target model f_t

12: end for

13: end for

Experimental setup

We present datasets, models, baselines, and hyperparameters for our experiments.

Datasets.

We use 5 image classification datasets summarized in Table 2. Office-31[1] comprises 3 domains, Amazon (A), DSLR (D), and Webcam (W), whereas Office-Home[2] [21] encompasses 4 domains, Art (A), Clipart (C), Product (P), and Real-World (R). Image-CLEF[3] dataset is composed of 4 domains, including Caltech-256 (C), ILSVRC2021 (I), PASCAL VOC2021 (P), and Bing (B). Adaptiope[4] [22] contains 3 domains, Product (P), Real life (R), and Synthetic (S), while VisDA[5] [23] comprises Synthetic (S) and Real domains (R). The imbalance ratio in a multi-class dataset indicates the ratio of the number of instances in the least prevalent (minority) class to that in the most prevalent (majority) class.

Download:

Table 2. Summary of datasets.

https://doi.org/10.1371/journal.pone.0321987.t002

Models.

We use ResNet50 as source backbone and follow DINE [4] to train the source classifier. We add a fully-connected layer at the end of backbone feature extractor and train the source model f_s with label smoothing technique [24]. The loss of the source model is defined as , where q_s = (1 − + represents the smoothed label vector. The smoothing parameter is set empirically to a value of 0.1, and represents a one-hot encoding of K dimensions where only the j-th value is 1. For the target model, we also use ResNet50 as backbone, and follow [2, 4] to replace the original classifier with a refined architecture that consists of a bottleneck layer with 256 units and a task-specific classifier. We place a batch-normalization layer after the fully-connected layer within the bottleneck layer, and a weight normalization [25] layer in the task-specific classifier.

Baselines.

We compare our proposed TEN with two competitors: DINE [4] and IterLNL [5]. DINE trains the target model by distilling the soft labels derived from the black-box source model, which includes noisy data, while IterLNL focuses on training exclusively with clean data, which are selected based on the noise rate. Additionally, we train the target model using only the target predictions from the black-box source model, and the method is denoted as “No adapt.”

Hyperparameters.

We conduct experiments five times and report the average accuracies. For all experiments, we use PyTorch on a GeForce RTX 3080. Following DINE [4], the models initialized from the pre-trained ImageNet model have their learning rate set to 1e-3, and those learned from scratch are set to a learning rate of 1e-2. Furthermore, we adopt learning rate scheduler, momentum (0.9), weight decay (1e-3), bottleneck size (256), and batch size (64). The values for the thresholds and are determined to be within the sets {0.5, 0.6, 0.7, 0.8} and {0.0001, 0.0005, 0.001, 0.005, 0.01}, respectively. The hyperparameters are optimized using Optuna.

Download:

Table 3. Accuracies (%) on Office-Home for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 4.08% higher accuracy.

https://doi.org/10.1371/journal.pone.0321987.t003

Download:

Table 4. Accuracies (%) on Image-CLEF for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 3.67% higher accuracy.

https://doi.org/10.1371/journal.pone.0321987.t004

Download:

Table 5. Accuracies (%) on Office-31 for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 7.59% higher accuracy.

https://doi.org/10.1371/journal.pone.0321987.t005

Download:

Table 6. Accuracies (%) on Adaptiope for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 9.49% higher accuracy.

https://doi.org/10.1371/journal.pone.0321987.t006

Download:

Table 7. Accuracies (%) on VisDA for black-box model adaptation. The best is in bold. TEN outperforms baselines with up to 3.62% higher accuracy.

https://doi.org/10.1371/journal.pone.0321987.t007

Classification accuracy (Q1).

We report the target accuracies on the five datasets in Tables 37. TEN achieves the highest average accuracy in most cases and surpasses the second-best method by up to 7.59%, 4.08%, 3.67%, 9.49%, and 3.62% for Office-31, Office-Home, Image-CLEF, Adaptiope, and VisDA, respectively. Despite the significant disparity between the synthetic domain (Synthetic) and real-world (Product or Real life) domains in Adaptiope, TEN exhibits considerable improvement which shows that TEN is an effective method for domain adaptation.

We also evaluate the performance of Black-box UDA with multiple source models. The soft labels of the source models are aggregated through averaging to establish the initialized soft label of the source models. We conduct experiments on four multi-source datasets, and use ResNet101 as backbone. As shown in Table 9, TEN demonstrates competitive performance across all these datasets, surpassing competitors by up to 4.81%.

Download:

Table 8. Accuracies (%) on Office-31 for various division strategies used for splitting the target data into clean and noisy subsets. The best is in bold. Our proposed TEN outperforms baselines, demonstrating that flexible threshold is effective by taking advantages of high-confidence clean labels.

https://doi.org/10.1371/journal.pone.0321987.t008

Download:

Table 9. Accuracies (%) on four datasets for multi-source model adaptation. The best is in bold. TEN outperforms baselines with up to 4.81% higher accuracy.

https://doi.org/10.1371/journal.pone.0321987.t009

Effect of flexible threshold (Q2)

To demonstrate the effectiveness of the flexible threshold, we compare the target performance against various division strategies as presented in Table 8. “No div.” skips the division phase, and leverages all soft labels including noise for knowledge distillation. “TEN (fixed)” chooses clean labels based on a predefined threshold to ensure that the labels possess high confidence. Notably, our proposed TEN surpasses the baselines, demonstrating that the flexible threshold is beneficial by fully harnessing clean labels with high-confidence. We analyze the reason by plotting the instances (clean subset) utilized for knowledge distillation in Fig. 2. “Pos” and “Neg” represent instances that are labeled correctly and inaccurately, respectively. With a lower fixed threshold as in Fig. 2 (a), all instances including noise are incorporated into training, which potentially undermine the target performance. In contrast, a higher fixed threshold ensures the selection of high-confidence instances but might also curtail the overall number of “Pos” training instances as shown in Fig. 2 (b). Compared to the baselines, TEN proposes a flexible threshold to diminish the number of “Neg" instances in the selection process while simultaneously ensuring the inclusion of a sufficient quantity of “Pos" instances as shown in Fig. 2 (c). This closely aligns with the ideal scenario where all “Pos” instances are employed for distillation, while precluding the inclusion of all “Neg” instances.

Download:

Fig 2. Impact of threshold on instance selection.

The stacked bars indicate the number of instances used for knowledge distillation in each case. “Pos” and “Neg” represent instances that are labeled correctly and inaccurately, respectively. (a) When the fixed threshold is low, all instances, including noise, are used for training, which may diminish the target performance. (b) Conversely, when the fixed threshold is high, high-confidence instances are selected, but with the disadvantage of reducing a significant number of training instances. (c) TEN mitigates the occurrence of “Neg” instances in the selection process, while concurrently ensuring the inclusion of an adequate number of “Pos” instances. It closely approximates the optimal scenario where all positive instances are utilized for distillation, while excluding all negative instances.

https://doi.org/10.1371/journal.pone.0321987.g002

Effect of negative learning (Q3)

We conduct experiments on Office-31 dataset with varying quantities of noisy instances for training. In Table 10, the percentage in the first column indicates the proportion of noisy data used for training relative to the total amount of noisy data available. We have two observations. First, TEN achieves the highest accuracy with an improvement of up to 7.59% in all cases. Second, the relative performance of TEN compared to competitors increase as more noisy data are used training. The reason is that TEN effectively extracts high-confidence information from the noisy data, mitigating information loss.

Download:

Table 10. Accuracies (%) on Office-31 for different number of noisy instances used for negative learning. The best is in bold. The percentage in the first column indicates the proportion of noisy data used for training relative to the entire noisy data. Note that TEN outperforms competitors in almost all the cases. Also note that the performance gap of TEN and competitors increases as more noisy data are used for training.

https://doi.org/10.1371/journal.pone.0321987.t010

Ablation study (Q4)

We examine the contribution of various components of TEN in Table 11. The proposed flexible threshold, negative learning, entropy regularization, and consistency regularization loss consistently enhance the accuracy of TEN with up to 21.28%, 1.90%, 3.93%, and 4.77%, respectively.

Download:

Table 11. Ablation study for TEN. The best is in bold. TEN achieves the highest accuracy among its variants, demonstrating that the main ideas of TEN are effective for its superior performance.

https://doi.org/10.1371/journal.pone.0321987.t011

Hyperparameter sensitivity (Q5)

Fig 3 evaluates the sensitivity of the two hyperparameters (positive threshold) and (negative threshold) on accuracy using the Office-31 dataset. The horizontal axis at the bottom represents ranging from 0.50 to 0.80, while the top horizontal axis corresponds to , presented on a logarithmic scale from 10⁻⁴ to 10⁻². The vertical axis indicates accuracy. For , the accuracy is optimal at 0.6, while it drops when is too high because a higher reduces the number of clean samples used to adjust the threshold, leading to an inaccurate evaluation of the flexible thresholds. For , the accuracy is optimal at 0.005 and remains stable across different values with only minor variations.

Download:

Fig 3. Hyperparameter sensitivity to accuracy.

has a significant impact on accuracy, while demonstrates robustness with minimal effect on accuracy.

https://doi.org/10.1371/journal.pone.0321987.g003

Conclusion

We propose TEN, an accurate method for Black-box Unsupervised Domain Adaptation. TEN partitions the target data into clean and noisy subsets. The pseudo labels of the clean subset correspond closely to the ground truths of the target task, while those of the noisy subset are often inaccurate. The high-confidence of clean subset reflects the presence of certain classes, while the high-confidence of noisy subset reflects their absence. Considering this, we exploit knowledge distillation on clean labels and negative learning on noisy labels to learn their respective high-confidence predictions. Experimental results demonstrate that TEN outperforms the baseline methods by up to 9.49% higher accuracy for single-source UDA, and 4.81% higher accuracy for multi-source UDA. The performance of our approach exhibits an upward trend as an increasing amount of noisy data is utilized for training.

There are several possible future research directions. Our primary contribution in this work is the significant boost in accuracy. However, achieving computational efficiency is another important aspect. Also, adapting our method for non-image data by considering their unique characteristics would be interesting. Finally, addressing the case when we have few unlabeled target data is a promising direction.

References

1. Kundu JN, Venkat N, V RM, Babu RV. Universal source-free domain adaptation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation/ IEEE; 2020. pp. 4543–52.
2. Liang J, Hu D, Feng J. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. vol. 119 of Proceedings of Machine Learning Research. PMLR; 2020. pp. 6028–39.
3. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. ACM Commun. 2020;63(11):139–44.
- View Article
- Google Scholar
4. Liang J, Hu D, Feng J, He R. DINE: domain adaptation from single and multiple black-box predictors. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022. IEEE; 2022. pp. 7993–8003.
5. Zhang H, Zhang Y, Jia K, Zhang L. Unsupervised domain adaptation of black-box source models. In: 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22–25, 2021. BMVA Press; 2021. p. 147.
6. Kuzborskij I, Orabona F. Stability and hypothesis transfer learning. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, June 16–21, 2013. vol. 28 of JMLR Workshop and Conference Proceedings. JMLR.org; 2013. pp. 942–50.
7. Zhu C, Zhang L, Luo W, Jiang G, Wang Q. Tensorial multiview low-rank high-order graph learning for context-enhanced domain adaptation. Neural Netw. 2025;181:106859. pmid:39509810
- View Article
- PubMed/NCBI
- Google Scholar
8. Zhu C, Wang Q, Xie Y, Xu S. Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation. Inf Sci. 2024;662:120223.
- View Article
- Google Scholar
9. Zhang J, Huang J, Jiang X, Lu S. Black-box unsupervised domain adaptation with bi-directional Atkinson-Shiffrin memory. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1–6, 2023.
10. Tarvainen A, Valpola H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA; 2017. pp. 1195–204.
11. Arazo E, Ortego D, Albert P, O’Connor NE, McGuinness K. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19–24, 2020. IEEE; 2020. p. 1–8.
12. Morerio P, Volpi R, Ragonesi R, Murino V. Generative pseudo-label refinement for unsupervised domain adaptation. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1–5, 2020. IEEE; 2020. pp. 3119–28.
13. Saito K, Ushiku Y, Harada T. Asymmetric tri-training for unsupervised domain adaptation. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017. vol. 70 of Proceedings of Machine Learning Research. PMLR; 2017. pp. 2988–97.
14. Kundu JN, Lakkakula N, Radhakrishnan VB. UM-Adapt: unsupervised multi-task adaptation using adversarial cross-task distillation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE; 2019. pp. 1436–45.
15. Zhou B, Kalra N, Krähenbühl P. Domain adaptation through task distillation. In: Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI. vol. 12371 of Lecture Notes in Computer Science. Springer; 2020. pp. 664–80.
16. Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, et al. FlexMatch: boosting semi-supervised learning with curriculum pseudo labeling. In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6–14, 2021, virtual; 2021. pp. 18408–19.
17. Kim Y, Yim J, Yun J, Kim J. NLNL: negative learning for noisy labels. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE; 2019. pp. 101–10.
18. Rizve MN, Duarte K, Rawat YS, Shah M. In defense of pseudo-labeling: an uncertainty-aware pseudo-label selection framework for semi-supervised learning. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net; 2021.
19. Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel C, et al. FixMatch: simplifying semi-supervised learning with consistency and confidence. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual; 2020.
20. Cubuk ED, Zoph B, Shlens J, Le Q. RandAugment: practical automated data augmentation with a reduced search space. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual; 2020.
21. Venkateswara H, Eusebio J, Chakraborty S, Panchanathan S. Deep hashing network for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017.
22. Ringwald T, Stiefelhagen R. Adaptiope: a modern benchmark for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2021.
23. Peng X, Usman B, Kaushik N, Hoffman J, Wang D, Saenko K. VisDA: the visual domain adaptation challenge; 2017.
24. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society; 2016. pp. 2818–26.
25. Salimans T, Kingma DP. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain; 2016. p. 901.

[ref1] 1. Kundu JN, Venkat N, V RM, Babu RV. Universal source-free domain adaptation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13–19, 2020. Computer Vision Foundation/ IEEE; 2020. pp. 4543–52.

[ref2] 2. Liang J, Hu D, Feng J. Do we really need to access the source data? Source hypothesis transfer for unsupervised domain adaptation. In: Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event. vol. 119 of Proceedings of Machine Learning Research. PMLR; 2020. pp. 6028–39.

[ref3] 3. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative adversarial networks. ACM Commun. 2020;63(11):139–44.
View Article
Google Scholar

[4] View Article

[5] Google Scholar

[ref4] 4. Liang J, Hu D, Feng J, He R. DINE: domain adaptation from single and multiple black-box predictors. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18–24, 2022. IEEE; 2022. pp. 7993–8003.

[ref5] 5. Zhang H, Zhang Y, Jia K, Zhang L. Unsupervised domain adaptation of black-box source models. In: 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22–25, 2021. BMVA Press; 2021. p. 147.

[ref6] 6. Kuzborskij I, Orabona F. Stability and hypothesis transfer learning. In: Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, June 16–21, 2013. vol. 28 of JMLR Workshop and Conference Proceedings. JMLR.org; 2013. pp. 942–50.

[ref7] 7. Zhu C, Zhang L, Luo W, Jiang G, Wang Q. Tensorial multiview low-rank high-order graph learning for context-enhanced domain adaptation. Neural Netw. 2025;181:106859. pmid:39509810
View Article
PubMed/NCBI
Google Scholar

[10] View Article

[11] PubMed/NCBI

[12] Google Scholar

[ref8] 8. Zhu C, Wang Q, Xie Y, Xu S. Multiview latent space learning with progressively fine-tuned deep features for unsupervised domain adaptation. Inf Sci. 2024;662:120223.
View Article
Google Scholar

[14] View Article

[15] Google Scholar

[ref9] 9. Zhang J, Huang J, Jiang X, Lu S. Black-box unsupervised domain adaptation with bi-directional Atkinson-Shiffrin memory. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1–6, 2023.

[ref10] 10. Tarvainen A, Valpola H. Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4–9, 2017, Long Beach, CA, USA; 2017. pp. 1195–204.

[ref11] 11. Arazo E, Ortego D, Albert P, O’Connor NE, McGuinness K. Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International Joint Conference on Neural Networks, IJCNN 2020, Glasgow, United Kingdom, July 19–24, 2020. IEEE; 2020. p. 1–8.

[ref12] 12. Morerio P, Volpi R, Ragonesi R, Murino V. Generative pseudo-label refinement for unsupervised domain adaptation. In: IEEE Winter Conference on Applications of Computer Vision, WACV 2020, Snowmass Village, CO, USA, March 1–5, 2020. IEEE; 2020. pp. 3119–28.

[ref13] 13. Saito K, Ushiku Y, Harada T. Asymmetric tri-training for unsupervised domain adaptation. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017. vol. 70 of Proceedings of Machine Learning Research. PMLR; 2017. pp. 2988–97.

[ref14] 14. Kundu JN, Lakkakula N, Radhakrishnan VB. UM-Adapt: unsupervised multi-task adaptation using adversarial cross-task distillation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE; 2019. pp. 1436–45.

[ref15] 15. Zhou B, Kalra N, Krähenbühl P. Domain adaptation through task distillation. In: Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVI. vol. 12371 of Lecture Notes in Computer Science. Springer; 2020. pp. 664–80.

[ref16] 16. Zhang B, Wang Y, Hou W, Wu H, Wang J, Okumura M, et al. FlexMatch: boosting semi-supervised learning with curriculum pseudo labeling. In: Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6–14, 2021, virtual; 2021. pp. 18408–19.

[ref17] 17. Kim Y, Yim J, Yun J, Kim J. NLNL: negative learning for noisy labels. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2, 2019. IEEE; 2019. pp. 101–10.

[ref18] 18. Rizve MN, Duarte K, Rawat YS, Shah M. In defense of pseudo-labeling: an uncertainty-aware pseudo-label selection framework for semi-supervised learning. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3–7, 2021. OpenReview.net; 2021.

[ref19] 19. Sohn K, Berthelot D, Carlini N, Zhang Z, Zhang H, Raffel C, et al. FixMatch: simplifying semi-supervised learning with consistency and confidence. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual; 2020.

[ref20] 20. Cubuk ED, Zoph B, Shlens J, Le Q. RandAugment: practical automated data augmentation with a reduced search space. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual; 2020.

[ref21] 21. Venkateswara H, Eusebio J, Chakraborty S, Panchanathan S. Deep hashing network for unsupervised domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017.

[ref22] 22. Ringwald T, Stiefelhagen R. Adaptiope: a modern benchmark for unsupervised domain adaptation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV); 2021.

[ref23] 23. Peng X, Usman B, Kaushik N, Hoffman J, Wang D, Saenko K. VisDA: the visual domain adaptation challenge; 2017.

[ref24] 24. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27–30, 2016. IEEE Computer Society; 2016. pp. 2818–26.

[ref25] 25. Salimans T, Kingma DP. Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5–10, 2016, Barcelona, Spain; 2016. p. 901.

Figures

Abstract

Introduction

Related works

Unsupervised domain adaptation

Semi-supervised learning with noisy labels

Proposed method

Overview

Flexible threshold

Negative learning

Structural regularization

Experiments

Experimental setup

Datasets.

Models.

Baselines.

Hyperparameters.

Classification accuracy (Q1).

Effect of flexible threshold (Q2)

Effect of negative learning (Q3)

Ablation study (Q4)

Hyperparameter sensitivity (Q5)

Conclusion

References