M ULTI -EPL: Accurate multi-source domain adaptation

Given multiple source datasets with labels, how can we train a target model with no labeled data? Multi-source domain adaptation (MSDA) aims to train a model using multiple source datasets different from a target dataset in the absence of target data labels. MSDA is a crucial problem applicable to many practical cases where labels for the target data are unavailable due to privacy issues. Existing MSDA frameworks are limited since they align data without considering labels of the features of each domain. They also do not fully utilize the target data without labels and rely on limited feature extraction with a single extractor. In this paper, we propose M ULTI -EPL, a novel method for MSDA. M ULTI -EPL exploits label-wise moment matching to align the conditional distributions of the features for the labels, uses pseudolabels for the unavailable target labels, and introduces an ensemble of multiple feature extractors for accurate domain adaptation. Extensive experiments show that M ULTI -EPL provides the state-of-the-art performance for MSDA tasks in both image domains and text domains, improving the accuracy by up to 13.20%.


Introduction
Given multiple source datasets with labels, how can we train a target model with no labeled data? Large training data are essential for training deep neural networks. Collecting abundant data is, unfortunately, an obstacle in practice; even if enough data are obtained, manually labeling those data is prohibitively expensive. Using other available or much cheaper datasets would be a solution for these limitations; however, indiscriminate usage of other datasets often brings severe generalization error due to the presence of dataset shifts [1]. Unsupervised domain adaptation (UDA) tackles these problems where no labeled data from the target domain are available, but labeled data from other source domains are provided. Finding out domain-invariant features has been the focus of UDA since it allows knowledge transfer from the labeled source dataset to the unlabeled target dataset. There have been many efforts to transfer the knowledge from a single source domain to a target one. Most recent frameworks minimize the distance between two domains by deep neural networks and distance-based techniques such as discrepancy regularizers [2][3][4], adversarial networks [5,6], and generative networks [7][8][9]. While the above-mentioned approaches consider a single source, we address multi-source domain adaptation (MSDA), which is very crucial and more practical in real-world applications as well as more challenging. MSDA is able to bring significant performance enhancement by virtue of accessibility to multiple datasets as long as multiple domain shift problems are resolved. Previous works have extensively presented both theoretical analysis [10][11][12][13][14][15] and models [14,[16][17][18][19][20] for MSDA. MDAN [14], DCTN [16], and MDDA [18] build adversarial networks for each source domain to generate features domain-invariant enough to confound domain classifiers. However, these approaches do not encompass the interactions among source domains, counting only shifts between source and target domain. M 3 SDA [17] adopts a moment matching strategy but makes the unrealistic assumption that matching the marginal probability p(x) would guarantee the alignment of the conditional probability p(x|y). Most of these methods also do not fully exploit the knowledge of the target domain, imputing to the inaccessibility of the labels. Furthermore, these methods require individual deep neural networks for each source domain as described in Fig 1, which have great redundancy and significantly increase the overall model complexity. LtC-MSDA configures prototypes of the features from each domain and learns the interaction between multiple domains deploying GCN. However, summarizing each domain into only one prototype cannot fully represent the feature distributions of the domain and therefore deteriorates the performance.
In this paper, we propose MULTI-EPL (Multi-source domain adaptation with Ensemble of feature extractors, Pseudolabels, and Label-wise moment matching), a novel MSDA framework that mitigates the limitations of these methods of not explicitly considering conditional probability p(x|y), and having great redundancy in their models. MULTI-EPL is illustrated in Fig 2. MULTI-EPL aligns the conditional probability p(x|y) by utilizing label-wise moment matching. We employ pseudolabels for the inaccessible target labels to maximize the usage of the target data. Moreover, we generate an ensemble of features from multiple feature extractors to capture rich information about labels. Extensive experiments show the superiority of MULTI-EPL (see Fig 3).
Our contributions are summarized as follows: • Method. We propose MULTI-EPL, a novel approach for MSDA that effectively and efficiently obtains domain-invariant features from multiple domains by matching conditional probability p(x|y), utilizing pseudolabels for inaccessible target labels to fully exploit target data, handling all the source domains with one single neural network, and using an ensemble of multiple feature extractors for further enhancement. It allows domain-invariant features to be extracted, capturing the intrinsic differences of labels.
• Experiments. We conduct extensive experiments on image and text datasets. We show that 1) MULTI-EPL provides the state-of-the-art performance, and 2) each of our main ideas significantly contributes to the superior performance.
In the rest of this paper, we first introduce the related works and describe our proposed method. Then, we experimentally evaluate the performance of MULTI-EPL and its competitors. The code for MULTI-EPL can be found in https://github.com/snudatalab/MultiEPL. Frequently used symbols are summarized in Table 1.

Single-source domain adaptation
Given a labeled source dataset and an unlabeled target dataset, single-source domain adaptation aims to train a model that performs well on the target domain. The challenge of singlesource domain adaptation is to reduce the discrepancy between the two domains and to obtain appropriate domain-invariant features. Various discrepancy measures such as Maximum Mean Discrepancy (MMD) [2-4, 21, 22] and KL divergence [23] have been used as regularizers. Inspired by the insight that the domain-invariant features should exclude the clues about its domain, constructing adversarial networks against domain classifiers has shown superior performance. [7] and [9] deploy GAN to transform data across the source and target domains,

PLOS ONE
while [5] and [6] leverage the adversarial networks to extract common features of the two domains. Unlike these works, we focus on multiple source domains.

Multi-source domain adaptation
Single-source domain adaptation should not be naively employed for multiple source domains due to domain shifts. Many previous works have tackled Multi-source Domain Adaptation (MSDA) problems theoretically. [11] establishes distribution weighted combining rule that the weighted combination of source hypotheses is a good approximation for the target hypothesis.   l D Labeling function of the domain D.
n T Number of instance in the target dataset.

PLOS ONE
The rule is further extended to a stochastic case with joint distribution over the input and the output space in [13]. [12] proposes a general theory of how to sift appropriate samples out of multi-source data using expected loss. Efforts to find out transferable knowledge from multiple sources from the causal viewpoint are made in [24]. There have been salient studies on the learning bounds for MSDA. [10] finds the generalization bounds based on HDH-divergence, which are further tightened by [14]. Frameworks for MSDA have been presented as well. [14] proposes learning algorithms based on the generalization bounds for MSDA. DCTN [16] resolves domain and category shifts between source and target domains via adversarial networks. TMDA [25] aligns multiple domains utilizing clustering and adversarial training. M 3 SDA [17] associates all the domains into a common distribution by matching the moments of the feature distributions of multiple domains. In [26], attempts to find out the common latent space of source and target domains are made, focusing on the visual sentiment classification tasks. MDDA [18] employs Wasserstein distance to figure out which data from which source domains are closely related to the target data. In LtC-MSDA [19], the interactions among multiple domains are learned by constructing a knowledge graph. However, most of these methods do not consider multimode structures [27] that differently labeled data follow distinct distributions, even if they are drawn from the same domain. Also, the domain-invariant features in these methods contain the label information for only one label classifier which leads these methods to miss a large amount of label information. Differently from these methods, our framework fully considers the multimode structures, handles the data distributions in a label-wise manner, and minimizes the label information loss considering multiple label classifiers.

Moment matching
Moment matching strategy has been used to minimize the discrepancy between source and target domains in domain adaptation. MMD regularizer [2-4, 21, 22] can be interpreted as the first-order moment matching while [28] addresses second-order moment maching of source and target distributions. [29] investigates the effect of higher-order moment matching. M 3 SDA [17] demonstrates that moment matching yields remarkable performance also with multiple sources. While previous works have focused on matching the moments of marginal distributions for single-source adaptation, we handle conditional distributions in multi-source scenarios.

Methods
In this section, we describe our proposed method, MULTI-EPL. We first formulate the problem definition and describe our main ideas. Then, we elaborate on how to match label-wise moment with pseudolabels and extend the approach by adding the concept of ensemble learning. Fig 2 shows the overview of MULTI-EPL.

Problem definition
Given a set of labeled datasets from N source domains S 1 ; . . . ; S N and an unlabeled dataset from a target domain T , we aim to construct a model that minimizes the test error on T . We formulate source domain S i as a tuple of the data distribution m S i on data space X and the labeling function l S i : where n S i is the number of instance in X S i . Likewise, the target domain and the target dataset are denoted as T ¼ ðm T ; l T Þ and , respectively, where n T is the number of instance in X T . We narrow our focus down to homogeneous settings in classification tasks: all domains share the same data space X and label set C.

Overview
We propose MULTI-EPL based on the following observations: 1) existing methods focus on aligning the marginal distributions p(x) not the conditional ones p(x|y), 2) knowledge of the target data is not fully employed as no target label is given, 3) existing methods that require separate neural networks for each source domain have considerable inefficiency in model size, and 4) there is a large amount of loss in label information since domain-invariant features are extracted for only one label classifier. Designing a method to solve these limitations entails the following challenges: 1. Matching conditional distributions. How can we align the conditional distribution, p(x| y), of multiple domains, not the marginal one, p(x)?
2. Exploitation of the target data. How can we fully exploit the knowledge of the target data despite the absence of the target labels?
3. Maximization of the model efficiency. How can we maximize the model efficiency and performance?
We propose the following main ideas to address the challenges: 1. Label-wise moment matching. We match the label-wise moments of the domain-invariant features so that the features with the same labels have similar distributions regardless of their original domains. This improves not only adaptation but also classification performance compared to the previous methods, which align features not considering labels and therefore cannot clearly separate differently labeled instances.
2. Pseudolabels. We use pseudolabels as alternatives to the target labels. While the existing MSDA methods have made only limited use of target data, this allows the intrinsic properties related to the label prediction of each target instance to be better reflected.
3. Ensemble of feature representations. We integrate multiple neural networks, each of which handles each source domain, into one neural network. For further improvement, we propose a variant of ensemble learning to concatenate features from multiple feature extractors. This enhances model performance without an extreme increase in model size, whereas the existing methods have significantly increased model size for better performance.

Label-wise moment matching with pseudolabels
We describe how MULTI-EPL matches conditional distributions p(x|y) of the features from multiple distinct domains. In MULTI-EPL, a feature extractor f e and a label classifier f lc lead the features to be domain-invariant and label-informative at the same time. The feature extractor f e extracts features from data, and the label classifier f lc receives the features and predicts the labels for the data. We train f e and f lc , according to the losses for label-wise moment matching and label classification, which make the features domain-invariant and label-informative, respectively.
Label-wise moment matching. To achieve the alignment of domain-invariant features, we define a label-wise moment matching loss as follows: where K is a hyperparameter indicating the maximum order of moments considered by the loss, and n D;c is the number of data labeled as c in X D . We introduce pseudolabels to determine the label c for the target data, which are determined by the outputs of the model currently being trained, to manage the absence of the ground truths for the target data. In other words, we compute f lc ðf e ðx T ÞÞ using f lc and f e trained up to the previous iteration step to give the pseudolabels to the target data x T . The L2 norm term in Eq 1 measures how much k-th order moments of the features labeled as c are different when it comes to the source domain S i and the target domain T . The sum of the term for every possible c, i, and k gives the discrepancy of the feature distributions between the source domains and the target domain. By minimizing L lmm;K , the feature extractor f e aligns data from multiple domains by bringing consistency in distributions of the features with the same labels. The data with distinct labels are aligned independently, taking account of the multimode structures that differently labeled data follow different distributions.
Label classification. The label classifier f lc gets the features projected by f e as inputs and makes the label predictions. The label classification loss is defined as follows: where L ce is the softmax cross-entropy loss. Minimizing L lc separates the features with different labels so that each of them becomes label-distinguishable.

Ensemble of feature representations
In this section, we introduce ensemble learning for further enhancement. Features extracted with the method described in the previous section contain the label information for a single label classifier. However, each label classifier leverages only limited label characteristics, and thus the conventional scheme to adopt only one pair of feature extractor and label classifier captures only a small part of the label information. Our idea is to leverage an ensemble of multiple pairs of feature extractors and label classifiers in order to make the features to be more label-informative. We train two pairs of feature extractor and label classifier in parallel following the labelwise moment matching approach explained in the previous section. We denote the two (feature extractor, label classifier) pairs as (f e,1 , f lc, 1 ) and (f e,2 , f lc,2 ), and the resultant features from each feature extractor as feat 1 and feat 2 respectively. After obtaining two different feature mappings, we concatenate the two into one vector feat final = concat(feat 1 , feat 2 ). The final label classifier f lc,final takes the concatenated feature as input and predicts the label of the feature.

MULTI-EPL: Accurate multi-source domain adaptation
Our final model MULTI-EPL consists of two pairs of feature extractor and label classifier, (f e,1 , f lc, 1 ) and (f e,2 , f lc,2 ), and one final label classifier, f lc,final . We train the model in an iterative manner where each iteration is composed of two steps. We first train the entire model except for the final label classifier with the loss L: ðL lc;n þ aL lmm;K;n Þ; ð3Þ where L lc;n is the label classification loss of the classifier f lc,n , L lmm;K;n is the label-wise moment matching loss of the feature extractor f e,n , α is a hyperparameter that weights each of the loss term, and K is the hyperparameter for the maximum order of moments in L lmm;K;n . Then, the final label classifier is trained with respect to the label classification loss L lc;final using the concatenated features from the multiple feature extractors. We repeat these two steps over and over until the number of iterations reaches the predetermined number of epochs.

Experimental results
We conduct experiments to answer the following questions.

Experimental settings
Datasets. We use three collections of datasets, Digits-Five, Office-Caltech10 [30], and Amazon Reviews [31], listed in Table 2. Digits-Five consists of five datasets for digit recognition: MNIST [32], MNIST-M [33], SVHN [34], SynthDigits [33], and USPS [35]. We set one of them as a target domain and the rest as source domains. Following the conventions in prior works [16,17], we randomly sample 25000 instances from the source training set and 9000 instances from the target training set to train the model except for USPS for which the whole training set is used. The entire test set is exploited to evaluate the performance. Office-Cal-tech10 is for image classification with 10 categories that Office31 dataset and Caltech dataset have in common. It involves four different domains: Amazon, Caltech, DSLR, and Webcam. We double the number of instances by data augmentation and exploit all the original instances and augmented instances as training and test sets, respectively. Amazon Reviews contains customers' reviews on 4 product categories: Books, DVDs, Electronics, and Kitchen appliances. The instances are encoded into 5000-dimensional vectors and are labeled as being either positive or negative depending on their sentiments. We set each of the four categories as a target and the rest as sources. For all the domains, 2000 instances are sampled for training, and the rest of the data are used for the test.

Competitors.
We use 5 MSDA algorithms, DCTN [16], M 3 SDA, M 3 SDA-β [17], MDDA [18], and LtC-MSDA [19] with state-of-the-art performances as baselines. All the frameworks share the same architecture for the feature extractor and the label classifier for consistency. For Digits-Five, we use convolutional neural networks based on LeNet5 [32]. For Office-Caltech10, ResNet50 [36] pretrained on ImageNet is used as the backbone architecture. For Amazon Reviews, the feature extractor is composed of three fully-connected layers each with 1000, 500, and 100 output units, and a single fully-connected layer with 100 input units and 2 output units is adopted for the label classifier. With Digits-Five, LeNet5 [32] and ResNet14 [36] without any adaptation are additionally investigated in two different manners: Source Combined and Single Best. In Source Combined, multiple source datasets are simply combined and fed into a model. In Single Best, we train the model with each source dataset independently and report the result of the best performing one. Likewise, ResNet50 and MLP consisting of 4 fully-connected layers with 1000, 500, 100, and 2 units are investigated without adaptation for Office-Caltech10 and Amazon Reviews, respectively.
Training details. We train our models for Digits-Five with Adam optimizer [37] with β 1 = 0.9, β 2 = 0.999, and the learning rate of 0.0004 for 100 epochs. All images are scaled to 32 × 32 and the mini-batch size is set to 128. We set the hyperparameters α = 0.01, and K = 1.
For the experiments with Office-Caltech10, all the modules comprising our model are trained with the SGD-momentum optimizer with the weight decay of 0.001 and the momentum factor of 0.9. The learning rate for the feature extractors and the label classifiers are 0.0001 and 0.001, respectively. We scale all the images to 224 × 224 and set the mini-batch size to 48. All the other hyperparameters are kept the same as in the experiments with Digits-Five. For Amazon Reviews, we train the models for 50 epochs using Adam optimizer with β 1 = 0.9, β 2 = 0.999, and the learning rate of 0.0001. We set α = 0.1, K = 2, and the mini-batch size to 100.

Performance evaluation
We evaluate the performance of MULTI-EPL against the competitors. We repeat experiments for each setting five times and report the mean and the standard deviation. The results are summarized in Tables 3-5. In the tables, SC and SB indicate Source Combined and Single Best, respectively. Note that MULTI-EPL provides the best accuracy in all the datasets, showing its superiority in both image datasets (Digits-Five and Office-Caltech10) and text datasets

PLOS ONE
(Amazon Reviews). The enhancement is remarkable especially when MNIST-M is the target domain in Digits-Five, improving the accuracy by 13.20% compared to the state-of-the-art methods. It is also notable that MULTI-EPL consistently achieves successful adaptation of multiple domains, while other state-of-the-art methods sometimes fail to adapt and even deteriorate the performance. The failure appears to be attributable to negative transfer [38], but we leave this issue as a future work. We also illustrate the summary of the results in Fig 4 using CD (critical difference) diagram [39]. We tackled every single source and target scenario, and the five adaptation methods DCTN, M 3 SDA, M 3 SDA-β, LtC-MSDA, and MULTI-EPL. It demonstrates that MULTI-EPL gives significant performance enhancement compared to the existing methods.

Ablation study
We perform an ablation study on Digits-Five to identify what exactly enhances the performance of MULTI-EPL. We compare MULTI-EPL with three of its variants: MULTI-0, MULTI-PL, and MULTI-PL-DED. MULTI-0 aligns moments regardless of the labels of the data. MULTI-PL trains the model without ensemble learning. MULTI-PL-DED consists of four feature generators and four label classifiers, each of which is dedicated to each source domain.
The results are shown in Table 6. By comparing MULTI-0 with MULTI-PL, we observe that considering labels in moment matching plays a significant role in extracting domain-invariant features. The remarkable performance gap between MULTI-PL and MULTI-EPL verifies the effectiveness of ensemble learning. The overall accuracy of MULTI-PL-DED is much lower than that of MULTI-PL or MULTI-EPL; it demonstrates that the existing methods that assign individual networks for each source domain deteriorate not only the performance but also the model efficiency.

Effects of ensemble
We evaluate the performance on Digits-Five while varying the number n of pairs of feature extractor and label classifier. The results are summarized in Table 6. While an ensemble of two pairs gives much better performance than the model with a single pair, using more than two pairs does not bring remarkable improvement, except for the case of SVHN being the target dataset. We presume that the overfitting due to the excessive number of parameters has  Table 6. Experiments with MULTI-EPL and its variants.
hindered the further improvement. We leave the task of figuring out proper regularization methods for the ensembles as a future work.

Parameter efficiency
We compare the number of parameters and performance of MULTI-EPL with other state-of-theart methods to demonstrate MULTI-EPL's efficient usage of the model complexity. Fig 5 illustrates the number of model parameters and the average accuracy of each method that are evaluated with the Digits-Five dataset. Multi-PL is the variation of MULTI-EPL that does not exploit the ensemble technique. Comparing Multi-PL and LtC-MSDA, the superiority of the proposed method is proved under the fair model complexity. On the other hand, the significant performance enhancement that the ensemble learning technique has made in MULTI-EPL demonstrates that MULTI-EPL greatly benefits from the additional model parameters, while MDDA has made little performance improvement even though it requires much more model parameters.

Visualization
We visualize the features from distinct adaptation methods using T-SNE [40] to verify the effect of label-wise moment matching. Fig 6 shows the feature distributions when no Note that MULTI-EPL clearly separates features with different labels, while other do not; this explains the outstanding performance of MULTI-EPL.

Conclusion
We propose MULTI-EPL, a novel framework for the multi-source domain adaptation problem. MULTI-EPL overcomes the problems in existing methods of not directly addressing conditional distributions of data p(x|y), not fully exploiting the knowledge of target data, and having redundancy in model networks. MULTI-EPL aligns data from multiple source domains and the target domain considering the data labels, and exploits pseudolabels for unlabeled target data. MULTI-EPL further enhances the performance by generating an ensemble of multiple feature extractors. Our framework exhibits superior performance on both image and text classification tasks. Considering labels in moment matching and adding ensemble learning are shown to bring remarkable performance enhancement through ablation study. Future works include extending our approach to other tasks such as regression, which may require modification in the pseudolabeling method.