Kernel Manifold Alignment

We introduce a kernel method for manifold alignment (KEMA) and domain adaptation that can align an arbitrary number of domains of different dimensionality without needing corresponding pairs, just few labeled examples in all domains. KEMA has interesting properties: 1) it reduces to SSMA when using a linear kernel, 2) it goes beyond data rotations so it can align manifolds of very different structures and complexities, performing a sort of manifold unfolding plus alignment, 3) it can extract as many features as available points and 4) is robust to strong nonlinear deformations. KEMA exhibits at least comparable performance over other domain adaptation methods in high-dimensional problems. We illustrate method's capabilities in illustrative toy examples.


Introduction
Domain adaptation constitutes a field of high interest in pattern analysis and machine learning. Classification algorithms developed with data from one domain cannot be directly used in another related domain, and hence adaptation of either the classifier or the data representation becomes strictly imperative [1]. For example, there is actually strong evidence that a significant degradation in the performance of state-of-the-art image classifiers is due to test domain shifts, such as changing image sensors and noise conditions [2], pose changes [3], consumer vs. commercial video [4], and, more generally, datasets biased due to changing acquisition procedures [5].
Adapting (modifying) the classifier for any new incoming situation requires either computationally demanding retraining, passive-aggressive strategies, online filtering, or sample-relevance estimation and weighting. These approaches are algorithm-dependent, often resort to heuristic parameters, require good estimates of sample relevance and information content. The ever-evolving classifier is also very hard to analyze. Alternatively, one may also try to adapt the domain representations to a single latent space, and then apply a unique single classifier in that semantically meaningful feature space. In this paper, we focus on the latter pathway. Adapting the representation space has been referred in the literature to as feature representation transfer [6] or feature transformation learning [7].

Related works
The literature of feature representation transfer can be divided into three families of adaptation problems, depending on the availability of labels in the different domains. They are briefly reviewed hereafter and their main properties are summarized in Table 1. We discuss on the require specifying a small amount of cross-domain sample correspondences. The problem was addressed in [23] by relaxing the constraint of paired correspondences with the constraint of having the same class labels in all domains. The semi-supervised manifold alignment (SSMA) method proposed in [23] projects data from different domains to a latent space where samples belonging to the same class become closer, those of different classes are pushed far apart, and the geometry of each domain is preserved. The method performs well in general and can deal with multiple domains of different dimensionality. However, SSMA cannot cope with strong nonlinear deformations and high-dimensional data problems.

Contributions
This paper introduces a generalization of SSMA through kernelization for manifold alignment and domain adaptation. The proposed Kernel Manifold Alignment (KEMA) has some remarkable appealing properties: 1. KEMA generalizes other manifold alignment methods. Being a kernel method, KEMA reduces to SSMA [23] when using a linear kernel, thus allowing to deal with high-dimensional data efficiently in the dual form (Q-mode analysis): therefore KEMA can cope with input space of very large dimension, e.g. extracted by Fisher vectors or deep features. KEMA also generalizes other manifold alignment methods, e.g. [20] when used with a linear kernel and with sample correspondences instead of the class similarity matrices (see page 5); 2. KEMA goes beyond data rotations and can align manifolds of very different structure, performing a flexible discriminative alignment that preserves the manifold structure; 3. KEMA defines a domain-specific metric when using different kernel functions in the different domains. Contrarily to SSMA, KEMA can use different kernels in each domain, thus allowing to use the best descriptor for each data source at hand, e.g. when aligning text and images one could involve using (more appropriate) string or histogram kernels in the very same alignment procedure, or using the same kernel function with different hyperparameters in each domain; 4. As SSMA, KEMA can align data spaces of different dimensionality. This is an advantage with respect to other feature representation transfer approaches that require either sample correspondences [9,12,15,20] or strict equivalence of the feature spaces across domains [2,10,25]. 5. KEMA is robust to strong (nonlinear) deformations of the manifolds to be aligned, as the kernel compensates for problems in graph estimation and numerical problems. As noted above, the use of different metric stemming from different kernels reinforces the flexibility of the approach; 6. Mapping data between domains (and hence data synthesis) can be performed in closedform, thus allowing to measure the quality of the alignment in physical units. Kernelization typically makes the new method not invertible analytically, and one commonly resorts to approximate methods for estimating pre-images [28][29][30]. For the case of KEMA, this is not straightforwad (see page 8). As an alternative, we propose a chain of transforms of different types as a simple, yet efficient way of performing the inversion accurately and in closed form.
The reported theoretical advantages translate into outstanding convenience when working with high-dimensional problems and strong distortions in the manifold structures, as illustrated on a large set of synthetic and real applications in the experimental section.

Materials and Methods
In this section, we first recall the linear SSMA algorithm and then derive our proposed KEMA. We discuss its theoretical properties, the stability bounds and propose a reduced rank algorithm, as well as a closed-form inversion strategy.

Semi-supervised manifold alignment
Semi-supervised learning consists in developing inference models that collectively incorporate labeled and unlabeled data in the model definition. In semi-supervised learning (SSL) [31], the algorithm is provided with some available labeled information in addition to the unlabeled information, thus allowing to encode some knowledge about the geometry and the shape of the dataset. There is an overwhelming amount of SSL methods in the literature, yet the vast majority of algorithms try to encode the relations between labeled and unlabeled data through the definition of an undirected graph, and more precisely through the graph Laplacian matrix L.
To define L, let's first define a graph G(V, E) with a set of n nodes, V, connected by a set of edges, E. The edge connecting nodes i and j has an associated weight [31]. In this framework, the nodes are the samples, and the edges represent the similarity among samples in the dataset. A proper definition of the graph is the key to accurately introduce data structure in the model.
To understand how matrix L is constructed, two mathematical tools have to be introduced [31,32]: First, the adjacency matrix W, which contains the neighborhood relations between samples. It has non-zero entries only between neighboring samples, which are generally found by k-nearest neighbors or an -ball distance. Then, the degree matrix D, which is a diagonal matrix of size n × n containing the number of connections to a node (degree). The Laplacian matrix L is then defined as L = D − W. Intuitively, L measures the variation (i.e. norm of derivatives hence the name of Laplacian operator) of the decision function along the graph built upon all (labeled and unlabeled) samples [31].
When it comes to manifold alignment, an interesting semisupervised approximation was presented in [23]. Let us consider D domains X i representing similar classification problems. The corresponding data matrices, X i 2 R d i Ân i , i = 1, . . ., D, contain n i examples (labeled, l i , and unlabeled, u i , with n i = l i +u i ) of dimension d i , and n ¼ P D i¼1 n i . The SSMA method [23] maps all the data to a latent space F such that samples belonging to the same class become closer, those of different classes are pushed far apart, and the geometry of the data manifolds is preserved. Therefore, three entities have to be considered, leading to three n × n matrices: 1) a similarity matrix W s that has components W ij s ¼ 1 if x i and x j belong to the same class, and 0 otherwise (including unlabeled); 2) a dissimilarity matrix W d , which has entries W ij d ¼ 1 if x i and x j belong to different classes, and 0 otherwise (including unlabeled); and 3) a similarity matrix that represents the topology of a given domain, W, e.g. a radial basis function (RBF) kernel or a k nearest neighbors graph computed for each domain separately and joined in a block-diagonal matrix. Since we are not interested in preserving geometrical similarity between the domains (we are only interested in preserving their inner geometry), all the elements of the off-diagonal blocks in the matrix W are zeros. On the contrary, W s and W d are defined between the domains and therefore act as registration anchor points in the feature space. An illustrative example of how SSMA works in given in Fig 1. The three different entities lead to three different graph Laplacians: L s , L d , and L, respectively. Then, the SSMA embedding must minimize a joint cost function essentially given by the eigenvectors corresponding to the smallest non-zero eigenvalues of the following generalized eigenvalue problem: where Z is a block diagonal matrix containing the data matrices X i , Z = diag(X 1 , Á Á Á,X D ), and V contains in the columns the eigenvectors organized in rows for the particular domain, V = [v 1 , v 2 , . . .,v D ] > , see details in [21,33]. The method allows to extract a maximum of N f ¼ P D i¼1 d i features that serve for projecting the data to the common latent domain as follows: Advantageously, SSMA can easily project data between domains j and i: first mapping the data in X j to the latent domain F , and from there inverting back to the target domain X i as follows: where † represents the pseudo-inverse of the eigenvectors of the target domain. The operation is depicted as: Therefore, the method can be used for domain adaptation but also for data synthesis. This property was pointed out in [23], and experimentally studied for image analysis in [34].

Kernel manifold alignment
When using linear algorithms, a well-established theory and efficient methods are often available. Kernel methods exploit this fact by embedding the data set S defined over the input or attribute space X ðS X Þ into a higher (possibly infinite) dimensional Hilbert space H, or feature space, and then they build a linear algorithm therein, resulting in an algorithm which is nonlinear with respect to the input data space. The mapping function is denoted as : X ! H.
Though linear algorithms will benefit from this mapping because of the higher dimensionality of the feature space, the computational load would dramatically increase because we should compute sample coordinates in that high dimensional space. This computation is avoided through the use of the kernel trick by which, if an algorithm can be expressed with dot products in the input space, its (nonlinear) kernel version only needs the dot products among mapped samples. Kernel methods compute the similarity between training samples S ¼ fx i g n i¼1 using pair-wise inner products between mapped samples, and thus the so-called kernel matrix i contains all the necessary information to perform many classical linear algorithms in the feature space.
Kernelization of SSMA. Kernelization of SSMA is apparently straightforward; one should map the data to a Hilbert feature space and then replace all instances of dot products with kernel functions. However, note that in the original formulation of SSMA, there are D data sources that need to be first mapped to a common feature space. For doing this, we need to define D different feature mappings to eventually different Hilbert feature spaces, and then ensure that mapped data live in the same subspace in order to do linear operations therein with all mapped data sources. This can be actually done by resorting to a property of Functional Analysis Theory [35], the direct sum of Hilbert spaces.
Theorem 1 Direct sum of Hilbert spaces [35]: Given two Hilbert spaces, H 1 and H 2 , the set of pairs {x,y} with x 2 H 1 and y 2 H 2 is a Hilbert space H with inner product hfx 1 ; y 1 g; fx 2 ; y 2 gi ¼ hx 1 ; x 2 i H 2 þ hy 1 ; y 2 i H 2 . This is called the direct sum of the spaces, and is where F is a block diagonal matrix containing the data matrices F i = [ϕ i (x 1 ), . . ., ϕ i (x n i )] > and U contains the eigenvectors organized in rows for the particular domain defined in Hilbert Note that the eigenvectors u i are of possibly infinite dimension and cannot be explicitly computed. Instead, we resort to the definition of D corresponding Riesz representation theorems [36] so the eigenvectors can be expressed as a linear combination of mapped samples [37], u i = F i α i , and in matrix notation U = FΛ. This leads to the problem: Now, by pre-multiplying both sides by F > and replacing the dot products with the corresponding kernel matrices, where K is a block diagonal matrix containing the kernel matrices K i . Now the eigenproblem becomes of size n × n instead of d × d, and we can extract a maximum of N f = n features. When a linear kernel is used for all the domains, This dual formulation is advantageous when dealing with very high dimensional datasets, d i ) n i for which the SSMA problem is not well-conditioned. Operating in Q-mode endorses the method with numerical stability and computational efficiency in current high-dimensional problems, e.g. when using Fisher vectors or deep features for data representation. This type of problems with much more dimensions than points are recurrent nowadays for example in the fields of bioinformatics, chemometrics, and image and video processing. In this sense, even KEMA with a linear kernel becomes a valid solution for these problems, as it has all the advantages of CCA-like methods, but can also deal with unpaired data. Projection to the latent space requires first mapping the data X i to its corresponding Hilbert space H i , thus leading to the mapped data F i , and then applying the projection vector u i defined therein: which can be depicted as: Therefore, projection to the kernel latent space is possible through the use of dedicated reproducing kernel functions. In order to map data from domain X j to domain X i with KEMA we would need to estimate D − 1 inverse mappings from the latent space to the corresponding target domain X i . Such transformations are highly desirable in order to measure the accuracy of the alignment/adaptation in meaningful physical units. In general, nevertheless, using kernel functions hampers the invertibility of the transformation. One can show that if an exact pre-image exists, and if the kernel can be written as kðx; x 0 Þ ¼ c k ðx > x 0 Þ with an invertible function ψ k (Á), then one can compute the pre-image analytically under mild assumptions. However, it is seldom the case that exact pre-images exist, and one resorts to approximate methods such as those in [28][29][30]. In the case of KEMA, inversion from the latent space to the target domain X i is even harder, and hampers the use of standard pre-imaging techniques. Standard pre-image methods in kernel machines [28][29][30] typically assume a particular kernel method (e.g. kPCA) endorsed with a particular kernel function (often the polynomial or the squared exponential). If other kernel functions are used, the formulation should be derived again. Remember that our KEMA feature vectors in the latent space were obtained using a complex (and supervised) function that considers labeled and unlabeled samples from all available domains through the composition of kernel functions and graph Laplacians. One could derive the equations for preimaging under our eigenproblem setting, K 0 ≔K À1 s K d where K s ≔ K(L + μ L s )K and K d = KL d K, but this is very complicated, data dependent, and sensitive because of the appearance of several hyperparameters. Another alternative could be performing a sort of multidimensional regression (from the latent space to X i ) in a similar way to the kernel dependency estimation (KDE) method revised in [29], but the approach would be complicated (no guarantees about the existence of a kernel trying to reproduce the inverse mapping implicit in K 0 exist), computationally demanding (many hyperparameters appear), and would not deliver a closed-form solution.
Here we propose a simple alternative solution to the mapping inversion: to use a linear kernel for the latent-to-target transformation K i ¼ X > i X i , and K j for j 6 ¼ i with any desired form. Following this intuition, projection of data X j to the target domain i becomes: where for the target domain we used u i = F i α i = X i α i . We should note that the solution is not unique since D different inverse solutions can be obtained depending on the selected target domain. Using different transforms to perform model inversion was also recently studied in [38]: here, instead of using an alternate scheme, we perform direct inversion by chaining different transforms, leading to an efficient closed-form solution. Such a simple idea yields impressive results in practice (see the experimental section, page 14).

Computational efficiency and stability of KEMA
One of the main shortcomings of KEMA is related to the computational cost since two n × n kernel matrices are involved, being n ¼ P D i¼1 n i . KEMA complexity scales quadratically with n in terms of memory, and cubically with respect to the computation time. Also projection for new data requires the evaluation of n kernel functions per example, becoming computationally expensive for large n. To alleviate this problem, we propose two alternatives to speed up KEMA: a reduced-rank approximation (REKEMA) and a randomized features approximation (rKEMA). We compare both approaches in CPU time, and for rKEMA we study the convergence bound in ℓ 2 -norm based on matrix Bernstein inequalities. Finally, we study the stability of the obtained solution when solving a (regularized) generalized eigenproblem using a finite number of samples based on Rademacher principles.

Reduced rank approximation
The so-called reduced-rank Kernel Manifold Alignment (REKEMA) formulation imposes reduced-rank solutions for the projection vectors, W = F r Λ, where F r is a subset of the training data containing r samples (r ( n) and Λ is the new argument for the maximization problem. Plugging W into Eq (5), and replacing the dot products with the corresponding kernels, where K rn is a block diagonal matrix containing the kernel matrices K i comparing a reduced set of r representative vectors and all training data points, n. REKEMA reports clear benefits for obtaining the projection vectors (the eigenproblem becomes of size r × r instead of n × n), hence the computational cost becomes Oðr 3 Þ, r ( n, compacting the solution (now N f = r ( n features), and in storage requirements (hence Oðr 2 Þ). We want to highlight here that this is not a simple subsampling, because the model considers correlations between all training data and the reduced subset through K rn . The selection of the r points can be done in different ways and degrees of sophistication: close to centroids provided by a pre-clustering stage, extremes of the convex hull, sampling to minimize the reconstruction error or preserve information, form compact basis in feature space, etc. While such strategies are crucial in low-to-moderate sample-size regimes, random selection offers an easy way to select the r points and is the most widely used strategy. Fig 2A shows the evolution of the computational cost as a function of (randomly selected) r samples in a toy example of aligning two spirals (cf. experiment ]1 in the experiments section).

Random features approximation
A recent alternative to reduced rank approximations exploits the classical Bochner's theorem in harmonic analysis, which has been recently introduced in the field of kernel methods [39]. The Bochner's theorem states that a continuous kernel k( if and only if k is the Fourier transform of a non-negative measure. If a shift-invariant kernel k is properly scaled, its Fourier transform p(w) is a proper probability distribution. This property is used to approximate kernel functions and matrices with linear projections on m random features as follows: where p(w) is set to be the inverse Fourier transform of k and b i $ Uð0; 2pÞ [39]. Therefore, we can randomly sample parameters w i 2 R d from a data-independent distribution p(w) and construct a m-dimensional randomized feature map z(Á): X ! Z, for data X 2 R nÂd and Z 2 R nÂm , as follows: For a collection of n data points, fx i g n i¼1 , a kernel matrix K 2 R nÂn can be approximated with the explicitly mapped data, Z 2 R nÂm ,K % ZZ > . The Gaussian kernel kðx; yÞ ¼ expðÀkx À yk 2 2 =ð2s 2 ÞÞ can be approximated using w i $ N ð0; I=s 2 Þ. For the case of KEMA, we have to sample twice, hence obtain two sets of vectors and associated matrices Z s and Z d , to approximate the similarity and dissimilarity kernel matrices, K s ≔KðL þ mL s ÞK % Z s Z > s and The associated cost by using the random features approximation now reduces to Oðnm 2 Þ, see Fig 2B. It is also important to notice that solving the generalized eigenvalue problem in KEMA feature extraction with random features converges in ℓ 2 -norm error with Oðm À1=2 Þ and logarithmically in the number of samples when using an appropriate random parameter sampling distribution [40] (see the Appendix).

Stability of KEMA
The use of KEMA in practice raises, however, the important question of the amount of data needed to provide an accurate empirical estimate, and how the quality of the solution differs depending on the datasets. Such results have been previously derived for KPCA [41] and KPLS [42] and here we extend them to our generalized eigenproblem setting. We focus on the concentration of sums of eigenvalues of the generalized KEMA eigenproblem solved using a finite number of samples, where new points are projected into the m-dimensional space spanned by the m eigenvectors corresponding to the largest m eigenvalues.
Following the notation in [41], we refer to the projection onto a subspace U of the eigenvectors of our eigenproblem as P U (ϕ(x)). We represent the projection onto the orthogonal complement of U by P U ?(ϕ(x)). The norm of the orthogonal projection is also referred to as the residual since it corresponds to the distance between the points and their projections.
Theorem 2 (Th. 1 and 2 in [41]) Let us define K s ≔ K(L + μ L s )K and K d = KL d K. If we perform KEMA in the feature space defined by K Ã ≔K À1 s K d , then with probability greater than 1 − δ over n random samples S, for all 1 r n, if we project data on the spaceÛ r , the expected squared residual is bounded by where the support of the distribution is in a ball of radius R in the feature space and λ i arel i are the process and empirical eigenvalues, respectively.
Theorem 3 (Regularized KEMA) The previous theorem holds only when the inverse K À1 s exists. Otherwise, we typically resort to matrix conditioning via regularization. Among the many possibilities in problem conditioning, the standard direct Tikhonov-Arnoldi approach helps solving the generalized eigenproblem on a shifted and inverted matrix, which damps the eigenvalues. Now we aim to bound a well-conditioned matrix It is easy to show that its estimated eigenvalues,ŷ i are related to the unregularized ones asl j ¼ŷ j =ð1 À gŷ j Þ. Therefore, with probability greater than 1 − δ over n random samples S, for all 1 r n, if we project data on the spaceÛ r , the expected squared residual is bounded by where the support of the distribution is in a ball of radius R in the feature space, θ i andŷ i are the process and empirical eigenvalues.
In either case, the lower bound confirms that a good representation of the data can be achieved by using the first r eigenvectors if the empirical eigenvalues quickly decrease before ffiffiffiffiffiffi ffi l=n p becomes large, while the upper bound suggests that a good approximation is achievable for values of r where ffiffiffiffiffiffiffi r=n p is small. These results can be used as a benchmark to test different approaches or to select among possible candidate kernels. Also, note that depending on how much non-diagonal is K Ã (or K 0 ), i.e. how large are the manifold mis-alignments, the KEMA bounds may be tighter than those of KPCA. With an appropriate estimation of the manifold structures via the graph Laplacians and tuning of the kernel parameters, the performance of KEMA will be at least as fitted as that of KPCA. Note that when intense regularization is needed, the trace of the squared K 0 can be upper bounded by 1 ng 2 and then the expected squared residuals are mainly governed by n and γ.

Results and discussion
We analyze the behavior of KEMA in a series of artificial datasets of controlled level of distortion and mis-alignment, and on real domain adaptation problems of visual object recognition from multi-source commercial databases and recognition of multi-subject facial expressions.

Toy examples with controlled distortions and manifold mis-alignments
Setup. the first set of experiments considers a series of toy examples composed of two domains with data matrices X 1 and X 2 , which are spirals with three classes (see the two first columns of Fig 3). Each dataset is visualized by Then, a series of deformations are applied to the second domain: scaling, rotation, inversion of the order of the classes, the shape of the domain (spiral or line) or the data dimensionality (see Table 2). These experiments are designed to study the flexibility of KEMA to handle alignment problems of increasing complexity and between data of different dimensionality (Ex. #2). The last experiment (#6) considers the same setting of Exp. #1, but adds 50 features of Gaussian noise to the two informative features.
For each experiment, 60 labeled pixels per class were sampled in each domain, as well as 1000 unlabeled samples that were randomly selected. Classification performance was assessed on 1000 held-out samples from each domain. The toy classification results can be reproduced using the MATLAB toolbox available at https://github.com/dtuia/KEMA.git. The σ bandwidth parameter of the RBF kernel was set in each domain as half of the median distance between all the samples in the domain, thus enforcing a domain-specific metric in each domain.
Latent space and domain adaptation. Fig 3 illustrates the projections obtained by KEMA when using a linear and an RBF kernel (lengthscale was set as the average distance between labeled samples). Looking at the alignment results, we observe that the linear KEMA lin aligns effectively the domains only in experiments #1 and #4, which are basically scalings and rotations of the data. However, it fails on experiments #2, #3 and #5, where the manifolds have undergone stronger deformations. The use of a nonlinear kernel (KEMA RBF ) allows much more flexible solution, performing a discriminative transform plus alignment in all experiments. In Experiment #6, even though the two discriminative dimensions (out of 52) are the same as in Exp. #1, only KEMA RBF can align the data effectively, since KEMA lin is strongly affected by the noise and returns a non-discriminative alignment for the eigenvectors corresponding to the smallest eigenvalues.
Classification performances. Fig 4 reports the classification errors obtained by a linear discriminant analysis (LDA, Fig 4A) and the nearest neighbor classifier (1-NN, Fig 4B). For each classifier, classification errors are reported for the samples from the source domain (left inset) and the target domain (right inset). LDA is used to show the ability of projecting the Kernel Manifold Alignment for Domain Adaptation domains in a joint discriminative latent space, where even the simplest linear classifier can be successful. 1-NN is used to show the increase in performance that can be obtained by using a nonlinear, yet simple, classifier on top of the projected data.
When using a linear model (LDA), a large improvement of KEMA RBF over KEMA lin (thus over SSMA) is observed. In experiment #1, even if the alignment is correct (Fig 3), the linear classifier trained on the projections of KEMA lin cannot resolve the classification of the two domains, while KEMA RBF solution provides a latent space where both domains can be classified correctly. Experiment #2 shows a different picture: the baseline error (green line in Fig 4) is much smaller in the source domain, since the dataset in 3D is linearly separable. Even if the classification of this first domain (red square in Fig 3) is correct for all methods, classification after SSMA/KEMA lin projection of the second domain (blue x in Fig 3) is poor, since their projection in the latent space does not "unfold" the blue spiral. KEMA RBF provides the best result. For experiment #3, the same trend as in experiment #2 is observed. Experiments #4 and #5 deal with reversed classes (the brown class is the top one in the source domain and the bottom one in the target domain). In both experiments, we observe a very accurate baseline (both domains are linearly separable in their own input spaces), but only KEMA RBF provides the correct match in a low-dimensional latent space (2 dimensions), including a discriminative Vshaped projection leading to nearly 0% errors on average; KEMA lin requires 5 dimensions to achieve a correct manifold alignment and a classification as accurate as the baseline (that still includes misclassifications in the linear classifier). The missclassifications can be explained by the projected space (3rd and 4th columns in Fig 3), where classes are aligned at best, but no real matching of the two data clouds is performed. The last experiment (#6) deals with noisy data, where only two out of the 52 dimensions are discriminative: KEMA RBF finds the two first eigenvectors that align the data accurately (classification errors close to 0% in both domains), while KEMA lin shows a much noisier alignment that, due to the rigidity of a linear transform, leads to about 20% misclassification in both domains.
When using the nonlinear 1-NN, both the KEMA RBF and KEMA lin perform similarly. KEMA RBF still leads to correct classification with close to zero errors in all cases, thus confirming that the latent space projects samples of the same class close. KEMA lin leads to correct classification in almost all the cases, since the 1-NN can cope with multimodal class distributions and nonlinear patterns in the latent space. KEMA lin still fails in Exp #3, where the projection of the source domain (red circle in Fig 3) stretches over the target domain, and in Exp. # 6, where the latent space is not discriminative and harms the performance of the 1-NN.
Alignment with REKEMA. We now consider the reduced-rank approximation of KEMA proposed. We used the data in the experiment #1 above. Fig 5 illustrates the solutions of the standard SSMA (or KEMA lin ), and for REKEMA using a varying rate of samples. We also give the classification accuracies of a SVM (with both a linear and an RBF kernel) in the projected latent space. Samples were randomly chosen and the sigma parameter for the RBF kernel in KEMA RBF was fixed to the average distance between all used labeled samples. We can observe that SSMA successfully aligns the two domains, but we still need to resort to nonlinear classification to achieve good results. REKEMA, on the contrary, essentially does two operations simultaneously: aligns the manifolds and increases class separability. Excessive sparsification leads to poor results. Virtually no difference between the full and the reduced-rank solutions are obtained for small Invertibility of the projections. Fig 6 shows the results of invertibility of SSMA and KEMA (using Eq (9)) on the previous toy examples (we excluded Exp. # 6 to avoid synthesizing data with 50 noisy dimensions). We use a linear kernel for the inversion part (latent-to-source) Fig 5. Linear and kernel manifold alignment on the scaled interwined spirals toy experiment (Exp. #1 in Fig 3). REKEMA is compared to SSMA for different rates of training samples (we used l i = 100 and u i = 50 per class for both domains).  and use for the direct part (target-to-latent space) an RBF kernel. All results are shown in the source domain space. All the other settings (# labeled and unlabeled, μ, graphs) are kept as in the experiments shown in Fig 3. The reconstruction error, averaged on 10 runs, is also reported: KEMA RBF ! lin is capable of inverting the projections and is always as accurate as the SSMA method in the simplest cases (#1, #4). For the cases related to higher levels of deformation, KEMA is either as accurate as SSMA (#3, where the inversion is basically a projection on a line) or significantly better: for experiment #2, where the two domain are strongly deformed, and experiment #5, where we deal with both scaling and inverted classes, only KEMA RBF ! lin can achieve satisfying inversion, as it "unfolds" the target domain and then only needs a rotation to match the distribution in the source domain.

Visual object recognition in multi-modal datasets
We here evaluate KEMA on visual object recognition tasks by using the Office-Caltech dataset introduced in [2]. We consider the four domains Webcam (W), Caltech (C), Amazon (A) and DSLR (D), and selected the 10 common classes in the four datasets following [13]. By doing so, the domains contain 295 (Webcam), 1123 (Caltech), 958 (Amazon) and 157 (DSLR) images, respectively. The features were extracted in two ways Experimental setup. We compare our proposed KEMA with the following unsupervised and semi-supervised domain adaptation methods: GFK [13], OT-lab [15] and JDA [26]. We used the same experimental setting as [13], in order to compare with these unsupervised domain adaptation methods. For all methods, we used 20 labeled pixels per class in the source domain for the C, A and W domains and 8 samples per class for the D domain. After alignment, an ordinary 1-NN classifier was trained with the labeled samples. The same labeled samples in the source domain were used to define the PLS eigenvectors for GFK and OT-lab. For all the methods using labeled samples in the target domain (including KEMA), we used 3 labeled samples in target domain to define the projections.
We used a sensible kernel for this problem in KEMA: the (fast) histogram intersection kernel [44]. Using a χ 2 kernel resulted in similar performances. We used u = 300 unlabeled samples to compute the graph Laplacians, for which a k-NN graph with k = 21 was used.
Numerical results. The projections obtained by KEMA in the visual object recognition experiments remain discriminative, as shown by Fig 7, where projections on the first three dimensions of the latent space are reported for the A ! W (top) and C ! A (bottom) using the SURF features. The numerical results obtained in all the eight problems are reported in Table 3: KEMA outperforms the unsupervised GFK and, in most of the cases, improves the results obtained by the semi-supervised methods using labels in the source domain only. KEMA provides the most accurate results in 5 out of the 8 settings. KEMA is as accurate as the state of the art, but with the advantage of handling naturally domains of different   Table) to align the domains as JDA. The results obtained when using the deep DeCAF features are reported in Table 4: a strong improvement in performance is observed for all methods. This general increase was expected, since the deep features in DeCAF are naturally suited for domain adaptation (they are extracted with fine tuning on this specific dataset): but nonetheless, even if the boost in performance is visible for all the methods (including the case without adaptation), KEMA improves performances even further and leads to the best average results. Looking at the single experiments, KEMA performs most often on a tie with OT-lab [15]. Summing up, KEMA leads to results as accurate as the state of art, but is much more versatile, since it allows to handle unpaired data, works with datasets of different dimensionality, and has a significantly smaller computational load (see also Table 1 for a taxonomical comparison of the properties of the different methods).

Recognition of facial expressions in multi-subject databases
This experiment deals with the task of recognizing facial expressions. We used the dataset in [45], where 185 photos of three subjects depicting three facial expressions (happy, neutral and shocked) are available. The features are included in the MATLAB package on https://github. com/dtuia/KEMA.git. Alternatively, they can be downloaded from their original repository on http://www.cc.gatech.edu/lsong/code.html. Each image is 217 × 308 pixels and we take each pixel as one dimension for classification (66836 dimensional problem). Each pair {subject, expression} has around 20 repetitions. Experimental setup. Different subjects represent the domains and we align them with respect to the three expression classes. We used only three labeled examples per class and subject, and held out 70% of the data for testing and used the remaining 30% (55 samples) for the extraction of the labeled samples. The examples which have not been selected as labeled points are used as unlabeled data. The three domains are aligned simultaneously into a common latent space, and then all classifications are run therein for all subjects. Below, we report the results obtained by using a LDA classifier trained in that common latent space. We consider three experimental settings: • Single resolution: all images are considered at their maximal resolution accounting for the three domains. Each domain is therefore a 66836-dimensional dataset. SSMA could not handle these data, since it would involve a 200508-dimensional eigendecomposition.
• Multiresolution, factor 2: the resolution of one of the domains (Subject #1) is downgraded by a factor two. 154 × 109, leading to a 16786-dimensional domain. The alignment problem in the primal would then be 16786 + (2 × 66836) = 150458-dimensional. With this experiment, we aim at showing the capability of KEMA to handle data of different dimensionality.
• Multiresolution, factor 4: the resolution of one of the domains (Subject #1) is downgraded by a factor four. 62 × 44, leading to a 2728-dimensional domain. The alignment problem in the primal would then be 136400-dimensional.
Numerical results. Average results over ten realizations are given in Fig 8: since it works directly in the dual, KEMA can effectively cast the three-domains problem into a low dimensional space. In the single resolution case (Fig 8B) all domains are classified with less than 5% error. This shows an additional advantage of KEMA with respect to SSMA in high dimensional spaces: SSMA would have required to solve a 200508-dimensional eigenproblem, while KEMA solves only a 55-dimensional problem. Subject #1 seems to be the most difficult to align with the two others, difficulty that is also reflected in the higher classification errors. Actually, subject #1 shows little variations in his facial traits from one expression to the other compared to the other subjects (see Fig 3 in [45]).
In the multi-resolution cases, similar error rates are observed for subjects #2 and #3, even though the images of subject #1 were of coarse resolution. The reduced resolution of the images of subject #1 made the expression recognition harder, but error rates lower than 20% are still achieved by using KEMA. By looking at the projections (second and third rows of Fig 8), those of the multiresolution experiment with a factor 2 reduction ((B) panel) are very similar to those in the single resolution experiment ((A) panel).

Conclusions
We introduced a kernel method for semi-supervised manifold alignment. We want to stress that this particular kernelization goes beyond the standard academic exercise as the method addresses many problems in the literature of domain adaptation and manifold learning. The so-called KEMA can actually align an arbitrary number of domains of different dimensionality without needing corresponding pairs, just few labeled examples in all domains. We also showed that KEMA generalizes SSMA when using a linear kernel, which allows us to deal with highdimensional data efficiently in the dual form. Working in the dual can be computationally costly because of the construction of the graph Laplacians and the size of the involved kernel matrices. Regarding the Laplacians, they can be computed just once and off-line, while regarding the size of the kernels, we introduced a reduced-ranked version that allows to work with a fraction of the samples while maintaining the accuracy of the representation. Advantageously, KEMA can align manifolds of very different structures and dimensionality, performing a discriminative transform along with the alignment. We have also provided a simple yet effective way to map data between domains as an alternative to standard pre-imaging techniques in the kernel methods literature. This is an important feature that allows synthesis applications, but more remarkably allows to study and characterize the distortion of the manifolds in physically meaningful units. To the authors' knowledge this is the first method in addressing all these important issues at once. All these features were illustrated through toy examples of increasing complexity (including data of different dimensionality, noise, warps and strong nonlinearities) and real problems in computer vision, and face recognition, thus showing the versatility of the method and its interest for numerous application domains. It does not escape our attention that KEMA may become a standard multivariate method for data preprocessing in general applications where multisensor, multimodal, sensory data is acquired. matrices from products of their approximations through random projection matrices. First we recall the Hermitian Matrix Bernstein theorem, which is then used to derive the bound on rKEMA.
Theorem 4 (Matrix Bernstein, [46]) Let Z 1 , . . ., Z m be independent n × n random matrices. Assume that E½Z i ¼ 0 and that the norm of the error matrices is bounded kZ i k R. Define the variance parameter s 2 ≔ max fk Then, for all t ! 0, Àt 2 3s 2 þ 2Rt and E X i Z i ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 3s 2 logðnÞ p þ RlogðnÞ: ð17Þ Theorem 5 Given two kernel matrices K d and K s , we aim to solve the eigensystem of K À1 d K s . Let us define the corresponding kernel approximationsK d ,K s using m d , m s random features as in Eq (12), respectively, and m: = min(m d , m s ). Then, the ℓ 2 approximation error bound can be bounded as Proof 1 For the sake of simplicity, let us renameD ¼K À1 d and D ¼ K À1 d . We follow a similar derivation to [47] for randomized nonlinear CCA. The total error matrix can be decomposed as a sum of individual error terms, E ¼ P m s i¼1 E i , which are defined as where B is a bound on the norm of the randomized feature map, kzk 2 B. The variance is defined as s 2 ≔ max fk P m s i¼1 E½Z i Z > i k; k P m s i¼1 E½Z > i Z i kg. Let us expand the individual terms in the (first) summand: and now taking the norm of the expectation, and using Jensen's inequality, we obtain kE Z > i Z i Â Ã k B 2 kK s k 2 m 2 , which is the same for kE½Z i Z > i k, and therefore the worst-case estimate of the variance is s 2 B 2 kK s k 2 m s . The bound can be readily obtained by appealing to the matrix Bernstein inequality (Theorem 4) and using the fact that random features and kernel evaluations are upper-bounded by 1, and thus both B and kKk are upper-bounded by n.
Theorem 6 Equivalently, when we define the corresponding bound for a Tikhonov-regularized problem as (K d + γI) −1 K s , and its approximation as ðK d þ gIÞ where γ > 0 is a regularization parameter. Proof 2 The demonstration is trivial by following the same rationale and derivations in Theorem 5, and simply bounding kðK d þ gIÞ À1 k 2 1=g. Interestingly, the bound is exactly the same as that of the randomized nonlinear CCA in [47] for the case of paired examples in the domains and no graph Laplacian terms. Fig 9 shows the absolute error committed by doing an approximation with random features of the corresponding kernels for rKEMA, along with the derived theoretical bound. We analyze the issue as a function of m (here for the sake of simplicity we used m d = m s = m), and the number of samples n. The curves are the result of 300 realizations. The reported results match the previous bound: we observe a logarithmical trend as a function of m (linear in the log-scale, Oðm À1=2 Þ), and n log(n) for the case of training examples, as expected.