Multiple Cayley-Klein metric learning

As a specific kind of non-Euclidean metric lies in projective space, Cayley-Klein metric has been recently introduced in metric learning to deal with the complex data distributions in computer vision tasks. In this paper, we extend the original Cayley-Klein metric to the multiple Cayley-Klein metric, which is defined as a linear combination of several Cayley-Klein metrics. Since Cayley-Klein is a kind of non-linear metric, its combination could model the data space better, thus lead to an improved performance. We show how to learn a multiple Cayley-Klein metric by iterative optimization over single Cayley-Klein metric and their combination coefficients under the objective to maximize the performance on separating inter-class instances and gathering intra-class instances. Our experiments on several benchmarks are quite encouraging.


Introduction
An effective distance metric is of great importance for many computer vision and pattern recognition applications such as clustering [1], retrieval [2,3] and classification [4,5]. Researches have shown that the widely used Euclidean metric mainly performs well under isotropic assumption of the data space. Therefore, its performance is usually limited since it can not reasonably reflect the underlying relationships between input instances [6][7][8][9]. To take the correlation among different data dimensions into consideration, using Mahalanobis metric is a popular solution.
Due to the difficulty in designing a specific Mahalanobis metric for a specific task, learning a Mahalanobis-like distance metric from labeled data attracts a growing attention over the last years [10,11]. The underlying idea of Mahalanobis metric learning is to define an application dependent metric which could capture the characteristics of the data. It aims to learn a positive semi-definite (PSD) matrix to define a specific Mahalanobis metric, i.e., d 2 (x, y) = (x − y) T M(x − y). Different learning objectives have been proposed in the literature, for example, to maximize the distances between dissimilar samples and simultaneously constrain the distances between similar samples [12], or to maximize the margin between similar pairs and dissimilar pairs [11].
Although Mahalanobis metric learning has been successfully applied in many applications, it is actually a linear metric. However, it is widely believed that the high dimensional data space encountered in computer vision applications is essentially non-linear. Therefore, researchers have resorted to more complicated non-linear metrics to pursue a higher a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 performance. These attempts include local metric learning [11,[13][14][15][16], kernel metric learning [17] and the most recently proposed Cayley-Klein metric learning [18], etc.
This paper follows the work of Cayley-Klein metric learning in [18]. A multiple Cayley-Klein metric learning method is proposed. It effectively learns several Cayley-Klein metrics and their linear combination weights to form a powerful non-linear metric. Each of the combined Cayley-Klein metric is focused on a part of the data space and can be considered as a locally optimized metric on a part of the training data. To achieve this goal, we first partition the training data into different clusters according to their label information. Each cluster is assigned with a local Cayley-Klein metric, whose learning optimization is conducted only on the training data from the related cluster. Once these Cayley-Klein metrics have been learned, their combination weights are optimized by maximizing the distances between inter-class instances and simultaneously restricting the distances between intra-class instances smaller than an upper bound. By combining these local metrics together, it effectively leads to a more powerful and global metric for the whole data space. The local Cayley-Klein metrics and their weights are iteratively optimized towards a high classification performance distance metric.

Related work
In this section, we will first review some related work under the topic of metric learning. Then, we move to a brief introduction to the Cayley-Klein geometries as a basis of our method.

Metric learning
When the general Euclidean distance can not fulfill the requirement of many computer vision applications, it is straight-forward to explore the label information and the intrinsic structure of training data to learn a specific but more powerful distance metric for a given task.
Most works in the literature have been focused on the Mahalanobis metric learning. The earlier work for Mahalanobis metric learning is the MMC proposed by Xing et al. [12]. It aims to learn a positive semi-definite metric matrix by maximizing the distances between instances from different classes while restricting the distances between instances from a same class smaller than a fixed upper bound. Based on this objective, they finally formulated the metric learning problem as a convex optimization problem which is solved by semidefinite programming. Similar objective has been used in Davis et al. [10] as constraints. Subject to these constraints, Davis et al. proposed the Information Theory Metric Learning (ITML) by minimizing the differential relative entropy. Instead of restricting the intra-class distances below an upper bound, Globerson and Roweis [19] proposed to make them as zero. Guillaumin et al. [20] proposed a discriminative linear logistic regression for Mahalanobis metric learning. Other famous works include the LMNN [11], which tried to learn a Mahalanobis distance metric so as to make the k-nearest neighbors always lie in the same class while instances from different classes are separated by a large margin. By replacing the exponential loss in LMNN with the hinge loss, Shen et al. [21] proposed the BoostMetric. They further proposed the FrobMetric by adding a general Frobenius norm as a regularization term to the objective function [22]. More recently, Lu et al. [23] proposed a neighborhood repulsed metric learning method for kinship verification. Their target is to learn a distance metric so that the intra-class samples are pulled as close as possible and inter-class samples lying in a neighborhood are repulsed and pushed away as far as possible. Wang et al. [24] proposed the Shrinkage Expansion Adaptive Metric Learning (SEAML). Their method could adaptively adjust the bound constraints used in previous works [10,12] by shrinking the distances between samples of similar pairs and expanding the distances between samples of dissimilar pairs. Law et al. [25] proposed the Fantope regularization and applied it to the Mahalanobis metric learning.
Beyond Mahalanobis metric learning, a lot of researchers have also made a big effort to non-Mahalanobis metric learning due to its potential in dealing with more complex intra-and interclass variations. Kernel trick is the most straight-forward technique to deal with non-linearity, so it is naturally to use kernel method in metric learning, such as [17,26]. Non-Euclidean spaces such as Riemannian space, projective space have also been explored for metric learning. These methods include Riemannian and manifold metric learning [27,28] and Cayley-Klein metric learning [18]. In [27], Cheng proposed the Riemannian similarity learning by tackling the metric learning problem in a Riemannian optimization framework. In [18], Bi et al. shown that Cayley-Klein metric can be incorporated into the metric learning frameworks of MMC [12] and LMNN [11] to obtain a better distance metric. Besides, Li et al. [29] proposed a margin based method to learn a second-order discriminant function as distance metric for verification problem. Some researchers have embedded metric learning into the framework of deep neural networks [30,31].
Since our method learns several Cayley-Klein metrics locally and combines them together for a global and powerful distance metric, it is mostly related to the local metric learning [11,13,15,32] and some mixed/compositional metric learning methods [16,33]. MM-LMNN [11] is an extension of LMNN which learns a small number of metrics (typically one per class) in an effort to alleviate overfitting. Noh et al. [32] pointed out that finite sampling using the class conditional probability distribution leads to a theoretical bias of the nearest neighbor classifier. Thus they proposed the Generative Local Metric Learning (GLML) using local metrics to limit this theoretical bias. In [13], Wang et al. introduced a local metric learning method based on finite number of linear metrics named PLML. They used the k-means algorithm to define some anchor points as the means of clusters and optimized a combination of metric bases learned from these clusters. Reduced-Rank Local Metric Learning (R 2 LML) proposed in [15] learns k Mahalanobis-like local metrics that are then conically combined. Additionally, a nuclear norm regularizer is adopted to obtain low-rank weight matrices for calculating metrics, which is able to control the rank of the involved linear mappings through a sparsityinducing matrix norm. Recently, Semerci and Alpaydin [16] proposed the Mixture of LMNN (MoLMNN) method to learn a mixture of local Mahalanobis distances to better discriminate the data. It needs a gating function to softly partition the input space into several regions. In [33], SCML-local aims to learn a sparse combination of locally discriminative metrics. This algorithm do not need to perform projections onto the PSD cone, thus getting a computational advantage for high-dimensional problems.
Different from these methods, the proposed multiple Cayley-Klein metric learning linearly combines several local Cayley-Klein metrics while most previous methods combine Mahalanobis metrics. Due to the intrinsic non-linearity of Cayley-Klein metric, combining them is more effective than combining linear metrics like Mahalanobis metrics. Thus, our method is potentially to have a better performance than previous methods. Moreover, contrast to the sophisticated methods in the previous works for partitioning the input data space into several clusters for local metrics learning, we use a simpler and straight-forward method by directly utilizing the label information supplied with the training data.

Cayley-Klein geometries
Cayley-Klein geometries are branches of non-Euclidean geometry, which is an ancient topic in geometry and can be traced back to the 19th century. Among many mathematicians who conducted research on this topic, there were A. Cayley and F. Klein. In 1859, A. Cayley discovered that Euclidean geometry can be considered as a special case of projective geometry which leads to his famous statement "descriptive geometry (his term for projective geometry) is all geometry" [34]. Ten years later, F. Klein [35,36] followed A. Cayley's ideas and showed that the projective geometry can provide a framework for the development of hyperbolic and elliptic geometries as well. His research is mainly focused on the real Euclidean, hyperbolic and elliptic geometries since he believed that only these geometries can describe the physical universe [37]. Based on their researches, it is acknowledged that the Euclidean, the hyperbolic and the elliptic geometries are independent and self-subsistent geometries. Their research also leads to working models for these different geometries. Owing to their distinguished work on this topic, both the hyperbolic and elliptic geometries are called Cayley-Klein geometries. They occupy a significant position in the foundations of geometry, because of their distinguished position as geometries of constant curvature.
Nowadays, the term "non-Euclidean geometry" is frequently used to refer the hyperbolic geometry only [38] or the hyperbolic and elliptic geometries together [39]. The reason of calling them "non-Euclidean" is perhaps that no other non-Euclidean geometry had been discovered earlier, and also for which, they both violate the parallel postulate of Euclidean geometry. In Euclidean geometry, for each tangent to a circle there is a unique second parallel tangent. That is to say, there is a unique line through a fixed point in parallel with a given line (not through the fixed point). Whereas in elliptic geometry, there are no parallels at all. As great circles are taken to be lines in elliptic geometry, two different lines in one plane always intersect. In hyperbolic geometry, through one point not on the given line, there are infinitely many parallels to this line.

Cayley-Klein metric
According to [34,35], Cayley-Klein metric is defined over an invertible symmetric matrix G in projective space. Mathematically, the Cayley-Klein distance between two data points where k is a parameter related to the space curvature [18]. Apparently, there is one-to-one correspondence between the symmetric matrix G 2 R ðnþ1ÞÂðnþ1Þ and the Cayley-Klein metric, i.e., a specific G defines a specific kind of Cayley

Multiple Cayley-Klein metrics
In many computer vision tasks, it is expected that data points from same class are localized near each other in the feature space, while data points from different classes are far from each other. On one hand, a distance metric learned for one class may not perform well when applying to another class. On the other hand, a single distance metric learned on data from all classes is usually incompetent to model the multiclass decision boundaries due to the complexity of high dimensional data space. Based on these reasons, we propose the multiple Cayley-Klein metric. It combines multiple Cayley-Klein metrics that are trained on different parts of the training set. Since Cayley-Klein metric is a kind of non-linear metric, combining several metrics could enlarge its non-linearity, thus leading to a better performance.
The definition of multiple Cayley-Klein metric is simple, Essentially, it linearly combines N different Cayley-Klein metrics, so it fulfills the metric axioms as well. Note that d CK (x i , x j ; G c ) is a Cayley-Klein metric learned on the c-th data cluster. When the label information is available in the training data, we cluster training data by their labels. In other words, d CK (x i , x j ; G c ) is learned to maximize the performance related to the cth class. For example, making the distance between any two instances in the c-th class small and the distance between instance in the c-th class and instance from other classes large. In this case, N is set equal to the number of classes. If the label information is unavailable, the training data can be partitioned into N clusters by any unsupervised clustering method, such as k-means. In this paper, we only focus on the supervised case as the purpose of metric learning is to leverage metric's performance by using labeled training data. Fig 1 illustrates the basic idea of the proposed multiple Cayley-Klein metric learning method by a toy example. There are two classes of data in (a) denoted by squares and circles, three classes of data in (b) denoted by squares, circles and triangles respectively. In situation (a), we can see that using a non-linear metric achieves the same goal as using two linear metrics in data classification. While in situation (b), a single non-linear metric is not enough, it would need at least two non-linear metrics or even more linear ones to separate the data. Therefore, multiple Cayley-Klein metrics actually correspond to a series of Riemannian metrics with several different (but fixed) curves, which we expect to model more complex data distribution.
In the following, we will describe the formulation of multiple Cayley-Klein metric learning, and then elaborate how to optimize the objective function.

Metric learning
Suppose we have a training set of N classes. According to the label information, we organize it into N sets of similar pairs S ¼ fS c ; c ¼ 1; 2; Á Á Á ; Ng and N sets of dissimilar pairs D ¼ fD c ; c ¼ 1; 2; Á Á Á ; Ng. In S c , it is constituted by samples from the c-th class. While in D c , it contains pairs of dissimilar samples, one of which from the c-th class, and the other from the j-th class, j 6 ¼ c. Following the widely used learning criteria in metric learning community, we formulate our objective as follows: Our objective is to learn a multiple Cayley-Klein metric such that the distances of dissimilar pairs as max as possible, while in the meantime restricting the distances of similar pairs to be smaller than 1. Directly optimize the above problem is difficult. Here, we propose to optimize α c and G c alternatively.
Optimize α. Given N Cayley-Klein matrices G c , the problem to solve α c is formulated as: Such a linear programming problem is easy to solve. Note that concatenating all sets of dissimilar pairs D c ; c ¼ 1; 2; Á Á Á ; N contains duplicated pairs. D 0 is the set of dissimilar pairs after removing duplicated pairs from D. Optimize G c . Once the weights are fixed, the problem in (4) could be separated into N subproblems, which are solved one by one. For the c-th sub-problem, it is: Since matrix G c in the objective is symmetric, it is convenient to optimize on L c after Cholesky decomposition G c ¼ L T c L c with L c 2 R ðnþ1ÞÂðnþ1Þ . In this way, the above problem can be solved by the gradient ascend algorithm. At each iteration, we take a gradient ascent step on the objective function with respect to L c . By applying the Cholesky decomposition on G c , constraint (b) is satisfied. Then we just need to approximate the updated L c to fulfill the constraint (a). Specifically, given an updated L c , its approximated L 0 that meets the constraints (a) can be obtained by the following minimization problemml: For simplicity, we denote C x i x j ¼ ðx T i ; 1Þ T ðx T j ; 1Þ, then: Suppose the matrix L c at the t-th iteration is L t , we can compute the gradient of the objective function at the t-th iteration as: Initialization. To start the alternative optimization procedure described above, we have to initialize α c and G c in a reasonable way. Bi et al. [18] have proposed a specific method to construct a Cayley-Klein matrix from a given dataset, which is called the generalized Mahalanobis matrix. They have experimentally shown a better performance of initialization using the generalized Mahalanobis matrix compared to using an identity matrix or a random matrix. Therefore, we also choose to use the generalized Mahalanobis matrix to initialize G c . Since G c is a local metric mainly focused on the c-th class, we use the mean m (c) and inverse covariance S (c) computed from samples of the c-th class. In this way, we initialize G c with the following matrix:

Experiments
In this section, we evaluate the proposed method on image classification tasks with three different public datasets. For comparison, we also tested the performance of CK-MMC and MMC as they share an identical learning target as our method. Their difference only lies in the definition of distance metric. Moreover, LMNN and CK-LMNN have been evaluated due to their good performance. Additionally, MM-LMNN and SCML-local also have been tested as they are typical local metric learning methods.

Results on the UCI datasets
Datasets: In this experiment, we use 9 different datasets from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets.html, which are widely used in evaluating metric learning methods. These datasets include: Wine, Ionosphere, Vowel, Balance, Pima, Vehicle, Segmentation, Waveform and Letter. The characteristics of each UCI dataset such as the number of data points, feature dimensions, and the number of classes are summarized in Table 1.
Set up: For each dataset, we randomly divide it into training/validation/test sets. The numbers of samples in the training/validation/test subsets are shown in Table 1, and the proportion of these three subsets is nearly 60%/20%/20%. All features are first normalized over the training data to have zero mean and unit variance. Features of the validation and test data are normalized using the mean and variance of training data. The parameters of all methods are set by authors' recommendation. LMNN, MM-LMNN and CK-LMNN use 3 target neighbors and all imposters, while these are set to 3 and 10 in SCML-Local. The k-nearest neighbor (kNN) classifier is used for classification, and we set k = 3 for all the datasets. We repeat this procedure 10 times and report the average accuracies for these datasets.
Results: Table 2 shows the classification accuracies for the seven evaluated methods. Consistent to the previous work, the performance is improved by using Cayley-Klein metric to replace the traditional Mahalanobis metric. This point can be read from "CK-MMC VS. MMC" and "CK-LMNN VS. LMNN". Among all the evaluated methods, the proposed MCKML performs the best on 6 out of 9 datasets. For two datasets (Balance and Letter), it performs the second best and closely follows the best result (SCML-local). Note that CK-LMNN, MM-LMNN and SCML-local use a learning target based on triplets of samples that is more powerful than the learning target based on pairs of samples, which is used in MCKML. When considering the same learning target, MCKML consistently improves over MMC and CK-MMC on all datasets. By incorporating MCKML to the learning paradigm of LMNN, it is expected to further improve its performance. We will leave this as our future work.
For more accurate comparison, we perform paired t-test with significance level 0.05 to statistically evaluate which result is better. The comparison results with CK-MMC, CK-LMNN and two local metric learning methods (MM-LMNN and SCML-local) are summarized in Table 3. We use "*"to indicate the classification results of the two methods are not significantly different for the given confidence level, and "<" to indicate that the mean of the classification accuracy of the latter method is statistically higher than that of the former one. From the paired t-test results, we can conclude with a 95% confidence level that the proposed MCKML generally outperforms CK-MMC and is comparable with or even better than CK-LMNN, MM-LMNN and SCML-local on all datasets except Balance dataset. Visualization of the learned metric: In order to provide a better understanding of why the proposed MCKML works well and further show the necessity (benefit) of enlarging non-linear property, we added a graphical illustration using t-SNE [40] [41] is a challenging real-world face database collected from the internet. It contains 200 people and has a total number of 58,797 images of them. The images in this database are taken in completely uncontrolled situations with noncooperative subjects, leading to large variations in pose, lighting, expression, scene, camera, imaging conditions and parameters, etc. Similar to [18,25], our experiment uses a subset of PubFig, containing 772 images from 8 identities, including Alex Rodriguez (Alex), Clive Owen (Clive), Hugh Laurie (Hugh), Jared Leto (Jared), Miley Cyrus (Miley), Scarlett Johansson (Scarlett), Viggo Mortensen (Viggo) and Zac Efron (Zac). We use 11-dimensional relative attributes [42] to represent each image in the dataset. The relative attributes are computed from a concatenation of the 512-dimensional GIST descriptor [43] and a 45-dimensional LAB color histogram. We use the publicly available codes of [42] to compute relative attributes.
Set up: For all the evaluated methods, we randomly select 30 images per class for training, 30 images per class for validation, and use the remaining images for testing. In the test stage, we use a 3-NN classifier based on the learned distance metric. We repeat this procedure 10 times and report the average classification accuracies.
Results: The results are listed in Table 4. We could obtain similar observations as in the UCI datasets: MCKML outperforms MMC and CK-MMC in all cases, while it is slightly inferior to CK-LMNN in some categories (the reason has been explained in the last subsection). , mountain (m) and forest (F). We use the 6-dimensional relative attributes generated from 512-dimensional GIST descriptors to represent the images. Set up: As in the experiment on the PubFig dataset, we randomly select 30 images per class for metric learning, 30 images per class for validation, and use the remaining images to test the performance of the learned metric. 3-NN classifier is used for classification. We repeat this procedure 10 times and report the average classification accuracies.

Results:
The classification results on the OSR dataset are listed in Table 5 Finally, we can find that the results in Tables 2, 4 and 5 are rather consistent, although these datasets are fundamentally different from each other. Among all the tested methods, the proposed MCKML achieves the best average classification and only slightly inferior to CK-LMNN which uses a more powerful learning objective based on triplets. When using the same objective based on pairs of samples, our method outperforms previous methods on all tested categories. Table 6 shows the running times on OSR and PubFig for different methods, which are average results of 10 runs. Generally speaking, using Cayley-Klein metric requires a litter more time in testing as more operations are involved in computing Cayley-Klein metric according to its definition. While for training, compared with MMC and CK-MMC, which all need one loop of gradient ascending to find the optimal solution, MCKML needs two loops that is time consuming. One is the outer loop optimized alternatively on α and the Cayley-Klein matrices

Conclusion
This paper follows a very recent work of Cayley-Klein metric learning, which is a first paper introducing the ancient Cayley-Klein geometries in computer vision. We show in this paper that Cayley-Klein metric can benefit from learning multiple local Cayley-Klein metrics, each of which is only focused on a part of the data space. To this end, we propose the multiple Cayley-Klein metric learning method, which alternatively optimizes over the local Cayley-Klein metrics and their global combination weights. Although the metric learning target is identical to some previous works, i.e., to maximize the inter-class distances and restrict the intra-class distances to be less than an upper bound, our method results in a better performance on three widely used datasets as shown in the experiments. These results demonstrate the superiority of multiple Cayley-Klein metric learning to the Cayley-Klein metric learning, as well as the traditional Mahalanobis metric learning and the state-of-art local metric learning.