Spectral clustering with distinction and consensus learning on multiple views data

Since multi-view data are available in many real-world clustering problems, multi-view clustering has received considerable attention in recent years. Most existing multi-view clustering methods learn consensus clustering results but do not make full use of the distinct knowledge in each view so that they cannot well guarantee the complementarity across different views. In this paper, we propose a Distinction based Consensus Spectral Clustering (DCSC), which not only learns a consensus result of clustering, but also explicitly captures the distinct variance of each view. It is by using the distinct variance of each view that DCSC can learn a clearer consensus clustering result. In order to optimize the introduced optimization problem effectively, we develop a block coordinate descent algorithm which is theoretically guaranteed to converge. Experimental results on real-world data sets demonstrate the effectiveness of our method.


Introduction
Many real-world data sets are represented in multiple views. For example, images on the web may have two views: visual information and textual tags; multi-lingual data sets have multiple representations in different languages. Different views can often provide complementary information which is very helpful to improve the performance of learning. Multi-view clustering aims to do clustering on such multi-view data by using the information from all views.
Over the past years, many multi-view clustering methods are proposed. Roughly speaking, depending on the goal of the clustering learning, they can be categorized into two closely related but different families. In the first family, they aim to learn a common or consensus clustering result from multiple views. These methods [1][2][3][4][5][6] usually extend single view clustering methods such as spectral clustering or nonnegative matrix factorization (NMF) to deal with multi-view data. For example, Cai et al. extended k-means for multi-view data, leading to a multi-view k-means clustering method [1]; Liu et al. extended the NMF method to multiview clustering [2]; Kumar et al. presented a co-training approach for multi-view spectral clustering by bootstrapping the clusterings of different views [3]. Kumar et al. also proposed two coregularization based approaches for multi-view spectral clustering by enforcing the clustering hypotheses on different views to agree with each other [4]. Xia  multi-view spectral clustering method by building a Markov chain based on a low rank and sparse decomposition method [5]. Nie et al. presented a parameter-free multi-view spectral clustering method, which learned an optimal weight for each view automatically without introducing an additive parameter and extended it to semi-supervised classification [6,7]. Zhang et al. learned a uniform projection to map the multiple views to a consensus embedding space [8]. Tang et al. extended the NMF to unsupervised multi-view feature selection by making use of the consensus information of all views [9]. All these methods learn the consensus clustering from each view, while ignoring the distinct information that only exits in one view while does not exist in other views. Therefore, these methods could not guarantee the complementarity across different views [10].
In the second family, instead of learning a consensus clustering result, they tend to learn distinct clustering result on each view. These methods [10][11][12] make use of complementary information to improve the performance of clustering on each view. For example, Günnemann et al. presented a multi-view clustering method based on subspace learning, which provides multiple generalizations of the data by modeling individual mixture models, each representing a distinct view [11]; Cao et al. also presented a subspace learning based multiview clustering method which learns better clustering result on each view [10]. Different from the first family, they learn the clustering results on each view instead of the consensus clustering results.
In this paper, we focus on the first family, which is to learn a consensus clustering result. It is well known that multi-view learning follows the complementary principle, that each view of data may contain some knowledge that other views do not have [13,14], and some theoretical and experimental results [11,15,16] have already demonstrated it. However, most existing consensus clustering methods do not make full use of the distinguishing information, and thus could not well guarantee the complementarity across different views [10]. To address this issue, we propose a Distinction based Consensus Spectral Clustering (DCSC), borrows the main idea of the second family that the distinguishing information may be helpful, to learn a better consensus result.
Since spectral clustering is a widely used clustering method in both single view and multiple view settings [4,5,17], we adopt it as the basic clustering model for our method. The essential step of spectral clustering is to learn a spectral embedding. In our method, the underlying spectral embedding of each view consists of two parts; one is the consensus embedding of all views and the other is the sparse variance of each view. In order to distinguish between variances, we apply Hilbert Schmidt Independence Criterion (HSIC) to measure and control the diversities of all views. Therefore, we learn a cleaner consensus embedding of all views by explicitly captures the distinct variance of each view. Note that, Liu et al. [18] also considered the consistency and complementarity, i.e., they decomposed the latent factor into two parts: the common part and the specific part. However, they did not impose any constraints (like HSIC in our method) on the specific parts to control the diversity, so that it is possible that the learned specific parts may be similar in their methods.
We develop a block coordinate descent algorithm for effectively learning the consensus embedding and distinguishing variances, which is theoretically guaranteed to converge. The experiments on benchmark data sets show that our method outperforms the closely related algorithms, which indicates the importance of using distinct information in multi-view clustering.
To sum up, we highlight the main contribution of this paper here: we propose a new multiview spectral clustering method, which uses the HSIC to explicitly capture the distinction information of all views and can obtain a clearer and more accurate consensus result; and then, we provide a block coordinate descent algorithm to solve it effectively and the experimental results demonstrate that our algorithm outperforms other state-of-the-art methods.
The remainder of this paper is organized as follows: Section 2 introduces some preliminaries. Section 3 presents our Distinction based Consensus Spectral Clustering method. Section 4 shows the experimental results. Section 5 concludes this paper.

Spectral clustering
Spectral clustering [19] is a widely used clustering method. Given a data set which contains data points {x 1 , . . ., x n }, it firstly defines similarity matrix S 2 R n�n where S ij � 0 denotes the similarity of x i and x j . Then it constructs a Laplacian matrix L by L ¼ I À D À 1 2 SD À 1 2 , where I is an identity matrix and D 2 R n�n is a diagonal matrix with the (i, i)-th element d ii ¼ P n j¼1 S ij . Spectral clustering aims to learn a spectral embedding Y 2 R n�c (c is the dimension of embedding space and is often set to the number of clusters) by optimizing the following objective function min Y trðY T LYÞ; When getting spectral embedding Y, it applies k-means or spectral rotation [20] to discretize Y to obtain the final clustering result.

Hilbert schmidt independence criterion
Let k 1 : X � X ! R and k 2 : Y � Y ! R be two positive-definite reproducing kernels that correspond to RKHSs (Reproducing Kernel Hilbert Space) [21]H k 1 and H k 2 respectively with inner-products k 1 (x i , x j ) = hϕ(x i ), ϕ(x j )i and k 2 (y i , y j ) = hψ(y i ), ψ(y j )i, where � : X ! H k 1 and c : Y ! H k 2 are two maps and h�, �i denotes the inner product. Then the cross covariance is defined as: where � is the outer product, μ x = E(ϕ(x)) and μ y = E(ψ(y)), and E(�) denotes the expectation. Then Hilbert Schmidt Independence Criterion (HSIC) is defined as: Definition 1. [22] Given two separable RKHSs and a joint distribution p xy , we define the HSIC as the Hilbert-Schmidt norm of the associated cross-covariance operator C xy : where k�k HS denotes the Hilbert-Schmidt norm of a matrix. From the definition, we can find that HSIC can be used to measure the independence of two variables, i.e., the less HSIC is, the more independent the two variables are.
Since at most time the joint distribution p xy is unknown, the empirical version of HSIC is often used. Let Z (1) and Z (2) be two data sets which contain fz ð1Þ 1 ; :::; z ð1Þ n g and fz ð2Þ 1 ; :::; z ð2Þ n g as their data respectively. Then the empirical version of HSIC is defined as: where K (1) and K (2) are the Gram matrices with the (i, j)-th element K ð1Þ ij ¼ k 1 ðz ð1Þ i ; z ð1Þ j Þ and K ð2Þ ij ¼ k 2 ðz ð2Þ i ; z ð2Þ j Þ, H is a centering matrix defined by H ¼ I À 1 n 11 T , where 1 is an all-ones vector. More details of HSIC can be found in [22].

Distinction based consensus spectral clustering
In this section, we present the framework of DCSC, and then introduce how to solve the introduced optimization problem.

Formulation
Given a multi-view data set {X (1) , . . ., X (m) } which contains n instances, where m is the number of views, we can construct m Laplacian matrices L (1) , . . ., L (m) as [19] did. Then for each view we learn the spectral embedding by solving Eq (1). However, in multi-view data, each view contains both common information and distinguishing knowledge. To capture them, we decompose the spectral embedding of the i-th view into two parts: the consensus embedding Y 2 R n�c and the distinct variance V ðiÞ 2 R n�c . Then the objective function of Eq (1) can be rewritten as where the orthogonal constraint Y T Y = I is to avoid the trivial solution.
Since V (1) , . . ., V (m) denote the distinct variance of each view, they should be far apart from each other. Here we use the aforementioned HSIC to measure the difference between V (i) and V (j) . As we wish V (i) and V (j) to be far apart, we should minimize HSIC(V (i) , V (j) ), so we add term P m i¼1 P m j¼iþ1 HSICðV ðiÞ ; V ðjÞ Þ to the objective function. Then, we obtain the following formulation: HSICðV ðiÞ ; V ðjÞ Þ; where λ 1 is a balancing parameter to control the diversity. Moreover, although each view may contain some complementary or distinct information, since we focus on the first family which aims to learn a consensus clustering result, the consensus embedding is the main part and also what we really want. So we wish each view contains a small quantity of distinct information, which means the variance V (i) should be sparse. To make sure that, we impose ℓ 1 -norm on each V (i) and obtain: where λ 2 is another balancing parameter to control the sparsity.  . Fig 1(a) illustrates the ideal embedding we aim to learn. The orthogonal consensus embedding Y should contain most information, i.e., the distinct variance V (i) only contains little non-zero elements. In addition, variances V (i) contains the distinct information, and thus should be as different as possible from each other as shown in Fig 1(a). Fig 1(b) shows the result that Y = 1/3∑ i Y (i) without any constraints on V (i) . It is easy to verify that ∑ i, j HSIC(V (i) + V (j) ) in Fig 1(a) is much smaller than that in Fig 1(b), which means V (i) in Fig 1(a) is more like the distinct parts. Therefore, the consensus Y obtained by subtracting V (i) from Y (i) is cleaner in Fig 1(a).
It is worth noting that, [5] decomposes transition probability matrix of each view into a consensus transition probability matrix and a sparse noise matrix, which is similar to our method. However, the two methods are totally different. Firstly, the motivations are different. Their approach mainly considers robustness, while in our method, we try to discover the distinguishing information in each view. Secondly, in their method, they only impose sparsity on the noise matrices, while do not control the diversity, so it is not necessarily that the noise matrices are far apart from each other. In our method, we explicitly control the diversity by minimizing the HSIC of each pair of views.
In Eq (7), we treat the difference of each pair of views equally, because in the term P m i¼1 P m j¼iþ1 HSICðV ðiÞ ; V ðjÞ Þ, the weights of all pairs HSIC(V (i) , V (j) ) are all 1. However, in practice, if two views are more different, we wish the variance matrices of these two views to be farther apart. Therefore, we replace the term P m i¼1 P m j¼iþ1 HSICðV ðiÞ ; V ðjÞ Þ with P m i¼1 P m j¼iþ1 o ij HSICðV ðiÞ ; V ðjÞ Þ, where the pre-defined weight ω ij is the prior information to capture the diversity between V (i) and V (j) for controlling the diversities more precisely. Intuitively, if the i-th view is more different from the j-th view, we need to impose larger weight ω ij to keep HSIC(V (i) , V (j) ) as small as possible. There are many ways to set ω ij . In this paper, we first use the similarity matrices S (i) and S (j) to compute an average similarity score β ij 2 [0, 1] of the i-th view and the j-th view. In more details, we compute β ij = hS (i) , S (j) i/n 2 , i.e., β ij is the normalized inner product of S (i) and S (j) , which can represent the similarity of the i-th and the j-th view. Then we use the similar technique in [23] is monotonically decreasing, i.e., smaller β ij leads to larger ω ij , which satisfies the property of weight, that is if the two views are more different then the weight is larger.
Here, for simplicity, we use the linear kernel, i.e., Taking Eq (4) into our objective function, we get the final formulation of our method: Note that for notational convenience, we absorb the scaling factor (n − 1) −2 of HSIC into the parameter λ 1 . By explicitly capturing the variances V (1) , . . ., V (m) , Eq (8) can learn a clearer consensus spectral embedding Y.

Optimization
Eq (8) involves m + 1 variables (Y, V (1) , . . ., V (m) ), thus we present a block coordinate descent scheme to optimize it. In particular, we optimize the objective w.r.t one variable while fixing the others. This procedure repeats until convergence.
Optimize V (i) by fixing other variables. When Y, V (1) , . . .,V (i − 1) ,V (i + 1) , . . .,V (m) are fixed, Eq (8) can be rewritten as where C and E are defined as For notational convenience, we use V to replace V (i) . Let it is easy to verify that the gradient of F , denoted as rF , is Lipschitz continuous with some constant Γ [24], i.e., So we can optimize this subproblem with Accelerated Proximal Gradient Descent (APGD) method [25]. More specifically, instead of optimizing V by solving Eq (9) directly, we linearize F ðVÞ at V k (the result of V in the k-th iteration) and add a proximal term: Then we update V k + 1 by solving Let Eq (12) can be easily solved by thresholding algorithm and has a closed form solutionṼ: whereṼ ij and Q ij are the (i, j)-th element in matrixṼ and Q respectively, sign(�) is a sign function, i.e. sign(x) = −1 if x is negative and sign(x) = 1 if it is positive; sign(x) = 0, otherwise. Then we can set V kþ1 ¼Ṽ. Until now, this is the process of Proximal Gradient Descent method. According to [25], we can update V k + 1 as follows to get a faster convergence rate.

10: end while
Here ρ is a constant rate to update μ. We need to check whether μ is appropriate because μ should satisfy μ > Γ, while in most cases, Γ is unknown. We can check it with the method in [25] and for saving space, we omit it here.
We show in the next theorem that this algorithm converges as O 1 k 2 À � to the global optima of this subproblem. Theorem 1. [24] Let V k be the sequence generated by Algorithm 1. Then for any k � 1, where G is the objective function defined in Eq (9) where A and B are defined as To handle the constraints, we obtain the Lagrangian function of Eq (17) by introducing the Lagrangian multiplier Λ, Set the partial derivative with respect to Y to zero, we get Multiplying both sides of Eq (19) by Y T and using the fact that Y T Y = I, we can get Λ = Y T (AY + B). Since Y T Y is symmetric, the Lagrangian multiplier Λ corresponding to Y T Y = I is also symmetric, so we can rewrite it as Λ = (AY + B) T Y. Taking it into Eq (19), we have We denote 2(AYY T + BY T − Y(AY + B) T ) as W, and have the following Lemma which shows the first order condition of Eq (17): Lemma 1 @L @Y ¼ 0 if and only if W = 0, so W = 0 is the first-order optimality condition of Eq (17).
Proof. On one hand, according to the definition of W, we have @L @Y ¼ WY, so if W = 0, @L @Y ¼ 0.
On the other hand, if @L @Y ¼ 0, that is to say, (AYY T + BY T − Y(AY + B) T )Y = 0. Let M = AY + B, then we have M = YM T Y. Furthermore, Taking the transposition of both sides of the equality, we have M T = M T YY T . Then we obtain which is equal to MY T − YM T = 0. Note that W = 2(MY T − YM T ), so W = 0.
In summary, W = 0 is the first-order optimality condition of our subproblem. According to this, a natural way to update Y is gradient descent method which is Y k+1 Y k − τ WY k , where τ is the step size and Y k is the result of Y at the k-th iteration. However, in this problem, since we have the orthogonal constraint on Y, we cannot use gradient descent directly for the reason that gradient descent may violate the constraint. To overcome this problem, we use a constraint preserving descent method to update it inspired by [26][27][28]. In more details, we compute the initial Y which is denoted as Y 1 by the following standard spectral clustering objective: Then, we use the following constraint preserving descent formula to compute the new iteration Y: Note that according to the definition of W, we can easily verify that W T = −W, which means that W is a skew-symmetric matrix. For a skew-symmetric matrix W, we have the following theorem which gives a closed form solution of Eq (24) that satisfies the orthogonal constraint and updates Y in a descent direction. Moreover, due to Lemma 1, it converges to a stationary point. Theorem 2. 1) Given any skew-symmetric matrix W 2 R n�n and Y k 2 R n�c which satisfies Y kT Y k = I, the closed form solution of matrix Y k+1 defined by Eq (24) is and it satisfies that Y k+1T Y k+1 = I.
where J ð�Þ is the objective function in Eq (17), and it means that updating Y is in a descent direction.
3) This update formula converges to a stationary point of the subproblem.
Proof. 1) Moving all Y k+1 to the left side, we get both sides by the inverse of I þ t 2 W À � À 1 , we obtain the closed form solution of Y k+1 : where the second equality follows that W T = −W when W is a skew-symmetric matrix. Furthermore, we have Take Eq (28) into Eq (27), 2) According to the chain rule, we have When τ = 0, Y k+1 = Y k , and @J ðY kþ1 Þ @Y kþ1 j t¼0 ¼ 2ðAY k þ BÞ, @Y kþ1 @t j t¼0 ¼ À WY k , so on one hand, @J ðY kþ1 Þ @t On the other hand, we have , which means if Y moves a small step Δτ > 0 in the update direction, the objective function J will have a change À 1 2 k W k 2 F Dt and À 1 2 k W k 2 F � 0, so the objective function J will decrease. Thus the update direction is a descent direction.
3) Since the objective function deceases monotonically and the objective function is lower bounded by 0, the iteration method converges. When it converges, which means k W k 2 F ¼ 0 when τ = 0, i.e., W = 0, it satisfies the first-order optimality condition of our subproblem according to Lemma 1. So the algorithm can converge to a stationary point of this subproblem.
Note that we choose the iteration step size τ by a curvilinear search method as was done in [28], which can guarantee the convergence. Therefore, we compute Y k+1 iteratively using the update formula Eq (25) until the decent process converges. Clearly, the computationally heaviest step in this algorithm is to compute the matrix inverse I þ t 2 W À � À 1 , which is O(n 3 ). Fortunately, we can find a fast way to calculate it. By the definition of W, we rewrite t T , then according to [29] we have ðI þ UGÞ Since I + GU is a 2c × 2c matrix, where c is the dimension of the embedding space and often has c � n, we can efficiently compute the original inverse by matrix multiplication (O(n 2 c)) and the inverse of a much smaller matrix (O(c 3 )).
After getting Y, V (1) , . . .,V (m) , we use spectral rotation [20] to discretize Y to get the final clustering result. Algorithm 2 summarizes the whole algorithm.

Convergence analysis and time complexity
According to Theorem 1 and Theorem 2, no matter updating Y or V (i) , the objective function decreases monotonically. Moreover, the objective function has a lower bound 0. Thus Algorithm 2 converges. In fact, this algorithm converges very fast (within several ten iterations in practice).
Since we decreases the time complexity of matrix inverse from O(n 3 ) to O(n 2 c + c 3 ), the computationally heaviest step is matrix multiplication. Among all matrix multiplication in our method, the highest time complexity is O(n 2 c) which is generated in the multiplication of an n × n and an n × c matrix. So the time complexity is square in the number of instances.

Experiments
In this section, we evaluate the effectiveness of DCSC by comparing it with several state-ofthe-art multi-view clustering methods on benchmark data sets.

Data sets
We use totally 8 data sets to evaluate the effectiveness of our method, including WebKb data set [30], which contains webpages collected from four universities: Cornell, Texas, Washington and Wisconsin and is available in http://membres-liglab.imag.fr/grimal/data.html; UCI handwritten digit data set [5] which can be found in http://archive.ics.uci.edu/ml/datasets/Multiple +Features; Advertisements data set [31] which is published in http://archive.ics.uci.edu/ml/ datasets/Internet+Advertisements; Corel image data set [32] which can be found in http:// www.cs.virginia.edu/~xj3a/research/CBIR/Download.htm; and Flower17 data set [33] which is available in http://www.robots.ox.ac.uk/~vgg/data/flowers/17/index.html. Statistics of these data sets are summarized in Table 1. Since Flower17 data set only contains 7 distance matrices constructed by the 7 views, we do not show the dimension of each view in Table 1.

Compared methods
To demonstrate the effectiveness of our method, we compare DCSC with the following algorithms: • FeaConcat, which first concatenates features in all views and then applies spectral clustering on it.
• RMKMC [1], which is a robust k-means based multi-view clustering method.
• Co-reg SC [4], which is a co-regularized multi-view spectral clustering method.
• RMSC [5], which is a robust multi-view spectral clustering with sparse low-rank decomposition.
• AMGL [6], which is a parameter-free multi-view spectral clustering method, i.e., it learns an optimal weight for each view automatically without introducing an additive parameter.
• SwMC [34], which is a self-weighted multi-view clustering method with multiple graphs.
• DCSC-ω. To show the effect of the prior weight ω in our method, we remove the ω (or equivalently, we set all ω ij to 1) and obtain DCSC-ω.

Experiment setup
The number of clusters is set to the true number of classes for all data sets and all methods.
Since the results of most compared algorithms depend on the initializations, we independently repeat the experiments for 10 times and report the average results and t-test results. In our method, we tune λ 1 in [10 −5 , 10 5 ] and tune λ 2 in [10 −4 , 10 4 ] by grid search. Note that, λ 1 absorbs the scaling factor (n − 1) −2 in it which is dependent on the size of the data set, and in our experiments, the size is in the range between 100 to 5000, which is relatively narrow. Therefore we tune it in a wider range [10 −5 , 10 5 ]. Of course, we can also use other parameter tuning strategy, for example, letl ¼ l 1 � ðn À 1Þ 2 , so thatl is independent of n and we tunê l in a predefined range. For other compared methods, we tune the parameters as suggested in their papers. Three clustering evaluation metrics are adopted to measure the clustering  Tables 2-4 show the ACC, NMI and Purity results of all methods on all data sets, respectively. Bold font indicates that the difference is statistically significant (the p-value of t-test is smaller than 0.05). Note that since Flower17 data set only contains distance matrices instead of the original features, we construct Laplacian matrices from distance matrices and only compare our method with spectral clustering based methods. In Tables 2-4, on most data sets, the performance of spectral based methods (Co-reg SC, RMSC, AMGL, SwMC, RAMC and ours) are much better than spectral clustering on feature concatenating, which demonstrates the effectiveness of multi-view clustering. Moreover, our method outperforms other compared methods on most data sets, which means that taking divergence of each view into consideration is indeed helpful to multi-view clustering. Spectral clustering with distinction and consensus learning on multiple views data Especially on Corel data set, which is the most difficult for clustering (it has the most views (7 views) and the most classes (34 classes)), our method has 23%, 12%, and 18% improvements on the second best method on ACC, NMI and Purity, respectively. In our method, since we explicitly capture the distinct variances of all views by minimizing the dependency among them, the remainder is a clearer consensus spectral embedding leading to a better clustering result. Note that, DCSC also obtain better results compared with DCSC-ω, which means considering the similarity between each view can improve the performance of our method.

Experimental results
We show the algorithm convergence on UCI Digit, Advertisements, Corel and Flower17 data set in Fig 2, and the results on other data sets are similar. The example result in Fig 2 shows that our method converges within a small number of iterations, which empirically demonstrates our claims in the previous section.

Parameter study
We explore the effect of the parameters on clustering performance. There are two parameters in our method: λ 1 and λ 2 . We show the ACC, NMI and Purity on UCI Digit and Corel data sets and the results are similar on other data sets. Fig 3 shows the results, from which we can see that the performance of our method is stable across a wide range of the parameters.

Conclusion
In this paper, we proposed a novel multi-view spectral clustering method DCSC. We explicitly captured the distinguishing or complementary information in each view and took advantage of the distinct information to learn a better consensus clustering result. To characterize the distinct information effectively, we use HSIC to control the diversity of each view. Since the introduced optimization problem contains several variables, we presented a block coordinate descent algorithm to solve it and proved its convergence. Finally, experiments on benchmark data sets show that the proposed method outperforms the state-of-the-art multi-view clustering methods.
Since the scalability is a serious problem in spectral clustering, in the future, we will study scalability issue with multi-view spectral clustering and apply it to large-scale data sets.