## Figures

## Abstract

Since multi-view data are available in many real-world clustering problems, multi-view clustering has received considerable attention in recent years. Most existing multi-view clustering methods learn consensus clustering results but do not make full use of the distinct knowledge in each view so that they cannot well guarantee the complementarity across different views. In this paper, we propose a Distinction based Consensus Spectral Clustering (DCSC), which not only learns a consensus result of clustering, but also explicitly captures the distinct variance of each view. It is by using the distinct variance of each view that DCSC can learn a clearer consensus clustering result. In order to optimize the introduced optimization problem effectively, we develop a block coordinate descent algorithm which is theoretically guaranteed to converge. Experimental results on real-world data sets demonstrate the effectiveness of our method.

**Citation: **Zhou P, Ye F, Du L (2018) Spectral clustering with distinction and consensus learning on multiple views data. PLoS ONE 13(12):
e0208494.
https://doi.org/10.1371/journal.pone.0208494

**Editor: **Ivan Olier,
Liverpool John Moores University, UNITED KINGDOM

**Received: **April 17, 2018; **Accepted: **November 19, 2018; **Published: ** December 6, 2018

**Copyright: ** © 2018 Zhou et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

**Data Availability: **All relevant data are within the paper and its Supporting Information files.

**Funding: **This work is supported in part by the Key Natural Science Project of Anhui Provincial Education Department KJ2018A0010, and National Natural Science Foundation of China grant 61502289.

**Competing interests: ** The authors have declared that no competing interests exist.

## Introduction

Many real-world data sets are represented in multiple views. For example, images on the web may have two views: visual information and textual tags; multi-lingual data sets have multiple representations in different languages. Different views can often provide complementary information which is very helpful to improve the performance of learning. Multi-view clustering aims to do clustering on such multi-view data by using the information from all views.

Over the past years, many multi-view clustering methods are proposed. Roughly speaking, depending on the goal of the clustering learning, they can be categorized into two closely related but different families. In the first family, they aim to learn a common or consensus clustering result from multiple views. These methods [1–6] usually extend single view clustering methods such as spectral clustering or nonnegative matrix factorization (NMF) to deal with multi-view data. For example, Cai et al. extended k-means for multi-view data, leading to a multi-view k-means clustering method [1]; Liu et al. extended the NMF method to multi-view clustering [2]; Kumar et al. presented a co-training approach for multi-view spectral clustering by bootstrapping the clusterings of different views [3]. Kumar et al. also proposed two coregularization based approaches for multi-view spectral clustering by enforcing the clustering hypotheses on different views to agree with each other [4]. Xia et al. proposed a robust multi-view spectral clustering method by building a Markov chain based on a low rank and sparse decomposition method [5]. Nie et al. presented a parameter-free multi-view spectral clustering method, which learned an optimal weight for each view automatically without introducing an additive parameter and extended it to semi-supervised classification [6, 7]. Zhang et al. learned a uniform projection to map the multiple views to a consensus embedding space [8]. Tang et al. extended the NMF to unsupervised multi-view feature selection by making use of the consensus information of all views [9]. All these methods learn the consensus clustering from each view, while ignoring the distinct information that only exits in one view while does not exist in other views. Therefore, these methods could not guarantee the complementarity across different views [10].

In the second family, instead of learning a consensus clustering result, they tend to learn distinct clustering result on each view. These methods [10–12] make use of complementary information to improve the performance of clustering on each view. For example, Günnemann et al. presented a multi-view clustering method based on subspace learning, which provides multiple generalizations of the data by modeling individual mixture models, each representing a distinct view [11]; Cao et al. also presented a subspace learning based multi-view clustering method which learns better clustering result on each view [10]. Different from the first family, they learn the clustering results on each view instead of the consensus clustering results.

In this paper, we focus on the first family, which is to learn a consensus clustering result. It is well known that multi-view learning follows the complementary principle, that each view of data may contain some knowledge that other views do not have [13, 14], and some theoretical and experimental results [11, 15, 16] have already demonstrated it. However, most existing consensus clustering methods do not make full use of the distinguishing information, and thus could not well guarantee the complementarity across different views [10]. To address this issue, we propose a Distinction based Consensus Spectral Clustering (DCSC), borrows the main idea of the second family that the distinguishing information may be helpful, to learn a better consensus result.

Since spectral clustering is a widely used clustering method in both single view and multiple view settings [4, 5, 17], we adopt it as the basic clustering model for our method. The essential step of spectral clustering is to learn a spectral embedding. In our method, the underlying spectral embedding of each view consists of two parts; one is the consensus embedding of all views and the other is the sparse variance of each view. In order to distinguish between variances, we apply Hilbert Schmidt Independence Criterion (HSIC) to measure and control the diversities of all views. Therefore, we learn a cleaner consensus embedding of all views by explicitly captures the distinct variance of each view. Note that, Liu et al. [18] also considered the consistency and complementarity, i.e., they decomposed the latent factor into two parts: the common part and the specific part. However, they did not impose any constraints (like HSIC in our method) on the specific parts to control the diversity, so that it is possible that the learned specific parts may be similar in their methods.

We develop a block coordinate descent algorithm for effectively learning the consensus embedding and distinguishing variances, which is theoretically guaranteed to converge. The experiments on benchmark data sets show that our method outperforms the closely related algorithms, which indicates the importance of using distinct information in multi-view clustering.

To sum up, we highlight the main contribution of this paper here: we propose a new multi-view spectral clustering method, which uses the HSIC to explicitly capture the distinction information of all views and can obtain a clearer and more accurate consensus result; and then, we provide a block coordinate descent algorithm to solve it effectively and the experimental results demonstrate that our algorithm outperforms other state-of-the-art methods.

The remainder of this paper is organized as follows: Section 2 introduces some preliminaries. Section 3 presents our Distinction based Consensus Spectral Clustering method. Section 4 shows the experimental results. Section 5 concludes this paper.

## Preliminaries

### Spectral clustering

Spectral clustering [19] is a widely used clustering method. Given a data set which contains data points {*x*_{1}, …, *x*_{n}}, it firstly defines similarity matrix where *S*_{ij} ≥ 0 denotes the similarity of *x*_{i} and *x*_{j}. Then it constructs a Laplacian matrix **L** by , where **I** is an identity matrix and is a diagonal matrix with the (*i*, *i*)-th element . Spectral clustering aims to learn a spectral embedding (*c* is the dimension of embedding space and is often set to the number of clusters) by optimizing the following objective function
(1)

When getting spectral embedding **Y**, it applies k-means or spectral rotation [20] to discretize **Y** to obtain the final clustering result.

### Hilbert schmidt independence criterion

Let and be two positive-definite reproducing kernels that correspond to RKHSs (Reproducing Kernel Hilbert Space) [21] and respectively with inner-products *k*_{1}(*x*_{i}, *x*_{j}) = 〈*ϕ*(*x*_{i}), *ϕ*(*x*_{j})〉 and *k*_{2}(*y*_{i}, *y*_{j}) = 〈*ψ*(*y*_{i}), *ψ*(*y*_{j})〉, where and are two maps and 〈⋅, ⋅〉 denotes the inner product. Then the cross covariance is defined as:
(2)
where ⊗ is the outer product, *μ*_{x} = *E*(*ϕ*(*x*)) and *μ*_{y} = *E*(*ψ*(*y*)), and *E*(⋅) denotes the expectation. Then Hilbert Schmidt Independence Criterion (HSIC) is defined as:

**Definition 1**. [22] *Given two separable RKHSs and a joint distribution p _{xy}, we define the HSIC as the Hilbert-Schmidt norm of the associated cross-covariance operator*

**C**

_{xy}: (3)

*where*‖⋅‖

_{HS}

*denotes the Hilbert-Schmidt norm of a matrix*.

From the definition, we can find that HSIC can be used to measure the independence of two variables, i.e., the less HSIC is, the more independent the two variables are.

Since at most time the joint distribution *p*_{xy} is unknown, the empirical version of HSIC is often used. Let **Z**^{(1)} and **Z**^{(2)} be two data sets which contain and as their data respectively. Then the empirical version of HSIC is defined as:
(4)
where **K**^{(1)} and **K**^{(2)} are the Gram matrices with the (*i*, *j*)-th element and , **H** is a centering matrix defined by , where **1** is an all-ones vector. More details of HSIC can be found in [22].

## Distinction based consensus spectral clustering

In this section, we present the framework of DCSC, and then introduce how to solve the introduced optimization problem.

### Formulation

Given a multi-view data set {**X**^{(1)}, …, **X**^{(m)}} which contains *n* instances, where *m* is the number of views, we can construct *m* Laplacian matrices **L**^{(1)}, …, **L**^{(m)} as [19] did. Then for each view we learn the spectral embedding by solving Eq (1). However, in multi-view data, each view contains both common information and distinguishing knowledge. To capture them, we decompose the spectral embedding of the *i*-th view into two parts: the consensus embedding and the distinct variance . Then the objective function of Eq (1) can be rewritten as
(5)
where the orthogonal constraint **Y**^{T} **Y** = **I** is to avoid the trivial solution.

Since **V**^{(1)}, …, **V**^{(m)} denote the distinct variance of each view, they should be far apart from each other. Here we use the aforementioned HSIC to measure the difference between **V**^{(i)} and **V**^{(j)}. As we wish **V**^{(i)} and **V**^{(j)} to be far apart, we should minimize *HSIC*(**V**^{(i)}, **V**^{(j)}), so we add term to the objective function. Then, we obtain the following formulation:
(6)
where λ_{1} is a balancing parameter to control the diversity.

Moreover, although each view may contain some complementary or distinct information, since we focus on the first family which aims to learn a consensus clustering result, the consensus embedding is the main part and also what we really want. So we wish each view contains **a small quantity of** distinct information, which means the variance **V**^{(i)} should be sparse. To make sure that, we impose *ℓ*_{1}-norm on each **V**^{(i)} and obtain:
(7)
where λ_{2} is another balancing parameter to control the sparsity.

Fig 1 shows a toy example, where **Y**^{(i)} is the embedding of *i*-th view and contains two parts: consensus part **Y** and distinct variance **V**^{(i)}, i.e., **Y**^{(i)} = **Y** + **V**^{(i)}. Fig 1(a) illustrates the ideal embedding we aim to learn. The orthogonal consensus embedding **Y** should contain most information, i.e., the distinct variance **V**^{(i)} only contains little non-zero elements. In addition, variances **V**^{(i)} contains the distinct information, and thus should be as different as possible from each other as shown in Fig 1(a). Fig 1(b) shows the result that **Y** = 1/3∑_{i} **Y**^{(i)} without any constraints on **V**^{(i)}. It is easy to verify that ∑_{i, j} *HSIC*(**V**^{(i)} + **V**^{(j)}) in Fig 1(a) is much smaller than that in Fig 1(b), which means **V**^{(i)} in Fig 1(a) is more like the distinct parts. Therefore, the consensus **Y** obtained by subtracting **V**^{(i)} from **Y**^{(i)} is cleaner in Fig 1(a).

(a) Consensus embedding and distinct variances satisfy our constraints. (b) Consensus embedding obtained by averaging all views without any constraints on the distinct variances.

It is worth noting that, [5] decomposes transition probability matrix of each view into a consensus transition probability matrix and a sparse noise matrix, which is similar to our method. However, the two methods are totally different. Firstly, the motivations are different. Their approach mainly considers robustness, while in our method, we try to discover the distinguishing information in each view. Secondly, in their method, they only impose sparsity on the noise matrices, while do not control the diversity, so it is not necessarily that the noise matrices are far apart from each other. In our method, we explicitly control the diversity by minimizing the HSIC of each pair of views.

In Eq (7), we treat the difference of each pair of views equally, because in the term , the weights of all pairs *HSIC*(**V**^{(i)}, **V**^{(j)}) are all 1. However, in practice, if two views are more different, we wish the variance matrices of these two views to be farther apart. Therefore, we replace the term with , where the pre-defined weight *ω*_{ij} is the prior information to capture the diversity between **V**^{(i)} and **V**^{(j)} for controlling the diversities more precisely. Intuitively, if the *i*-th view is more different from the *j*-th view, we need to impose larger weight *ω*_{ij} to keep *HSIC*(**V**^{(i)}, **V**^{(j)}) as small as possible. There are many ways to set *ω*_{ij}. In this paper, we first use the similarity matrices **S**^{(i)} and **S**^{(j)} to compute an average similarity score *β*_{ij} ∈ [0, 1] of the *i*-th view and the *j*-th view. In more details, we compute *β*_{ij} = 〈**S**^{(i)}, **S**^{(j)}〉/*n*^{2}, i.e., *β*_{ij} is the normalized inner product of **S**^{(i)} and **S**^{(j)}, which can represent the similarity of the *i*-th and the *j*-th view. Then we use the similar technique in [23] to get . Since *f*(⋅) is monotonically decreasing, i.e., smaller *β*_{ij} leads to larger *ω*_{ij}, which satisfies the property of weight, that is if the two views are more different then the weight is larger.

Here, for simplicity, we use the linear kernel, i.e., **K**^{(i)} = **V**^{(i)} **V**^{(i)T}. Taking Eq (4) into our objective function, we get the final formulation of our method:
(8)

Note that for notational convenience, we absorb the scaling factor (*n* − 1)^{−2} of HSIC into the parameter λ_{1}. By explicitly capturing the variances **V**^{(1)}, …, **V**^{(m)}, Eq (8) can learn a clearer consensus spectral embedding **Y**.

### Optimization

Eq (8) involves *m* + 1 variables (**Y**, **V**^{(1)}, …, **V**^{(m)}), thus we present a block coordinate descent scheme to optimize it. In particular, we optimize the objective w.r.t one variable while fixing the others. This procedure repeats until convergence.

#### Optimize V^{(i)} by fixing other variables.

When **Y**, **V**^{(1)}, …,**V**^{(i − 1)},**V**^{(i + 1)}, …,**V**^{(m)} are fixed, Eq (8) can be rewritten as
where **C** and **E** are defined as
(9)

For notational convenience, we use **V** to replace **V**^{(i)}. Let
it is easy to verify that the gradient of , denoted as , is Lipschitz continuous with some constant Γ [24], i.e.,
(10)

So we can optimize this subproblem with Accelerated Proximal Gradient Descent (APGD) method [25]. More specifically, instead of optimizing **V** by solving Eq (9) directly, we linearize at **V**^{k} (the result of **V** in the *k*-th iteration) and add a proximal term:
where *μ* > Γ.

Then we update **V**^{k + 1} by solving
(11)

Let and , we can obtain **V**^{k + 1} by solving
(12)

Eq (12) can be easily solved by thresholding algorithm and has a closed form solution :
(13)
where and *Q*_{ij} are the (*i*, *j*)-th element in matrix and **Q** respectively, *sign*(⋅) is a sign function, i.e. *sign*(*x*) = −1 if *x* is negative and *sign*(*x*) = 1 if it is positive; *sign*(*x*) = 0, otherwise. Then we can set . Until now, this is the process of Proximal Gradient Descent method. According to [25], we can update **V**^{k + 1} as follows to get a faster convergence rate.
(14) (15)
where *α*^{k} is an auxiliary variable. Algorithm 1 provides the Accelerated Proximal Gradient Descent algorithm.

**Algorithm 1** APGD for solving Eq (9)

**Input**: **C**, **E**, and the initial constant *μ*_{0}, **V**^{1}.

**Output**: **V**.

1: Initialize *μ* = *μ*_{0}, *α*^{1} = 1, *ρ* = 1.02, and *k* = 1.

2: **while** not converge **do**

3: Calculate by Eq (13).

4: **while** *μ* is not appropriate **do**

5: Set *μ* = *μρ*.

6: Calculate by Eq (13).

7: **end while**

8: Set *α*^{k+1} and **V**^{k+1} by Eqs (14) and (15).

9: Set *k* = *k* + 1.

10: **end while**

Here *ρ* is a constant rate to update *μ*. We need to check whether *μ* is appropriate because *μ* should satisfy *μ* > Γ, while in most cases, Γ is unknown. We can check it with the method in [25] and for saving space, we omit it here.

We show in the next theorem that this algorithm converges as to the global optima of this subproblem.

**Theorem 1**. [24] *Let* **V**^{k} *be the sequence generated by Algorithm 1. Then for any k* ≥ 1,
(16) *where* *is the objective function defined in* Eq (9) *and* *is the global optima of* Eq (9).

*Proof*. See the proof of Theorem 4.4 in [24].

#### Optimize Y by fixing V^{(i)}.

When **V**^{(1)}, …,**V**^{(m)} are fixed, we rewrite Eq (8) as follows
(17)
where **A** and **B** are defined as

To handle the constraints, we obtain the Lagrangian function of Eq (17) by introducing the Lagrangian multiplier **Λ**,
(18)

Set the partial derivative with respect to **Y** to zero, we get
(19)

Multiplying both sides of Eq (19) by **Y**^{T} and using the fact that **Y**^{T} **Y** = **I**, we can get **Λ** = **Y**^{T}(**A****Y** + **B**). Since **Y**^{T} **Y** is symmetric, the Lagrangian multiplier **Λ** corresponding to **Y**^{T} **Y** = **I** is also symmetric, so we can rewrite it as **Λ** = (**A****Y** + **B**)^{T} **Y**. Taking it into Eq (19), we have
(20)

We denote 2(**A****Y****Y**^{T} + **B****Y**^{T} − **Y**(**A****Y** + **B**)^{T}) as **W**, and have the following Lemma which shows the first order condition of Eq (17):

**Lemma 1**
*if and only if* **W** = **0**, *so* **W** = **0** *is the first-order optimality condition of* Eq (17).

*Proof*. On one hand, according to the definition of **W**, we have , so if **W** = **0**, .

On the other hand, if , that is to say, (**A****Y****Y**^{T} + **B****Y**^{T} − **Y**(**A****Y** + **B**)^{T})**Y** = **0**. Let **M** = **A****Y** + **B**, then we have **M** = **Y****M**^{T} **Y**. Furthermore,
(21)

Taking the transposition of both sides of the equality, we have **M**^{T} = **M**^{T} **Y****Y**^{T}. Then we obtain
(22)
which is equal to **M****Y**^{T} − **Y****M**^{T} = **0**. Note that **W** = 2(**M****Y**^{T} − **Y****M**^{T}), so **W** = **0**.

In summary, **W** = **0** is the first-order optimality condition of our subproblem.

According to this, a natural way to update **Y** is gradient descent method which is **Y**^{k+1} ← **Y**^{k} − *τ* **W****Y**^{k}, where *τ* is the step size and **Y**^{k} is the result of **Y** at the *k*-th iteration. However, in this problem, since we have the orthogonal constraint on **Y**, we cannot use gradient descent directly for the reason that gradient descent may violate the constraint. To overcome this problem, we use a constraint preserving descent method to update it inspired by [26–28]. In more details, we compute the initial **Y** which is denoted as **Y**^{1} by the following standard spectral clustering objective:
(23)

Then, we use the following constraint preserving descent formula to compute the new iteration **Y**:
(24)

Note that according to the definition of **W**, we can easily verify that **W**^{T} = −**W**, which means that **W** is a skew-symmetric matrix. For a skew-symmetric matrix **W**, we have the following theorem which gives a closed form solution of Eq (24) that satisfies the orthogonal constraint and updates **Y** in a descent direction. Moreover, due to Lemma 1, it converges to a stationary point.

**Theorem 2**. *1) Given any skew-symmetric matrix* *and* *which satisfies* **Y**^{kT} **Y**^{k} = **I**, *the closed form solution of matrix* **Y**^{k+1} *defined by* Eq (24) *is*
(25)
*and it satisfies that* **Y**^{k+1T} **Y**^{k+1} = **I**.

*2) Set* **W** = 2(**A****Y**^{k} **Y**^{kT} + **B****Y**^{kT} − **Y**^{k}(**A****Y**^{k} + **B**)^{T}), *then* (26) *where* *is the objective function in* Eq (17), *and it means that updating* **Y** *is in a descent direction*.

*3) This update formula converges to a stationary point of the subproblem*.

*Proof*. 1) Moving all **Y**^{k+1} to the left side, we get . Multiplying both sides by the inverse of , we obtain the closed form solution of **Y**^{k+1}:

Then we calculate **Y**^{k+1T} **Y**^{k+1}
(27)
where the second equality follows that **W**^{T} = −**W** when **W** is a skew-symmetric matrix.

Take Eq (28) into Eq (27), (29)

2) According to the chain rule, we have (30)

When *τ* = 0, **Y**^{k+1} = **Y**^{k}, and , , so on one hand,

So we have , which means if **Y** moves a small step Δ*τ* > 0 in the update direction, the objective function will have a change and , so the objective function will decrease. Thus the update direction is a descent direction.

3) Since the objective function deceases monotonically and the objective function is lower bounded by 0, the iteration method converges. When it converges, which means when *τ* = 0, i.e., **W** = **0**, it satisfies the first-order optimality condition of our subproblem according to Lemma 1. So the algorithm can converge to a stationary point of this subproblem.

Note that we choose the iteration step size *τ* by a curvilinear search method as was done in [28], which can guarantee the convergence. Therefore, we compute **Y**^{k+1} iteratively using the update formula Eq (25) until the decent process converges. Clearly, the computationally heaviest step in this algorithm is to compute the matrix inverse , which is *O*(*n*^{3}). Fortunately, we can find a fast way to calculate it. By the definition of **W**, we rewrite , where and **G** = [**Y**^{k}, **A****Y**^{k} + **B**]^{T}, then according to [29] we have
(31)

Since **I** + **G****U** is a 2*c* × 2*c* matrix, where *c* is the dimension of the embedding space and often has *c* ≪ *n*, we can efficiently compute the original inverse by matrix multiplication (*O*(*n*^{2} *c*)) and the inverse of a much smaller matrix (*O*(*c*^{3})).

After getting **Y**, **V**^{(1)}, …,**V**^{(m)}, we use spectral rotation [20] to discretize **Y** to get the final clustering result. Algorithm 2 summarizes the whole algorithm.

**Algorithm 2** Algorithm of DCSC

**Input**: Multi-view data **X**^{(1)}, …,**X**^{(m)}, λ_{1}, λ_{2}.

**Output**: Clustering result **R**.

1: Construct Laplacian matrix **L**^{(i)} for each view.

2: **while** not converge **do**

3: Compute **V**^{(i)}(*i* = 1, 2, …, *m*) by Algorithm 1.

4: Compute **Y** by Constraint Preserving Descent method.

5: **end while**

6: **R** = *discrete*(**Y**).

### Convergence analysis and time complexity

According to Theorem 1 and Theorem 2, no matter updating **Y** or **V**^{(i)}, the objective function decreases monotonically. Moreover, the objective function has a lower bound 0. Thus Algorithm 2 converges. In fact, this algorithm converges very fast (within several ten iterations in practice).

Since we decreases the time complexity of matrix inverse from *O*(*n*^{3}) to *O*(*n*^{2}*c* + *c*^{3}), the computationally heaviest step is matrix multiplication. Among all matrix multiplication in our method, the highest time complexity is *O*(*n*^{2}*c*) which is generated in the multiplication of an *n* × *n* and an *n* × *c* matrix. So the time complexity is square in the number of instances.

## Experiments

In this section, we evaluate the effectiveness of DCSC by comparing it with several state-of-the-art multi-view clustering methods on benchmark data sets.

### Data sets

We use totally 8 data sets to evaluate the effectiveness of our method, including WebKb data set [30], which contains webpages collected from four universities: *Cornell*, *Texas*, *Washington* and *Wisconsin* and is available in http://membres-liglab.imag.fr/grimal/data.html; UCI handwritten digit data set [5] which can be found in http://archive.ics.uci.edu/ml/datasets/Multiple+Features; Advertisements data set [31] which is published in http://archive.ics.uci.edu/ml/datasets/Internet+Advertisements; Corel image data set [32] which can be found in http://www.cs.virginia.edu/~xj3a/research/CBIR/Download.htm; and Flower17 data set [33] which is available in http://www.robots.ox.ac.uk/~vgg/data/flowers/17/index.html. Statistics of these data sets are summarized in Table 1. Since Flower17 data set only contains 7 distance matrices constructed by the 7 views, we do not show the dimension of each view in Table 1.

### Compared methods

To demonstrate the effectiveness of our method, we compare DCSC with the following algorithms:

**FeaConcat**, which first concatenates features in all views and then applies spectral clustering on it.**RMKMC**[1], which is a robust k-means based multi-view clustering method.**MultiNMF**[2], which is a nonnegative matrix factorization based multi-view clustering method.**Co-reg SC**[4], which is a co-regularized multi-view spectral clustering method.**RMSC**[5], which is a robust multi-view spectral clustering with sparse low-rank decomposition.**AMGL**[6], which is a parameter-free multi-view spectral clustering method, i.e., it learns an optimal weight for each view automatically without introducing an additive parameter.**SwMC**[34], which is a self-weighted multi-view clustering method with multiple graphs.**RAMC**[35], which is a robust auto-weighted multi-view clustering method.**DCSC-**. To show the effect of the prior weight*ω**ω*in our method, we remove the*ω*(or equivalently, we set all*ω*_{ij}to 1) and obtain DCSC-*ω*.

### Experiment setup

The number of clusters is set to the true number of classes for all data sets and all methods. Since the results of most compared algorithms depend on the initializations, we independently repeat the experiments for 10 times and report the average results and *t*-test results. In our method, we tune λ_{1} in [10^{−5}, 10^{5}] and tune λ_{2} in [10^{−4}, 10^{4}] by grid search. Note that, λ_{1} absorbs the scaling factor (*n* − 1)^{−2} in it which is dependent on the size of the data set, and in our experiments, the size is in the range between 100 to 5000, which is relatively narrow. Therefore we tune it in a wider range [10^{−5}, 10^{5}]. Of course, we can also use other parameter tuning strategy, for example, let , so that is independent of *n* and we tune in a predefined range. For other compared methods, we tune the parameters as suggested in their papers. Three clustering evaluation metrics are adopted to measure the clustering performance, including clustering Accuracy (ACC), Normalized Mutual Information (NMI) and clustering Purity.

### Experimental results

Tables 2–4 show the ACC, NMI and Purity results of all methods on all data sets, respectively. Bold font indicates that the difference is statistically significant (the *p*-value of *t*-test is smaller than 0.05). Note that since Flower17 data set only contains distance matrices instead of the original features, we construct Laplacian matrices from distance matrices and only compare our method with spectral clustering based methods.

In Tables 2–4, on most data sets, the performance of spectral based methods (Co-reg SC, RMSC, AMGL, SwMC, RAMC and ours) are much better than spectral clustering on feature concatenating, which demonstrates the effectiveness of multi-view clustering. Moreover, our method outperforms other compared methods on most data sets, which means that taking divergence of each view into consideration is indeed helpful to multi-view clustering. Especially on Corel data set, which is the most difficult for clustering (it has the most views (7 views) and the most classes (34 classes)), our method has 23%, 12%, and 18% improvements on the second best method on ACC, NMI and Purity, respectively. In our method, since we explicitly capture the distinct variances of all views by minimizing the dependency among them, the remainder is a clearer consensus spectral embedding leading to a better clustering result. Note that, DCSC also obtain better results compared with DCSC-*ω*, which means considering the similarity between each view can improve the performance of our method.

We show the algorithm convergence on UCI Digit, Advertisements, Corel and Flower17 data set in Fig 2, and the results on other data sets are similar. The example result in Fig 2 shows that our method converges within a small number of iterations, which empirically demonstrates our claims in the previous section.

(a) UCI Digit data set. (b) Corel data set. (c) Advertisements data set. (d) Flower17 data set.

### Parameter study

We explore the effect of the parameters on clustering performance. There are two parameters in our method: λ_{1} and λ_{2}. We show the ACC, NMI and Purity on UCI Digit and Corel data sets and the results are similar on other data sets. Fig 3 shows the results, from which we can see that the performance of our method is stable across a wide range of the parameters.

(a) ACC on UCI Digit data set. (b) NMI on UCI Digit data set. (c) Purity on UCI Digit data set. (d) ACC on Corel data set. (e) NMI on Corel data set. (f) Purity on Corel data set.

## Conclusion

In this paper, we proposed a novel multi-view spectral clustering method DCSC. We explicitly captured the distinguishing or complementary information in each view and took advantage of the distinct information to learn a better consensus clustering result. To characterize the distinct information effectively, we use HSIC to control the diversity of each view. Since the introduced optimization problem contains several variables, we presented a block coordinate descent algorithm to solve it and proved its convergence. Finally, experiments on benchmark data sets show that the proposed method outperforms the state-of-the-art multi-view clustering methods.

Since the scalability is a serious problem in spectral clustering, in the future, we will study scalability issue with multi-view spectral clustering and apply it to large-scale data sets.

## Acknowledgments

We would like to thank the anonymous reviewers for their helpful comments and suggestions.

## References

- 1.
Cai X, Nie F, Huang H. Multi-view k-means clustering on big data. In: Proceedings of the Twenty-Third IJCAI. AAAI Press; 2013. p. 2598–2604.
- 2.
Liu J, Wang C, Gao J, Han J. Multi-view clustering via joint nonnegative matrix factorization. In: Proc. of SDM. vol. 13. SIAM; 2013. p. 252–260.
- 3.
Kumar A, Daumé H. A co-training approach for multi-view spectral clustering. In: ICML; 2011. p. 393–400.
- 4.
Kumar A, Rai P, Daume H. Co-regularized multi-view spectral clustering. In: Advances in NIPS; 2011. p. 1413–1421.
- 5.
Xia R, Pan Y, Du L, Yin J. Robust multi-view spectral clustering via low-rank and sparse decomposition. In: AAAI; 2014. p. 2149–2155.
- 6.
Nie F, Li J, Li X. Parameter-Free Auto-Weighted Multiple Graph Learning: A Framework for Multiview Clustering and Semi-Supervised Classification. International Joint Conferences on Artificial Intelligence; 2016.
- 7. Nie F, Cai G, Li J, Li X. Auto-Weighted Multi-View Learning for Image Clustering and Semi-Supervised Classification. IEEE Transactions on Image Processing. 2018;27(3):1501–1511.
- 8. Zhang Z, Zhai Z, Li L. Uniform Projection for Multi-View Learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(8):1675–1689.
- 9. Tang C, Chen J, Liu X, Li M, Wang P, Wang M, et al. Consensus learning guided multi-view unsupervised feature selection. Knowledge-Based Systems. 2018;160:49–60.
- 10.
Cao X, Zhang C, Fu H, Liu S, Zhang H. Diversity-Induced Multi-View Subspace Clustering; 2015.
- 11.
Günnemann S, Färber I, Seidl T. Multi-view clustering using mixture models in subspace projections. In: The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’12, Beijing, China, August 12-16, 2012; 2012. p. 132–140.
- 12.
Niu D, Dy JG, Jordan MI. Multiple Non-Redundant Spectral Clustering Views. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel; 2010. p. 831–838.
- 13.
Xu C, Tao D, Xu C. A survey on multi-view learning. arXiv preprint arXiv:13045634. 2013;.
- 14. Zhao J, Xie X, Xu X, Sun S. Multi-view learning overview: Recent progress and new challenges. Information Fusion. 2017;38:43–54.
- 15.
Blum A, Mitchell T. Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on Computational learning theory. ACM; 1998. p. 92–100.
- 16.
Wang W, Zhou ZH. Analyzing co-training style algorithms. In: Machine Learning: ECML 2007. Springer; 2007. p. 454–465.
- 17.
Ng AY, Jordan MI, Weiss Y. On Spectral Clustering: Analysis and an algorithm. In: Dietterich TG, Becker S, Ghahramani Z, editors. Advances in Neural Information Processing Systems 14. MIT Press; 2002. p. 849–856.
- 18. Liu J, Jiang Y, Li Z, Zhou ZH, Lu H. Partially shared latent factor learning with multiview data. IEEE Trans Neural Netw Learning Syst. 2015;26(6):1233–1246.
- 19. Shi J, Malik J. Normalized cuts and image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on. 2000;22(8):888–905.
- 20.
Yu SX, Shi J. Multiclass spectral clustering. In: Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on. IEEE; 2003. p. 313–319.
- 21.
Scholkopf B, Smola A. Learning with kernels. MIT press Cambridge. 2002;.
- 22.
Gretton A, Bousquet O, Smola A, Schölkopf B. Measuring statistical dependence with Hilbert-Schmidt norms. In: PROCEEDINGS ALGORITHMIC LEARNING THEORY. Springer-Verlag; 2005. p. 63–77.
- 23.
Tang J, Hu X, Gao H, Liu H. Exploiting local and global social context for recommendation. In: Proceedings of the Twenty-Third IJCAI. AAAI Press; 2013. p. 2712–2718.
- 24. Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences. 2009;2(1):183–202.
- 25.
Ji S, Ye J. An accelerated gradient method for trace norm minimization. In: Proceedings of the 26th annual international conference on machine learning. ACM; 2009. p. 457–464.
- 26. Goldfarb D, Wen Z, Yin W. A curvilinear search method for p-harmonic flows on spheres. SIAM Journal on Imaging Sciences. 2009;2(1):84–109.
- 27. Vese LA, Osher SJ. Numerical methods for p-harmonic flows and applications to image processing. SIAM Journal on Numerical Analysis. 2002;40(6):2085–2104.
- 28. Wen Z, Yin W. A feasible method for optimization with orthogonality constraints. Mathematical Programming. 2013;142(1-2):397–434.
- 29. Petersen KB, Pedersen MS. The matrix cookbook. Technical University of Denmark. 2008;7:15.
- 30.
Li SY, Yuan J, Zhou ZH. Partial multi-view clustering. In: AAAI; 2014.
- 31.
Kushmerick N. Learning to remove internet advertisements. In: Proceedings of the third annual conference on Autonomous Agents. ACM; 1999. p. 175–181.
- 32.
French JC, Watson JV, Jin X, Martin W. Integrating multiple multi-channel CBIR systems. In: In: Proc. Inter. Workshop on Multimedia Information Systems (MIS). Citeseer; 2003.
- 33.
Nilsback ME, Zisserman A. Automated Flower Classification over a Large Number of Classes. In: Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing; 2008.
- 34.
Nie F, Li J, Li X. Self-weighted multiview clustering with multiple graphs. In: Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence; 2017. p. 2564–2570.
- 35.
Ren P, Xiao Y, Xu P, Guo J, Chen X, Wang X, et al. Robust Auto-Weighted Multi-View Clustering. In: IJCAI; 2018. p. 2644–2650.