A Bayesian Alternative to Mutual Information for the Hierarchical Clustering of Dependent Random Variables

The use of mutual information as a similarity measure in agglomerative hierarchical clustering (AHC) raises an important issue: some correction needs to be applied for the dimensionality of variables. In this work, we formulate the decision of merging dependent multivariate normal variables in an AHC procedure as a Bayesian model comparison. We found that the Bayesian formulation naturally shrinks the empirical covariance matrix towards a matrix set a priori (e.g., the identity), provides an automated stopping rule, and corrects for dimensionality using a term that scales up the measure as a function of the dimensionality of the variables. Also, the resulting log Bayes factor is asymptotically proportional to the plug-in estimate of mutual information, with an additive correction for dimensionality in agreement with the Bayesian information criterion. We investigated the behavior of these Bayesian alternatives (in exact and asymptotic forms) to mutual information on simulated and real data. An encouraging result was first derived on simulations: the hierarchical clustering based on the log Bayes factor outperformed off-the-shelf clustering techniques as well as raw and normalized mutual information in terms of classification accuracy. On a toy example, we found that the Bayesian approaches led to results that were similar to those of mutual information clustering techniques, with the advantage of an automated thresholding. On real functional magnetic resonance imaging (fMRI) datasets measuring brain activity, it identified clusters consistent with the established outcome of standard procedures. On this application, normalized mutual information had a highly atypical behavior, in the sense that it systematically favored very large clusters. These initial experiments suggest that the proposed Bayesian alternatives to mutual information are a useful new tool for hierarchical clustering.


Introduction
Cluster analysis aims at uncovering natural groups of objects in a multivariate dataset (see Jain (2010) for a review).In the vast variety of methods used in cluster analysis, an agglomerative hierarchical clustering (AHC) is a generic procedure that sequentially merges pairs of clusters that are most similar according to an arbitrary function called similarity measure1 , thereby generating a nested set of partitions, also called hierarchy (Duda et al., 2000).The choice of the similarity measure indirectly defines the shape of the clusters, and thus plays a critical role in the clustering process.While this choice is guided by the features of the problem at hand, it is also often restricted to a limited number of commonly used measures, such as the Euclidean distance or Pearson correlation coefficient (D'haeseleer, 2005).In the present work, we focus on the clustering of random variables based on their mutual information, which has recently gained in popularity in cluster analysis, notably in the field of genomics (see, e.g., Butte and Kohane, 2000;Zhou et al., 2004;Dawy et al., 2006;Priness et al., 2007) and in functional magnetic resonance imaging (fMRI) data analysis (Stausberg et al., 2009;Benjaminsson et al., 2010;Kolchinsky et al., 2014).Mutual information is a general measure of statistical dependency derived from information theory (Shannon, 1948;Kullback, 1968;Cover and Thomas, 1991).A key feature of mutual information is its ability to capture nonlinear interactions for any type of random variables (Steuer et al., 2002); also of interest, it indifferently applies to univariate or multivariate variables and can thus be applied to clusters of arbitrary size.Yet, mutual information is an extensive measure that increases with variable dimensionality.In addition, Î, the finite-sample estimator of mutual information, suffers from a dimensionality-dependent bias (see Appendix A).Several authors have proposed to correct mutual information for dimensionality (Li et al., 2001;Kraskov et al., 2005;Kraskov and Grassberger, 2009) by using a "normalized" version of mutual informa-tion Înorm (X i , X j ) = Î(X i , X j ) where D i and D j are the dimensions of X i and X j , respectively.In the clustering literature, normalized mutual information is routinely used.However, the impact of such correction procedure has not been extensively evaluated so far.
In the present paper, we consider Bayesian model-based clustering (Scott and Symons, 1971;Binder, 1981;Heller and Ghahramani, 2005;Jain, 2010) as an alternative to mutual information for the hierarchical clustering of dependent multivariate normal variables.Specifically, we derive a similarity measure by comparing two models: M I where X i and X j are independent (i.e., the covariance between any element of X i and any element of X j is equal to zero), against M D where the covariance between X i and X j can be set to any admissible value.The proposed similarity measure is a log Bayes factor, i.e., the log of the ratio of the marginal model likelihoods: where S is the sample sum-of-square matrix and S i∪j its restriction to (X i , X j ).With appropriate priors on the parameters of the model, we show that s(X i , X j ) can be expressed in closed form as where a i a j , a i∪j and b are constants defined below and S is a regularized version of the empirical sum-of-square matrix.As will be shown below, the Bayesian formulation naturally (1) allows for clustering even when the sample covariance matrix is ill-defined; (2) provides for an automated stopping rule when the clustering reached has s(X i , X j ) ≤ 0 for any pair of remaining clusters; (3) corrects for dimensionality using a term that scales up the measure as a function of the dimensionality of the variables; and (4) provides for a local and global measure of similarity, in that it can be used to decide which pair of variables to cluster a each step (local level) as well as to compare different levels of the resulting hierarchy (global level).The correction for dimensionality is similar to a Bayesian information criterion (BIC) (Schwarz, 1978), since, asymptotically (i.e., when the number of samples N → ∞), we obtain where O(1) stands for a function of N that is bounded for N large enough.In this sense, the present paper makes an explicit connection between Bayesian model comparison for the clustering of dependent random variables and mutual information.
We evaluated an AHC procedure based on this approach with synthetic datasets.This experiment aimed to evaluate how it behaved under both its exact and asymptotic forms compared to other approaches, including normalized mutual information.We finally tested the new measures on two real datasets: a toy dataset and functional magnetic resonance imaging (fMRI) data.

Analysis
In the following, we detail a Bayesian solution to the problem of clustering detailed above.In Section 2.1, we introduce the model together with the Bayesian framework and a general expression for the similarity measure.We then derive a closed-form expression for the quantities of interest, namely the marginal model likelihoods under the assumption of dependence (Section 2.2) and independence (Section 2.3).Sections 2.4 and 2.5 provide exact and asymptotic expressions, respectively, for the similarity measure.Finally, in Section 2.6, we examine how the same framework can be conveniently used to compare nested partitions, that is, different levels of a hierarchy.

Bayesian model comparison
Let X be a D-dimensional multivariate normal variable with known mean2 and unknown covariance matrix Σ. Define X i and X j as two disjoint subvectors of X (of size D i and D j , respectively), and X i∪j as their union (of size D i∪j = D i +D j ).Assume that we have to decide whether we should cluster X i and X j based on their level of dependence.To this end, consider two competing models M I and M D .In M I , X i and X j are independent and the distribution of X i∪j can be decomposed as the product of the marginal distributions of X i and X j .Under such condition, Σ i∪j , the restriction of Σ to X i∪j , is block diagonal with blocks Σ i and Σ j , the restrictions of Σ to X i and X j , respectively.In M D , by contrast, X i and X j are dependent.Given a dataset of N independent and identically distributed (i.i.d.) realizations of X and S the corresponding sample sum-of-square matrix, we propose to quantify the similarity between X i and X j as the log ratio of the marginal model likelihoods of Each marginal model likelihood can be expressed as an integral over the model parameters as described below.

Marginal model likelihood under the hypothesis of dependence
Let us first calculate P (S i∪j |M D ), the marginal model likelihood under the hypothesis of dependence.Expressing this quantity as a function of the model parameters yields (2) where Z(d, n) is the inverse of the normalization constant As to the prior distribution, this quantity is here set as a conjugate prior, namely an inverse-Wishart distribution with ν i∪j degrees of freedom and inverse scale matrix Λ i∪j (Gelman et al., 2004, §3.6) (4) Bringing Equations ( 3) and (4) together into Equation (2) yields for the marginal model likelihood The integrand is proportional to an inverse-Wishart distribution with N + ν i∪j degrees of freedom and scale matrix Λ i∪j + S i∪j , leading to (5)

Marginal model likelihood under the hypothesis of independence
We can now calculate P (S i∪j |M I ), the marginal model likelihood under the hypothesis of independence.If M I holds, then Σ i∪j is block-diagonal with two blocks Σ i and Σ j the submatrix restrictions of Σ i∪j to X i and X j , respectively.Introduction of the model parameters therefore yields for the marginal likelihood To calculate this integral, we again need to know the likelihood p(S i∪j |M I , Σ i , Σ j ) and the prior distribution p(Σ i , Σ j |M D ) of the two blocks of the covariance matrix under M I .The likelihood is the same as for M D and has the form of Equation (3).Furthermore, since Σ i∪j is here block diagonal, we have , where S i and S j are the restrictions of S to X i and X j , respectively.Consequently, the likelihood can be further expanded as (7) As to the prior distribution, assuming no prior dependence between Σ i and For the sake of consistency, p(Σ i |M I ) and p(Σ j |M I ) are set equal to p(Σ i |M D ) and p(Σ j |M D ), respectively, which are in turn obtained by marginalization of p(Σ i∪j |M D ) as given by Equation (4).For k ∈ {i, j}, p(Σ k |M I ) can be found to have an inverse-Wishart distribution with ν k = ν i∪j − D i∪j + D k degrees of freedom and inverse scale matrix Λ k the restriction of Λ i∪j to X k (Press, 2005, §5.2) Incorporating Equations ( 7), (8), and (9) into Equation (6) yields Each integrand is proportional to an inverse-Wishart distribution with N + ν k degrees of freedom and scale matrix

Log Bayes factor of dependence versus independence
Let us now express the Bayesian similarity measure by incorporating Equations ( 5) and (10) into Equation (1), yielding with and ∆φ k quantifies, up to a constant that cancels out in s(X i , X j ), the amount by which the data support a model of multivariate normal distributions with unrestricted covariance matrix for X k .

Asymptotic form of the log Bayes factor
We can now provide an asymptotic expression for s(X i , X j ).Assume that N → ∞.Starting from the expression φ of Equation ( 15), we have Defining S k as the standard sample covariance matrix, i.e., S k = N S k , the first term of the right-hand side can be expanded as since |aA| = a dim(A) |A| for any positive number a and matrix A. In the expression of φ in Equation ( 16), each term in the sum can be approximated using Stirling approximation (Abramowitz and Stegun, 1972, p. 257) .
Summing this expression over d = 1, . . ., D k and using the fact that We then have for φ , and the Taylor expansion of ∆φ k , as defined in Equation ( 14), is also given by Equation (17).Finally, the approximation for s(X i , X j ) of Equation ( 1) is given by s where we used the fact that D i∪j = D i + D j , and where is the plug-in estimator of mutual information for a multivariate normal distribution.Alternatively, N Î(X i , X j ) can be seen as the minimum discrimination information for the independence of X i and X j (Kullback, 1968, Chap.12, §3.6).

Comparing distinct levels of the hierarchy
The last development aims at providing a way to compare nested partitions, i.e., different levels of the hierarchy.Once the hierarchical clustering has been performed, each step is associated with a partition of X. Assume that, at level l, the partition reads {X 1 , . . ., X K } and that, at step l + 1, X i and X j are clustered.Denote by C l the assumption that the partition at level l is the correct partition of X.The global improvement brought by the clustering of X i and X j between steps l and l + 1 can be quantified by the log ratio between the marginal model likelihoods ln p(S|C l+1 ) p(S|C l ) , where both quantities can be computed in a manner similar to what was done for the similarity measure.For instance, if C l is true, then Σ is block-diagonal with K blocks Σ k 's, the submatrix restrictions of Σ to X k .Introducing the model parameters then yields In a way similar to what was done previously, the likelihood p(S|C l , Σ 1 , . . ., Σ K ) can be expanded as Turning to the prior distribution p(Σ 1 , . . ., Σ K |C l ) and assuming no prior dependence between the Σ k 's, we can set Each p(Σ k |C l ) can be obtained by the marginalization of p(Σ|C l ), which is here taken as an inverse-Wishart distribution with ν degrees of freedom and inverse scale matrix the D-by-D diagonal matrix Λ.Note that such a prior on Σ is compatible with the prior used earlier for Σ i∪j if one sets ν i∪j = ν − D + D i∪j and if Λ i∪j is the restriction of Λ to X i∪j (Press, 2005, § 5.2).We then have Incorporating Equations ( 20), (21), and ( 22) into Equation ( 19) and integrating leads to The same calculation can be performed for p(S|C l+1 ).The result is the same as in Equation ( 23), except that the product is composed of K − 1 terms.Of these terms, K − 2 correspond to clusters that are unchanged from C l to C l+1 and, as a consequence, are identical to those of Equation ( 23).The (K − 1)th corresponds to the cluster obtained by the merging of X i and X j .As a consequence, the log Bayes factor reads But this quantity is nothing else than s(X i , X j ).In other words, we proved that i.e., s(X i , X j ), the local measure of similarity between X i and X j , can be used to compute the global measure of relative probability between two successive levels of the hierarchy.

Setting the hyperparameters
Hyperparameter selection is often a thorny issue in Bayesian analysis.We here considered two approaches.The first approach (coined BayesCov) is to set the degree of freedom to the lowest value that still corresponds to a well-defined distribution, that is ν = D, and a diagonal scale matrix that optimizes the marginal model likelihood of Equation ( 23) before any clustering (that is, with K = D clusters and D k = 1 for all k), yielding (see Appendix B) An alternative approach (coined BayesCorr) is to work with the sample correlation matrix instead of the sample covariance matrix.One can then set the number of degrees of freedom to ν = D + 1 and the scale matrix to the identity matrix.The corresponding prior distribution yields uniform marginal distributions for the correlation coefficients (Barnard et al., 2000).Note that clustering with the asymptotic form of Equation ( 18) (coined Bic) does not involve hyperparameters.

Automatic stopping rule
An advantage of the Bayesian clustering scheme proposed here and its BIC approximation is that they come naturally with an automatic stopping rule.By definition of s in Equation ( 1), the fact that s(X i 0 , X j 0 ) > 0 for the pair that is selected for clustering also means that the marginal model likelihood for M D is larger than that for M I .As such, X i 0 and X j 0 are more likely to belong to the same cluster than not and, as a consequence, it indeed makes sense to cluster them.By contrast, if we have s(X i 0 , X j 0 ) < 0 for the same pair, the marginal model likelihood for M D is smaller than that for M I .X i 0 and X j 0 are therefore more likely to belong to different clusters.If, at a given step of the clustering, the pair with highest similarity measure has a negative similarity measures, then all pairs do, meaning that all pairs of clusters tested more probably belong to different clusters.It therefore makes sense to stop the clustering procedure at that point.This shows that an automatic stopping rule can simply be implemented into the clustering algorithm: Stop the clustering if the pair (X i 0 , X j 0 ) selected for clustering at a given step has s(X i 0 , X j 0 ) < 0. Note that, according to Equation ( 24), the resulting clustering corresponds to the one in the hierarchy that has largest marginal likelihood.We will refer to BayesCovAuto, BayesCorrAuto and BicAuto when applying the clustering schemes with this automated stopping rule.

Validation on synthetic data
To assess the behavior of the method expounded here, we examined how it fared on synthetic data.We used the two variants of the Bayes factor mentioned above (BayesCov and BayesCorr), Bic, as well as the same methods with automatic stopping rule (BayesCovAuto, BayesCorrAuto and BicAuto).As a mean of comparison, we also used the following methods- • A random hierarchical clustering scheme, where variables were clustered uniformly at random at each step.This category contains only one algorithm: Random, which was implemented for the purpose of the present study.
• Hierarchical clustering schemes with similarity measures given by either Pearson correlation or absolute Pearson correlation, and a merging rule based on either the single, average, or complete linkage, or using Ward criterion.This category contains 8 algorithms: Single, Average, Complete, Ward, SingleAbs, AverageAbs, CompleteAbs, WardAbs.We used the implementations of these methods proposed in NIAK3 .
• Hierarchical clustering with a similarity measure given by mutual information, with and without normalization.This category contains 2 algorithms: Infomut and InfomutNorm.These methods were implemented for the purpose of the present study.
Since the graphical lasso is not invariant by transformation of a covariance matrix into a correlation matrix, we used either the covariance matrix or the correlation matrix as input.Note that this approach automatically determined a number of clusters.Also, for λ = 0 (unconstrained case), the algorithm cannot run on small samples.This category contains 34 algorithms: 16 algorithms gLassoCov-x and 16 algorithms gLassoCorr-x, where x is the value of λ, and 2 algorithms gLassoCovOpt and gLassoCorrOpt.For this category of algorithms, we used a package freely available4 and already used in (Smith et al., 2011).
• An approach based on the spectral clustering (von Luxburg, 2006) of either the original value or the absolute value of either the correlation or the partial correlation matrix.This approach required the number of clusters as input.Since this clustering has a step of k-means, which is stochastic by nature, we considered 2 variants: one with 1 step of k-means and the other one with 10 repetitions of k-means and selection of the best clustering in terms of inertia.The similarity measures were defined so that the range would be the same (namely in [0, 1]) when using the signed or the absolute value of correlation: 0.5(1 + r) and |r|, respectively.This category contains 8 algorithms: 2 algorithms SpecCorr-x, 2 algorithms SpecCorrAbs-x, 2 algorithms SpecCorrpar-x and 2 algorithms SpecCorrparAbs-x, where x is the number of times that k-means is performed.The spectral clustering algorithms of this category were coded for the purpose of the present study, while we used the implementations of the k-means algorithm proposed in NIAK.
All in all, 59 variants were tested.

Data description
In order to assess the performance of the Bayesian approach, we performed the following set of simulations.For each value of D in {6, 10, 20, 40}, we considered partitions with increasing number of clusters For a given value of C, we performed 500 simulations.For each simulation, the D variables were randomly partitioned into C clusters, all partitions having equal probability of occurrence (Nijenhuis and Wilf, 1978, chap. 12;Wilf, 1999).For a given partition {X 1 , . . ., X K } of X, we generated data according to where all f k 's were taken either as multivariate normal distributions (parameters: mean µ k and covariance matrix Σ k ) or multivariate Student-t distributions (parameters: degres of freedom ν, location parameter µ k and scale matrix Σ k ).In both case, the µ k 's were set to 0 and the Σ k 's were first sampled according to a Wishart distribution with D k + 1 degrees of freedom and scale matrix the identity matrix and then rescaled to a correlation matrix.The sampling scheme on Σ k generated correlation matrices with uniform marginal distributions for all correlation coefficients (Barnard et al., 2000).For the multivariate Student-t distributions, ν was set in {1, 3, 5}.Equation ( 25) was used to generate synthetic datasets of length N varying from 10 to 300 by increment of 40.Each dataset was summarized by its sample correlation matrix and hierarchical clustering was performed using the methods mentioned above.
To assess the efficiency of the various methods, we thresholded each clustering at the true number of clusters, except for BayesCovAuto, BayesCorrAuto, BicAuto and gLasso for which we used the clustering obtained by the method.We then quantified the quality of the resulting partition using the proportion of correct classifications as well as the adjusted Rand index, which computes the fraction of variable pairs that are correctly clustered corrected for chance (Rand, 1971;Hubert and Arabie, 1985).Results corresponding to a given dimension D and a given method were then pooled across numbers of clusters C, lengths N and distributions (multivariate normal and Student).We classified the methods from best to worst based on these global results using the following indices (in this order): median of adjusted Rand index, 25% percentile value of adjusted Rand index, 5% percentile value of adjusted Rand index, smallest value of adjusted Rand index, and proportion of correct classifications.Note that some algorithms (Infomut, InfomutNorm and SpecCorrpar) require the sample covariance matrix to be definite positive.As a consequence, these algorithms could not run on small samples.We therefore restrained our evaluation to cases where all algorithms were operational.
All simulations were implemented using the Pipeline System for Octave and Matlab, PSOM5 (Bellec et al., 2012) under Matlab 7.2 (The MathWorks, Inc.) and run on a 24-core server.

Results
The results corresponding to the 20 best methods are summarized in Figures 1 and 2. Globally, and as expected, all methods were adversely affected by an increase in the number of variables.In all cases, the variants proposed in the present paper performed very well compared to other methods.BayesCov and BayesCorr were always classified as the two best algorithms and Bic was never outperformed by a method already published.The methods with automatic thresholding of the hierarchy performed surprinsingly well, considered that they were compared against methods with oracle.In particular, they clearly outperformed all variants of gLasso, the only method that was able to automatically determine the number of clusters.Of note, all variants of gLasso proved too slow to be applied to our simulation data for D ∈ {20, 40}.

Toy example
This data set was first used in Roverato (1999) and later re-analyzed in Marrelec and Benali (2006) in the context of conditional independence graphs.It originates from a study investigating early diagnosis of HIV infection in children from HIV positive mothers.The variables are related to various measures on blood and its components: X 1 and X 2 immunoglobin G and A, respectively; X 4 the platelet count; X 3 , X 5 lymphocyte B and T4, respectively; and X 6 the T4/T8 lymphocyte ratio.The sample summary statistics are given in Table 1.Roverato (1999) found that the correlations between X 4 and any other variable had strong probability around zero and hypothesized that the model was overparametrized.Based on the simulation study, we performed clustering of the data with the following methods: BayesCov(Auto), BayesCorr(Auto), Bic(Auto), Infomut, InfomutNorm, SingleAbs, AverageAbs, CompleteAbs, WardAbs, SpecCorrAbs and SpecCorrparAbs.For spectral clustering, we used either 1 or 10 repe- titions of the k-means step; since the results obtained for 1 step of k-means were highly variable for 3, 4, and 5 clusters, we discarded these results.The resulting clusterings are given in Figure 3 and Table 2.All hierarchical clusterings started by clustering X 3 and X 5 (lymphocite).This was confirmed by SpecCorrAbs-10 when required to provide 5 clusters.All hierarchical clustering methods then clustered X 1 and X 2 (immunoglobin).This result was in agreement with both SpecCorrAbs-10 and SpecCorrparAbs-10 when required to provide 4 clusters.After the second step, we observed four behaviors for the AHC algorithms and classified them accordingly: (G1) BayesCov, BayesCorr, Infomut and InfomutNorm; (G2) SingleAbs and AverageAbs; (G3) Bic; (G4) CompleteAbs and WardAbs.
While not a hierarchical clustering, SpecCorrAbs-10 provided successive clusterings that were in agreement with methods in (G2).Algorithms in (G1) and (G3) clustered X 6 with {X 3 , X 5 }, creating a cluster of variables related to lymphocite.Algorithms in (G1) and (G2) (and SpecCorrAbs-10) agreed that a partitioning of the variables in two clusters should leads to {X 1 , X 2 , X 3 , X 5 , X 6 } on the one hand and {X 4 } on the other hand.This clustering was also found by SpecCorrpar-10 when constrained to extract two clusters from the data.It was also considered the best clustering for BayesCov and BayesCorr.For Bic, the optimal partitioning was composed of three clusters, namely, {X 1 , X 3 , X 6 }, {X 1 , X 2 }, and {X 4 }, which is in agreement with what would methods in (G1) yield for three clusters; furthermore, it still kept X 4 separated from the other variables.
In Figure 4, we represented the evolution of the log 10 Bayes factor during hierarchical clustering for BayesCov, BayesCorr and Bic.Note that, while the clustering steps are identical for BayesCov and BayesCorr, the log Bayes factors are similar but not identical.Likewise, while the first two clustering steps of Bic is identical to those of BayesCov and BayesCorr, one can see that, unlike BayesCov and BayesCorr, Bic considered the merging of {X 3 , X 5 } with X 6 almost as likely as that of X 1 with X 2 .Also, while the successive clusterings of X 3 with X 5 and then X 6 as well as that of X 1 with X 2 strongly increased the Bayes factor for BayesCov and BayesCorr, the improvement brought by the clustering of {X 3 , X 5 , X 6 } with {X 1 , X 2 } in these methods was less important.
All in all, this analysis led us to the following conclusion: it is very likely that variables X 1 and X 2 belong to the same cluster of dependent variables, and similarly for variables X 3 and X 5 .Also, there is very strong evidence in favor of the fact that X 4 is independent from the other variables.Finally, we suspect that X 3 , X 5 and X 6 could belong to the same cluster of variables.

fMRI data
Cluster analysis is a popular tool to study the organization of brain networks in resting-state fMRI (Yeo et al., 2011;Kelly et al., 2012), by identifying clusters of brain regions with highly correlated spontaneous activity.We applied the 13 methods that were found to have good performance on simulations (see Figure 5) to a collection of resting-state time series.The time series had 205 time samples and were recorded for 82 brain regions in 19 young healthy subjects.See Appendix C for details on data collection and preparation.
We first aimed at establishing which clustering algorithms yielded similar results on these datasets.We more specifically investigated a 7-cluster solution, as this level of decomposition has been examined several times in the literature (Bellec et al., 2010;Power et al., 2011;Yeo et al., 2011).Each clustering algorithm was applied to the time series of each subject independently.For a given pair of methods, the similarity between the cluster solutions generated by both methods on the same subject was evaluated with the Rand index6 .The Rand indices were averaged across subjects, hence resulting into a method-by-method matrix capturing the (average) similarity of clustering outputs for each pair of methods (see Figure 5).An AHC with Ward's criterion was applied to this matrix in order to identify clusters of methods with similar cluster outputs.We visually identified five clusters of methods that had high (> 0.7) average within-cluster Rand index.The largest cluster included classical AHCs such as CompleteAbs, AverageAbs, WardAbs, as well as the Bic and BayesCov methods proposed here.It should be noted that this class of algorithms generated solutions for this problem that were very close to one another (average within-cluster Rand index > 0.8).The BayesCorr method was also close to that large group of methods, but not quite as much as the aforementioned methods (average Rand index of about 0.7), and was thus singled out as a separate cluster.The spectral methods were split into two clusters, depending on whether they were based on correlation (SpecCorrAbs-1 and SpecCorrAbs-10) or partial correlation (SpecCorrparAbs-1 and SpecCorrparAbs-10).Finally, the two variants of mutual information (Infomut and InfomutNorm) generated solutions that were highly similar to SingleAbs.It was reassuring that the variants of Bayes methods proposed here performed similarly to algorithms known to produce physiologically plausible solutions, such as Ward (Thirion et al., 2014;Orban et al., in press).While we found some analogy between BayesCorr, BayesCov, Infomut and InfomutNorm, it was intriguing that the variants of mutual information tested seemed to generate markedly different classes of solutions from the Bayes methods.We decided to examine the cluster solutions of these methods in more details.As a reference, we examined the cluster solutions generated by WardAbs, in addition to two variants of Bayes clustering that yielded slightly different solutions (BayesCov and BayesCorr), and nomalized mutual information, InfomutNorm.To represent the typical behavior of each method across subjects, we generated a "group" consensus clustering summarizing the stable features of the ensemble of individual cluster solutions.This consensus clustering was generated by the evidence accumulation algorithm (Fred and Jain, 2005) outlined below.First, each partition of each subject was represented as a binary 81-by-81 adjacency matrix A = (A ij ), where for each pair (i, j) of brain regions, A ij = 1 if areas i and j were in the same cluster, and A ij = 0 otherwise.The adjacency matrices were then averaged across subjects and selected methods, yielding a 81-by-81 stability matrix C = (C ij ) where each element C ij coded for the frequency at which brain areas i and j fell in the same cluster.Finally this stability matrix was used as a similarity matrix in a AHC with Ward's criterion to generate one consensus partition.The brain regions were grouped into the same consensus cluster if they had a high probability of falling into the same cluster on average across subjects and methods, hence the name consensus clustering.
Figure 6 represents the stability matrices and consensus clusters, for the four methods of interest.As expected based on our first experiment on the similarity of cluster outputs, the WardAbs and BayesCov methods were associated to highly similar stability matrices and almost identical consensus clusters.Many areas of high consensus could be identified (with values close of 0 or 1), illustrating the very good agreement of the cluster solutions across subjects.The outline of the consensus clusters as well as a volumetric representation of the brain parcellation are presented in Figure 6b.Some of these clusters closely matched those reported in previous studies: cluster 7 can be recognized as being the visual network, cluster 2 the sensorimotor network, and clusters 6 and 3 the anterior and posterior parts of the defaultmode network, respectively (Salvador et al., 2005;van den Heuvel et al., 2008;Bellec et al., 2010).By contrast with WardAbs and BayesCorr, the InfomutNorm tended to generate very large clusters, which was apparent both on the stability matrix and the consensus clusters.The BayesCorr method was intermediate between BayesCov and InfomutNorm in terms of cluster size.These decompositions into very large clusters do not fit current views on the organization of resting-state networks.
Overall, our analysis on real fMRI data led to the following conclusions.Three big subsets of methods emerged: spectral methods, mutual information (with SingleAbs), and finally all the other methods.Application of this last group of methods, which included the Bayes variants proposed here, resulted in a plausible decomposition into resting-state networks.In the absence of ground truth, it is not possible to further comment on the relevance of the differences in cluster solutions identified by the three groups of methods.We still noted that the methods based on mutual information led to large clusters that were difficult to interpret.Our interpretation is that the strategies implemented in Infomut and InfomutNorm did not behave well for these datasets.
As a final computational note, the time required by the different methods to cluster data is summarized in Table 3.Although the differences in execution time may reflect the quality of the implementation, the methods proposed here were the slowest of the hierarchical methods, but were still faster than spectral clustering.

Contributions
Summary.We here proposed some novel similarity measures well suited for agglomerative hierarchical clustering.These measures rely on a Bayesian model comparison for multivariate normal random variables.On synthetic data with a known (ground truth) partition, hierarchical clustering based on the Bayesian measures was found to outperform several standard clustering procedures in terms of adjusted Rand index and classification accuracy.On the toy example, the Bayesian approaches led to result similar to those of mutual information clustering techniques, with the advantage of an automated thresholding.On the real fMRI data, the Bayesian measures led to results consistent with standard clustering methods, in contrast to methods based on mutual information, which exhibited a highly atypical behavior.

Bayesian normalization of mutual information.
A key feature of the Bayesian approach is its ability to take the dimension of the clusters into account.Dimensionality is an important issue in two respects (see Appendix A for an illustration).First, mutual information is an extensive measure that depends on the dimension of the variables.This has motivated the introduction of a normalization factor in the application of mutual information to hierarchical clustering (Kraskov et al., 2005;Kraskov and Grassberger, 2009).A second issue is the existence of a bias in the estimation of mutual information.This bias mechanically increases with the dimensionality of the variables.Because of the two issues described above, hierarchical clustering based on mutual information will tend to cluster unrelated but large variables rather than correlated variables of lower dimensions.As demonstrated on real fMRI data, the heuristic proposed by Kraskov et al. (2005) does not provide a general solution to the issue of dimensionality.Furthermore, it removes some interesting features of mutual information, such as the additivity of the clustering measure.By contrast, the Bayesian approach takes the dimensionality into account in a principled way, providing a quantitative version of Occam's factor (Jaynes, 2003, Chap. 20).The Bayesian normalization is additive rather than multiplicative, thus preserving the additive properties of mutual information.
From similarity measure to log Bayes factor.We defined the similarity measure s(X i , X j ) between any two pairs of variables X i and X j as a log Bayes factor.At each step, the pair (X i 0 , X j 0 ) that had the largest similarity measure was merged.Taking into account the unique features of s as the log Bayes factor defined in Equation ( 1) allowed us to have access to a global measure of fit as defined in Equation ( 24) as well as provide an automated stopping rule that behaved very well on simulated data.Going from a similarity measure to a log Bayes factor has other advantages that could take the clustering proposed here even further (see below).
Practical value of the Bayes/Bic clustering in fMRI.The new alternatives of mutual information introduced in this paper (i.e., Bayes and Bic) proved useful for the analysis of resting-state fMRI.The benefits were particularly clear when compared to InfoMut, which tended to create large, inhomogeneous clusters.By contrast, both Bayes and Bic had a behavior similar to standard hierarchical clustering based on Pearson's linear correlation.The possible benefits of Bayes and Bic over those canonical methods are still substantial.The mutual information first provides a multivariate measure of interaction that is well adapted to hierarchical brain decomposition (Tononi et al., 1994;Marrelec et al., 2008) and which has a clear interpretation in information theory (Watanabe, 1960;Joe, 1989;Studený and Vejnarová, 1998).For these reasons, the mutual information for Gaussian variables is more appealing than a simple average of pairwise correlation coefficients across clusters.Because mutual information is measured between clusters, it is natural to build the clusters themselves based on this metric.
A second benefit of the proposed approach is that Bic proved to be the most stable of all tested methods in the range of 5-15 clusters on real fMRI datasets.The increase in stability over Ward's was modest, but significant.This advantage may become even more substantial if the clustering is performed in higher dimension, i.e., with smaller areas than the AAL brain parcellation or even at the voxel level.

Modeling choices
Choice of priors.Any Bayesian analysis requires the introduction of prior distributions.In the present study, we needed the prior distribution for the covariance matrix associated to any clustering of X.Our choices were guided by one assumption, one rule of consistency, and one rule of simplicity.First, our assumption was to not assume a priori any sort of dependence between covariance matrices associated to different clusters.This allowed for the decomposition of any prior as a product of independent priors, see Equations ( 8) and ( 21).The rule of consistency was to consider that the prior for a given covariance matrix should not be contradictory at different levels of the hierarchy.This is why we set the prior distribution for the global covariance matrix Σ as an inverse-Wishart distribution and then derived the prior for any covariance matrix Σ k as the prior distribution for Σ integrated over all parameters that do not appear in Σ k ; using such an approach, the resulting prior turned out to be an inverse-Wishart distribution as well.Last, the rule of simplicity is the one that dictated the use of inverse-Wishart distributions as prior distributions for the covariance matrices.Such a family of priors had the twofold advantage of being closed by marginalization and allowing for closed-form expressions of the quantities of interest.An inverse-Wishart distribution is characterized by two parameters: the degrees of freedom ν and the inverse scale matrix Λ.If Σ is a D-by-D matrix, we must have ν > D − 1 for the distribution to be normalized.Also, Λ must be positive definite.From there, we had two strategies.
The first one was to set the degree of freedom to the lowest value that still corresponded to a well defined distribution (ν = D) and a diagonal scale matrix that optimized the marginal model likelihood.An alternative approach was to work with the sample correlation matrix, set ν = D + 1 and equal Λ with the D-by-D identity matrix I, since this choice corresponds to a distribution that is associated with uniform marginal distributions of the correlation coefficients (Barnard et al., 2000).While we believe that our assumption and the rule of consistency are sensible choices, we must admit that we are not quite as content with the choice of inverse-Wishart distributions for priors.The major issue with such a family of priors is that they simultaneously constrain the structure of correlation and the variances.By contrast, it seems intuitive that clustering should depend on the correlation structure only, not on the variances.As such, the prior on the variances should ideally be set independently from the correlation structure.Priors that separate variance and correlation have already been proposed (Barnard et al., 2000).Unfortunately, the use of such priors would make it impossible to provide a closed form for the marginal model likelihood.While numeric schemes could be implemented to circumvent this issue, it would render the procedure proposed much more complex and computationally burdensome.By contrast, the algorithm detailed here is rather straightforward.Also, the influence of the prior vanishes when the sample size increases.From a practical perspective, the three methods proposed here (two, BayesCov and BayesCorr, with different priors and one, Bic, not influenced by the prior) exhibited similar behaviors and still outperformed other existing methods in the simulation study.We take it as a proof of the robustness of the method to the choice of prior.
Covariance vs. precision matrix modeling.The presence of clusters of variables that are mutually independent is equivalent to having a covariance matrix Σ that is block diagonal, which is itself equivalent to having a precision (or concentration) matrix Υ = Σ −1 that is also block diagonal.
One could therefore solve the problem using precision matrices instead of covariance matrices.The corresponding calculations can be found in Appendix D. The main difference between the two approaches stems from the fact that a submatrix of the inverse covariance matrix is not equal to the inverse of the corresponding submatrix of the covariance matrix, that is, Ω k = (Λ −1 ) k = (Λ k ) −1 , unless Λ is block diagonal; also, Wishart and inverse-Wishart distributions do not marginalize the same way.These differences are to be related to the fact that a submatrice of a covariance matrix is better estimated than the whole covariance matrix, while the same does not hold for a precision matrix.From there, we could expect the covariancebased approach to perform better than the precision-based approach, and the difference to increase with increasing D. This was confirmed on our synthetic data, where the precision-based approach behaved as well as BayesCov and BayesCorr for D = 6 but had worse results than these two approaches for D ∈ {10, 20, 40}.Besides, performance of the automated stopping rule was much more efficient with BayesCov and BayesCorr than it was with the precision-based approach.As a final note, basing the method on concentration matrices yielded a slower algorithm, arguably because of the matrix inversions that are required.
Sample covariance matrix vs. full dataset.Intuitively, the structure of dependence of a multivarite normal distribution is included in its covariance matrix.All existing algorithms that we used here do not need the full dataset but only the sample covariance (or correlation) matrix.This is why we started with a likelihood model that only considers the covariance matrix [see Equations ( 3) and ( 7)].Rigorously, this model is only valid for N ≥ D; for N < D, one should resort to a model of the full data as being multivariate normal with a mean µ and covariance matrix Σ.Nonetheless, we kept the 'intuitive' approach as it has fewer hyperparameters, is easier to deal with, and leads to formulas that are simpler to interpret.Also, the resulting similarity measure [Equation ( 11)] is not restricted to N ≥ D, but is well defined for N < D as well.From a practical perspective, the 'intuitive' algorithm only requires the sample covariance matrix as input, while a full model would also require the sample mean.Finally, this simpler model exhibited good behavior on our synthetic data, even for small datasets.

Directions for future work
Computational costs.Regarding the computational cost, measures derived from mutual information or a Bayesian approach are more demanding than standard methods such as Avg or Ward.The derivation of the similarity matrix and the search for the most similar pairs of clusters are the two critical operations that can be optimized to decrease computation time.It is always possible to speed up these two steps by taking advantage of the fact that the similarity matrices of two successive iterations of the algorithm have many elements in common, as all but one element of a partition at a given iteration are identical to the elements of the partition of the previous iteration.Critically, in the case of the Avg and the Ward method, it is in addition possible to derive the similarity matrix at every iteration only based on the similarity matrix at initialization through successive updates using the Lance-Jambu-Williams recursive formula (Batagelj, 1988).By contrast, other measures, including BayesCov, BayesCorr, Bic, InfoMut, and InfomutNorm, have to be re-evaluated independently at every step, which means that the determinant of a covariance matrix with increasing size has to be computed.Finding an update formula analogous to Lance-Jambu-Williams for clustering methods based on variants of the mutual information would substantially accelerate these algorithms.
From deterministic to stochastic clustering.Another extension would be to replace the deterministic rule of selecting the pair with largest similarity measure for merging by a probabilistic rule where the probability to cluster a given pair is given by the posterior probability of the resulting clustering.
Group analysis.A last extension that could be easily implemented in the present framework is the generalization of the method to account for E different entities (e.g., subjects in fMRI) sharing the same structure but with potentially different covariance matrices for that structure.If each entity e is characterized by a variable X [e] and corresponding sample sumof-square matrix S [e] , one can perform AHC on each X [e] using S [e] and the corresponding similarity measure s [e] .However, with a straightforward modification of the present method, one could also perform global AHC of all E covariance matrices considered simultaneously.Assuming that the covariance matrices of the different elements are independent given the common underlying structure, then the resulting similarity measure is the sum of all individual similarity measures s [e] 's.
Generalization to other types of distribution.Altogether, the Bayesian framework that we used provides a principled way to generalize our approach to distributions other than multivariate normal.Such generalization would potentially account for a wide variety of situations, such as nonlinear dependencies or discrete distributions.This would widen the scope of possible applications of the technique, e.g., genetics (Butte and Kohane, 2000;Zhou et al., 2004;Dawy et al., 2006;Priness et al., 2007).The core of the problem for such generalizations is the possibility to express the marginal posterior likelihood.For multivariate discrete variables, we expect it to be feasible.For other distributions, obtaining a close form might be out of reach.Nonetheless, the marginal posterior likelihood could be approximated using various criteria, such as the AIC (Akaike, 1974) or variants thereof-AICc (Burnham and Anderson, 2004); (McQuarrie and Tsai, 1998, § 2.3.1) or AICu (McQuarrie and Tsai, 1998, § 2.4.1)-, or the BIC (Schwarz, 1978), which naturally appeared in the present derivation.
Application to truly hierarchical data.In the present manuscript, we used a hierarchical algorithm as a way to extract an underlying structure of dependence from data.Our assumption was that there was one such structure.Such an approach provided a simple and efficient clustering algorithm with an interesting connection to mutual information.However, the method as presented here is not able to deal with data that are truly organized hierarchically.Extending it to deal with such data would improve the scope of the algorithm.One way to do would be to use Dirichlet process mixtures (Heller and Ghahramani, 2005;Heller, 2007;Savage et al., 2009;Cooke et al., 2011;Darkins et al., 2013;Sirinukunwattana et al., 2013), together with a model of dependent variables.

Conclusion
In this paper, we proposed a procedure based on Bayesian model comparison to decide whether or not to merge Gaussian multivariate variables in a agglomerative hierarchical clustering procedure.The resulting similarity measure was found to be closely related to the standard mutual information, with some additional corrections for the dimensionality of the datasets.These new Bayesian alternatives to mutual information turned out to be beneficial to hierarchical clustering on Gaussian simulations and real datasets alike.Because of the simplicity of its implementation, its good practical performance and the potential generalizations to other types of random variables, we believe that the approach presented here is a useful new tool in the context of hierarchical clustering.

A Illustration of the behavior of estimated mutual information
In the case of a D-dimensional variable X with a multivariate normal distribution, the mutual information between two subvectors X i and X j is given by (Marrelec et al., 2008): where Σ k , k ∈ {i, j, i ∪ j}, is the covariance matrix of X k and | • | is the usual determinant function.
Estimation bias.In this case, mutual information is estimated using its plug-in estimator Î, that is, Equation ( 26) where the model covariance matrices have been replaced by their estimators (i.e., the corresponding sample covariance matrices).For large N , this estimator suffers from a systematic bias, as it was shown that (Marrelec and Benali, 2011) This bias is additive and is a (quadratic) function of the problem dimensionality.This is consistent with the fact that 2N I(X i , X j ) asymptotically follows a chi square distribution with D i D j degrees of freedom (Kullback, 1968, Chap. 12, § 3.6;Press, 2005, § 11.3.2).
Extensivity of the measure.Besides the above problem of estimation, mutual information itself is an extensive quantity i.e., it mechanically increases with an increase in the dimensionality of the problem.To better demonstrate this effect, consider a model of homogeneous matrix for Σ.More specifically, let A D (ρ) be a D-by-D homogeneous matrix with parameter ρ, i.e., a matrix with 1s on the diagonal and all off-diagonal elements equal to ρ.A D (ρ) has two eigenvalues: 1 + (D − 1)ρ with multiplicity 1 (associated with the vector composed only of 1s) and 1 − ρ with multiplicity D − 1 (associated with the subspace of vectors with a zero mean).This covariance matrix is therefore positive definite for 0 ≤ ρ < 1 and its determinant is given by [1 + (D − 1)ρ](1 − ρ) D−1 .Assuming that Σ = A D (ρ), mutual information can be expressed as where D k , k ∈ {i, j, i ∪ j}, is the dimension of X k .This expression can lead to paradoxical results: the mutual information is about 0.31 for D i = D j = 5 and ρ = 0.3, while it is about 0.33 for D i = D j = 7 and ρ = 0.25.More generally, I(X i , X j ) is an increasing function of D i and D j (as can be seen by differentiation).This means that, for the same value of ρ, mutual information will be larger for larger values of D i and D j .It also means that mutual information will favor the merging of two variables with smaller marginal correlations if the dimensionality of the variables is large enough, thus systematically favoring larger clusters.To give a feeling of the amplitude of this behavior, set D k = D k − 1 and assume that we have D k ρ 1.Mutual information can then be approximated by

B Hyperparameter optimization
Given a clustering {X 1 , . . ., X K } of X, optimizing the marginal likelihood with respect to a diagonal Λ leads to the optimization of where k d is the cluster containing X d .Differentiation with respect to Λ dd leads to To obtain the solution of this equation, we notice that the equation is equivalent to Optimizing at the first level (i.e., with D clusters) yields

C Real fMRI data: datasets and preprocessing
We used the 'Atlanta' resting-state fMRI database (Liu et al., 2009).This resource was made publicly available as part of the 1000-connectome project7 (Biswal et al., 2010).The Atlanta sample includes 28 subjects (age ranging from 22 to 57 years, 15 women) with one structural MRI and one fMRI run each (205 volumes, TR = 2 s), acquired on a 3 T scanner.The datasets were preprocessed using the neuroimaging analysis kit8 (NIAK), version 0.6.5c(Bellec et al., 2012).The parameters of a rigid body motion were first estimated at each time frame of the fMRI dataset (no correction of inter-slice difference in acquisition time was applied).The median volume of the fMRI time series was coregistered with a T 1 individual scan using Minctracc9 (Collins et al., 1994), which was itself transformed to the Montreal Neurological Institute (MNI) non-linear template (Fonov et al., 2011) using the CIVET10 pipeline (Zijdenbos et al., 2002).The rigid-body transform, fMRI-to-T 1 transform and T 1 -to-stereotaxic transform were all combined, and the functional volumes were resampled in the MNI space at a 3 mm isotropic resolution.The "scrubbing" method of Power et al. (2012) was used to remove the volumes with excessive motion (frame displacement greater than 0.5).The following nuisance parameters were regressed out from the time series at each voxel: slow time drifts (basis of discrete cosines with a 0.01 Hz high-pass cut-off), average signals in conservative masks of the white matter and the lateral ventricles as well as the first principal components (95% energy) of the six rigid-body motion parameters and their squares (Giove et al., 2009).The fMRI volumes were then spatially smoothed with a 6 mm isotropic Gaussian blurring kernel.Because some of the measures considered in this paper are poorly conditioned when the number of spatial locations is larger than the number of time points (BIC and InfoMut), the fMRI time series were spatially averaged on each of the areas of the AAL brain template (Tzourio-Mazoyer et al., 2002).To further reduce the spatial dimension, only the 81 cortical AAL areas were included in the analysis (excluding the cerebellum, the basal ganglia and the thalamus).
The clustering methods were applied to these regional time series.Note that 8 subjects were excluded because there was not enough time points left after scrubbing (a minimum number of 190 volumes was selected as acceptable), and one additional subject had to be excluded because the quality of the T 1 -fMRI coregistration was substandard (by visual inspection).A total of 19 subjects was thus actually used in this analysis.
The prior for Υ k that derives from that of Υ i∪j is a Wishart distribution with ν i∪j degrees of freedom and scale matrix Ω k (Press, 2005, §5.1.2) This leads to a marginal likelihood of Each integrand is proportional to a Wishart distribution with N + ν i∪j degrees of freedom and scale matrix (S k + Ω −1 k ) −1 , leading to

(Figure 3 :
Figure 3: Toy example.Result of clustering.Algorithms in the top row clustered X 4 at the last step, while it was clustered at the before the last step for algorithms in the bottom row.Algorithms in the left column clustered X 6 with {X 3 , X 5 }, while X 6 was clustered with {X 1 , X 2 } for the algorithms in the right column.Parts in grey correspond to clustering steps that would not be performed by BayesCovAuto or BayesCorrAuto in (G1), or Bic in (G3).

Figure 4 :
Figure 4: Toy example.Result of clustering for BayesCorr, BayesCov and Bic.The values on the y axis correspond to the log 10 Bayes factor in favor of the global clustering obtained at each step compared to a model where all variables are independent (step 0 of hierarchical clustering).

Figure 5 :
Figure 5: Real resting-state fMRI data-between-method similarity.Panel a: Rand indices between individual partitions generated with different methods, averaged across all subjects and scales (number of clusters).Panel b: Hierarchical clustering using matrix shown in Panel a as a similarity measure and Ward's criterion.

Figure 6 :
Figure 6: Real data-consensus clustering.A consensus clustering was generated based on the average adjacency matrices across all subjects (Panel a).The (weighted) adjacency matrix associated with the consensus clustering is represented along with a volumetric brain parcellation (Panel b).The weights in the adjacency matrix were added to establish a visual correspondence with the volumetric representation.Note that the brain regions have been order based on the hierarchical clustering generated with WardAbs.

Table 1 :
Toy example.Summary statistics for the HIV data: sample vari-

Table 2 :
Toy example.Result of spectral clustering with increasing number of clusters.

Table 3 :
Real resting-state fMRI data-computational cost.Time required by each method to cluster one dataset.For nonhierarchical methods, we summed the times used to perform clustering at each scale.