Approximate Joint Diagonalization and Geometric Mean of Symmetric Positive Definite Matrices

We explore the connection between two problems that have arisen independently in the signal processing and related fields: the estimation of the geometric mean of a set of symmetric positive definite (SPD) matrices and their approximate joint diagonalization (AJD). Today there is a considerable interest in estimating the geometric mean of a SPD matrix set in the manifold of SPD matrices endowed with the Fisher information metric. The resulting mean has several important invariance properties and has proven very useful in diverse engineering applications such as biomedical and image data processing. While for two SPD matrices the mean has an algebraic closed form solution, for a set of more than two SPD matrices it can only be estimated by iterative algorithms. However, none of the existing iterative algorithms feature at the same time fast convergence, low computational complexity per iteration and guarantee of convergence. For this reason, recently other definitions of geometric mean based on symmetric divergence measures, such as the Bhattacharyya divergence, have been considered. The resulting means, although possibly useful in practice, do not satisfy all desirable invariance properties. In this paper we consider geometric means of covariance matrices estimated on high-dimensional time-series, assuming that the data is generated according to an instantaneous mixing model, which is very common in signal processing. We show that in these circumstances we can approximate the Fisher information geometric mean by employing an efficient AJD algorithm. Our approximation is in general much closer to the Fisher information geometric mean as compared to its competitors and verifies many invariance properties. Furthermore, convergence is guaranteed, the computational complexity is low and the convergence rate is quadratic. The accuracy of this new geometric mean approximation is demonstrated by means of simulations.


Introduction
The study of distance measures between symmetric positive definite (SPD) matrices and the definition of the center of mass for a number of them has recently grown very fast, driven by practical problems in radar data processing, image processing, computer vision, shape analysis, medical imaging (especially diffusion MRI and Brain-Computer Interface), sensor networks, elasticity, mechanics, numerical analysis and machine learning (e.g., [1][2][3][4][5][6][7][8][9][10]). Interestingly, in this endeavor disparate perspectives from matrix analysis, operator theory, differential geometry, probability and numerical analysis have yielded converging results. In fact, we arrive at the same formalism and we end up with the same Riemannian metric from a pure differential geometric point of view [3,5,[11][12][13][14][15][16], or from an information geometric point of view, assuming the multivariate Normal distribution of the data and adopting the Fisher Information metric [17,18], dating back to the seminal works of Rao [19] and Amari [20].
The set of SPD matrices with a given dimension forms a smooth manifold, which we name the SPD manifold. As we will discuss below, there are various ways of defining natural geometries on the SPD manifold. If one is given a set of SPD matrices it may be useful to define a center of mass or geometric mean for the set taking into account the specific geometry of the manifold. One expects such a mean to be a more appropriate representative of the set as compared to other means that are not specific to the SPD manifold, such as the usual arithmetic and harmonic ones. The geometric mean is defined based on a chosen metric (distance) and several metrics have been proposed. These include the aforementioned Fisher information metric, the log-Euclidean metric [5,21,22] and the Bhattacharyya divergence [23,24], also named S-divergence [25,26], which turns out to be a specific instance of the α-divergence [27]. The geometric mean based on the Fisher information metric, hereafter referred to as the FI (Fisher Information) mean, satisfies a number of desirable invariance properties, including congruence invariance, self-duality, joint homogeneity and the determinant identity. This is not always the case for geometric means based on other metrics, thus the FI mean is a fundamental object. Whereas for two SPD matrices the geometric mean has a straightforward definition, this is not the case for a set composed of more than two matrices [2]. For its estimation, regardless of the definition, one has to resort to either geometrical (constructive) procedures [28][29][30] or iterative optimization algorithms (see [30] for a comparison). In this work we do not consider constructive procedures because, in general, they do not satisfy all desirable properties of a geometric mean or, if they do, their computational cost becomes prohibitive as the number of matrices in the set increases [30]. Among iterative optimization algorithms for estimating the FI mean the most widely used is a simple gradient descent algorithm in the SPD manifold [10]. This algorithm features a moderate computational complexity per iteration and linear convergence rate. However, convergence itself is not guaranteed without choosing a small enough step-size. Specifically, if the algorithm convergences it does so to the only critical point, which is the global minimum, but it may not converge. Due to the convexity of the costfunction, in order to guarantee convergence the step-size must be adjusted as a function of the radius of a ball containing the data points [31] or an Armijo step-size search must be carried out at each iteration [30], thus the overall computational cost to ensure convergence is of concern. Recent attempts to apply conjugate gradient optimization and second order methods based on the exact or an approximate Hessian (such as trust-region and BFGS) improve the convergence rate, as expected. In [32] the authors show that for 3 x 3 matrices (e.g., diffusion tensors) a Newton algorithm based on explicit Hessian computations outperforms the gradient descent algorithm unless the radius of the ball is small. However, this advantage is completely nullified by the increased complexity per iteration as the dimension of the matrices in the set increases; extensive simulations performed by [30] have shown that with matrix dimension as little as ten the gradient descent approach is overall advantageous over all second order alternatives. Moreover, for the second order methods, convergence conditions are more restrictive. The recently proposed majorization-minimization algorithm of [33] guarantees convergence with linear convergence rate but high complexity per iteration, burdening its usefulness in practice for matrices of large dimension and/or large matrix sets. In light of this situation, the search for an efficient algorithm for estimating the geometric mean in large data set problems is currently a very active field.
In this article we introduce a new approximation to the FI mean springing from the study of the relation between the geometric mean of a set of SPD matrices and its approximate joint diagonalization [34][35][36]. We show that the invariance properties of this approximation derive from the invariance properties of the AJD algorithm employed. For instance, we obtain an approximation satisfying congruence invariance, joint homogeneity and the determinant identity, that is, all important properties of the FI geometric mean except self-duality, using the AJD algorithm developed in [36]. Using this AJD algorithm convergence is guaranteed, the computational complexity per iteration is low and the convergence rate is quadratic when the signalto-noise ratio is favorable. As such, it offers an interesting alternative to existing iterative algorithms. Moreover, an on-line implementation is straightforward, allowing a fast on-line estimation of the geometric mean of an incoming stream of SPD matrices. We mention here that the approximate joint diagonalization (AJD) problem has originated in a completely different context and in the previous literature is completely unrelated to the problem of estimating the geometric mean. In fact, a solution to the AJD problem has arisen in statistics almost 30 years ago to solve the common principal component problem [37]. In the literature on signal processing it has appeared about 20 years ago [34] as an instrument to solve a very wide family of blind source separation problems, including the well-known independent component analysis [38]. Nonetheless, the connection between the geometric mean of a SPD matrix set and their AJD has remained unexplored so far.
In the following sections we introduce the notation and nomenclature. Then we review some concepts from Riemannian geometry and relevant metrics used to define a geometric mean. Then we establish the connection between the geometric mean of a SPD matrix set and their approximate joint diagonalizer. We introduce our approximation and we study its properties. In the result section we illustrate the accuracy of our approximation by means of simulations. Finally, we briefly discuss an on-line implementation and we conclude.

Notation and Nomenclature
In the following we will indicate matrices by upper case italic characters (A), vectors, integer indices, random variables by lower case italic characters (a) and constants by upper case Roman characters (A). A set of objects will be enclosed in curly brackets such as n ∊ {1, . . ., N}. Whenever possible, the same symbol will be used for indices and for the upper bound for summation and products, thus ∑ n a n will always stand short for X N n¼1 a n . We will denote by tr(Á), |Á|, (Á) T , and kÁk F the trace of a matrix, its determinant, its transpose and its Frobenius norm, respectively. The operator diag(Á) returns the diagonal part of its matrix argument. The identity matrix is denoted by I and the matrix of zeros by 0. S will denote a symmetric matrix, C a symmetric positive definite matrix (SDP) and D, Δ will be reserved for diagonal matrices. A set of K SPD matrices will be indicated by {C 1 , . . ., C K } or shortly as {C k }. An asymmetric divergence from SPD matrix C 2 to SPD matrix C 1 will be denoted as δ (C 1 C 2 ), whereas a symmetric distance or divergence between two SPD matrices will be denoted δ (C 1 $C 2 ). The lambda symbol, as in λ n (A) will be reserved for the n th eigenvalue of matrix A. For the sake of brevity, notations of the kind ln 2 λ n (A) shall be read (lnλ n (A)) 2 . We will make extensive use of symmetric functions of eigenvalues for SDP matrices. For a symmetric matrix S and SPD matrix C these functions have general form Uf(W)U T , where U is the orthogonal matrix holding in its columns the eigenvectors of S or C and W is the diagonal matrix holding the corresponding eigenvalues, to which the function applies element-wise. In particular, we will employ the following functions: inverse C −1 = UW −1 U T , symmetric square root C ½ = UW ½ U T , symmetric square root inverse C −½ = UW −½ U T , logarithm ln(C) = Uln(W)U T and exponential exp(S) = Uexp(W)U T .

Data Model
In many engineering applications we are confronted with multivariate observations obeying a linear instantaneous mixture generative model. For example, in electroencephalography (EEG) we observe N time-series of scalp electric potentials, typically sampled a few hundreds of times per second. Let us denote by x(t)2< N the multivariate vector holding the observed data, in our example, the electric potentials recorded at the N scalp electrodes at time t. In EEG the sources of interest are equivalent electric dipoles formed by assemblies of locally synchronized pyramidal cells in the cortex. The current induced by these dipoles diffuses to the scalp sensors. Because of well-grounded physical and physiological reasons this process can be reasonably approximated as a linear instantaneous mixture [39], yielding generative model for the observed data where s(t)2< P , P N, holds the time series of the P cerebral sources to be estimated, (t) 2< N is a noise (model error) term, assumed uncorrelated to s(t), and A2< NxP is the full column rank time-invariant mixing matrix determining the contribution of each source to each sensor, depending on the position and orientation of the electric dipoles. In the following we will assume for simplicity of exposition that P = N, i.e., we consider the estimation of as many sources as available sensors. Model (1) is quite general and appears in a very wide variety of engineering applications including speech, images, sensor array, geophysical and biomedical data processing. The well-known blind source separation (BSS) family of techniques, including independent component analysis [38], attempts to solve Equation (1) for s(t), without assuming any knowledge on the mixing matrix A (that is why it is named "blind"). The goal of BSS is to estimate the demixing matrix B yielding source estimates y(t) such as where we have ignored the noise term in (1) and where A is the unknown true mixing matrix. An important family of BSS problems are solved by estimating a number of matrices holding second-order statistics (SOS) of multivariate observed measurements x(t), e.g., in the case of EEG data, Fourier cospectral matrices estimated at different frequencies, covariance matrices estimated under different experimental conditions or covariance matrices estimated in different temporal windows, etc. [39]. According to model (1) such matrices, which are SPD, have theoretical form where matrices D k 2< PxP here represent SOS matrices of the unknown P source processes and k is the index for the K number of SOS matrices that are estimated. Under the assumption that the sources are uncorrelated (i.e., the matrices D k are diagonal), it can then be shown that the approximate joint diagonalization (AJD) of a set of K matrices generated under theoretical model (2) is indeed an estimation of the demixing matrix B [40][41][42]. As per BSS theory, the sources can be estimated only up to an order and scaling indeterminacy, that is, we can at best find a matrix B for which BA (A is the unknown true mixing process) approximates a matrix PΔ, where P is a permutation matrix and Δ an invertible diagonal matrix. This means that we can identify the waveform of the source processes, but their order and scaling (including sign switching) is arbitrary.

Approximate Joint Diagonalization of SPD Matrices
The joint diagonalizer (JD) of two SPD matrices C 1 and C 2 is a matrix B satisfying ( where D 1 and D 2 are diagonal matrices. The JD is not unique, since if B is a JD of C 1 and C 2 so is PΔB, for any invertible diagonal matrix Δ and any permutation matrix P. The JD is obtained by the well-known generalized eigenvalue-eigenvector decomposition, as the matrix holding in the rows the eigenvectors of or the matrix holding in the rows the eigenvectors of right-multiplied by C À 1= 2 1 , where indices 1 and 2 can be permuted in the above expressions. The JD matrix B is orthogonal iff C 1 and C 2 commute in multiplication ( [43], p. 160-165), which is not the case in general. Let B = A −1 and C 1 , C 2 be appropriate SOS matrices estimated from data x(t), then conjugating by A both sides of (3) we obtain the second order statistics of the generative model (2) (for the case K = 2) such as ( In general, for a set of K>2 SPD matrices {C 1 ,. . .,C K } there does not exist a matrix diagonalizing all of them. However, one may seek a matrix diagonalizing the set as much as possible, that is, we seek a matrix B such that all products BC k B T , with k2{1,. . .,K}, are as diagonal as possible, according to some criterion. Such a problem is known as approximate joint diagonalization (AJD) and has been studied extensively in the signal processing community (e.g., [34], [36], [44]). As per the JD, the AJD matrix B is not unique, since if B is the AJD of set {C 1 ,. . ., C K } so is PΔB, for any invertible diagonal matrix Δ and any permutation matrix P. As for the geometric mean, there is no closed-form expression for the AJD in the case K>2, so we proceed by specifying an AJD criterion and iteratively optimizing it. One such criterion specific for SPD matrices has been proposed by Pham [35,36]; a consequence of the Hadamard inequality is that any SPD matrix C verifies |C| |diag(C)|, with equality iff C is diagonal. Also, according to the Kullback-Leibler divergence δ K (C D), the closest diagonal matrix D to C is D = diag(C), for only in this case the divergence is zero [41]. The criterion proposed by Pham is then the sum of the Kullback-Leibler divergences of the input matrices to their diagonal form. Therefore we write a JD cost function as, where the ϛ k are optional non-negative real numbers weighting the diagonalization effort with respect to the input matrices C k . Besides being specific to SPD matrices, criterion (7) is interesting because it possesses an important invariance property, as stated in the following: Proposition 1. Criterion (7) is invariant under positive rescaling of the input matrices C k , thus the AJD matrix B of {a 1 C 1 ,. . .,a K C K } is the same as the AJD matrix of {C 1 ,. . .,C K } for any positive set {a 1 ,. . .,a K }.
The proof is trivial using well-known properties of the determinant and of the logarithm and will be omitted here. For AJD solutions according to a given criterion, we make use of the following: Definition 1. The AJD of set {C 1 ,. . .,C K } according to some AJD criterion is well defined (or is essentially unique) if any two global minimizers B 1 and B 2 of the criterion are in relation B 1 = PΔB 2 , where P is a permutation matrix and Δ is an invertible diagonal matrix.
For details on the essential uniqueness of AJD see [40,45,46]. This is a generic property in the sense that in the presence of noise the AJD is essentially unique (with very high probability). Now, consider again data model (1) and the theoretical generative model of SOS statistics of the observed data as per (2), where A is the (unknown) mixing matrix and the D k , with k2 {1,. . .,K}, are the (unknown) diagonal matrices holding K SOS statistics of the source processes. Working with AJD we are interested in the situation where the noise term in (1) is small enough so that the AJD solution is well defined as per definition 1. We will then make use of a general property of such well-defined AJD solutions: where invertible diagonal matrix Δ and permutation matrix P are the usual AJD scaling and permutation indeterminacy, respectively.
Proof. Saying that B is an essentially unique AJD of set {FC 1 F T ,. . .,FC K F T } according to some AJD criterion implies that the set {BFC 1 F T B T ,. . ., BFC K F T B T } is a global minimizer of the AJD criterion employed. Thus, matrix BF is, out of possible different scaling and permutation as per definition 1, a global minimizer of the same AJD criterion for the set {C 1 ,. . .,C K }.
Finally, we will need the following: Definition 2. Let B, with inverse A, be an essentially unique AJD of set {C 1 ,. . .,C K }; an AJD criterion is said to verify the self-duality invariance if PΔA T is a well defined AJD of set {C 1 -1 ,. . .,C K -1 } satisfying the same criterion, with invertible diagonal matrix Δ and permutation matrix P the usual AJD scaling and permutation indeterminacy, respectively.

The Riemannian Manifold of SPD Matrices
Geometrically, the Euclidean space of SPD matrices of dimension N x N can be considered as a ½N(N+1)-dimensional hyper cone (Fig 1). The usual vector (Euclidian) space of general square matrices is endowed with the metric hS 1 ; S 2 i ¼ trðS T 1 S 2 Þ and associated (Frobenius) norm kSk F . We will replace the convex pointed cone in the vector space of Fig 1 with a regular manifold of non-positive curvature without boundaries, developing instead infinitely in all of its ½N(N+1) dimensions. In differential geometry, a N-dimensional smooth manifold is a topological space that is locally similar to the Euclidean space and has a globally defined differential structure. A smooth Riemannian manifold (or Riemannian space) M is a real smooth manifold equipped with an inner product on the tangent space T O M defined at each point O that varies smoothly from point to point. The tangent space T O M at point O is the Euclidean vector space containing the tangent vectors to all curves on M passing through O (Fig 2). In the SPD manifold endowed with the Fisher information metric for any two vectors V 1 and V 2 in the tangent space the inner product through point O is given by [2] The Geodesic. The Fisher information metric allows us to measure the length of curves in M and find the shortest curve between two points O and F on M. This is named the geodesic and is given by [2]  This topology is easily visualized in case of 2x2 matrices; any 2x2 covariance matrix can be seen as a point in 3D Euclidean space, with two coordinates given by the two variances (diagonal elements) and the third coordinate given by the covariance (either one of the off-diagonal element). By construction a covariance matrix must stay within the cone boundaries. As soon as the point touches the boundary of the cone, the matrix is no more positive definite.
where β is the arc length parameter. When β = 0 we are at O, when β = 1 we are at F and when β = 1/2 we are at the geometric mean of the two points (Fig 2). The Exponential and Logarithmic Maps. The exponential and logarithmic maps are shown graphically in Fig 2. The function that maps a vector V2T O M to the point F2M following the geodesic starting at O, is named the exponential map and denoted by F = Emap O (V). It is defined as The inverse operation is the function mapping the geodesic relying O to F back to the tangent vector V2T O M. It is named the logarithmic map and denoted V = Lmap O (F). It is defined as The Metric (Distance). Given two points C 1 and C 2 on the manifold M, their Riemannian distance based on the Fisher information metric is the length of the geodesic (8) connecting them. It is given by [3,10,16] where Λ is the diagonal matrix holding the eigenvalues λ 1 ,. . .,λ N of either matrix (4) or (5). This distance has a remarkable number of properties, some of which are reported in Table 1 [14,15,25]. For more inequalities see [2,25]. Associated to the chosen metric is also the Riemannian norm, defined as the Riemannian distance from an SPD matrix C to the identity, that is, the Euclidean distance of its logarithm to the zero point: The Riemannian norm is zero only for the identity matrix, while the Frobenius norm is zero only for the null matrix. Either an eigenvalue smaller or greater than 1 increases the norm and the norm goes to infinity as any eigenvalues go to either infinity or zero. Importantly, because of the square of the log, an eigenvalueλ increases the norm as much as an eigenvalue 1/ λ does (see Fig 3), from which the invariance under inversion.
The Geometric Mean of SPD Matrices: general considerations. Given a set of SPD matrices {C 1 ,. . .,C K }, in analogy with the arithmetic mean of random variables, a straightforward definition of the matrix arithmetic mean is . and a straightforward definition of the harmonic mean is On the other hand a straightforward definition of the geometric mean is far from obvious because the matrices in the set in general do not commute in multiplication. Researchers have postulated a number of desirable properties a mean should possess. Ten such properties are Some inequalities of the Riemannian Fisher information distance   known in the literature as the ALM properties, from the seminal paper in [11]. The concept of Fréchet means and the ensuing variational approach are very useful in this context: in a univariate context the arithmetic mean minimizes the sum of the squared Euclidean distances to K given values while the geometric mean minimizes the sum of the squared hyperbolic distances to K given (positive) values. In analogy, we define the (least-squares) Riemannian geometric mean G{C 1 ,. . .,C k } of K SPD matrices C k such as the matrix satisfying [13,15]. argmin where δ 2 (Á$Á) is an appropriate squared distance. In words, the Riemannian geometric mean is the matrix minimizing the sum of the squared distances of all elements of the set to itself. Using the Fisher information (FI) distance (11) such mean, which we name the FI mean and denote as G R fC 1 ; Á Á Á ; C K g or, shortly, G R fC k g, features all ALM properties. The FI mean is the unique SPD geometric mean satisfying non-linear matrix equation [15] X or, equivalently, where, in line with the notation used in this article, G s fÁg stands short for ðGfÁgÞ s , for any power s. We have listed a number of properties of the FI mean along with some related inequalities in Table 2 (see also [2,25,47,48]). Note that in the literature this least-squares geometric mean is sometimes referred to as the barycenter, the Riemannian center of mass, the Fréchet mean, the Cartan mean or the Karcher mean (although these definitions in general are not equivalent and attention must be paid to the specific context, see e.g., [31]). The Geometric Mean and the Joint Diagonalization of Two SPD Matrices. Given two points C 1 and C 2 on the manifold M, the Geometric Mean of them, indicated in the literature by C 1 #C 2 , has several equivalent closed-form expressions, such as [2,12,25,28,47] And À Table 2. Some important properties of the Fisher information metric geometric mean.
Properties of the geometric Mean (25) Invariance by Reordering: G R fC 1 ; Á Á Á ; C K g is the same for any order of matrices in the set 30) if all matrices C k pair-wise commute then In the above the indices 1 and 2 can be switched to obtain as many more expressions. The geometric mean of two SPD matrices is indeed the midpoint of the geodesic in (8) and turns out to be the unique solution of a quadratic Ricatti equation [2,48], yielding Our investigation has started with the following: Proposition 3. The FI geometric mean of two SPD matrices can be expressed uniquely in terms of their joint diagonalizer; let B be the JD (3) of matrices {C 1 , C 2 } such that BC 1 B T = D 1 and BC 2 B T = D 2 and let A = B −1 , then the geometric mean is for any JD solution, regardless the permutation and scaling ambiguity.
Remark. A consequence of the scaling indeterminacy of the JD and (19) is that we can always chose B such that in which case A is a square root ( [49], p. 205-211) of the geometric mean, i.e., The Geometric Mean of a SPD Matrix Set. Given a set {C 1 ,. . .,C K } = {C k } of K>2 SPD matrices the point G R {C k } satisfying (23)- (24) has no closed form solution. The research of algorithms for estimating the FI geometric mean or a reasonable approximation for the case K>2 is currently a very active field [3, 24, 28-30, 48, 50]. In this work we consider two iterative algorithms for estimating the FI mean. The first is the aforementioned gradient descent algorithm [10]. In order to estimateG R {C k }, we initialize M with the arithmetic mean or any other smart guess. Then we apply the iterations given by until convergence. Here υ is the step size, typically fixed at 1.0. Notice that this algorithm iteratively maps the points in the tangent space through the current estimation of the mean (10), computes the arithmetic mean in the tangent space (where the arithmetic mean makes sense) and maps back the updated mean estimation on the manifold (9), until convergence (Fig 4). Note also that the global minimum attained by gradient descent (22) satisfies upon which the iterate does not move anymore. We have already reported that these iterations have linear convergence rate, but convergence itself is not guaranteed. In order to minimize the occurrence of divergence and also to speed-up convergence, while avoiding computationally expensive searches of the optimal step size, we use in the present work the following version with heuristical decrease of the step-size over iterations:

GM-GD Algorithm
Initialize M Set ε equal to a suitable machine floating point precision number (e.,g., 10 -9 for double precision), υ = 1, τ equal to the highest real number of the machine. Repeat The second algorithm we consider is the aforementioned majorization-minimization algorithm recently proposed in [33]. Since convergence is guaranteed, it is used here as a benchmark for accuracy:

GM-MM Algorithm
Initialize M Repeat For k from 1 to K do C k ln C À1=2 Until Convergence Alternative Metrics and Related Geometric Means. Recently it has been proposed to use the least-squares (Fréchet) geometric mean (14) based on the log-Euclidean distance [21,22,51] and the Bhattacharyya divergence [23,24], also named S-divergence [25,26], which turns out to be a specific instance of the α-divergence setting α = 0 [27]. These studies suggest that these alternative definitions of geometric mean give results similar to the FI mean in practical problems, while their computational complexity is lower.
The Log-Euclidean distance is which is the straightforward generalization to matrices of the scalar hyperbolic distance and, from property (48), equals the FI distance if C 1 and C 2 commute in multiplication. The interest of this distance is that the geometric (Fréchet) mean (14) in this case has closed-form solution given by This is a direct generalization of the geometric mean of positive scalars and, again, from property (48), it equals the FI mean if all matrices in the set pair-wise commute. The computation of (26) requires fewer flops than a single iteration of the gradient descent algorithm (22), so this constitutes by far the most efficient geometric mean estimation we know. The log-Euclidean mean (26) possesses the following properties: invariance by reordering, self-duality, joint homogeneity and determinant identity [21][22]. It does not possess the congruence invariance, however it is invariant by rotations (congruence by an orthogonal matrix). Also, important for the ensuing derivations, whereas the determinant of the log-Euclidean mean is the same as the determinant of the FI mean (since both verify the determinant identity), the trace of the log-Euclidean mean is larger than the trace of the FI mean, unless all matrices in the set pair-wise commute, in which case as we have seen, the two means coincide, hence they have the same trace and determinant [21]. Because of this trace-increasing property, the log-Euclidean mean is in general farther away from the FI mean as compared to competitors such as constructive geometric means [28]. Moreover, in general the log-Euclidean mean differs from the FI mean even for the case K = 2.
In [24] the author has shown that the Bhattacharyya divergence (also named log-det divergence and S-divergence) between two SPD matrices C 1 ,C 2 2< N·N behaves similarly to the FI distance if the matrices are close to each other. It is symmetric as a distance, however, it does not verify the triangle inequality. This divergence is where the arithmetic mean AfC 1 ; C 2 g is defined in (21), the geometric mean GfC 1 ; C 2 g ¼ C 1 C 2 in(32)- (33) and the λ n are, again, the N eigenvalues of either matrix (4) or (5). Successively, in [26] the author has shown that the square-root of the Bhattacharyya divergence is a distance metric satisfying the triangle inequality. Both the Bhattacharyya divergence and distance are invariant under inversion and under congruence transformation [24,26]. The geometric mean (14) of a set of matrices based on divergence (27) is the solution to nonlinear matrix equation [26] 2=K X In order to estimate G B fC 1 ; Á Á Á ; C K g we initialize M and apply iterations [24] M K which global minimum has been shown to satisfy (28) [26]. This mean possesses the following invariance properties: invariance by reordering, congruence invariance, self-duality and joint homogeneity. However, it does not satisfy the determinant identity. As a consequence, in general both the determinant and the trace of the Bhattacharyya mean differ from those of the FI mean. However, the Bhattacharyya mean of K = 2 matrices coincides with the FI mean [27].
The Geometric Mean of a SPD Matrix Set by AJD. Let H be an invertible matrix of which the inverse is G. Using the congruence invariance of the geometric mean (50) and conjugating both sides by G we obtain Our idea is to reach an approximation of G R fC 1 ; Á Á Á ; C K g by approximating G R fHC 1 H T ; Á Á Á ; HC K H T g in the expression above. Particularly, if the matrices HC k H T are nearly diagonal, then they nearly commute in multiplication, hence we can employ property (54) to approximate the geometric mean G R fC 1 ; Á Á Á ; C K g by expression where G L fBC 1 B T ; Á Á Á ; BC K B T g is the log-Euclidean mean introduced in (41), B is a well-defined AJD matrix of the set {C 1 ,. . .,C K } (definition 1) chosen so as to minimize criterion (7), and A is its inverse. Because of the scaling (Δ) and permutation (P) ambiguities of AJD solutions, (30) actually defines an infinite family of admissible means. Before we start to study the specific instance of the family we are interested in (the closest to the FI mean in general), let us observe that if all products BC k B T are exactly diagonal then the family of means defined by (30) collapses on a single point in the manifold, which is indeed the FI geometric mean, for any AJD matrix with form PΔB. This is the case when the data are generated according to model (1) and the noise term therein is null, in which case the coincidence of the two means is a consequence of property (54) or, regardless the data model, if K = 2, in which case (30) reduces to (19). The same is true also whenever in which case the matrix exponential of (30) is the identity matrix and the family (30) collapses to the unique point AA T just as for the case K = 2 in (21). Interestingly, the family of means (30) includes the log-Euclidean mean, in that setting A = B −1 = I we obtain (26), but also the FI mean, in the sense that setting A = M 1/2 and B = A −1 = M −1/2 we obtain the global minimum of the FI mean (23). This is the point on the manifold we sought to approximate by approximating condition (31). First of all, we show that the AJD permutation ambiguity is not of concern here. We have the following Proposition 4. The family of means given by (30) is invariant with respect to the AJD permutation indeterminacy P, for any invertible AJD solution B with inverse A and any invertible diagonal scaling matrix Δ.
Proof. Taking into account the P and Δ indeterminacies, since for any permutation matrix P T = P -1 and for any diagonal matrix Δ = Δ T , the family of solutions (30) has full form If f (S) is a function of the eigenvalues of S (see section "Notation and Nomenclature") and P is an arbitrary permutation matrix, then P -1 f (PSP -1 ) P = f (S), thus, the family of solutions (30) actually takes form for any invertible B = A -1 and any invertible diagonal Δ. Then, note that the family of means defined by (32) is SPD, which is a consequence of the fact that the exponential function of a symmetric matrix is always SPD and that both B and Δ are full rank. It also verifies the invariance under reordering (49), which is inherited from the invariance under reordering of AJD algorithms and of the log-Euclidean mean. Furthermore, we have the following Proposition 5. The family of means defined by (32) verifies the determinant identity(53), for any invertible matrix B = A -1 and any invertible diagonal Δ.
Proof. We need to prove that for any invertible matrix B = A -1 and any invertible scaling matrix Δ Using (32) we write the left-hand side of the above equation such as jjD À1 jjA T j . Since the log-Euclidean mean possesses the determinant identity, this is Developing the products of determinants and since jAj ¼ jA À1 j À1 ¼ jBj À1 , we obtain the desired result such as Proposition 6. If B is the AJD solution of (7), with inverse A, the family of means defined by (32) verifies the joint homogeneity property (52), for any invertible diagonal Δ.
Proof. We need to prove that The result follows immediately from the invariance under rescaling of criterion (7) (Proposition 1), as So far we have considered the properties of the whole family of means with general form, (32) for which we have solved the permutation AJD indeterminacy (Proposition 4). We now seek the member of the family better approximating the FI geometric mean given a matrix PΔB that performs as the AJD of the set {C 1 ,. . .,C K }. This involves choosing a specific scaling Δ. As we have seen, if in (32) exp 1=K then (21) is true also for K>2, in which case the FI geometric mean is given by AA T , where A is the inverse of ΔB in (32) and (30) is a stationary point of (22). Such condition is well approximated if the left-hand side of (33) is nearly diagonal and all its diagonal elements are equal, that is, in our case, by scaling the rows of B such that The uniqueness of this solution, regardless the choice of α, is demonstrated by the following Proposition 7. Let B = A -1 be an invertible AJD of the set {C 1 ,. . .,C K }. Scaling B so as to satisfy (34), the mean (32) is unique and invariant with respect to α.
Proof. Once matrix B satisfies (34), let us consider a further fixed scaling such us Δ = ωI, with ω>0. Substituting this fixed scaling matrix in the family of means (32) we have but this is showing that the resulting mean does not change. Thus, given an AJD matrix B, we obtain our sought unique member of the family as where A is the inverse of the matrix B scaled so as to satisfy (34). Without loss of generality, hereafter we will choose α = 1. With this choice the AJD matrix B is an approximate "whitening" matrix for the sought mean, since we have In order to satisfy (34), notice that any change in Δ changes the log-Euclidean mean therein since the log-Euclidean mean is not invariant under congruent transformations, thus there is no closed-form solution to match condition (34) given an arbitray AJD matrix solution. However, we easily obtain the desired scaling of B by means of an iterative procedure (see below). We name the resulting approximation to the geometric mean (35) the ALE mean, where ALE is the acronym of AJD-based log-Euclidean Mean. The algorithm is as follows:
Set equal to a suitable machine floating point precision number Repeat Note that instead of the average FI distance 1 N d R ðI $ DÞ = any suitable distance of Δ from the identity matrix can be used as well in order to terminate the Repeat-Until loop here above. Note also that if instead of (36) we perform iterations (which is the same as (36) without the "diag" operator) the algorithm converges to an inverse square root of the FI geometric mean, i.e., upon convergence (and, in this case, if the algorithm converges) In fact, (37) is the gradient descent equivalent to (22) converging to the inverse square root of the geometric mean instead of on the geometric mean itself. Equivalently, one may iterate converging to the square root of the FI geometric mean (as before, A here is a square root of the FI mean). In the case of iterations (37) or (39), however, there is no point to perform a previous AJD (B in (37) and A in (39) can be initialized, for instance, by the inverse square root and the square root of the arithmetic mean, respectively) and we encounter the same convergence dependency on the step-size υ as for the gradient descent (22). Our approximation based on (36) instead surely converges without any step-size to be searched, as convergence of the AJD algorithm is ensured without using a step-size [36] and iterations (36) imply only a scaling of B and also surely converge. Besides providing a unique solution, our ALE mean satisfies a very important property: whereas the log-Euclidean mean is not invariant under congruent transformation, the ALE mean is. This is demonstrated by the following: Proposition 8. The ALE mean (35) satisfies the invariance under congruent transformation.
Proof. We need to show that for any invertible matrix F Let B 1 a well defined AJD of set {C 1 ,. . .,C K } with inverse A 1 and B 2 a well-defined AJD of set {FC 1 F T ,. . ., FC K F T } with inverse A 2 , both satisfying condition (35) for their respective set. The expression above is then Because of Proposition 2, if matrix B 1 approximately diagonalizes the set {C 1 ,. . .,C K } so does matrix B 2 F according to the same criterion and if they both satisfy (35) for the set {C 1 ,. . .,C K } they are equal out of a permutation indeterminacy that we can ignore because of proposition 4. As a consequence A 1 = (B 2 F) -1 = F -1 A 2 and thus A 2 = FA 1 . Making the substitutions we obtain Proposition 9. The ALE mean verifies the self-duality property (51) if the AJD solution B, with inverse A, verifies the self-duality property of Definition 2.
Proof. Self-duality of the ALE mean is verified if Using definition 2 we have and computing the inverse of the right-hand side Using ln(S -1 ) = -ln(S) and then (exp(-S)) -1 = exp(S) we have Note that the AJD matrix satisfying criterion (7) verifies the self-duality of definition 2 only if K = 2 or if all products B T C k B are exactly diagonal, otherwise it verifies it only approximately. Therefore, in general our ALE mean estimation (35) verifies self-duality (51) only approximately. Interestingly, the ALE mean would verify the self-duality property as well if the AJD cost function makes use of a diagonality criterion based on the Riemannian distance (11) instead of the Kullback-Leibler divergence as in (7). However, the search of such an AJD algorithm has proven elusive so far. In conclusion, we have shown that the ALE mean verifies several important properties satisfied by the FI Riemannian mean (invariance under reordering and under congruence transformation, joint homogeneity and determinant equality). The selfduality property is satisfied approximately in general and exactly in some special cases. Next we will study the performance of the ALE mean by means of simulations.

Results
As mentioned in the introduction, the estimation of the FI geometric mean of a set of SPD matrices has proven useful in a number of diverse engineering applications [1][2][3][4][5][6][7][8][9][10]. For instance, working with brain-computer interfaces based on electroencephalography (EEG) we have found it very useful for defining prototypical brain states and classifying single trials according to the distance of the trials to different prototypes. Since EEG data is generated under the instantaneous linear mixing model (1) [39], in this section we explore the possible advantage of estimating the geometric mean using the ALE estimation. In order to perform simulations we generate sets of SPD matrices according to model (1) (see (2), but also see the generation of simulated matrices in [28,33]): a set of K matrices is generated as where A True 2< N·N (true mixing matrix) has entries randomly drawn from a standard Gaussian distribution, D k is a diagonal matrix with entries randomly drawn from a squared standard Gaussian distribution and bounded below by 10 -4 and S k is a noise matrix obtained as (1/N)QQ T where Q2< N·N is a matrix with entries randomly drawn from a Gaussian distribution with mean zero and standard deviation σ (noise level).  (Pham's AJD algorithm is the first part of the ALE algorithm) for K = 100 matrices of dimension N = 10, with three different noise levels (σ = 0.01, 0.1, 1). These three noise levels correspond approximately to typical low, medium and high noise situations according to AJD standards, that is to say, the diagonalization that can be obtained on a set generated with σ = 1 would be considered very bad for practical AJD purposes and the AJD matrix B in this case cannot be assumed well-defined. We can appreciate the typical superlinear convergence rate of Pham's AJD algorithm in low and medium noise situations. For high noise situations the convergence rate becomes slower. The opposite happens for the GD, MM and Bha algorithm, which convergence rate becomes faster as the noise level increases. For considering the overall computational efficacy of these algorithms we should also consider the complexity per iteration; it is the lowest for Pham's AJD algorithm, followed in the order by the Bha, GD and MM algorithms. In fact, the log-Euclidean mean (26) necessitates the computation of K eigenvalueeigenvector decompositions only once. On the other hand, Bha (29) involves K+1 Cholesky decompositions at each iteration, GD (22) involves K eigenvalue-eigenvector decompositions at each iteration and MM (24) involves 2K+2 eigenvalue-eigenvector decompositions at each iteration. Iterations (36) in the second part of the ALE algorithm converge reliably in a few iterations requiring K eigenvalue-eigenvector decompositions per iteration. Overall then, the ALE mean based on Pham's AJD is advantageous when Pham's algorithm convergence rate is quadratic. This happens when the noise in data model (1) is small enough. Fig 6 shows the typical relation observed between the trace and the determinant of the FI geometric mean estimated by algorithm GD and MM (they converge to the same point), LE, Bha and ALE. As we can see the log-Euclidean mean has always the same determinant as the FI mean, but larger trace. The Bhattacharyya mean has always both different trace and different determinant. The ALE mean has determinant and trace only very slightly different, regardless the noise level. As the noise increases the bias of the LE and Bha mean tends to decrease. Fig 7 shows the FI distance (11) between the LE, Bha, ALE geometric mean and the FI mean estimated by the GD and MM algorithms (they converge to the same point, which is used here as a benchmark) for K = 100 matrices of dimension N = 10, with three different noise level (σ = 0.01, 0.1, 1). The distance is plotted against the condition number of the true mixing matrix in our generative model (40). The ALE mean is consistently closer to the FI mean in all cases and, surprisingly, in absolute terms it is pretty close even in the high-noise case (σ = 1). Also, as suggested by Fig 6, the estimation of LE and Bha approaches the FI mean as the noise increase. On the other hand, the conditioning number appears to play a role only for the low and moderate noise cases (σ = 0.01, 0.1).

Conclusions and Discussion
In this paper we explored the relationship between the approximate joint diagonalization of a SPD matrix set and its geometric mean. After appropriate scaling, the inverse of the joint diagonalizer of two SPD matrices is a square root of their geometric mean. For the general case of a  (40) for three noise levels (σ = 0.01, 0.1, 1). Each algorithm was stopped when its own stopping criterion became smaller than -100dB. SPD matrix set comprised of more than two matrices, we have studied a family of geometric means that includes the geometric mean according to the Fisher information metric (the FI mean) and the log-Euclidean mean. Then we have introduced a specific instance of this family, which is computationally attractive and does not require a search for the optimal step size. We have showed that it approximates the FI geometric mean much better than the log-Euclidean mean. Indeed, this mean, named the ALE mean, can be conceived as an improved version of the log-Euclidean mean, in that i) it satisfies the congruence invariance and ii) similar to the log-Euclidean mean it has the same determinant as the FI mean, but has much smaller trace, thus its trace is much closer to the trace of the FI mean. The ALE mean can be computed by running an AJD algorithm followed by a specific scaling obtained by a simple iterative procedure. The AJD algorithm developed by Pham [36] is particularly adapted for this purpose, as its convergence is guaranteed and features nice invariance properties, which translate into a number of invariance properties for the ALE mean. For this algorithm the convergence rate is quadratic when the set can be nearly diagonalized, that is, when the data is generated according to model (1) and the noise term therein is small enough. In such a situation our ALE mean is clearly advantageous over its competitors both in terms of stability (guarantee of convergence) and computational complexity. Also, for matrices of big dimension Pham's AJD algorithm can be easily parallelized, since it proceeds in a Jacobi-like fashion updating pair-wise the vectors of the matrix [36]. An on-line implementation is also straightforward and very efficient [41]: one creates a circular buffer of matrices and once an incoming matrix enters the buffer one or two iterations of the algorithm suffice to update the AJD solution and one or two iterations (36) suffice to update the ALE mean altogether. By applying appropriate weighting to the matrices Distance of the LE (diamonds), Bha (triangles) and ALE (disks) geometric mean to the FI geometric mean estimated by the GD and MM algorithm (they converge to the same point), for three noise levels (σ = 0.01, 0.1, 1) and variable condition numbers of the true mixing matrix in (40). The simulations were repeated 100 times, with N = 10 and K = 100. Notice the different scales on the y-axes. Estimating the Geometric Mean by Approximate Joint Diagonalization in the buffer and/or regulating the buffer size one decides how fast the ALE mean should adapt to the incoming data. Our simulations have confirmed our analysis and have shown that the ALE mean can be employed even if the noise term in data generation model is high, i.e., even if the matrices are very far apart from each other in terms of their Riemannian distance; however in this case the computational advantage of the ALE mean vanishes. We conclude that the ALE mean is the convenient choice when the set can be diagonalized pretty well; otherwise the gradient descent algorithm is computationally advantageous, even if searching for the optimal step-size at each iteration.