Laplacian mixture modeling for network analysis and unsupervised learning on graphs

Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.


Introduction
Many scientific and engineering problems arise where multiple distinct processes, sources, or latent factors combine to generate multimodal superpositions of overlapping non-Gaussian distributions.Estimating or inferring the components of a mixture is called the mixture problem.Mixture problems can be solved using mixture models, also known as latent class models [Everitt, 1996].
Mixture models have numerous applications in data science and statistics-related fields.They are useful tools for solving signal processing and unsupervised learning problems such as source separation and cluster analysis [McLachlan and Peel, 2000].Although countably or uncountably infinite numbers of mixture components are also possible, only finite mixture models are described here.
In 1894 Karl Pearson stated that "the analytical difficulties, even for the case n = 2 are so considerable, that it may be questioned whether the general theory could ever be applied in practice to any numerical case" [Pearson, 1894].To this day, most algorithms in use cannot directly predict the number of components present from the mixture or are parametric approaches that restrict components to fixed functional forms which are often unrealistic assumptions, especially in high dimensional feature spaces [McAuliffe et al., 2006].
Statistical physics addresses some of these issues in terms of the theory of macrostates [Shalloway, 1996].Macrostate theory has not yet seen much use in scientific and engineering disciplines outside of physics, perhaps due to the lack of communication between disciplines.In order to encourage more widespread use an accessible formulation and a simpler, more efficient implementation is presented here.
The rapid recent growth in the popularity of advanced statistical methods is in part a response to rapidly increasing dataset sizes e.g.those from large-scale high-energy physics experiments.Conversely, ideas from physics can sometimes be applied to problems in statistics and machine learning.Macrostates of statistical physics from Shalloway [1996] also apply to nonphysical problems statistics and information theory, providing an important connection between physics and statistics similar to the principle of maximum entropy published by Jaynes [1957].

Mixture Models
Methods for separating the components of nonnegative mixtures of the form are called mixture models, where m ∈ N is the number of components and with x ∈ Ω ⊆ R n .For all practical problems, it is safe to assume f (x) is normalized as a probability distribution without loss of generality.
The f k (x) are known as the mixture components or component probability distributions of each independent process or signal source, and are also assumed to be normalized without loss of generality.The a k ∈ [0, 1] are the weights or mixing proportions summing to one and forming a discrete probability distribution P (k) = a k , which is known as the class prior distribution in the context of probabilistic or Bayes classifiers [Bishop, 2006].
Finite mixture models can be used to define probabilistic classifiers and vice versa.From exact knowledge of {f, f 1 , . . ., f n , a 1 , . . ., a n }, the posterior conditional distribution of an optimal Bayes classifier for any observed y ∈ Ω can be expressed as P (k|y) = P (y|k)P (k) forming a partition of unity over the space of observations or data.The component distributions f k (x) can be understood as the class conditional distributions P (x|k) and f (x) as the evidence P (x) in a Bayes classifier context.
Switching between the f and the P notations helps to build connections between concepts and algorithms from separate fields such as physics and statistics, at some risk of confusion.Supervised classification algorithms that directly optimize the posterior P (k|y) are referred to as discriminative classifiers and sometimes outperform generative methods that fit the f k and a k directly [Bishop, 2006].
Just as probabilistic/Bayes classifiers form partitions of unity from finite mixture models, so can finite mixture models be defined from partitions of unity.In other words, partitions of unity can be written Partitions of unity {w k (x)} m k=1 as defined in (4) determine mixture components f k (x) and weights a k from mixture distributions f according to explicitly showing the formal equivalence of either the generative and discriminative descriptions of finite mixture models.The discriminative form (4) uses the additional partition of unity constraints (1.1) to combine the information in each {f k (x), a k } pair of components and weights into a single w k (x) discriminative posterior distribution or partition of unity.

Smoluchowski Equations
In statistical physics, mixture models are used for linearly separating the stationary or Boltzmann distributions in systems with nonconvex potential energy landscapes where minima on multiple size scales occur, e.g.overdamped drift-diffusions.
The distribution dynamics of overdamped drift-diffusions are described by partial differential equations (PDEs) known as Smoluchowski equations [Risken, 1996].In physics, the potential energy function V (x) can often be viewed as a fixed parameter that is exactly known or defined in advance.
The potential energy function V (x) determines the deterministic drift forces acting on ensemble members i.e. sample paths or realizations of this stochastic process.The drift forces bias the temporal evolution of initial distributions P (x, 0) to flow towards regions of lower energy as t increases compared to free diffusion (Brownian motion).Technically there is a different Smoluchowski equation for each distinct choice of V (x).Hence the plural form should be used generally although this is often overlooked in the literature.
Smoluchowski equations have the form and belong to a class of reversible continuous-time, continuous-state Markov processes used to describe multiscale and multidimensional physical systems that can exhibit metastability [Risken, 1996].Understanding the connection between Smoluchowski equations of metastable systems and finite mixture models requires the theory of macrostates of classical stochastic systems.

Macrostates
Transitions between components of a mixture occur on relatively slow timescales in metastable systems, making them appear as the distinct discrete states of a finite-state continuous time Markov process when measured over appropriate timescales.
The rigorous definition of a macrostate was originally created for separating mixture distributions based on slow and fast time scales in the relaxation dynamics of nonequilibrium distributions to stationary states [Shalloway, 1996].
This paper describes how macrostate theory can provide a framework for nonphysical mixture modeling by associating stochastic processes to nonphysical mixture distributions via Smoluchowski equations.
The elliptic differential operator has e −βV (x) as an eigenfunction with eigenvalue zero, also called the stationary state or unnormalized Boltzmann distribution.
It is easy to evaluate L 0 e −βV (x) directly to verify that it equals zero, satisfying the eigenvalue equation.
L 0 is a normal operator and therefore a similarity transform S −1 L 0 S to a self adjoint form L exists [Risken, 1996].S has the simple form of a multiplication operator with kernel e − 1 2 βV (x) , giving Without loss of generality, we can assume that D can be reduced to a constant for a large class of potentials [Risken, 1996] and does not have significance for the nonphysical problems described here.Therefore it can be set equal to one and safely ignored, allowing the model structure to be defined by the potential energy V (x) alone.
If Ω is compact or V (x) → ∞ as x → ∞ for unbounded domains, then the stationary state is normalizable and L 0 0 has a discrete one-sided spectrum with one zero eigenvalue for every ergodic region.Ergodicity corresponds to the concept of irreducible Markov processes in probability theory.For reducible/nonergodic processes, the multiplicity of the zero eigenvalue gives the number of connected components of the process.
The normalized Boltzmann distribution is denoted and plays the role of the mixture distribution f (x) in (1) as described in more detail in Section 2.
The stationary state of L from equation ( 9) is denoted and is used along with other eigenfunctions of L in the separation/unmixing of the Boltzmann distribution into m macrostates as described below.
According to macrostate theory, a system with m metastable states will have m − 1 independent transition modes between them [Shalloway, 1996, page 9989, column 2, paragraph 3].Adapting the notation slightly from Shalloway [1996], the eigenfunctions of L are denoted {ψ i } ∞ i=0 and the eigenvalues are denoted as {λ i } ∞ i=0 .Assuming that m metastable states exist, the first m terms of the spectral expansion of L encode these m states and the spectrum will contain a gap when ordered by magnitude separating the first m eigenvalues from the rest of the ordered spectrum [Shalloway, 1996, page 9989, column 2, paragraph 6].
Replacing R by x in the original description, the macrostate component distributions are denoted Φ α (x), α = 1, . . ., m.For nonphysical mixture modeling applications, β can be considered as a tunable parameter, set to one by default, which can be varied to improve the accuracy of the model.The issue of choosing the best value of β for statistical mixture modeling problems is discussed in Section 2.4.
Macrostates of physical systems are derived from the macrostate expansion from equation (24) of Shalloway [1996, page 9990], restated here as using the above substitution.The formal equivalence of (1) and ( 13) suggests that the ψ 0 (x; β)Φ α (x; β) are analogous to the f k (x) in (1), as covered in the next section.Before stating this connection more precisely, a few more useful results from the theory of macrostates are restated from Shalloway [1996, Section III].
Using the same substitutions as above the Φ α (x; β) are defined by equation ( 26) of Shalloway [1996] as which can be written in matrix notation as T are column vector valued functions and M is an invertible m-by-m matrix of expansion coefficients with zero-based column indices.The vector-valued function ψ is determined from (9) via the stationary state ψ 0 (x; β), allowing the mixture components Φ α (x; β) to be estimated directly from ψ 0 (x; β) by formulating an optimization problem over M the square m-by-m matrix of macrostate expansion coefficients.

Minimum Uncertainty Condition
Minimizing the expected overlap of the macrostate boundaries serves as an objective for determining the optimal matrix M opt (β), yielding the least total overlap of the macrostate boundaries allowable by the subspace spanned by the selected eigenfunctions.The quantity 0 < Υ(M ; β) < 1, as defined by equation (36) of Shalloway [1996], represents the expected overlap over all macrostates between a window function w α (x; M ) and the other states α = α, where the w α (evaluated at β = 1) are defined by Shalloway [1996, equation (27)] as which suggests the central connection between macrostates and probabilistic classifiers (2) derived from finite mixture models.Section 2.1 describes the relationship between ( 15) and ( 2) in more detail.
The minimum uncertainty condition subject to the partition of unity constraints selects maximally crisp (non-overlapping) or minimally fuzzy (overlapping) decision boundaries among classifiers formed by the span of the selected eigenvectors.This objective function attempts to minimize the expected overlap or model error between the component distributions of p B (x; β) the Boltzmann distribution.Note that, although the w α are defined in terms of ψ 0 (x; β), all of the expectations in the definitions of the w α , p α , and Υ are taken over p B (x; β) = ψ 2 0 (x; β) and thus apply to this distribution.
Minimization of Υ occurs over the eigenspace spanned by the selected eigenvectors {ψ i } m−1 i=0 which serve as an orthogonal basis.The truncated spectral expansion of (9) provided by {ψ i } m−1 i=0 is a form of spectral regularization which prevents overfitting for appropriately chosen values of m [Engl et al., 1996].
Eigefunction smoothness confers smoothness to the resulting mixture components, which is a necessary property for transitions between macrostates to occur.This makes macrostates well-suited for probabilistic nonphysical mixture modeling problems where some amount of smoothness or overlap between components is desired or expected.

Macrostate mixture models
Macrostates of metastable physical systems are finite mixture models for physical systems.This paper addresses the question of whether or not they are also useful for solving nonphysical mixture modeling and unsupervised learning problems.The Kolmogorov extension theorem guarantees the existence of a stochastic processes having arbitrary mixture distributions as their stationary states [Øksendal, 1996].
Imagine a dynamic process that generates the observed mixture f (x) as its equilibrium Boltzmann distribution p B (x; 1).What metastable states or dynamic separability would occur for such a mixture distribution generating process that has f (x) as its stationary distribution?Macrostate theory rigorously answers this question and provides a precise framework for addressing some of the current challenges in mixture modeling such as determining the optimal number of components.Viewing energy landscapes as functions over arbitrary data spaces instead of as functions over physical configuration spaces gives information-theoretic meaning and statistical interpretations to Smoluchowski dynamics.

Definition
Let the negative logarithm of the nonnegative mixture distribution f (x) act as a pseudo potential energy for some mixture distribution f (x) given by ( 1), so that and therefore This leads to a form of (15) derived from nonphysical or statistical mixture distributions f (x) after substituting k for α and setting Combining this with fk (x; provides the definition of a macrostate mixture model via (6): for any 0 < β < ∞ and M opt (β) satisfying ( 16).
Because multiplying f (x) by the ŵk (x; β, M opt ) in ( 21) applies an envelope to f (x) reminiscent of the windowing operations used in signal processing, they were called window functions in the original formulation.The formal equivalence of window functions from Shalloway [1996] to discriminative finite mixture models ( 6) is the central idea in the definition of macrostate mixture models.Satisfying the partition of unity constraints (17) allows the âk and fk to be computed from the ŵk via (6), working backwards from the definition (15).
Note that the relationship of p B (x; β) to ψ 0 (x; β) via the symmetrization of L 0 to L in (9) has no consequence on the functions ŵk (x; β, M opt ) and only serves to symmetrize the spectral decomposition.Numerically, the symmetrization also stabilizes and reduces the computational complexity of calculating small magnitude eigenvectors and eigenvalues of matrix approximations of L, as illustrated in Section 4.
Macrosate theory can also be applied to data clustering and graph clustering/partitioning problems where the values of f (x) are not known as described in Section 3.1.For these types of unsupervised learning problems, the discriminative form ( 20) is the output and ( 6) is never invoked.Here the term macrostate model refers to both generative macrostate mixture models of the form ( 22) where the values of f (x) are available and to discriminative macrostate mixture models of the form (15) involving vector and graph valued data where the input data are samples from some distribution f (x) whose values are unknown.

Multiscale Spectral Gap
Macrostate models use a spectral criterion to select the appropriate number of components.The separability of the spectrum for each choice of m can be measured using the corresponding eigenvalue gap for m > 1. Larger gap values correspond to better separability for a given choice of m.
Using a ratio of eigenvalues means that gaps are measured relative to the fastest rate of the representative equilibration modes, allowing the detection of multiscale structure.In the context of macrostate models, the term spectral gap refers to the ratio defined by ( 23).
Spectra without sufficiently large gaps indicate lack of separability of the mixture components, perhaps because only a single component exists or because of a strong background "noise" component that reduces detectability of other components.
Values of m having larger gaps will have less expected error in the corresponding optimized mixture model.
Large problems require iterative techniques for computing the extremal eigenvectors.In this case the process can be stopped after the first sufficiently large gap appears or some maximum acceptable computational time is exhausted, whichever comes sooner.
Once the spectral decomposition of L in ( 9) is obtained and the appropriate value(s) of m is (are) chosen, the optimization over all feasible M must be performed subject to the partition of unity constraints (17).Although this is an NP-hard global optimization problem in general, it can be solved with reasonable computational expense for moderate values of m as illustrated in Section 4.

Global Optimization of Uncertainty
In terms of probabilistic classifiers, the uncertainty Υ(M ; β) represents the mean squared error of the probabilistic classifier derived from the mixture model specified by each value of M .By minimizing Υ(M ; β) macrostate models provide spectrally regularized minimum mean squared error classifiers.
For any m of interest, the column vector valued function T is numerically optimized to compute the macrostate model.The optimal w is determined via global optimization of Υ(M ; β) over the set of coordinate transformation matrices M satisfying the partition of unity constraints (17).
In matrix notation, any feasible ŵ can be written in terms of M and the column vector of basis functions Now the objective function Υ(M ; β) can be expressed in terms of ω and f (x) ≡ p B (x; β) as a concave quadratic function where tr M T M can also be expressed as M 2 F the squared Frobenius norm of M the basis transformation matrix.These results allow the linearly constrained global optimization problem for M opt to be succinctly stated as minimize for some fixed β > 0 temperature scaling parameter.
The linear inequality constraints M ψ 0 encode all of the problem-dependent data from the input f (x) via These inequality constraints are difficult to satisfy for all x over domains containing infinitely many points.In the next section some suggestions for handling these constraints under various discretizations are provided along with stable and scalable numerical implementations.

Temperature Scaling
This section describes how macrostate models depend on the inverse temperature parameter β.Since higher values of β correspond to lower intensity of random forces for the corresponding generating process relative to the deterministic drift forces, it could also be called a drift sensitivity parameter.
According to this interpretation, higher drift-sensitivities occur when the magnitudes of random forces are low relative to the magnitude of the deterministic drift forces i.e. negative gradients of the energy surface.Therefore increasing β may resolve components more effectively than using β = 1 in some cases, a hypothesis explored in Section 4.Although the equilibrium distribution of the stochastic process does not equal f when β = 1, it relates to f (x) via the negative log density V (x) ≡ − log f (x) from ( 18).
Combining equations ( 19) and ( 18) means that p B (x; 1) ∝ f (x) for β = 1.If β = 1 then p B ∝ f β and the normalization constant can be safely ignored since it does not affect the eigenfunctions.
Scaling V (x) by β applies a power law transformation to the corresponding unnormalized distribution (x) .This affects the separation between relatively higher and lower probability density values, deforming the shape of the original mixture distribution.
Physically, the β parameter sets the inverse thermal energy of the underlying Brownian dynamics process yielding f β (x) as the unnormalized equilibrium distribution or stationary state.Both the number and separability of the components therefore depend on the choice of β, the inverse temperature parameter.
In physical systems, larger values of β will reduce the ability of the sample paths of the underlying stochastic process to transition between macrostates relative to smaller values, reducing overlap between regions of high probability.The effect of β < 1 on the shape of the distribution is to increase the chance that a sample path will transition between macrostates, resulting in a flatter shape and more overlapping of components.
Although discontinuities in the number and structure of the mixture components can occur for sufficiently large deviations from β = 1, for smaller excursions continuity can be expected.If a specific number of clusters m is expected from some external knowledge, it is possible to use this prior information to adjust β until this a sufficiently large eigenvalue gap r m (β) is found for this cluster number.
Computing the M opt (β) for multiple values of β is often prohibitively expensive.Fortunately, its effects can be understood via the resulting spectral gaps r m (β) that are significantly easier to compute.This economy of resources allows examining several values of β at reasonable computational expense, as explained with examples in the next section.

Previous Formulation
The macrostate data clustering algorithm is an alternative formulation of macrostate models developed previously in Korenblum and Shalloway [2003], White and Shalloway [2009].However, the connections to finite mixture models of the form (1), Bayes classifier posterior distributions of the form (2), and partition of unity or probabilistic clustering/partitionings of the form (15) were not made there.Also, a different objective function and a different global optimization solver were used.
In the previous formulation [Korenblum and Shalloway, 2003], the objective function was a logarithm of the geometric mean uncertainty which led to a non-quadratic optimization problem.Here the original the objective function from Shalloway [1996] is used, providing a more standard concave quadratic programming (QP) problem that can be more easily solved.
Linearly constrained concave QPs can be solved using established heuristics such as modified Frank-Wolfe type procedures [Pardalos and Guisewite, 1993].Reducing the software implementation to a sequence of LP solver calls simplifies the ease of use and practical applicability of macrostates and encourages more active future developments.This method is easily parallelizable since the linear programming problems associated with each of the starting points are standard independent problems, making scalability feasible on distributed architectures such as GPUs.
In Korenblum and Shalloway [2003], White and Shalloway [2009] only unbounded inverse quadratic similarity measures and soft Gaussian thresholds that do not directly control sparsity were tested.Here other choices of similarity/distance measures are tested and the use of hard thresholding is examined to directly control sparsity.
By starting from the continuous mixture function representation (1), the range of input datatypes is expanded to cases where the values of the mixture function f (x) are available as inputs, in addition to the pairwise distance/similarity/Laplacian matrices input datatypes considered previously.The previous formulation defined in Korenblum and Shalloway [2003] did not mention applications to function value unmixing via (22).It also did not mention the probabilistic interpretation of the cluster assignments (20) provided by ( 6) and (2).
Reformulating macrostate data clustering using the theory of finite mixture models for linear separation and unsupervised learning problems provides a more general perspective on their role.Detailed comparisons between the two formulations are left for future work in order to focus on the applications of the reformulation derived here.

Applications
Discretizing the mixture model formulation (1) is necessary for most practical applications.There are a number of numerical methods for approximating the values of the first m eigenfunctions {ψ i } m i=0 that are necessary for solving (25).

Discrete Approximation
For sufficiently low dimensional functions (e.g.n 4) evaluated on evenly-spaced grids, the discrete approximation of ( 7) can be used via a nearest-neighbor Laplacian approximation to construct a sparse approximation of L in (9) as described in Banushkina et al. [2005].The discrete approximation approach is useful for applications where the mixture function f (x) can be evaluated on a grid such as density estimates generated by histograms or kernel density estimation.This was the method used for the numerical example described in Section 4.1.
Discrete approximations can also be applied to nonnegative signals such as spectral density estimates and 2 and 3 dimensional images sampled on evenly-spaced nodes after preprocessing to remove random noise.Since discrete approximations of Smoluchowski equations are microscopically reversible continuous-time Markov chains (CTMCs), macrostate models can also be constructed by embedding input data into Markov chain.
Just like in the continuous case for (8), discrete transition rate matrices for time reversible processes are similar to symmetric matrices.Similarly their eigenvalues and eigenvectors are real and the eigenvectors corresponding to distinct eigenvalues are orthogonal.

Markov Chain and Graph Embeddings
When data items are samples {x i } N i=1 from the mixture distribution f (x), the values of f (x) are not directly available.For these problems the function values f (x) are never computed and the final outputs are the probabilistic cluster assignments (20) evaluated at the item locations.Laplacian matrices provide the connection between pairwise similarity or dissimilarity matrices and CTMCs, allowing macrostate models to be used for systems where accurate estimates of the mixture distribution f (x) are not available.
Pairwise interitem similarity or distance measures describe the regions of data space that represent closely related locations.Laplacian matrices are derived from similarity and distance matrices and can be interpreted as generalized stencils for discretized Laplace operators identical to those used in finite-difference based discrete approximation methods as in 3.1 [Smola and Kondor, 2003].
Laplace operators are free diffusion limits of Smoluchowski operators L 0 where the pseudo potential energy function V (x) is a constant function.Graph/CTMC embeddings of data items using pairwise similarity/distance-measure based Laplacian matrices can therefore be considered as special cases of discretized Smoluchowski equations (7) with constant potentials and reflecting Neumann boundary conditions.
Negative Laplacian matrices are equivalent to transition rate matrices.Their eigensystems are widely used by spectral clustering methods [Azran and Ghahramani, 2006].Eigenvectors of Laplacian matrices can be inserted into (25) in place of discrete approximations of PDEs of the form (7) to define macrostate models for clustering data items and graph nodes.
Most other spectral clustering methods typically use the eigenvectors with small-magnitude eigenvalues as a basis for projecting the data onto, and then k-means or some other clustering method is used on the projected item coordinates [Ng et al., 2002].The optimized linear combinations of the eigenvectors used by macrostate models define nonparametric nonhierarchical probabilistic clusterings in a more integrated manner using fewer steps and parameters.Of course this involves a tradeoff i.e. the additional expense of the global optimization problem (25).
Other spectral clustering methods use the eigenvectors for representation purposes and select the number of output clusters independently at a later stage.The literature on spectral clustering defines spectral gaps using the difference of adjacent eigenvalues λ m − λ m−1 , an absolute measure that does not account for the scale of the representation.Multiscale spectral gaps ( 23) are more appropriate for macrostate modeling since the number of clusters or components m generated is identical to the dimensionality of the representative eigenspace span{ψ i } m−1 i=0 by definition.Forming Laplacian matrices can be prohibitively expensive in terms of memory resources for sufficiently large problems.Applying additional constraints to ensure sufficient sparsity to accommodate available storage space may be necessary.This can be achieved by applying a threshold on the similarities (dissimilarities) that are sufficiently low (high) and setting them to zero.

Graph Partitioning
Eigensystems of Laplacian matrices are also used for spectral partitioning of graphs in order to identify strongly intraconnected subgraphs or communities [Fortunado, 2010].For graph clustering or partitioning problems, adjacency matrices are primary inputs rather than coming from similarity/distance measure evaluations over item pairs [Schaeffer, 2007].In terms of adjacency matrices, the identification of separable subgraphs corresponds to an optimal symmetric permutation of rows and columns to achieve approximate block-diagonal form.
Unweighted adjacency matrices are a type of binary similarity matrix that encode connections between nodes as a 1 or a 0. Weighted symmetric graphs can also be encoded using weighted adjacency matrices that play the same role as similarity matrices for data items.Directed graph (digraph) Laplacian matrices are also defined [Chung, 2004].These also define valid macrostate models provided that the corresponding CTMCs are reversible.
Network graphs of links between world wide web pages or citations between publications are examples of currently relevant practical graph partitioning problems.The Google PageRank score can be interpreted in terms of the first and second eigenvectors of the graph Laplacian matrix of the graph of outgoing links on websites [Smola and Kondor, 2003].

Numerical Examples
In this section, several numerical examples are presented to illustrate the practical application of macrostate models using both synthetic and real-world problems.The results show high levels of performance across a range of problem types without encountering problems such as numerical instability or sensitivity to small variations of tunable parameters.All run-times were less than 20 minutes using Matlab on an 2.80 GHz Intel Core i7-2640M CPU running the 64-bit Windows 7 Home Premium operating system with 8GB of RAM.

Gaussian/Laplace/Hyperbolic Secant Mixture
For this example, a three-component mixture f (x) = f 1 (x)+f 2 (x)+f 3 (x) was constructed, consisting of one component made from a mixture of three Gaussians, a second component made from a mixture of three unnormalized Laplace distributions, and a third made from a mixture of three hyperbolic secant functions.Since these components do not share a common parametrization, using any one class of distribution to compute the unmixing will result in errors.This simple yet nontrivial example provides a test of the accuracy of the nonparametric macrostate mixture modeling approach.Table 1 lists the corresponding mean squared errors.
These computed values are shown as a series of lines colored by m with β on the horizontal axis in Figure 3. Notice that a the largest gap occurs at m = 3 for all β ∈ [0, 3], indicating stability of the algorithm in selecting the correct number of clusters over a wide range of the scaling parameter.
The downward slope of r 4 (β) versus β is also an indicator that the m = 3 solution is optimal since it implies that the higher-magnitude transition rates for eigenstates not used in forming the m = 3 solution are becoming increasingly similar as β increases, supporting the conclusion that the m = 3 is the optimal choice.Because the components of this mixture for this problem are significantly overlapping the eigenvalue gap r 3 (β) for m = 3 at β = 1 is small as shown in Figure 3.This leads to the unwanted overlap between the optimally separated components visible in the top row of Figure 2.
Since the true solution was known for this test problem, the relative error vs. β was computed as shown in Figure 4.
For β = 2.6 the computed eigenvalue gap was r 3 (2.6)= 93.09.The relative error was below 0.94 for β > 2.1 where the eigenvalue gap r 3 (2.1)= 35.82.This indicates that setting a sufficiently large eigenvalue gap cutoff and increasing β until this cutoff is reached would lead to near-optimal accuracy over a relatively wide range of thresholds.
Figure 3 indicates that the optimal value of β = 2.6 occurs where r 4 (β) reaches the same order of magnitude as the gaps r j (β) for j > 4.This observation suggests that perhaps an optimal β can be found by ensuring that higher-order gaps are sufficiently small.
If this conjecture holds true, then the optimal value of β could be found by placing an upper-bound on r j>m and increasing β until this criterion is met up to some maximum cluster/mixture component number m of interest.For the Gaussian/Laplace/hyperbolic secant test problem, the optimal value of β occurs at a corresponding cutoff of r 4 (β) 1.2, although more tests are needed to establish this as a useful estimate in general.
Figure 5 shows the optimal macrostate model for this test problem.As suggested by the lower error shown in 4 compared to the β = 1 solution shown in 2, the top and middle panels of the β = 2.6 solution are visually improved, with the unwanted mixing across components no longer visible.Table 2 shows the relative errors for the probabilistic/unthresholded and hard-thresholded optimal β = 2.6 solutions.By contrast to the β = 1 solution, hard thresholding does not remove any soft model error, increasing the error value compared to the original probabilistic/unthresholded version.

Gene expression data
Macrostate modeling was tested on real-world vector data by using a subset of the yeast expression data from DeRisi et al.
[1997] and is also available in Matlab as described online by Mathworks [2014].614 genes were selected by following the instructions to remove genes with temporal low variation described in Mathworks [2014].The Gaussian kernel similarity measure was chosen based on the Euclidean distances of the discretized fold-change vector time series data with a scale factor of 1/4 the standard deviation averaged over all time points (0.4215).
Because gene expression datasets can contain dozens of clusters having very few (e.g. 2) members, outlier filtering was applied to remove an additional 292 items.Outliers were iteratively selected as items having mean similarity less than 0.2 of the mean similarity over all items until all such points were removed.
The remaining 322 items were clustered using the eigenvectors of the (unnormalized) Laplacian matrix derived from the Gaussian kernel pairwise similarity matrix.Spectral gaps were computed for m = 2, . . ., 20 as plotted in Figure 6 indicating that the m = 9 clustering was the largest m having a gap greater than a cutoff of 1.5.
Figure 6: spectral gap vs. m using the Gaussian kernel similarity measure with scale factor of 0.4215 showing the peak m = 9 above a cutoff value of 1.5.
Figure 7 shows the m = 9 cluster solution after performing outlier removal.
By comparison, Figure 8 shows the solution for the Matlab implementation of k-means for the same outlier-filtered dataset shows over-splitting of items with low-amplitude variations.
Silhouette scores provide a means of quantitatively assessing data cluster validity [?].This visual observation is corroborated by the quantitative evidence provided by using mean silhouette data clustering validation scores as shown by Table 3. Macrostate modeling provides almost twice the average silhouette score compared to the standard k-means algorithm, indicating highly improved performance on test data from biological gene expression experiments.Although no effort was made to assign biological meaning or interpretation to the clusters identified, the quantitative validation of the macrostate modeling approach using silhouette scores provides demonstrates the practical value of this algorithm.

C. elegans neural network graph
The biological neural network graph for the nematode C. elegans, a well-studied animal model organism in biology, was obtained from Watts and Strogatz [1998] and used as a real-world graph-partitioning example.Nodes of the original directed graph without both input and output connections were removed in order to create an irreducible and reversible CTMC using the normalized graph Laplacian of the symmetrized adjacency matrix A sym ≡ A + A T .
Although symmetrization of digraphs is not essential for macrostate modeling, it can be useful when overall connectivity is of interest and also simplifies the numerical implementation.The symmetrized Laplacian matrix is therefore sufficient for the purpose of demonstrating the potential usefulness of macrostate modeling on graph partitioning problems.
Adapting the temperature scaling methodology from 4.1, the off-diagonal elements of the Laplacian matrix can be interpreted as transition rates and scaled by an arbitrary β exponent to tune the accuracy of the resulting clustering.Figure 12 shows the sparsity pattern for the adjacency matrix of the original and the partitioned graph after reordering the nodes according to the cluster/partition assignments.Visually there appears to be a significant increase in the block-diagonal structure of the adjacency matrix corresponding to each cluster, indicating a plausible partitioning.
Notice that as expected for small-world networks obeying power laws as described in Watts and Strogatz [1998] the clusters are quite varied in their sizes and occur at all scales.These results support the conclusion that macrostate models are useful for multiscale network analysis applications, although more work needs to be done to complete the interpretation of the results obtained here.

Discussion
Macrostate models are versatile and powerful tools for challenging problems such as linear separation/unmixing, cluster analysis, and graph partitioning.Their applicability to a variety of currently active research areas reveals deep similarities in the structures of these seemingly distinct and independently-derived problems.

Interpretation
Originally, macrostates were defined for separating physical systems where the equilibrium distribution is generated by a single process and the mixture components attempt to approximate the system's metastable dynamics [Shalloway, 1996].
For such physical problems the mixture components (macrostates) do not represent independent sources or processes but are expansions of a single process into subprocesses with fast equilibration times for intra-state vs. inter-state transitions.In nonphysical statistical finite mixture modeling and unsupervised learning, the assumption is that the observations actually come from distinct independent processes that happen to be superimposed.Although this may seem to be a small philosophical difference, it is worth noting since it obfuscates the mathematical equivalence of these two problems.
Understanding the geometric/mathematical equivalence of finite-mixture modeling and physical macrostates of dynamic systems bridges the gap between spectral clustering and stochastic/Markov process theory.This connection allows the effectiveness of Laplacian matrices in spectral clustering to be understood in terms of metastable dynamics of stochastic processes.

Geometry
It may seem surprising that a physical method for modeling dynamic systems can be adapted to nonphysical linear separation/unmixing problems.The reverse logic of considering generating stochastic processes for arbitrary mixture distributions makes sense because the PDE (7) defining the dynamics of the physical process is entirely determined by the geometry of the potential energy function V (x).
The geometric equivalence of separating a single dynamic process into weakly coupled subprocesses and separating mixtures of processes into independent components is the key idea behind macrostate models.This motivates the use of generating stochastic processes for mixture distributions f (x) defined by negative log densities V (x) in (18) to provide useful information about the geometry or structure of f (x).

Parameter tuning
The role of β the inverse temperature parameter is analogous to the scaling parameter used in many reproducing kernel functions such as Gaussian or other radial basis function kernels.Using β > 1 was shown to improve unmixing accuracy for distributions having significant overlaps to enforce sharper boundaries between components.This is analogous to lowering the temperature of a weakly metastable physical process in order to slow the transitions between the metastable states.In the context of similarity measures defined by kernel functions K(x, y; ), β can often be related to the scaling factor used, often determined by cross-validation.

Combined representation and inference
Macrostate models provide a single algorithm for probabilistic clustering directly from eigenspaces of transition-rate/Laplacian matrices.Most other existing spectral clustering methods use a two-stage approach with a represention step using Laplacian eigenspaces followed by separate inference step.
Instead of only using the selected eigenvectors of Laplacian matrices for data representation, macrostate models also use them to define the final values of the cluster assignment probabilities via a global optimization problem, providing a unified probabilistic theory for unsupervised learning from Laplacian eigenspaces.Unifying representation and inference creates a simpler and more coherent procedure while avoiding unnecessary parameters and algorithm selection tasks.
In exchange for the added cost or complexity of ( 25), all of the information used for defining the model comes from the eigenspace of a Laplacian or transition-rate matrix, enabling inference without any additional model selection and execution steps as required by most other spectral clustering algorithms.Although macrostate modeling involves solving an NP-hard linearly-constrained concave quadratic global optimization problem, the modified Frank-Wolfe heuristic of Pardalos and Guisewite [1993, page 94] provides an effective approximate solver for the example problems tested.

Cluster number prediction
Many other mixture modeling, clustering, and partitioning algorithms in use today are parametric and cannot easily predict the correct number of mixture components.By contrast, macrostate models are nonparametric and can accurately predict this number directly from efficiently computable eigenspectra.
Using a multiscale spectral gap involving a ratio rather than a difference of adjacent eigenvalues further differentiates macrosate mixture models from existing spectral clustering and partitioning methods.It would be interesting to obtain a probabilistic interpretation of the set of eigenvalue gaps {r m } ∞ m=2 , i.e. an estimate of P (m) the probability that a particular choice of m is appropriate, an issue that is left open for future work.

Future work
The topological structure of macrostate splitting as β increases from sufficiently small nonzero value towards ∞ has been useful for solving challenging global optimization problems [Pardalos et al., 1994].Such homotopy-related aspects of macrostate theory are not explored here for the sake of brevity.Recursive applications for creating hierarchical levels of partitions e.g. as used by multigrid methods are another interesting area left for future work.User-friendly and open-source software implementations are also necessary to promote the more widespread use of macrostate models.

Figure 1
Figure 1 shows an image of f (x) evaluated at 200 Cartesian gridpoints {x i : x i ∈ [−10, 10] 2 } 200 i=1 in two dimensions with colors indicating the value of f (x) at each point.

Figure 1 :
Figure 1: Gaussian/Laplace/hyperbolic-secant mixture The panels of Figure 2 shows the corresponding unthresholded (top row) and hard-thresholded (middle row) macrosate mixture model solutions using the default β = 1 scaling parameter setting.The bottom row shows the original mixture components prior to forming their superposition for comparison (ground truth).

Figure 3 :
Figure 3: Semilog plot of eigenvalue gap r m (β) vs. β colored by m for m = 2, . . ., 11.The m = 3 case in blue quickly goes above the vertical axis limit but continues increasing on the logarithmically-scaled vertical axis with a near constant slope as shown by the inset plot.

Figure 4 :
Figure 4: Relative error of the m = 3 macrostate mixture model for the Gaussian/Laplace/hyperbolic secant mixture test problem vs. β.

Figure 12 :
Figure 12: Original (left) and partitioned (right) C. elegans neural network graph sparsity pattern where dots indicate edges between nodes.Colors indicate cluster/partition assignment using the m = 5 cluster solution where Υ(M opt ; β) = 0.5893 with β = 2. Diagonal blocks suggested by the colors indicate strong relative connectivity of member nodes within each cluster.Cluster membership numbers are: 12, 35, 114, 51, 47 (from top to bottom).

Table 2 :
Relative error of the optimally-scaled β = 2.6 macrostate mixture model for the Gaussian/Laplace/hyperbolic secant

Table 3
Figures 9 and 10 respectively show the corresponding silhouette plots for the macrostate and k-means algorithms.Notice that the k-means algorithm contains clusters with negative silhouette scores, a clear sign of bad performance.