Graph diffusion distance: Properties and efficient computation

We define a new family of similarity and distance measures on graphs, and explore their theoretical properties in comparison to conventional distance metrics. These measures are defined by the solution(s) to an optimization problem which attempts find a map minimizing the discrepancy between two graph Laplacian exponential matrices, under norm-preserving and sparsity constraints. Variants of the distance metric are introduced to consider such optimized maps under sparsity constraints as well as fixed time-scaling between the two Laplacians. The objective function of this optimization is multimodal and has discontinuous slope, and is hence difficult for univariate optimizers to solve. We demonstrate a novel procedure for efficiently calculating these optima for two of our distance measure variants. We present numerical experiments demonstrating that (a) upper bounds of our distance metrics can be used to distinguish between lineages of related graphs; (b) our procedure is faster at finding the required optima, by as much as a factor of 103; and (c) the upper bounds satisfy the triangle inequality exactly under some assumptions and approximately under others. We also derive an upper bound for the distance between two graph products, in terms of the distance between the two pairs of factors. Additionally, we present several possible applications, including the construction of infinite “graph limits” by means of Cauchy sequences of graphs related to one another by our distance measure.


Introduction
Structure comparison, as well as structure summarization, is a ubiquitous problem, appearing across multiple scientific disciplines. In particular, many scientific problems (e.g. inference of molecular properties from structure, pattern matching in data point clouds and scientific images) may be reduced to the problem of inexact graph matching: given two graphs, compute a measure of similarity that gainfully captures structural correspondence between the two. Similarly, many algorithms for addressing multiple scales of dynamical behavior rely on methods for automatically coarsening some model architecture.
In this work we present a graph distance metric, based on the Laplacian exponential kernel of a graph. This measure generalizes the work of Hammond et al. [1] on graph diffusion distance for graphs of equal size; crucially, our distance measure allows for graphs of inequal size. We formulate the distance measure as the solution to an optimization problem dependent on a comparison of the two graph Laplacians. This problem is a nested optimization problem with the innermost layer consisting of multivariate optimization subject to matrix constraints (e.g. orthogonality). To compute this dissimilarity score efficiently, we also develop and demonstrate the lower computational cost of an algorithm which calculates upper bounds on the distance. This algorithm produces a prolongation/restriction operator, P, which produces an optimally coarsened version of the Laplacian matrix of a graph. Prolongation/restriction operators produced via the method in this paper have previously been applied to accelerate the training of machine learning algorithms in [2].

Prior work
Quantitative measures of similarity or dissimilarity between graphs have been studied for decades owing to their relevance for problems in pattern recognition including structurebased recognition of extended and compound objects in computer vision, prediction of chemical similarity based on shared molecular structure, and many other domains. Related problems arise in quantitative modeling, for example in meshed discretizations of partial differential equations and more recently in trainable statistical models of data that feature graph-like models of connectivity such as Bayes Networks, Markov Random Fields, and artificial neural networks. A core problem is to define and compute how "similar" two graphs are in a way that is invariant to a permutation of the the vertices of either graph, so that the answer doesn't depend on an arbitrary numbering of the vertices. On the other hand unlike an arbitrary numbering, problem-derived semantic labels on graph vertices may express real aspects of a problem domain and may be fair game for detecting graph similarity. The most difficult case occurs when such labels are absent, for example in an unstructured mesh, as we shall assume. Here we detail several measures of graph dissimilarity, chosen by historical significance and similarity to our measure. We mention just a few prior works to give an overview of the development of graph distance measures over time, paying special attention to those which share theoretical or algorithmic characteristics with the measure we introduce. Our mathematical distinctions concern the following properties: • Does the distance measure require an inner optimization loop? If so is it mainly a discrete or continuous optimization formulation?
• Does the distance measure calculation naturally yield some kind of explicit map from realvalued functions on vertices of one graph to functions on vertices of the other? (A map from vertices to vertices would be a special case.) If we use the term "graph signal" to mean a function f: V(G 1 )!S which identifies each vertex of a graph G 1 with some state s 2 S, then a mapexplicit graph distance is one whose calculation yields a second function g: V(G 2 )!V(G 1 ), with the composite function f � g: V(G 2 )!S.
• Is the distance metric definable on the spectrum of the graph alone, without regard to other data from the same graph? The "spectrum" of a graph is a graph invariant calculated as the eigenvalues of a matrix related to the adjacency matrix of the graph. Depending on context, the spectrum can refer to eigenvalues of the adjacency matrix, graph Laplacian, or normalized graph Laplacian of a graph. We will usually take the underlying matrix to be the graph Laplacian, defined in detail in Section 1.3. Alternatively, does it take into account more detailed "structural" aspects of the graph? This categorization (structural vs. spectral) is similar to that introduced in [3].
For each of the graph distance variants discussed here, we label them according to the above taxonomy. For example, the two prior works by Eschera et. al. and Hammond et al (discussed in Sections 1.1.4 and 1.1.5) would be labelled as (structural, explicit, disc-opt) and (spectral, implicit, non-opt), respectively. Our distance measure would be labelled (spectral, explicit, cont-opt).
1.1.1 Quadratic matching of points and graphs (structural, explicit, cont-opt). Other work focuses on the construction of a point-to-point correspondence between the vertices of two graphs. Gold et. al. [4] define the dissimilarity between two unlabelled weighted graphs (with adjacency matrices A (1) and A (2) and n 1 and n 2 vertices, respectively) as the solution to the following optimization problem (for real-valued M = [m ij ]: where � k k 2 F is the squared Frobenius norm. This problem is similar in structure to the optimization considered in Section 4: a key difference being that Gold et al. consider optimization over real-valued matchings between graph vertices, whereas we consider 0-1 valued matchings between the eigenvalues of the graph Laplacians. In [5,6] the authors present computational methods for computing this optimum, and demonstrate applications of this distance measure to various machine learning tasks such as 2D and 3D point matching, as well as graph clustering. Gold et al. also introduce the softassign, a method for performing combinatorial optimization with both row and column constraints, similar to those we consider. 1.1.2 Cut-distance of graphs (structural, implicit, disc-opt). Lovász [7] defines the cutdistance of a pair of graphs as follows: Let the □-norm of a matrix B be given by: Given two labelled graphs G 1 , G 2 , on the same set of vertices, and their adjacency matrices A 1 and A 2 , the cut-distance d cut (G 1 , G 2 ) is then given by (for more details, see [7]). Computing this distance requires combinatorial optimization (over all vertex subsets of G 1 , G 2 ) but this optimization does not result in an explicit map between G 1 and G 2 . 1.1.3 Wasserstein earth mover distance (spectral, implicit, disc-opt). One common metric between graph spectra is the Wasserstein Earth Mover Distance. Most generally, this distance measures the cost of transforming one probability density function into another by moving mass under the curve. If we consider the eigenvalues of a (possibly weighted) graph as point masses, then the EMD measures the distance between the two spectra as the solution to a transport problem (transporting one set of points to the other, subject to constraints e.g. a limit on total distance travelled or a limit on the number of 'agents' moving points). The EMD has been used in the past in various graph clustering and pattern recognition contexts; see [8].
In the above categorization, this is an optimization-based spectral distance measure, but is implicit, since it does not produce a map from vertices of G 1 to those of G 2 (informally, this is because the EMD is not translating one set of eigenvalues into the other, but instead transforming their respective histograms). Recent work applying the EMD to graph classification includes [9,10]. Some similar recent works [11,12] have used optimal transport theory to compare graphs. In this framework, signals on each graph are smoothed, and considered as draws from probability distribution(s) over the set of all graph signals. An optimal transport algorithm is used to find the optimal mapping between the two probability distributions, thereby comparing the two underlying graphs.
1.1.4 Graph-edit distance. The graph edit distance measures the total cost of converting one graph into another with a sequence of local edit moves, with each type of move (vertex deletion or addition, edge deletion or addition, edge division or contraction) incurring a specified cost. Costs are chosen to suit the graph analysis problem at hand; determining a cost assignment which makes the edit distance most instructive for a certain set of graphs is an active area of research. The distance measure is then the sum of these costs over an optimal sequence of edits, which must be found using some optimization algorithm i.e. a shortest-path algorithm (the best choice of algorithm may vary, depending on how the costs are chosen). The sequence of edits may or may not (depending on the exact set of allowable edit moves) be adaptable into an explicit map between vertex-sets. Classic pattern recognition literature includes: [13][14][15][16].
1.1.5 Diffusion distance due to Hammond et al. [1]. We discuss this recent distance metric more thoroughly below. This distance measures the difference between two graphs as the maximum discrepancy between probability distributions which represent single-particle diffusion beginning from each of the nodes of G 1 and G 2 . This distance is computed by comparing the eigenvalues of the heat kernels of the two graphs. The optimization involved in calculating this distance is a simple unimodal optimization over a single scalar, t, representing the passage of time for the diffusion process on the two graphs; hence we do not count this among the "optimization based" methods we consider.
1.1.6 Novel diffusion-derived measures. In this work, we introduce a family of related graph distance measures which compare two graphs in terms of similarity of a set of probability distributions describing single-particle diffusion on each graph. For two graphs G 1 and G 2 with respective Laplacians L(G 1 ) and L(G 2 ), the matrices e tL(G 1 ) and e tL(G 2 ) are called the Laplacian Exponential Kernels of G 1 and G 2 (t is a scalar representing the passage of time). The column vectors of these matrices describe the probability distribution of a single-particle diffusion process starting from each vertex, after t time has passed. The norm of the difference of these two kernels thus describes how different these two graphs are, from the perspective of single-particle diffusion, at time t. Since these distributions are identical at very-early and very late times t (we formalize this notion in Section 2.1), a natural way to define a graph distance is to take the supremum over all t. When the two graphs are the same size, so are the two kernels, which may therefore be directly compared with a matrix norm. This case is the case considered by Hammond et al. [1]. However, to compare two graphs of different sizes, we need a mapping between the column vectors of e tL(G 1 ) and e tL(G 2 ).
Optimization over a suitably constrained prolongation/restriction operator between the graph Laplacians of the two graphs is a permutation-invariant way to compare the behavior of a diffusion process on each. The prolongation map P thus calculated may then be used to map signals (by which we mean values associated with vertices or edges of a graph) on G 1 to the space of signals on G 2 (and vice versa). In [2] we implicitly consider the weights of an artificial neural network model to be graph signals, and use these operators to train a hierarchy of linked neural network models. However, in that work we do not address efficient calculation of this distance or the associated operators, a major focus of this paper.
We also, in sections 3.2 and 3.3 consider a time conversion factor between diffusion on graphs of unequal size, and consider the effect of limiting this optimization to sparse maps between the two graphs (again, our case reduces to Hammond when the graphs in question are the same size, dense P and R matrices are allowed, and our time-scaling parameter is set to 1).
In this work, we present an algorithm for computing the type of nested optimization given in our definition of distance (Eqs 8 and 9). The innermost loop of our distance measure optimization consists of a Linear Assignment Problem (LAP, defined below) where the entries of the cost matrix have a nonlinear dependence on some external variable. Our algorithm greatly reduces both the count and size of calls to the external LAP solver. We use this algorithm to compute an upper bound on our distance measure, but it could also be useful in other similar nested optimization contexts: specifically, nested optimization where the inner loop consists of a linear assignment problem whose costs depend quadratically on the parameter in the outermost loop.

Background
The ideal for a quantitative measure of similarity or distance on some set S is usually taken to be a distance metric d : S � S7 !R satisfying for all x, y, z 2 S: Then (S, d) is a metric space. Euclidean distance on R d and geodesic distance on manifolds satisfy these axioms. They can be used to define algorithms that generalize from R d to other spaces. A variety of weakenings of these axioms are required in many applications, by dropping some axioms and/or weakening others. For example if S is a set of nonempty sets of a metric space S 0 , one can define the "Hausdorff distance" on S which is an extended pseudometric that obeys the triangle inequality but not the Identity axiom and that can take values including + 1. As another example, any measure measure of distance on graphs which is purely spectral (in the taxonomy of Section 1.1) cannot distinguish between graphs which have identical spectra. We discuss this in more detail in Section 2.3.
Additional properties of distance metrics that generalize Euclidean distance may pertain to metric spaces related by Cartesian product, for example, by summing the squares of the distance metrics on the factor spaces. We will consider an analog of this property in Section 3.4.

L(G) is sometimes instead defined as D(G) − A(G)
; we take this sign convention for L(G) because it agrees with the standard continuum Laplacian operator, Δ, of a multivariate func- Linear Assignment Problem (LAP): We take the usual definition of the Linear Assignment Problem (see [17,18]): we have two lists of items S and R (sometimes referred to as "workers" and "jobs"), and a cost function c : S � R ! R which maps pairs of elements from S and R to an associated cost value. This can be written as a linear program for real-valued x ij as follows: Generally, "Linear Assignment Problem" refers to the square version of the problem where |S| = |R| = n, and the objective is to allocate the n jobs to n workers such that each worker has exactly one job and vice versa. The case where there are more workers than jobs, or vice versa, is referred to as a Rectangular LAP or RLAP. In practice, the conceptually simplest method for solving an RLAP is to convert it to a LAP by augmenting the cost matrix with several columns (rows) of zeros. In this case, solving the RLAP is equivalent to solving a LAP with size max(n, m). Other computational shortcuts exist; see [19] for details. Since the code we use to solve RLAPs takes the augmented cost matrix approach, we do not consider other methods in this paper.
Matching: we refer to a 0-1 matrix M which is the solution of a particular LAP as a "matching". We may refer to the "pairs" or "points" of a matching, by which we mean the pairs of indices (i, j) with M ij = 1. We may also say in this case that M "assigns" i to j.
Hierarchical graph sequences: A Hierarchical Graph Sequence (HGS) is a sequence of graphs, indexed by l 2 N ¼ 0; 1; 2; 3 . . ., satisfying the following: • G 0 is the graph with one vertex and one self-loop, and; • Successive members of the lineage grow roughly exponentially-that is, there exists some base b such that the growth rate as a function of level number l is Oðb l 1þ� Þ, for all � > 0.

Graded graph:
A graded graph is a graph along with a vertex labelling, where vertices are labelled with non-negative integers such that Δl, the difference in label over any edge, is in {−1, 0, 1}. We will refer to the Δl = 0 edges as "within-level" and the l = ±1 edges as "between-level".
Graph lineages: A graph lineage is a graded graph with two extra conditions: • The vertices and edges with Δl = 0 form a HGS; and • the vertices and edges with Δl = ±1 form a HGS of bipartite graphs.
More plainly, a graph lineage is an exponentially growing sequence of graphs along with ancestry relationships between nodes. We will also use the term graph lineage to refer to the HGS in the first part of the definition. Some intuitive examples of graph lineages in this sense are the following: • Path graphs or cycle graphs of size b n for any integer b.
• More generally, grid graphs of any dimension d, of side length b, yielding a lineage which grows with size b dn (with periodic or nonperiodic boundary conditions).
• For any probability distribution p(x, y) whose support is points in the unit square, we can construct a graph by discretizing the map of p as a function of x and y, and interpreting the resulting matrix as the adjacency matrix of a graph. For a specific probability distribution p, the graphs derived this way with discretizations of exponentially increasing bin count form a graph lineage.
• The triangulated mesh is a common object in computer graphics [20][21][22], representing a discretization of a 2-manifold embedded in  = V(G) × V(H) and an adjacency relationship defined by: (g 1 , h 1 )�(g 2 , h 2 ) in G□H if and only if g 1 � g 2 in G and h 1 = h 2 , or g 1 = g 2 and h 1 � h 2 in H. Note that the adjacency matrix of this relationship may be represented by the following identity: where � is the Kronecker Product of matrices (See [23], Section 11.4).

Diffusion distance definition
We generalize the diffusion distance defined by Hammond et al. [1]. This distortion measure between two graphs G 1 and G 2 , of the same size, was defined as: where L i represents the graph Laplacian of G i . This may be interpreted as measuring the maximum divergence, as a function of t, between diffusion processes starting from each vertex of each graph, as measured by the squared Euclidean distance between the column vectors of e tL i . Each column v j of e tL i (which is called the Laplacian Exponential Kernel) describes a probability distribution of visits (by a random walk of duration t, with node transition probabilities given by the columns of e L ) to the vertices of G i , starting at vertex j. This distance metric is then measuring the difference between the two graphs by comparing these probability distributions; the motivation between taking the supremum over all t is that the value of the objective function at the maximum is the most these two distributions can diverge. See Fig 3 for an example of a distance calculation, with a characteristic peak. For further intuition about why the peak is the most natural place to take as the distance, rather than some other arbitrary time, note that at very early times and very late times, the probability distribution of vertex visits is agnostic to graph structure: at early times no diffusion has had a chance to take place, while at very late times the distribution of vertex-visits converges to the stationary state for each connected component of the graph. Hence we are most interested in a regime of t-values in between these extremes, where differences in G 1 and G 2 are apparent in their differing probability distributions.
Our contribution generalizes this measure to allow for graphs of differing size. We add two variables to this optimization: a prolongation operator, P (represented as a rectangular matrix), and a time-scaling factor, α. The dissimilarity between two graphs G 1 and G 2 (with Laplacians L i = L(G i )) is then defined as: where CðPÞ represents some set of constraints on the matrix P. For the remainder of this work we use D(G 1 , G 2 ) to refer to the distance and D 2 (G 1 , G 2 ) to refer to the squared distance-this notation is chosen to simplify the exposition of some proofs. It will be convenient for later calculations to introduce and assume the concept of transitive constraints-by which we mean that for any constraint C, satisfaction of C by P 1 and P 2 implies satisfaction of C by their product P 1 P 2 (when such a product is defined). Some (non-exclusive) examples of transitive constraints include orthogonality, particular forms of sparsity, and their conjunctions.
The simplest transitive constraint we will consider is that P should be orthogonal. Intuitively, an orthogonal P represents a norm-preserving map between nodes of G 1 and nodes of G 2 , so we are measuring how well diffusion on G 1 approximates diffusion on G 2 , as projected by P. Note that since in general P is a rectangular matrix it is not necessarily true that PP T = I. We assume that |G 1 | = n 1 � n 2 = |G 2 |; if not, the order of the operands is switched, so that P is always at least as wide as it is tall. We also briefly consider a sparsity constraint in section 3.3 below. Since sparsity is more difficult to treat numerically, our default constraint will be orthogonality alone. Other constraints could include bandedness and other structural constraints (see Section 6). We also note that because L is finite-dimensional, the exponential map is continuous and therefore we can swap the order of optimization over t and α. The optimization procedure outlined in this paper optimizes these variables in the order presented above (namely: an outermost loop of maximization over t, a middle loop of minimization over α given t, and an innermost loop of minimization over P given t and α).
The other additional parameter, α, controls dilation between the passage of time in the two graphs, to account for different scales. Again, the intuition is that we are interested in the difference between structural properties of the graph (from the point of view of single-particle diffusion) independent of the absolute number of nodes in the graph. As an example, diffusion on an n × n grid is a reasonably accurate approximation of more rapid diffusion on a 2n × 2n grid, especially when n is very large. In our discussion of variants of this dissimilarity score, we will use the notation D 2 (G 1 , G 2 |x = c) to mean restrictions of any of our distortion measure equations where variable x is held to a constant value c; In cases where it is clear from context which variable is held to a fixed value c, we will write D 2 (G 1 , G 2 |c).
At very early times the second and higher-order terms of the Taylor Series expansion of the matrix exponential function vanish, and so e tL � I + tL. This motivates the early-time or "linear" version of this distance,D: (Note that the identity matrices cancel). The outermost optimization (maximization over t) is removed for this version of the distance, as t can be factored out: For the exponential version of the dissimilarity score, we note briefly that the supremum over t of our objective function must exist, since for any G 1 , G 2 : In other words, the infimum over all P and α is bounded above by any particular choice of values for these variables. Since this upper bound must have a supremum (possibly 0) at some t � 2 [0, 1). Then must be finite and therefore so must the objective function.

Directedness of distance and constraints
We note that this distance measure, as defined so far, is directed: the operands G 1 and G 2 serve differing roles in the objective function. This additionally makes the constraint predicate CðPÞ ambiguous: when we state that C represents orthogonality, it is not clear whether we are referring to P T P = I or PP T = I (only one of which can be true for a non-square matrix P). To remove this ambiguity, we will, for the computations in the rest of this manuscript, define the distance metric to be symmetric: the distance between G 1 and G 2 with |G 1 | � |G 2 | is always D(G 1 , G 2 ). P is then always at least as tall as it is wide, so of the two choices of orthogonality constraint we select P T P = I.

Variants of distance measure
Thus far we have avoided referring to this graph dissimilarity function as a "distance metric". As we shall see later, full optimization of Eqs 8 and 9 over α and P is too loose, in the sense that the distances D(G 1 , G 2 ), D(G 2 , G 3 ), and D(G 1 , G 3 ) do not necessarily satisfy the triangle inequality. The same is true forD. See Section 5.3.1 for numerical experiments suggesting a particular parameter regime where the triangle inequality is satisfied. We thus define several restricted/augmented versions of both D andD which are guaranteed to satisfy the triangle inequality. These different versions are summarized in Table 1. These variously satisfy some of the conditions necessary for generalized versions of distance metrics, including: for all x, y.
Additionally, we note here that a distance measure on graphs using Laplacian spectra can at best be a pseudometric, since isospectral, non-isomorphic graphs are well-known to exist [24,25]. Characterizing the conditions under which two graphs are isospectral but not isomorphic is an open problem in spectral graph theory. However, previous computational work has led to the conjecture that "almost all" graphs are uniquely defined by their spectra [26][27][28], in the sense that the probability of two graphs of size n being isospectral but not isomorphic goes to 0 as n ! 1. Furthermore, our numerical experiments have indicated that the violation of the triangle inequality is bounded, in the sense that D( 1. This means that even in the least restricted case our similarity measure may be a 2.1-infrapseudometric on graphs (and, since such isospectral, non-isomorphic graphs are relatively rare, it behaves like a 2.1-inframetric). As we will see in Section 3, in some more restricted cases we can prove triangle inequalities, making our measure a pseudometric. In Section 4.1, we discuss an algorithm for computing the optima in Eqs (8) and (9). First, we discuss some theoretical properties of this dissimilarity measure. Table 1. Summary of this paper's investigation of different forms of our graph dissimilarity measure. In this work, we systematically explore properties of this measure given sparsity parameter s = 0, and various regimes of t (fixed at some early time, or maximized over all t) and α (fixed at α = 1, fixed at a constant power r of the ratio of graph sizes, or minimized over all α. We leave exploration of nonzero values of the sparsity parameter to future work. Variants not explicitly called out are not considered. In the case where α and t are both optimized and s > 0, it is unclear which of the metric conditions GDD satisfies, hence the corresponding classification is left blank.

Triangle inequality for α = 1 Lemma 1. For any matrices M and P, with P satisfying P T P
Suppose without loss of generality that P T P = I. Then: If P T P = I, then letting PP T = P, P is a projection operator satisfying P T = P = P 2 . Then, Theorem 2.D 2 satisfies the triangle inequality for α = 1.
where we write CðP 32 P 21 Þ to signify that the product P 32 P 21 satisfies the original transitive constraints on P, e.g. orthogonality and/or sparsity. Since the constraint predicate CðPÞ satisfies Eq (25), then so does their product, so we may writẽ By Lemma 1,D We note that in this proof we use L 1 , L 2 , and L 3 (making this the small-t or linear version of the objective function), but the same argument holds when all three are replaced with e tL i , so we also have Corollary 3. D satisfies the triangle inequality for α = 1.
Proof. By the same calculation as in Theorem 2, with all L i replaced by e t c L i , we have for any constant t c . Then, letting we have: Note that in the proofs of Theorem 2, Theorem 4, and Corollary 3, we assume that the constraint predicate CðPÞ includes at least orthogonality (so that we may apply Lemma 1). However, this constraint predicate could be more strict, e.g. include both orthogonality and sparsity. Hence these statements also apply to the s > 0 cases in Table 1, which we do not otherwise consider in this work: in our numerical experiments we (for reasons of computational simplicity) only require our optimization over P be orthogonally constrained.

Time-Scaled Graph Diffusion Distance
For any graphs G 1 and G 2 , we define the Time-Scaled Graph Diffusion Distance (TSGDD) as a scaled version of the linear distance, with α fixed. Namely, let The intuition for this version of the distance measure is that we are constraining the time dilation, α, between G 1 and G 2 to be a power of the ratio of the two graph sizes. The factor (n 1 n 2 ) −2r is needed to ensure this version of the distance satisfies the triangle inequality, as seen in Theorem 4. Theorem 4. The TSGDD, as defined above, satisfies the triangle inequality. Proof. As above, let G 1 , G 2 , G 3 be three graphs with n i = |G i | and n 1 � n 2 � n 3 , and let L i be the Laplacian of G i . Let CðPÞ represent a transitive constraint predicate, also as described previously. Then, for a constant r 2 R, we have: under the assumption, as in Eq (25), that CðP 32 Þ^CðP 21 Þ ) CðP 32 P 21 Þ, for any fixed r 2 R.

Sparse-diffusion distance
We introduce the notation CðPÞ for a constraint predicate that must be satisfied by prolongation matrix P, which is transitive in the sense that: The simplest example is CðPÞ ¼ C orthog ðPÞ � ðP T P ¼ IÞ. However, sparsity can be introduced in transitive form by and since n 2 cancels out from the numerator and denominator of the product of the fanout bounds. Here, degree i,j (M) is the total number of nonzero entries in row i or column j of M. This transitive sparsity constraint depends on a power-law parameter s � 0. When s = 0, there is no sparsity constraint.
Another form of sparsity constraints are those which specify a pattern on matrix entries which are allowed to be nonzero. Two simple examples (which are also transitive) are matrices which are constrained to be upper triangular, as well as matrices which are constrained to be of the form A�B where A and B are themselves both constrained to be sparse. More complicated are n 1 × n 2 matrices which are constrained to be banded for some specified pattern of bands: more specifically, that there is a reordering of the rows and columns that the number of diagonal bands (of width 1, slope n 1 n 2 ) with nonzero entries is less than n 1 n 2 � � q for some 0 � q < 1. For example, linear interpolation matrices between d-dimensional grids, with non-overlapping source regions, follow this constraint. As a final note on sparsity, we observe that any of the optimizations detailed in this work could also be performed including a sparsity term (for example, the |�| 1 -norm of the matrix P, calculated as ∑ i ∑ j |p ij | is one possibility, as are terms which penalize t or α far from 1), rather than explicit sparsity constraints. A potential method of performing this optimization would be to start by optimizing the non-sparse version of the objective function (as detailed in Section 4.1) and then slowly increasing the strength of the regularization term.

Upper bounds for graph products
We now consider the case where we want to compute the distance of two graph box products, i.e. D(G 1 , G 2 ) where and are known for some t c , α c . Previous work [2] included a proof of a similar inequality for the small-t ("linear") case of our objective function. Theorem 5. Let G 1 and G 2 be graph box products as described above, and for a graph G let L(G) be its Laplacian. For fixed t = t c , α = α c , P (i) as given above, for any λ 2 [0, 1], we have where all distances are evaluated at t = t c , α = α c , but we have omitted those terms for simplicity of notation. Proof. For graph products G i , we have We will make the following abbreviations: By Lemma 1, If we instead use ðE ð1Þ 2 P ð1Þ � P ð2Þ E ð2Þ 1 Þ as the cross term in Eq (33), we have A linear combination of these two bounds gives us the desired bound. This has the additional consequence that This reduces further to An example of such a graph sequence is the sequence of two-dimensional square grids, which are each the box product of two identical one-dimensional grids i.e. path graphs: Sq n = Pa n □Pa n .

Spectral lower bound
In Theorem 7 we will derive and make use of an upper bound on the graph distancẽ DðG 1 ; G 2 Þ. This upper bound is calculated by constraining the variable P to be not only orthogonal, but also P ¼ U 2 MU T 1 where M is the solution (i.e. "matching", in the terminology of that section) to a Linear Assignment problem with costs given by a function of the eigenvalues of L(G 1 ) and L(G 2 ). In this section we derive a similar lower bound on the distance.
Let G 1 and G 2 be undirected graphs with Laplacians L 1 = L(G 1 ) and L 2 = L(G 2 ), and let α > 0 be constant. By Eq (52), we havẽ The following upper bound onD is achieved by constraining P to be not only orthogonal, but related to a constrained matching problem between the two lists of eigenvalues: where Λ 1 and Λ 2 are diagonal matrices of the eigenvalues of L 1 and L 2 respectively. Here we used the explicit mapP ¼ U T 2 PU 1 as a change of basis; we then converted the constraints on P into equivalent constraints onP, and imposed additional constraints so that the resulting optimization (a linear assignment problem) is an upper bound. See the proof of Theorem 7 for the details of this derivation. We show in this section that a less constrained assignment problem is a lower bound onD 2 . We do this by computing the same mappingP ¼ U T 2 PU 1 and then dropping some of the constraints onP (which is equivalent to dropping constraints on P, yielding a lower bound). The constraint P T P = I is the conjunction of n 2 1 constraints on the column vectors of P: if p i is the ith column of P, then P T P = I is equivalent to: If we discard the constraints in Eq (45), we are left with the constraint that every column of p must have unit norm.
Construct the "spectral lower bound matching" matrix P (SLB) as follows: For any α, this matrix is the solution to a matching problem (less constrained than the original optimization over all P) where each l ð1Þ j is assigned to the closest l ð2Þ i , allowing collisions. It clearly satisfies the constraints in Eq (44), but may violate those in Eq (45). Thus, we havẽ Various algorithms exist to rapidly find the member of a set of points which is closest to some reference point (for example, KD-Trees [29]). For any α, the spectral lower bound can be calculated by an outer loop over alpha and an inner loop which applies one of these methods. We do not consider joint optimization of the lower bound over P and α in this work.

Regularized distance
We can add a regularization term to the graph diffusion distance, as follows: define We can show analytically that this distance satisfies the triangle inequality: Theorem 6. D reg satisfies the triangle inequality. Proof. For graphs G 1 , G 2 , G 3 and Laplacians L 1 , L 2 , L 3 , for any fixed t � 0, we have: Since this is true for any fixed t, let t � ¼ arg sup t D reg ðG 1 ; G 3 jtÞ: We can construct a similar regularized version of the linear objective function: The term "regularized" here refers to the fact that the additional terms included in D reg and D reg penalize α distorting the respective Laplacians far from their original values. In practice, many of the theoretical guarantees provided earlier in this manuscript may not apply to optimization of the augmented objective function. Hence, a major area of future work will be modification of our optimization procedure to compute this form of distance.

Theory summary
Triangle inequalities are proven for some members of the proposed family of graph distortion or "distance" measures, including infinitesimal and finite diffusion time, a power law for sparsity, and/or a power law for the time scaling factor between coarse and fine scales. However, the case of an optimal (not power law) time conversion factor α needs to be investigated by numerical experiment, and that requires new algorithms, introduced in Section 4. We also show that in the case of distances between graph box products, optimization over P for the product graphs is bounded above by a monotonic function of the optimum over the component graphs. Table 1 summarizes the variants of our distance metric.

Numerical methods for optimal time conversion, α
Optimizing the α parameter for conversion between coarse and fine time scales in the proposed family of graph distance measures, in addition to optimizing the prolongation matrix P under transitive constraints CðPÞ, is a nontrivial numerical problem that in our experience seems to require new methods. We develop such methods here and apply them to investigate the resulting graph "distance" measure in the next section.

Algorithm development
In this section, we describe the algorithm used to calculate upper bounds on graph distances as the joint optima (over P, t, and α) of the distance Eqs 8 and 9, under orthogonality constraints only, i.e. the case CðPÞ ¼ fPjP T P ¼ Ig. At the core of both algorithms is a subroutine to solve the Linear Assignment Problem (LAP-see Eq (5)) repeatedly, in order to find the subpermutation matrix which is optimal at a particular value of α. Namely, we are interested in calculatingD as which, for orthogonality or any other compact constraint ¼ min However, we have found that the unique structure of this optimization problem admits a specialized procedure which is faster and more accurate than nested univariate optimization of α and t (where each innermost function evaluation consists of a full optimization over P at some t, α). We first briefly describe the algorithm used to find the optimal P and α forD 2 . The formal description of the algorithm is given by Algorithm 1. In both cases, we reduce the computational complexity of the optimization over P by imposing the additional constraint that P must be a subpermutation matrix when rotated into the spectral basis (we define subpermutations in the proof of Theorem 7). This constraint is compatible with the orthogonality constraint (all subpermutation matrices are orthogonal, but not vice versa). The tradeoff of this reduction of computational complexity is that we can only guarantee that our optima are upper bounds of the optima over all orthogonal P. However, in practice, this bound seems to be tight: we have yet to find an example where orthogonally-constrained optimization was able to improve in objective function value over optimization constrained to subpermutation matrices. Therefore, we shall for the remainder of this paper refer to the optima calculated as distance values, when strictly they are distance upper bounds. We also note here that a distance lower bound is also possible to calculate by relaxing the constraints in CðPÞ (for instance, by replacing the optimization over all P with a less constrained matching problem-see Section 3.5).
4.1.1 Optimization ofD 2 . Algorithm 1 Abbreviated pseudocode for the algorithm described in Section 4.1.1, for computing inf P;aD 2 .

4:
Until the current list of matchings is not expanded in the following step, or the entire interval [α low , α high ] is marked as explored: 5: Attempt to expand the list of optimal matchings by solving a linear assignment problem at the α where the cost curves of two matchings (currently in the list) intersect. If no better assignment exists, then mark the interval covered by those matchings as explored, as guaranteed by Theorem 9. 6: Return the lowest-cost M and its optimal α. 7: end procedure Joint optimization ofD 2 over α and P is a nested optimization problem (see [30] and [31] for a description of nested optimization), with potential combinatorial optimization over P dependent on each choice of α. Furthermore, the function f ðaÞ ¼ inf PjCðPÞD 2 ðG 1 ; G 2 jaÞ is both multimodal and continuous but with in general discontinuous derivative (See Fig 4). Univariate optimization procedures such as Golden Section Search result in many loops of some procedure to optimize over P, which in our restricted case must each time compute a full solution to a LAP with n 2 × n 1 weights. In our experience, this means that these univariate methods have a tendency to get stuck in local optima. We reduce the total number of calls to the LAP solver, as well as the size of the LAPs solved, by taking advantage of several unique properties of the optimization as a function of α. When the optimal P (1) and P (2) are known for α 1 and α 2 , then for any α c such that min(α 1 , α 2 )�α c � max(α 1 , α 2 ), the optimal P (c) at α c must satisfy: Theorem 9). Thus, the optimization over P at α c is already partially solved given the solutions at α 1 and α 2 , and so we need only re-compute the remaining (smaller) subproblem on the set of assignments where P (1) and P (2) disagree. This has two consequences for our search over α: First, the size of LAP problems which must be solved at each step decreases over time (as we find P-optima for a denser and denser set of α). Secondly, these theoretical guarantees mean that we can mark intervals of α-values as being explored  (meaning we have provably found the P which are optimal over the interval) and thus do not have to perform the relatively expensive optimization over P for any α in that interval.

Optimization of D 2 . Algorithm 2
Abbreviated pseudocode for the algorithm described in Section 4.1.2, for computing sup t inf P,α D 2 .
1: procedure D(L 1 , L 2 , α low , α high , step size �) 2: Compute λ (1) , λ (2) as the eigenvalues of L 1 and L 2 . 3: Solve the Linear Version of the problem using Algorithm 1, obtaining α � , M � . According to the argument presented in the definition of linear distance (Eq 9) this solution holds for very small t. Keep the entire frontier of matchings found during the execution of Algorithm Use the linear algorithm with e tL 1 and e tL 2 as the input matrices, initializing the list of matchings with those found at the previous t. 7: Set where α � , M � are the optima from the previous step. 8: Return the max t d(t). 9: end procedure Many of the theoretical guarantees underlying our algorithm for computingD 2 no longer hold for the exponential version of the distance. We adapt our linear-version procedure into an algorithm for computing this version, with the caveat that the lack of these guarantees means that our upper bound on the exponential version may be looser than that on the linear version. It is still clearly an upper bound, since the α and P found by this procedure satisfy the given constraints α > 0 and P T P = I. In particular, we have observed cases where the exponential-distance analog of Theorem 9 would not hold, meaning we cannot rule out α-intervals as we can in the linear version. Thus, this upper bound may be looser than the that computed for the linear objective function.
For the exponential version of the algorithm, we first compute the list of optimal P for the linear version, assuming (since e tL � I + L for very small t) that this is also the list of optimal P for the exponential version of the objective function at some low t. We proceed to increment t with some step size Δt, in the manner of a continuation method [32]. At each new t value, we search for new optimal P along the currently known frontier of optima as a function of α. When a new P is found as the intersection of two known P i , P i+1 , it is inserted into the list, which is kept in order of increasing α. For each P in this frontier, we find the optimal α, keeping P and t constant. Assuming inf P inf α D 2 (G 1 , G 2 |t c ) is unimodal as a function of t c , we increase t c until inf P inf α D 2 (G 1 , G 2 |t c )�inf P inf α D 2 (G 1 , G 2 |t c + Δt), storing all P matrices found as optima at each t c value. P which were on the lower convex hull at some prior value of t but not the current value are retained, as they may regain optimality for some α-range at a future value of t (we have observed this, in practice). For this list P 1 , P 2 . . .P m , we then compute sup- Since the exponential map is continuous, and we are incrementing t by very small steps, we also propose the further computational shortcut of storing the list of optimal α at time t to use as starting points for the optimization at t + Δt. In practice, this made little difference in the runtime of our optimization procedure.

Algorithm correctness proof
Theorem 7. For any two graphs G 1 and G 2 with Laplacians L(G 1 ) and L(G 2 ), for fixed α, the optimization over P given in the innermost loop of Eq 9 is upper bounded by a Linear Assignment Problem as defined in Eq (5). This LAP is given by taking R to be the eigenvalues l ð1Þ j of L(G 1 ) and S to be the eigenvalues l ð2Þ i of L(G 2 ), with the cost of a pair (equivalently, one entry of the cost matrix C) given by Proof. L(G 1 ) and L(G 2 ) are both real symmetric matrices, so they may be diagonalized as where the U i are rotation matrices, and the Λ i are diagonal matrices with the eigenvalues l ðiÞ 1 ; l ðiÞ 2 . . . l ðiÞ n i along the diagonal. Because the Frobenius norm is invariant under rotation, we have: Because the U i are orthogonal, the transformationP ¼ U T 2 PU 1 preserves orthogonality, sõ For any given α, where subpermðPÞ could be any other condition more strict than the constraint P T P = I. Here we take this stricter constraint to be the condition thatP is a subpermutation matrix: an orthogonal matrix (i.e.P TP ¼ IÞ for whichP 2 f0; 1g n 2 �n 1 . Equivalently, a subpermutation matrix is a {0, 1}-valued matrix ½p ij � such that for each i 2 {1, . . .n 1 � n 2 }, exactly one j 2 {1, . . .n 2 � n 1 } takes the value 1 rather than 0 (so P n 2 j¼1P ji ¼ 1), and for each j 2 {1, . . .n 2 � n 1 }, either zero or one i 2 {1, . . .n 1 � n 2 } takes the value 1 rather than 0 (so P n 1 i¼1P ji � 1). Furthermore, this optimization is exactly a linear assignment problem of eigenvalues of L(G 1 ) to L(G 2 ), with the cost of a pair ðl ð1Þ j ; l ð2Þ i Þ given by Note also that the same argument applies to the innermost two optimizations of the calculation of D 2 (the exponential version of the diffusion distance) as well as D 2 r . In the D 2 case the entries of the cost matrix are instead given by If we instead loosen the constraints on P, we can calculate a lower bound on the distance. See Appendix 3.5 for lower bound details.
Recall that our definition of a 'matching' in Section 1.3 was a P matrix representing a particular solution to the linear assignment problem with costs given as in Eq (49). For given G 1 , G 2 , and some matching M, let where U 1 , U 2 diagonalize L 1 and L 2 as in Eq (50). Lemma 8. For two unique matchings M 1 and M 2 (for the same G 1 , G 2 ) the equation f M 1 ðaÞ À f M 2 ðaÞ ¼ 0 has at most one real positive solution in α. This follows from the fact that when P and t are fixed, the objective function is a rational function in α (see Eq (51)), with a quadratic numerator and an asymptote at α = 0.
Proof. By Eq (51), we have þ 2 Abbreviating the sums, we have and so a ¼ � ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi Since A 1 , A 2 , C 1 , C 2 are all nonnegative reals, at most one of these roots is positive. We will say that a matching M "assigns" j to i if and only if M ij = 1. Theorem 9. If two matchings M 1 and M 3 which yield optimal upper bounds for the linear dis-tanceD 2 (at α 1 � α and α 3 � α respectively) agree on a set of assignments, then the optimal M at α must also agree with that set of assignments.
Proof. We need the following lemmas: Lemma 10. If an optimal matching assigns i to m(i) (so that eigenvalue l ð1Þ i of G 1 is paired with l ð2Þ f ðiÞ of G 2 in the sum of costs Eq (49)), then the sequence m(1), m(2), . . .m(n 1 ) is monotonic increasing.
Proof. This follows from the fact that the two sequences of eigenvalues are monotonic nondecreasing, so if there's a 'crossing' (i 1 < i 2 but m(i 2 )<m(i 1 )) then the new matching obtained by uncrossing those two pairs (performing a 2-opt step as defined in [33]) has strictly lesser objective function value. Hence an optimal matching can't contain any such crossings.
Lemma 11. For all positive real α � � � > 0, let M 1 be an optimal matching at α � − � and M 2 be optimal at α � + �. For 1 � i � n 1 , let s 1 (i) and s 2 (i) be the indices of λ (2) paired with i in M 1 and M 2 , respectively. Then for all i, s 1 (i)�s 2 (i).
2. Any pair of matchings as defined above consists of a sequence of runs, where we allow a run to be trivial i.e. be a single index.
Next, we show that within a run, we must have s 1 (i)<s 2 (i) for all i. Let S = {l, l + 1, . . .l + k} be a run. By optimality of M 1 , M 2 at α � − � and α � + � respectively, we have: Respectively, these simplify to Summing and reducing gives us However, since the l ð2Þ j are monotonic nondecreasing, this means we cannot also have s 1 (i)>s 2 (i) for all i 2 S, since that would imply Therefore, in a run of arbitrary length, all indices must move 'forward' (meaning that s 1 (i)< s 2 (i) for all i in the run), and so (since any pair of matchings optimal at such α define a set of runs) we must have s 1 (i)�s 2 (i). This completes the proof of the lemma.
Thus, for three matchings M 1 , M 2 , M 3 which optimal at a sequence of α 1 � α 2 � α 3 , we must have s 1 (i)�s 2 (i)�s 3 (i) for all i. In particular, if s 1 (i) = s 3 (i), we must also have s 1 (i) = s 2 (i) = s 3 (i). Proof. This follows directly from the facts that f M 1 ðaÞ and f M 2 ðaÞ (as defined in Eq (53)), can only meet at one real positive value of α (Lemma 8). Say that the cost curves for M 1 (known to be optimal at α = α 1 ) and M 3 (optimal at α = α 3 ) meet at α = α 2 , and furthermore assume that α 1 � α 2 � α 3 . If some other matching M 2 improves over (meaning, has lesser obj. function value as a function of α) M 1 or M 3 anywhere in the interval [α 1 , α 3 ], it must improve over both at α = α 2 , since it may intersect each of these cost curves at most once on this interval. If M 1 and M 3 are both optimal at their intersection point (meaning no such distinct M 2 exists) then we know that no other matching improves on either of them over the the interval [α 1 , α 3 ] and may therefore mark it as explored during the outermost loop (otimization over α) of Algorithm 1.
Together, the preceeding properties verify that our algorithm will indeed find the joint optimum over all α and P (for fixed t = c, forD, subject to subpermutation constraints on P): it allows us to find the entire set of P subpermutation matrices which appear on the lower convex hull of distance as a function of alpha.

Implementation details
We implement Algorithms 1 and 2 in the programming language "Python" (version 3.6.1) [34]. Numerical arrays were stored using the numpy package [35]. Our inner LAP solver was the package lapsolver [36]. Univariate optimization over t and α was performed with the 'bounded' method of the scipy.optimize package [37], with bounds set at [0, 10.0] for each variable and a input tolerance of 10 −12 . Laplacians were computed with the laplacian method from the package networkX [38], and their eigenvalues were computed with scipy.linalg.eigh.
Because of numerical precision issues arising during eigenvalue computation, it can be difficult to determine when two matchings agree, using eigenvalue comparison. In practice we ignore this issue and assume that two matchings are only identical if they associate the same indices of the two lists of eigenvalues. This means we may be accumulating multiple equivalent representations of the same matching (up to multiplicity of eigenvalues) during our sweeps through t and α. We leave mitigating this inefficiency for future work.
Code for computing diffusion distance, both with our algorithm and with naive univariate optimiztion, may be found in the S1 Data associated with this paper, as well as a maintained GitHub repository [39].

Graph lineages
In this subsection we introduce several graph lineages for which we will compute various intra-and inter-lineage distances. Three of these are well-known lineages of graphs, and the fourth is defined in terms of a product of complete graphs: Path Graphs (Pa n ): 1D grid graphs of length n, with aperiodic boundary conditions. Cycle Graphs (Cy n ): 1D grid graphs of length n, with periodic boundary conditions. Square Grid Graphs (Sq n ): 2D grid graphs of dimensions n, with aperiodic boundary conditions. Sq n = Pa n □Pa n "Multi-Barbell" Graphs (Ba n ): Constructed as Cy n □K n , where K n is the complete graph on n vertices.
These familes are all illustrated in Fig 5. Additionally, some examples distances between elements of these graph lineages are illustrated in Fig 6. In these tables we see that in general intra-lineage distances are small, and inter-lineage distances are large.

Numerical optimization methods
We briefly discuss here the other numerical methods we have used to calculateD 2 and D 2 .
Nelder-Mead in Mathematica For very small graph pairs (n 1 × n 2 � 100) we are able to find optimal P, α, t using constrained optimization in Mathematica 11.3 [40] using NMinimize, which uses Nelder-Mead as its backend by default. The size limitation made this approach unusable for any real experiments.
Orthogonally Constrained Opt. We also tried a variety of codes specialized for numeric optimization subject to orthogonality constraints. These included (1) the python package PyManopt [41], a code designed for manifold-constrained optimization; (2) gradient descent in Tensorflow using the penalty function g(P) = c||P T P − I|| F (with c � 1 a small positive constant weight) to maintain orthogonality, as well as (3) an implementation of the Cayley reparametrization method from [42] (written by the authors of that same paper). In our experience, these codes were slower, with poorer scaling with problem size, than combinatorial optimization over subpermutation matrices, and did not produce improved results on our optimization problem.
Black-Box Optimization Over α. We compare in more detail two methods of joint optimization over α and P when P is constrained to be a subpermutation matrix in the diagonal basis for L(G 1 ) and L(G 2 ). Specifically, we compare our approach given in Algorithm 1 to univariate optimization over α, where each function evaluation consists of full optimization over P. Fig 7  shows the results of this experiment. We randomly sample pairs of graphs as follows: 1. n 1 is drawn uniformly from [5,120].
We compute the linear version of distance for each pair. Because our algorithm finds all of the local minima as a function of alpha, we compute the cost of the golden section approach as the summed cost of multiple golden section searches in alpha: one GS search starting from the initial bracket [0.618α � , 1.618α � ] for each local minimum α � found by our algorithm. We see that our algorithm is always faster by at least a factor of 10, and occasionally faster by as much as a factor of 10 3 . This can be attributed to the fact that the golden section search is unaware of the structure of the linear assignment problem: it must solve a full n 2 × n 2 linear assignment problem for each value of α it explores. In contrast, our algorithm is able to use information from prior calls to the LAP solver, and therefore solves a series of LAP problems whose sizes are monotonically nonincreasing.

Triangle inequality violation of D (exponential distance) andD (linear distance).
As stated in Section 2.3, our full graph dissimilarity measure does not necessarily obey the triangle inequality. In this section we systematically explore conditions under which which measures the degree to which a triplet of graphs violates the triangle inequality (i.e. falls outside of the unit interval [0, 1]), for approximately 3 × 10 4 such triplets. It is clear that, especially for the linear definition of the distance, the triangle inequality is not always satisfied. However, we also observe that (for graphs of these sizes) the discrepancy score is bounded: no triple violates the triangle inequality by more than a factor of approximately 1.8. This is shown We see that our algorithm is much less computationally expensive, sometimes by a factor of 10 3 . The most dramatic speedup occurs in the regime where n 1 � n 2 . Graphs were generated by drawing n 1 uniformly from [5,120] by the histogram of discrepancies in Fig 8. Additionally, the triangle inequality is satisfied in 28184 (95.2%) of cases. We see similar but even stronger results when we run the same experiment with D 2 instead ofD 2 ; these may also be seen in Fig 8. We calculated the discrepancy score analogously, but with D substituted forD. We see similarly that the degree of violation is bounded. In this case, no triple violated the triangle inequality by a factor of more than 5, and in this case the triangle inequality was satisfied in 99.8% of the triples. In both of these cases, the triangle inequality violations may be a result of our optimization procedure finding local minima/maxima for one or more of the three distances computed. We also repeat the above procedure for the same triplets of graphs, but with distances computed not in order of increasing vertex size: calculating Disc(G 2 , G 1 , G 3 ) and Disc(G 3 , G 2 , G 1 ). All of these results are plotted in Fig 8.   Fig 8. Histograms of triangle inequality violation. These plots show the distribution of Disc(G 1 , G 2 , G 3 ), as defined in the text, for the cases (a) top: the linear or small-time version of distance and (b) bottom: the exponential or arbitrary-time version of distance. We see that for the sizes of graph we consider, the largest violation of the triangle inequality is bounded, suggesting that our distance measure may be an infra-ρ-pseudometric for some value of ρ � 1.8 (linear version) or ρ � 5.0 (exponential version). See Table 1 for a summary of the distance metric variants introduced in this paper. We also plot the same histogram for out-of-order (by vertex size) graph sequences: Disc(G 2 , G 1 , G 3 ) and Disc(G 3 , G 2 , G 1 ). Each plot has a line at x = 1, the maximum discrepancy score for which the underlying distances satisfy the triangle inequality. https://doi.org/10.1371/journal.pone.0249624.g008

Intra-and inter-lineage distances.
We compute pairwise distances for sequences of graphs in the graph lineages displayed in Fig 5. For each pair of graph families (Square Grids, Paths, Cycles, and Multi-Barbells), we compute the distance from the ith member of one lineage to the (i + 1)-st member of each other lineage, and take the average of the resulting distances from i = 1 to i = 12. These distances are listed in Table 2. As expected, average distances within a lineage are smaller than the distances from one lineage to another.
We note here that the idea of computing intra-and inter-lineage distances is similar to recent work [43] computing distances between graph ensembles: certain classes of similarlygenerated random graphs. Graph diffusion distance has been previously shown (in [43]) to capture key structural information about graphs; for example, GDD is known to be sensitive to certain critical transitions in ensembles of random graphs as the random parameters are varied. This is also true for our time dilated version of GDD. More formally: let G p and G 0 p represent random graphs on n vertices, drawn from the Erdős-Renyi distribution with edge probability p. Then DðG p ; G 0 p Þ has a local maximum at p ¼ 1 n , representing the transition between disconnected and connected graphs. This is true for our distance as well as the original version due to Hammond.

Graph limits.
Here, we provide preliminary evidence that graph distance measures of this type may be used in the definition of a graph limit-a graphlike object which is the limit of an infinite sequence of graphs. This idea has been previously explored, most famously by Lovász [7], whose definition of a graph limit (called a graphon) is as follows: Recall the definition of graph cut-distance D cut (G, H) from Eq 3, namely: the cut distance is the maximum discrepancy in sizes of edge-cuts, taken over all possible subsets of vertices, between two graphs on the same vertex-set. A graphon is then an equivalence class of Cauchy sequences of graphs, under the equivalence relation that two sequences G 1 , G 2 , . . . and H 1 , H 2 , . . . are equivalent if D cut (G i , H i ) approaches 0 as n ! 1. Here we are calling a sequence of graphs "Cauchy" if for any � > 0 there is some N such that for all n, m � N, D cut (G n , G m )<�.
We propose a similar definition of graph limits, but with our diffusion distance substituted as the distance measure used in the definition of a Cauchy sequence of graphs. Hammond et. al. argue in [1] why their variant of diffusion distance may be a more descriptive distance measure than cut-distance. More specifically, they show that on some classes of graphs, some edge deletions 'matter' much more than others: removal of a single edge changes the diffusive properties of the graph significantly. However, the graph-cut distance between the new and old graphs is the same, regardless of which edge has been removed, while the diffusion distance captures this nuance. For graph limits, however, our generalization to unequal-sized graphs via P is of course essential. Furthermore, previous work [44] on sparse graph limits has shown that in the framework of Lovász all sequences of sparse graphs converge (in the infinite-size limit) to the zero graphon. Graph convergence results specific to sparse graphs include the Benjamini-Schramm framework [45], in which graph sequences are compared using the distributional limits of subgraph frequencies. These two graph comparison methods both have the characteristic that the "limit object" of a sequence of graphs is rigorously defined. In this section we attempt to show empirically that such a limit object of graph sequences under GDD may exist, and therefore merit further investigation. We examine several sequences of graphs of increasing size for the required Cauchy behavior (in terms of our distance measure) to justify this variant definition of a "graph limit". For each of the graph sequences defined in Section 5.1, we examine the distance between successive members of the sequence, plotting D 2 (G n , H n+1 ) for each choice of G and H. These sequences of distances are plotted in Fig 9. In this figure, we see that generally distance diverges between different graph lineages, and converges for successive members of the same lineage, as n ! 1. We note the exceptions to this trend: 1. The distances between n-paths and n + 1-cycles appear to be converging; this is intuitive, as we would expect that difference between the two spectra due to distortion from the ends of the path graph would decrease in effect as n ! 1.

2.
We also show analytically, under similar assumtions, that the distance between successive path graphs also shrinks to zero (Theorem 14).
We do not show that all similarly-constructed graph sequences display this Cauchy-like behavior. We hope to address this deeper question, as well as a more formal exploration of the limit object, with one or more modified versions of the objective function (see Section 3.6).

Limit of path graph distances.
In this section, we demonstrate analytically that the sequence of path graphs of increasing size is Cauchy in the sense described by the previous section. In the following theorem (Theorem 14), we assume that the optimal value of t approaches some valuet as n ! 1. We have not proven this to be the case, but have observed this behavior for both square grids and path graphs (see Fig 10 for an example of this behavior). Lemmas 13 and 14 show a related result for path graphs; we note that the spectrum of the Laplacian (as Cauchy-like behavior of graph distance as a function of sequence index, n. The distance between successive square grids and all other graph sequences appears to diverge (the same behavior is seen for k-barbells). Notably, the distance between Grid n×n and Grid (n+1)×(n+1) does not appear to converge, until much higher values of n (n > 100) than the other convergent series. This may be because the distances calculated are an upper bound, and may be converging more slowly than the 'true' optima. for any finite t (which will then includet). First, for any particular (n + 1) × n subpermutation matrix S, note that D 2 ðPa n ; Pa nþ1 jtÞ ¼ inf a>0 inf PjCðPÞ D 2 ðPa n ; Pa nþ1 jt; P; aÞ � D 2 ðPa n ; Pa nþ1 jt; a ¼ 1; U T nþ1 SU n Þ Here, U n and U n+1 are the matrices which diagonalize L(Pa n ) and L(Pa n+1 ) respectively (note also that a diagonalizer of a matrix L also diagonalizes e L ). If at each n we select S to be the subpermutation S ¼ I 0 " # , then (using the same argument as in Theorem 7) the objective Since each pass of this procedure (with different random seeds) varied in Step 1, each pass produced a different graph. We generated 20 graphs for each mesh, and compared the graphs using GDD.
The results of this experiment can be seen in Fig 13. This Figure shows the three first principal components of the distance matrix of GDD on the dataset of graphs produced as described above. Each point represents one graph in the dataset, and is colored according to the mesh which was used to generate it. Most notably, all the clusters are tight and do not overlap. Close clusters represent structurally similar objects: for example, the cluster of graphs from the tube mesh is very close to the cluster derived from the tube with an equatorial fin. This synthetic dataset example demonstrates that graph diffusion distance is able to compare structural information about point clouds and meshes.

Applications and future work
We briefly discuss possible applications of both our distance metric and our procedure for calculating the relevant minima.

Algebraic multigrid
The need for prolongation / restriction operators arises naturally in the Algebraic MultiGrid (AMG) context, where a hierarchy of progressively coarser meshes are constructed, with the goal of speeding convergence of a model with local update ("smoothing") rules. A model with modes of behavior at wavelengths which are much larger than the neighborhood of one update will take many update steps to converge. Thus, the goal in AMG is to iteratively construct a series of coarsened meshes, so that update steps at the coarser scales can address coarser modes of behavior. A fine-scale model state is translated into a coarse-scale state via a "restriction" operator. After a coarse-scale smoothing step, the new coarse state is translated back to the fine-scale by "prolonging" it. Our procedure for calculating P could be incorporated as a preprocessing step, in the case where the series of meshes are known in advance; otherwise, the P from the previous round of coarsening could be used as the initial conditions to a modified version of our solver. In either case, the matrix P is a natrual choice of prolongation/ restriction operator for this type of coarsening scheme, since it optimally transforms the Laplacian of one graph into another. Embedding of pairwise distances between mesh discretizations. We see that GDD clusters each category of mesh tightly, and furthermore that clusters are nearby when they are structurally similar meshes, and distant otherwise. Axes represent the three principal components of the distance matrix and are thus unitless. https://doi.org/10.1371/journal.pone.0249624.g013

Graph limits
In this work we briefly introduce a new definition of graph limits based on the diffusion distance, which raises several natural questions: What does the "limit" of a sequence of graphs under diffusion distance look like? Are there pairs of sequences that converge to the same such object, as in the example of path graphs and cycle graphs? Can we separate graph sequences into equivalence classes based on which of these they converge to? We hope to address these questions in future work.

Graph convolutional networks
Graph convolutional networks (GCNs) are a variant of the convolutional neural networks (CNNs) widely used in machine vision. In the same way that CNNs learn a set of trained image filters and apply them across multiple spatial locations in an image, GCNs learn a set of filters which are applied to local neighborhoods of a graph. One implementation of GCNs due to Kipf and Welling [46] uses a Chebyshev polynomial of the Laplacian matrix as an approximation of the graph fourier transform, demonstrating comparable results to the full transform but far fewer multiplication operation needed. However, construction of pooling operators for GCNs is still an area of open research. Since our P is a restriction operator that preserves information about the Laplacian, it is natural to use it as a pooling operator in this type of model. In [47] we make use of the optimization procedure described in this manuscript to find optimal (in the sense of Graph Diffusion Distance) coarsenings of a graph representing a protein nanotube. Since our procedure yields a P matrix which maps between the coarse and fine graphs, we are then able to define a machine learning model which learns to reproduce the protein's energetic dynamics at multiple spatial scales. As a result of this multiscale construction, this multiscale model is more computationally efficient by an order of magnitude.

Graph clustering
We can also use the diffusion distance and its variants to compare graphs or neighborhoods of graphs for structural similarity, independent of graph size. Section 5.3.5 showcases an example of doing this with a small synthetic database of 3D point clouds derived from mesh files of several well-known objects. This is similar to the approach of [4] for comparing point clouds in 2D and 3D, in the sense that both approaches optimize an objective function based to a matching between elements of the two graphs. This type of similarity measure may then be used to convert a dataset of graphs to a distance-to-cluster-centers representation, or for any other of the typical methods used in machine learning for converting sets of pairwise distances into fixed-length feature vectors (k-medoids, kernel methods, multidimensional scaling, etc.). In this setting, our distance measure has an additional benefit: since computing it yields an explicit projection operator between the nodes of the graphs, we may use the set of P we compute to project signals (e.g. labels on the vertices of each graph in the dataset) to a common space.

Conclusion
In this work, we present a novel generalization of graph diffusion distance which allows for comparison of graphs of inequal size. We consider several variants of this distance measure to account for sparse maps between the two graphs, and for maps between the two graphs which are optimal given a fixed time-dilation factor α. We prove several important theory properties of distances in this family of measures, including triangle inequalities in some cases and Cauchy-like behavior of some graph sequences. We present a new procedure for optimizing the objective function defined by our distance measure, prove the correctness of this procedure, and demonstrate its efficiency in comparison to univariate search over the dilation parameter, α. Numerical experiments suggest that this dissimilarity score satisfies the triangle inequality up to some constant ρ � 2.1. We demonstrate that this measure of graph distance may be used to compare graph lineages (families of exponentially-growing graphs with shared structure), and additionally that certain lineages display Cauchy-sequence like behavior as the graph size approaches infinity. We suggest several possible applications of our distance measure to scientific problems in the contexts of pattern matching and machine learning.