Correction: Beyond convexity—Contraction and global convergence of gradient descent

[This corrects the article DOI: 10.1371/journal.pone.0236661.].


Introduction
This paper considers the analysis of continuous-time gradient-based optimization through the lens of nonlinear contraction theory. It is motivated, in part, by recent observations in machine learning that arise in the application of gradient descent (or its stochastic counterpart) for the training of over-parameterized networks [1]. Modern networks often possess many more parameters than training examples and can fit the labels perfectly, resulting in submanifold valleys of the parameter space with equal cost [2][3][4]. Moreover, recent results suggest that highly-redundant networks experience few to no local optima that are not global optima PLOS ONE PLOS ONE | https://doi.org/10.1371/journal.pone.0236661 August 4, 2020 1 / 29 a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 [5][6][7]. These observations may be surprising in light of the fact that the loss landscapes for these problems are rarely convex. Although convex problems admit provable globally optimal solutions, other broader classes of functions share this same property. For example, Invex functions [8] guarantee that any local optimum is a global optimum, although the utility of invexity conditions remains a point of contention [9]. Functions satisfying the Polyak-Lojasiewicz (PL) inequality [1,3,10,11] give rise to exponentially convergent gradient descent to a provably optimal solution. While the PL condition is, in general, difficult to verify without an a-priori known globally optimal solution, the existence of zero-loss solutions in over-parameterized learning [3,6,7] makes it tractable in important special cases. Geodesic convexity [12,13] generalizes convexity to a Riemannian setting, with applicability to optimization on manifolds [14], as well as to conventional Euclidean settings where R n is endowed with a manifold structure through the definition of a metric. Here, we consider another class of conditions for the convergence of gradient and natural gradient descent to a globally optimal point. We do so through adopting the perspective of nonlinear contraction theory and analyzing gradient descent in continuous time.
Contraction theory [15] allows the stability of nonlinear non-autonomous systems to be characterized through linear time-varying dynamics describing the propagation of infinitesimally small displacements along the systems' flow. The existence of a Riemannian metric that contracts these virtual displacements (i.e., elements in the tangent space) is necessary and sufficient for exponential convergence of any pair of trajectories. Contraction naturally yields methods for constructing stable systems of systems, including synchronization phenomena [16] and consensus [17,18] as well as other key building blocks that allow the construction of large contracting systems out of simpler elements [19]. These properties provide opportunities to construct larger optimization structures from simpler elements (e.g., in distributed or competitive optimization settings).
The contribution of this paper is to apply these contraction tools for the analysis of gradient and natural gradient optimization. We consider optimization problems posed over Rn wherein no explicit manifold structure necessarily exists a-priori. Instead, we consider the analysis of optimization following endowing these problems with additional structure (a Riemannian metric), analyzing their convergence, and considering the use of contraction tools to build larger optimization structures out of smaller ones. Analysis proceeds in continuous time. While this approach is limited, in part, by the fact that computational optimization algorithms require a discrete implementation, a continuous perspective has yielded insight on important phenomena such as in the analysis [20], discrete implementations [21], and extensions [22,23] of Nesterov's accelerated gradient descent method [24]. It has also enabled analysis of primaldual algorithms [25], where an absolute time reference is obtained by introducing additional fast dynamics or delays using a singular perturbation framework. Recent results [26] provide principled tools to derive discrete-time implementations that preserve specific continuoustime convergence rates.
The paper is organized as follows. Section 2 provides our main results, detailing the applicability of contraction theory to analyze gradient descent in continuous time. We show that convex functions represent the special case of contraction in the identity metric. The flexibility afforded by state-dependent contraction metrics, however, enables significant extra freedom for guaranteeing that all local optima are globally optimal. We then consider the extensions of these results to natural gradient descent, where geodesic convexity of a function corresponds to contraction of its natural gradient system in the natural metric. In both cases, results highlight the topology of the set of optimizers in the case of semi-contraction, which would have most direct applicability to over-parameterized networks. New results also show how semi-contraction may be combined with specific additional information to reach broad conclusions about a dynamical system. Section 3 details extensions of these results to the case of primaldual type dynamics that appear in mixed convex/concave saddle systems, and shows how a broad class of natural adaptive control laws can be interpreted as a primal-dual system. Section 4 discusses the special case of g-convex functions and associated combination properties for interfacing with other models. Section 5 provides an outlook on potential future advances that may stem from these connections.

Contraction analysis of gradient systems
We first recall basic definitions and facts on convex optimization and show how a contraction analysis of gradient-based optimization considerably generalizes the class of functions that admit a unique global optimum. Following this presentation, results are generalized to the case of geodesically-convex optimization, which is particularly suited to analysis via contraction tools. Throughout this analysis, given a differentiable function h : R n ! R m , we denote the Jacobian of h(x) by In the special case of a scalar-valued function f : R n ! R we denote the gradient of f(x) by rf ðxÞ ¼ @f @x � � > 2 R n and its Hessian by r 2 f(x). Unless otherwise stated, we assume all functions are sufficiently smooth such that derivatives of the necessary order exist and are continuous. Before we embark on this discussion, let us note that of course, as illustrated, e.g., in [20] and in the following example, continuous-time analysis tools in general may be used to conceptually illuminate the mechanisms involved in discrete-time algorithms. As this paper will show, contraction tools give particularly simple insights into important classes of optimization problems, such as, e.g., geodesically-convex optimization. Example 1. The Polyak-Lojasiewicz (PL) inequality is one of the most general sufficient conditions for discrete-time gradient descent to exhibit linear convergence rates without strong convexity of the cost [10,11] By comparison, the results pursued via contraction analysis in this paper will ensure exponential convergence of any pair of trajectories for gradient descent, but likewise will ensure convergence of those solutions to a global optimum.

Relationships between convexity and contraction
Definition 1 (Strong Convexity). A twice differentiable function f : R n ! R is α-strongly convex with α > 0 if its Hessian matrix r 2 f(x) satisfies the matrix inequality As its name suggests, a function that is strongly convex is convex in the usual sense, while the converse is not always true. From a dynamic systems perspective, strong convexity provides exponential convergence of gradient flows: Proposition 1 (Exponential Convergence of Gradient Systems for Strongly Convex Functions). If a twice differentiable function f : R n ! R is α-strongly convex, then its gradient system converges to the unique global minimum of f exponentially with rate α. Toward proving this proposition, we will consider stability analysis through the application of nonlinear contraction theory.
Definition 2 (Contraction Metric [15]). A system _ x ¼ hðx; tÞ is said to be contracting at rate α > 0 with respect to a symmetric positive definite metric M : R ! R n�n , if for all t 2 R and all where aðx; tÞ ¼ @h @x is the system Jacobian and _ M ¼ P i ð@M=@x i Þh i ðx; tÞ. The system is said to be semi-contracting with respect to M when (2) holds with α = 0.
Given an α-contracting system and an arbitrary pair of initial conditions x 1 (0) and x 2 (0), the solutions x 1 (t) and x 2 (t) converge to one another exponentially where d M ð�; �Þ denotes the geodesic distance on the Riemannian manifold M ¼ ðR n ; MÞ. This property can be shown by considering the evolution of differential displacements δ x, which describe the evolution of nearby trajectories and coincide with the notion of virtual displacements in Lagrangian mechanics. More precisely, letting x(t;x 0 , t 0 ) denote the solution of _ x ¼ hðx; tÞ from initial condition x(t 0 ) = x 0 , differential displacements evolve according to (3) follows from the evolution of the squared length of these differential displacements [15], which verifies Furthermore, if a system is α-contracting in a metric M that satisfies M(x) ≽ β I uniformly for some constant β > 0, then any two solutions verify Example 2. Consider an α-strongly convex function f and its associated gradient descent system (1). Since f is strongly convex, it has a unique global minimum x � , which is a equilibrium point of (1). It can be verified that the gradient descent dynamics of f are contracting in the identity metric M = I with rate α. Since geodesic distances are just Euclidean distances in this metric, (3) immediately implies that 8t � 0; kxðtÞ À x � k � e À at kxð0Þ À x � k thus proving Proposition 1.
From this example, it is clear that strongly convex functions are a special case of ones whose gradient systems are contracting. The following proposition shows that one does not lose the convergence properties to a global optimum on this more general class of functions.
Proposition 2 (Exponential Convergence of Contracting Gradient Systems). Consider again gradient descent as in Eq (1). The system converges exponentially to a unique global minimum if it is contracting in some metric.
Proof. Because (1) is autonomous and contracting, it converges exponentially to a unique equilibrium x � [15]. Furthermore, this equilibrium must be a global minimum since f can only decrease along trajectories, with _ f ¼ À rf ðxÞ T MðxÞ À 1 rf ðxÞ < 0 for x 6 ¼ x � .
The above result, which emphasizes contraction rather than convexity as a sufficient condition to converge to a global minimum, can be extended to the semi-contracting case as follows.
Proposition 3 (Asymptotic Convergence of Semi-Contracting Gradient Systems). Consider a twice differentiable function f : R n ! R, a symmetric positive definite metric M : R n ! R n�n , and the associated gradient system Assume that dynamics (5) is semi-contracting in some metric, and furthermore that one trajectory of the system is known to be bounded. Then, (a) f has at least one stationary point, (b) any local minimum of f is a global minimum, (c) all global minima of f are path-connected, and (d) all trajectories asymptotically converge to a global minimum of f. Proof. (a) By assumption, there exists some initial condition x 0 such that x(t; x 0 ) remains bounded. This, in turn, implies that the ω-limit set ω[x(t; x 0 )] is non-empty, compact, forward invariant, and that dðxðt; x 0 Þ; o½xðt; x 0 Þ�Þ ! 0 as t ! þ1 Let x � denote an element of ω[x(t; x 0 )]. Since (5) is a gradient system, Theorem 15.0.3 of [28] guarantees that x � must be an equilibrium point of (5). This proves that f has at least one stationary point.
Let À � are disjoint. Further, since the system is semi-contracting these geodesic balls are forward invariant. Yet, since x � 1 a limit point, x(t;x 0 ) arrives within B 1 at some point, and never leaves. Likewise, since x � 2 is a limit point, x(t;x 0 ) arrives within B 2 at some point, and never leaves. Thus, we have a contradiction, and the limit set must consist of a single point.
(b) and (c): Consider now two equilibrium points of (5), x � 1 and x � 2 , and a smooth path γ(s) such that gð0Þ ¼ x � 1 and gð1Þ ¼ x � 2 . Since the gradient dynamics are semi-contracting, for each s the solution x(t;γ(s)) remains bounded. Thus, by the same reasoning as above, each x(t;γ(s)) converges to some equilibrium x � (s) as t ! + 1. Since rf(x � (s)) = 0 for each s, and x � (s) smoothly connects That is, all solutions converge to the same value for f.
(d): That all solutions of (1) asymptotically converge to a global minimum of f follows from that fact that f decreases along all solutions, and all solutions converge to the same value for f. Remark 1. In the case that a contraction metric needs to be found numerically, note that the conditions (2) for certifying contraction or semi-contraction are convex criteria. Thus, in many instances, the process of finding a metric numerically to verify contraction may be accomplished via convex optimization approaches, such as those based on sums-of-squares programming [29].

Relationship between geodesic convexity and contraction
Geodesic convexity [12] generalizes conventional notions of convexity to the case where the domain of a function is equipped with a Riemannian metric. A special case occurs in geometric programming (GP) [30]. In GP, a non-convex problem over positive variables fx i g N i¼1 can be transformed into a convex problem by a change of variables y i ¼ logðx i Þ. Alternately GP can be formulated over the positive reals viewed as a Riemannian manifold by measuring differential length elements ds in a relative sense Geodesically-convex optimization generalizes this transformation strategy to a broader class of problems [13]. However, beyond special cases (see, e.g., [31]), generative procedures remain lacking to formulate g-convex optimization problems or recognize g-convexity.
To introduce g-convexity more formally, consider a function f : R n ! R and a positive definite metric M : Rn ! R n�n . We note that geodesic convexity of f is not an intrinsic property of the function itself, but rather is a property of f defined on the Riemannian manifold ðR n ; MÞ.
Definition 3 (g-Strong Convexity [32]). A twice differentiable function f : R n ! R is said to be geodesically α-strongly convex (with α > 0) in a symmetric positive definite metric M if its Riemannian Hessian matrix H(x) satisfies: The elements of the Riemannian Hessian are given as [32] where @ ij f ¼ @ 2 f @x i @x j provide the elements of the conventional (Euclidean) Hessian and G k ij denotes the Christoffel symbols of the second kind The Riemannian Hessian generalizes the notion of the Hessian from a Euclidean context and captures the curvature of f along geodesics. Likewise, the natural gradient generalizes the notion of a Euclidean gradient to the Riemannian context in the following sense.
Definition 4 (Natural Gradient [33]). Consider R n equipped with a Riemannian metric M. The natural gradient of a differentiable function f : R n ! Rn is the direction of steepest ascent on the manifold and is given in coordinates by M(x) −1 rf(x).

Remark 2. When M(x)
is the Hessian of some twice differentiable strictly convex scalar function ψ(x), natural gradient descent coincides with the continuous-time limit of mirror descent [34,Sec. 2.3] with potential ψ(x).
Remark 3. From a differential geometric viewpoint, the first covariant derivative of f is a covector field given in coordinates by rf(x), while the natural gradient is a vector field given in coordinates by M(x) −1 rf(x) [33].

In a Euclidean context, where M(x) is identity, this distinction between covariant (covector) and contravariant (vector) representations of the gradient is immaterial. Similarly, the Riemannian Hessian H represents in coordinates the second covariant derivative of f.
When M is the identity metric, geodesic α-strong convexity naturally coincides with the definition of α-strong convexity in Definition 1. The natural gradient can be used to directly mirror Proposition 1 within the Riemannian context. Theorem 1 (Equivalence between g-Strong Convexity and Contraction of Natural Gradient). Consider a twice differentiable function f : R n � R ! R, a symmetric positive definite metric M : R n ! R n�n , and the natural gradient system [33] Then, f is α-strongly g-convex in the metric M for each t if and only if (9) is contracting with rate α in the metric M. More specifically, the Riemannian Hessian verifies where a ¼ @h @x . Appendix 1 provides a self-contained proof using conventional tensor analysis methods [35], whose relationship with contraction conditions have been noted previously [36,37]. The same relationships drive coordinate-free versions of the result in [38].
Remark 4. Theorem 1 can also be viewed as a special case of contraction analysis for complex Hamilton-Jacobi dynamics [37]. A reorganization of (9) as MðxÞ _ x ¼ À r x f ðx; tÞ may be recognized as the generalized momentum being the negative covariant gradient within a Hamiltonian mechanics context. Remark 5. While Theorem 1 applies to α-strong convexity, the link between the Riemannian Hessian and the contraction condition (2) also provides immediate equivalence between g-convexity of a function and semi-contraction of its natural gradient dynamics. Remark 6. Eq (10) provides an alternate way to compute the geodesic Hessian H, and, as expected, leaves it invariant when the metric M is scaled by a strictly positive constant. Because of the structure of the natural gradient dynamics, scaling M is akin to scaling time and implies inversely scaling the contraction rate α, consistently with (7).
By contrast, note that given a fixed dynamics H, the contraction metric analyzing it can always be arbitrarily scaled while leaving the contraction rate unchanged.
Similar to in Section 2.1 where convexity corresponded to contraction of gradient in the identity metric, we likewise see that Thm. 1 imposes g-convexity via a particular choice of contraction metric for the natural gradient dynamics. Mirroring Prop. 2, removing this restriction on the contraction metric leads to significant additional flexibility for guaranteeing convergence to a globally optimal point. Proposition 4 (Exponential Convergence of Contracting Natural Gradient Systems). Consider again natural gradient descent as in Eq (9). The system converges exponentially to a unique global minimum if it is contracting in some metric.
Proof. The proof follows immediately from the same logic as the proof of Proposition 2. Remark 7. Note that contraction also provides robustness. Consider perturbed dynamics ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi dðtÞ > MðxÞdðtÞ q < R uniformly. If the dynamics are contracting with rate λ, then all trajectories contract to a geodesic ball of radius R/λ [15]. This observation implies favorable properties for algorithms where an exact gradient may be difficult or intractable to compute, with approximation methods used in their place. Theorem 2 (Semi-Contraction for Natural Gradient). Consider a twice differentiable function f : R n ! R, a symmetric positive definite metric M : R n ! R n�n , and the associated natural gradient system Assume that dynamics (11) is semi-contracting in some metric, and furthermore that one trajectory  [28], which guarantees that any ω-limit point of gradient descent (5) is an equilibrium point, generalizes immediately to the case of natural gradient descent (11).
Remark 8. The topology of global optimizers satisfying this semi-contraction condition is the same as those observed when training over-parameterized networks [2,3]. However, empirical loss functions in these networks often also experience multiple saddle points [39,40]. The attractor sets associated with strict saddles have measure zero [41,42] under discrete gradient descent with sufficiently small stepsize (i.e., with adequately close approximation to the continuous time case), while the dimensionality of the attractor sets can be further reduced via smoothed versions of the gradient [43].
While the presence of strict saddles precludes the ability of a gradient system to be globally semi-contracting, any of the results given here can be generalized to forward invariant contraction or semi-contraction regions [15]. In principle, saddles could then be treated by excluding their measure zero attractor sets from suitably chosen contraction or semi-contraction regions.
The topology of equilibria in semi-contracting gradient systems immediately implies the following result. Corollary 1. Consider an autonomous, semi-contracting natural gradient system. If the linearization at some equilibrium point is strictly stable, then all system trajectories tend to this global minimizer.
More generally, if some equilibrium is locally asymptotically stable, all trajectories tend to this global minimizer.
Proof. We prove the second part, the first then follows directly from Lyapunov's linearization method. Existence of an equilibrium implies existence of a bounded trajectory. Furthermore, by definition, there exists a ball around the equilibrium point x � such that all trajectories initiated in that ball tend to x � . If there was another equilibrium, the path connecting it to x � would intersect that ball, which is a contradiction since the path is itself composed of equilibria via Thm. 2.
Remark 9. Strict stability of a natural gradient system at an equilibrium point can of course be established simply by ensuring that all eigenvalues of its Jacobian at this point are strictly in the left-half complex plane. This condition is equivalent to requiring that the Hessian of the objective function is positive definite at x � .
Indeed, given the natural gradient dynamics (11) Applying a similarity transformation with the symmetric square root of M(x � ) yields All eigenvalues of the symmetric matrix above are real, and they are all strictly negative if and only if the Hessian r 2 f(x � ) is positive definite. Note that this condition is equivalent to the geodesic Hessian at x � being positive definite in any metric, as the Euclidean Hessian is numerically equal to the geodesic Hessian in any metric in this case, due to all terms multiplying the Christoffel symbols in (8) being zero. Corollary 2. Consider an autonomous semi-contracting natural gradient system, and assume that the system has more than one equilibrium. Then, at any equilibrium, both the Jacobian matrix of the dynamics and the Hessian of the objective have at least one zero eigenvalue.
Proof. Consider an equilibrium x � , and an equilibrium path connecting it to some other equilibrium. The unit tangent vector at x � along this path is an eigenvector of the Jacobian with eigenvalue zero. Given the algebraic relation between the Jacobian and the objective Hessian pointed out in Remark 9, this shows in turn that the objective Hessian has a zero eigenvalue.

Examples
Let us illustrate Theorem 1 using the classical nonconvex Rosenbrock function: This function has a unique global optimum at x � = [1, 1] > , which is located along a long, shallow, parabolic-shaped valley. (12) and the metric [32] MðxÞ ¼

Example 3 Consider the Rosenbrock function
It can be verified algebraically that which shows that natural gradient descent is contracting with rate α = 2. This implies that the natural gradient dynamics satisfy The Rosenbrock metric M(x) can be viewed as following from a differential change of variables This differential change of variables is integrable, so that g-convexity of the Rosenbrock can be shown using the explicit nonlinear coordinate change z 1 ¼ 10x 2 1 À 10x 2 and z 2 = Consider the explicit change of variables z = rψ(x), which can be written in differential form as δ z = H ψ δ x. The dynamics (13) can be viewed in the mirror space as and therefore In the particular case when f is α-strongly convex and the potential function is chosen as ψ(x) = f(x), Eq (13) simply corresponds to Newton's method, and (14) verifies that Newton's method is contracting with rate 1 in the squared Hessian metric MðxÞ ¼ H 2 f ðxÞ [44]. Note that the well-known result that the transformation z = rϕ(x) is one-to-one (given the strict convexity of ψ) can also be shown by constructing, for a given z, the system which is autonomous and contracting in the identity metric and thus must reach a unique equilibrium point.
The following proposition provides further insight into the case when the contraction metric is related to an explicit change of variables more generally.
Proposition 5 (Relationship between gradient and natural gradient under a diffeomorphic change of variables). Consider a diffeomorphic change of variables z = g(x), and the associated metric Proof. In the z coordinates we have

Proposition 6. Consider a metric M(x) and suppose there exists a diffeomorphic change of variables
Then, the associated Riemannian curvature tensor with components R ikℓm must be identically zero.
Proof. Note that since δ z = Θ(x)δ x, it follows that δ z > δ z = δ x > M(x)δ x and thus the Riemannian metric tensor expressed in the z coordinates is the identity. Since the components of the Riemannian metric tensor are constant in these transformed coordinates, it follows that the components of the Riemannian curvature tensor are identically zero [32]. Transformation laws for tensors ensure that the components of the curvature tensor remain zero under arbitrary coordinate change, thus R ikℓm = 0.
The general freedom to consider differential changes of coordinates δ z = Θ(x)δ x where Θ is non-integrable provides additional flexibility and generality to both contraction analysis and g-convexity, as illustrated by the following examples.

Example 5. Consider the non-convex function
which has a global minimum at x = 0. Contours of the function are shown in Fig 1. Gradient descent _ x ¼ À rf ðxÞ ¼ À 2 can be shown to be contracting at rate λ = 2 in the metric  (3). The curvature tensor for this metric has some non-zero components, such as From Proposition 6, this shows that this metric cannot be derived from an explicit change of coordinates. Example 6. Consider the function and natural gradient descent with a given natural metric Θ(z) > Θ(z), where This natural gradient dynamics is verified semi-contracting in the metric

YðzÞ
Similar to Example 5, this metric has non-zero Riemannian curvature, and thus cannot be derived from a change of coordinates. Fig 2 shows the contours of f and two solutions of natural gradient descent. Fig 3 shows that the the geodesic distance between these two solutions is nonincreasing. Since the system is only semi-contracting, the distance between solutions does not tend toward zero. It can be verified that f is a sum of squares and thus f(z) � 0, and that f(z) = 0 when z 2 ¼ z 3 1 À z 1 . Both initial conditions asymptotically lead to this path connected set of global optima.
Example 7. Geodesically-convex optimization can also be used to carry out manifold-constrained optimization in an unconstrained fashion via recasting problems over a Riemannian manifold directly [14,31]. Taking an intrinsic view of the manifold, coordinate free results are available [38], however, for the purposes of computation, we assume a global coordinate chart here. Consider for instance optimization over the set S n þ of n × n positive definite matrices, and specifically the problem of finding the Karcher mean of m matrices a i 2 S n þ [13], which   [13] on S n þ in the metric that measures symmetric differential displacements as Naturally, the requirement that δ X be symmetric makes it an element of the tangent space to the manifold of symmetric positive definite matrices. This metric generalizes the GP case (6), and coincides with the second-order terms in the Taylor series of the log barrier −logdet(X) [45].
The metric is convex in its first argument, and can be shown to be geodesically convex in the second. We illustrate the connection with contraction to show this property. Note that rdðakXÞX ¼ X À 1 À X À 1 aX À 1 so that the natural gradient descent dynamics are simply where the differential displacement δ X must be symmetric. Considering the rate of change in length of these differential displacements d dt tr ðX À 1 dXÞ 2 Þ À � ¼ À 2trðX À 1 aðX À 1 dX À 1 Þ 2 Þ and defining the differential change of variables dZ ¼ X À 1 2 dX X À 1 2 , one has trðdZ 2 Þ ¼ trððX À 1 dXÞ 2 Þ and d dt trðdZ 2 Þ ¼ À 2tr dZ X À 1 2 aX À 1 2 dZ for all δ Z 6 ¼ 0. Hence, considering only the second argument to LogDet divergence, its Riemannian Hessian is positive definite, thus proving g-convexity via Thm. 1.

Non-autonomous systems and virtual systems
In our optimization context, the fact that contraction analysis is directly applicable to nonautonomous systems can be exploited in a variety of ways. As we shall detail later, a key aspect is that it allows feedback combinations or hierarchies of contracting modules to be exploited to address more elaborate optimization problems or architectures. Also, it makes the construction of virtual systems [46] possible to potentially extend results beyond natural-gradient descent. Remark 10. The natural gradient M −1 (x)r x f(x, t) represents the direction of steepest ascent on the manifold at any given time. With this in mind, Remark 7 on robustness enables convergence analysis for natural gradient descent within time-varying optimization contexts [25]. Let x � (t) denote the optimum of a time-varying α-strongly g-convex function. If ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi _ x � ðtÞ > MðxÞ _ x � ðtÞ q < R, then the natural gradient will track _ x � ðtÞ with accuracy R/α after exponential transient. Remark 11. Consider a contracting natural gradient system of the form (9). In the autonomous case, equations governing the differential displacement follow which has a similar structure to the time evolution of h(x)  [46][47][48] allows guaranteed exponential convergence to a unique minimum to be extended to classes of dynamics which are not pure natural gradient. For instance, it is common in optimization to adjust the learning rate as the descent progresses. Consider a natural gradient descent with the function f(x)α-strongly g-convex in metric M(x), and define the new system _ x ¼ À pðx; tÞ MðxÞ À 1 rf ðxÞ ð21Þ where the smooth scalar function p(x, t) modulates the learning rate [33] and is uniformly positive definite, 9 p min > 0; 8t � 0; 8x; pðx; tÞ � p min Let us show that this system tends exponentially to the minimum x � of f(x). Consider the auxiliary, virtual system, For this system, p(x(t), t) is an external, uniformly positive definite function of time, and thus r y ð pðx; tÞf ðyÞ Þy ¼ pðx; tÞ rf ðyÞ so that the contraction of (9) with rate α implies the contraction of (22) with rate αp min . Since both x(t) and x � are particular solutions of (22), this implies in turn that x(t) tends to x � with rate αp min . Note that since we only assumed that p(x, t) is uniformly positive definite, in general the actual system (21) is not contracting with respect to the metric M(x). Remark 12. The learning rate may also be selected to improve the numerical properties of the algorithm in a discrete time implementation. For example, p(x, t) could vary as the inverse of the condition number of r 2 f(x) to improve numeric conditioning without impact on stability guarantees.

Contraction +
Corollary 1 above points to a more general class of results where contraction or semi-contraction properties are combined with other information, such as a stable local linearization or a decreasing cost, to provide global results.

Contraction is attractive.
As we now show, Corollary 1 extends more generally to autonomous semi-contracting systems. An instance of this result in the case of an identity metric was derived in [49]. Proposition 7. Consider an autonomous system semi-contracting in a bounded metric M(x), If a system equilibrium is locally asymptotically stable, then it is globally asymptotically stable. In particular, if the system linearization at some equilibrium point is strictly stable, then all system trajectories tend to this equilibrium. Proof. The result is a particular case of Theorem 3, to be discussed next.

Theorem 3. Consider a non-autonomous system, semi-contracting in a bounded metric M(x),
Assume that a specific trajectory x � (t) is locally attractive. Then all trajectories tend asymptotically to x � (t).
In particular, if contraction holds (possibly in a different bounded metric) along a specific trajectory x � (t), and within a tube of constant size around it, then all trajectories tend asymptotically to x � (t).
Proof. The first part generalizes the equilibrium argument from [49] to arbitrary trajectories and arbitrary metrics. Assume that x � (t) is locally attractive, by which we mean there exists some � > 0 such that, for any initial time t 0 and initial condition x 0 2 B I ðx � ðt 0 Þ; �Þ, one has x(t 0 + T;x 0 , t 0 ) ! x � (t 0 + t) as T ! +1. Without loss of generality, we assume t 0 = 0.
Consider some generic initial condition x 0 with d I (x 0 , x � (0)) > �. We will argue that there is always a finite time window over which the geodesic distance from x(t; x 0 ) to x � (t) decreases by a fixed finite increment.
Consider a geodesic connecting x 0 and x � (0) and denote by x the unique point on this geodesic that is a geodesic distance β 0 � away from x � (0). Due to the uniform positive definiteness of M, this condition implies that x 2 B I ðx � ð0Þ; �Þ.
Because of the local attractivity of x � (t), there exists a time t 1 > 0 such that This implies that so long as d I (x 0 , x � (0)) > �, the trajectory from x 0 will eventually decrease its geodesic distance by a fixed finite increment. Since this process can be repeated, it follows that there must exist some time T such that xðT; x 0 Þ 2 B I ðx � ðTÞ; �Þ.
To complete the second part of the proof we proceed to show that if contraction (2) holds within a tube of constant size around trajectory x � (t), for some bounded metric ðb ? 0 Þ 2 I≼; M ? ðxÞ≼ðb ? 1 Þ 2 I and some rate α � > 0, then that trajectory is locally attractive. By condition (2) holding within a tube we mean that there exists some � > 0 such that (2)  Remark 13. The condition regarding contraction within a tube of fixed size is included to avoid pathological cases where the region of contraction shrinks to zero as t ! +1. For example, the system _ x ¼ À x þ tx 3 is contracting with rate 1 at the origin for all time, yet the origin is not locally asymptotically stable. Remark 14. Intuitively, the result can be understood by analogy with a shrinking rope. Consider a path of initial conditions connecting x � (0) to any x 0 . As this path flows forward in time, at t = 0, only a portion of this path of states is within the basin of attraction for x � (t). Viewing this path as a rope, the semi-contraction property ensures that no part of the rope can increase in length as it flows forward through the dynamics. Yet, due to local attractivity at one end of the rope, a portion of it is guaranteed to have shrinking length, pulling the rest of the rope toward the the region of attraction.
Remark 15. Numerical tools for determining contraction metrics [29] are based on the fact that contraction conditions (2) are convex in the metric for a fixed contraction rate. In practice, these methods often involve an outer search procedure for the contraction rate (e.g., via a binary search). In this sense, the use of semi-contraction is desirable as it does not require this additional search.
Remark 16. These results have analogs in the context of controller design using control contraction metrics (CCMs) [48,50]. In this setting, one can impose a semi-contracting closed-loop metric everywhere, except in a tube along a desired trajectory where a strict contraction condition would be required, possibly in a different metric. Since the existence of an exponential (resp. semi) CCM implies that the closed-loop plant can be rendered contracting (resp. semi-contracting), Theorem 3 would then imply asymptotic stabilizability of the desired trajectory.
This extension likewise has analogs for manifold convergence results [48] and convergence to a limit cycle by transverse contraction [51], both of which are special cases of CCM results applied to suitably constructed virtual control systems [48]. In either case, a semi-contracting CCM everywhere can be combined with a contracting CCM condition on the manifold (or limit cycle) and within a neighborhood of it to assert asymptotic stability of the manifold (or limit cycle). In the limit cycle case for autonomous systems, the contracting CCM condition needs only be enforced on the limit cycle itself, as its satisfaction within some neighborhood is then guaranteed by compactness. Likewise, for convergence to a compact manifold (e.g., an eggshell) in an autonomous system, the contracting CCM condition needs only be considered on the manifold itself.

Contraction as minimization.
Similarly, Proposition 2 may be viewed as a particular instance of the following results, which use contraction properties to minimize a cost or Lyapunov-like function.
Proposition 8 (Exponential Cost Minimization). Consider an autonomous contracting system (23), and a scalar cost function V(x) such that _ V ðxÞ � 0 for all x. Then all trajectories tend exponentially to a global minimum of V.
Proof. Because the system is contracting and autonomous, it tends exponentially to a unique equilibrium x � [15]. Consider now an arbitrary x, and the system trajectory initialized at x. Since the cost V can only decrease along the trajectory, this implies that V(x � ) � V(x), for all x.

Proposition 9. Consider an autonomous semi-contracting system (23) in a bounded metric M(x), and a scalar cost function V(x) such that _
V ðxÞ � 0 for all x. Assume that one system equilibrium x � is locally attractive (e.g., that linearization at x � is strictly stable). Then this equilibrium is unique, it is a global minimum of V, and all trajectories converge to it asymptotically.
Proof. Applying Proposition 7 shows that all trajectories asymptotically tend to x � , which also implies that the equilibrium is unique. By the same reasoning as in Proposition 8, since V can only decrease, V(x � ) must be a global minimum.
Remark 17. These results extend readily to the case where a system is semi-contracting within some forward invariant region, as opposed to globally. These generalizations may have applicability e.g., to the continuous-time limit of trained neural networks [52][53][54], wherein semi-contraction regions represent basins of attraction that are free of saddles. Metrics may become singular as they approach the boundary of these open sets [15], allowing the semi-contraction region to cover the entire basin.
Proposition 9 can be stated more generally as follows.

Theorem 4 (Asymptotic Cost Minimization). Consider an autonomous semi-contracting system in a bounded metric M(x), and a scalar cost function V(x) such that _
V ðxÞ � 0 for all x. Assume that one trajectory is known to be bounded. Let I be a forward invariant set where _ V ¼ 0, and assume that the contraction condition (2) holds on I for some (possibly different) metric.
Then I is path connected, all system trajectories converge to a unique equilibrium x � 2 I and V is globally minimized at x � .
Proof. Let us first show that I is path connected, by contradiction. Assume I is not path connected, then it can be decomposed into two disjoints subsets, I 1 and I 2 . Because I is invariant and the subsets are disjoint, each of the subsets must be invariant. Strict contraction on I 1 and I 2 then implies that each subset contains at least one locally stable equilibrium point (note that each of the subsets may themselves be disconnected and thus may contain more than one stable equilibrium point). The existence of two equilibrium points contradicts Proposition 9, and thus I is path connected.
Next, on the connected invariant set I, contraction implies that the geodesic distance between any two points shrinks exponentially. By the same reasoning as in Proposition 8, this in turn implies convergence to a global minimum of V.
Remark 18. Note that for a system where a scalar cost V satisfies _ V � 0, radial unboundedness of V is a sufficient condition for all trajectories to be bounded, ensuring the existence of a bounded trajectory as necessary in Thm. 4.
Remark 19. In the case of mechanical systems, V may often be chosen as the total energy of the system, so that Proposition 8 implies exponential convergence of the total energy, and, in turn, that potential energy is exponentially minimized. Similarly, Theorem 4 implies that potential energy is asymptotically minimized.
Remark 20. Contraction criteria can also be expressed in non-Euclidean norms and their associated matrix measures ( [15], section 3.7.ii). The results above extend immediately to these representations.

Primal-dual optimization
Primal-Dual algorithms are widely used in optimization to determine saddle points and also appear naturally in constrained optimization [45], where Lagrange parameters play the role of dual variables. When a function is strictly convex in a subset of its variables, and strictly concave in the remaining, gradient descent/ascent dynamics converge to a unique saddle equilibrium [55,56]. Within the context of constrained optimization, these dynamics are known as the primal-dual dynamics. Such dynamics play an important role e.g., in machine learning, for instance in adversarial training [57], in the information theory [58] of deep networks, in reinforcement learning [59] and actor-critic methods, and in support vector machine representations [60]. More generally, they are central to a large class of practical min-max problems, such as problems in physics involving free energy, or, e.g., nonlinear electrical networks modeled in terms of Brayton-Moser mixed potentials [25,61,62].
Consider a scalar function L(x, λ, t), possibly time-dependent, and metrics M x (x) and M λ (λ). Consider the natural primal-dual dynamics, which we define as In contrast to Remark 8, wherein spurious saddle equilibrium points presented an obstacle to global contraction, here the target equilibrium points of these dynamics are, by construction, chosen to be the saddle points of the function L. Using the metrics M x (x) and M λ (λ) extends the standard case [25], where they would be replaced by constant, symmetric positive definite matrices. The practical relevance of this extension is illustrated by the following example in the case of natural adaptive control.

Primal dual dynamics in natural adaptive control
This section illustrates the presence of natural primal-dual dynamics embedded in the application of natural adaptive control laws. Consider a system given by wherex ¼ x À x d . With this sliding variable, we define a reference x ðnÀ 1Þ r for the order n − 1 derivative of the state.
Choosing the control law Inspired by the elegant modification of the Slotine and Li adaptive robot controller introduced by Lee et al. [64,65], we consider the Lyapunov-like function  [65,66].
Note that if f is a second-order function 1 2 a T Pa, the Bregman divergence is simply 1 2ã > Pã, with H −1 equal to the constant matrix P −1 similar to the standard adaptive algorithm [63].
A quick calculation shows that the derivative of the Bregman divergence is simply _ a > Hã, so that the adaptation law yields Considering a virtual system with W and Q as externally provided functions of time, the dynamics (30) and (31) are equivalent to natural primal-dual over the function in the decoupled metric M s = J and Mâ ¼ HðâÞ. Overall, this construction enables the results in natural adaptive robot control [64,65] to be extended to the broader class (26). Remark 21. Note that a similar construction could be applied to provide natural adaptation within recent applications of nonlinear adaptive control [67,Thm. 2] based on control contraction metrics [48,50].

Natural primal dual
Continuous-time convex primal-dual optimization is analyzed from a nonlinear contraction perspective in [25], building on a earlier result of [19]. As we now show, Theorem 1 yields a natural extension to geodesic primal-dual optimization, where convexity in terms of primal and dual variables is replaced by g-convexity, thus broadening the above results to state-dependent metrics.
Theorem 5. Consider a scalar function L(x, λ, t), with L g-strongly convex over x and gstrongly concave over λ in metrics M x (x) and M λ (λ) respectively. Then, the geodesic primal-dual dynamics (25) is globally contracting, in metric Proof. Letting z = [x > , λ > ] > and _ z ¼ fðz; tÞ denote the overall system dynamics, the system's Jacobian can be written Aðx; λ; tÞ ¼ @f @z ¼ À @ @x ðM À 1 x r x LÞ À M x À 1 r xλ L M λ À 1 r λx L @ @λ ðM À 1 λ r λ LÞ so that, using Theorem 1, Proposition 10. Consider the primal dual dynamics (25) for a scalar cost function L(x, λ), with L g-strongly convex over x and g-concave (not necessarily strongly so) over λ in metrics M x (x) and M λ (λ) respectively. Suppose also that one solution of (25) is known to be bounded. Then, for any initial condition, the geodesic primal-dual dynamics (25) converge to an equilibrium x � , λ � . Moreover, x � is independent of initial conditions.
Proof. The proof is given as a corollary to Theorem 6 in the next section. Remark 22. The above proposition highlights that contraction of the PD dynamics (e.g., as developed in [25]) is not necessary to guarantee convergence to a unique primal solution. Note however, that the above results only guarantee asymptotic convergence toward the unique primal equilibrium, as opposed to exponential convergence [25] when contraction can be shown for the PD dynamics as a whole.
This observation is reminiscent of results in adaptive control wherein the error dynamics of a certainty-equivalent controller may be asymptotically stable despite the fact that an associated adaptation law may not converge to the actual unknown parameters [63,68], with adaptation occurring on a "need-to-know" basis in that sense. Conceptually, this principle can apply to more general contexts involving concurrent control and learning, when effective control is the main goal (e.g., in reinforcement learning).
functions satisfy the skew-symmetry conditions for each j > i, then the suitable generalizations of (35) result in a coupled system that is contracting with rate min(α 1 , . . ., α n ) in the metric . . . ; k n M n Þ: The overall system converges to a unique Nash-like equilibrium satisfying and a similar relation for each other player. Likewise, the result can be extended to the case when the natural gradient dynamics for each individual player may only be semi-contracting. Theorem 6. Consider the two player case (35), wherein (a) f 1 is α 1 -strongly g-convex with α 1 > 0 in a uniformly positive definite metric M 1 (x 1 ) for each x 2 (b) the Riemannian Hessian H 2 (x 1 , x 2 ) of f 2 (x 1 , x 2 ) in x 2 is only positive semi-definite for each x 1 in a uniformly positive definite metric M 2 (x 2 ) and (c) the skew-symmetry property (34) holds. Assume that one trajectory of (35) is known to be bounded. Then, every trajectory of (35) converges to a Nash equilibrium Moreover, x � 1 does not depend on initial conditions (i.e., every Nash has the same strategy for player 1).
Proof. It can be shown that virtual displacements evolve such that which implies, by Barbalat's lemma, Via the same argument as follows from (20), it follows that r x1 x 1 (x 1 (t), x 2 (t)) ! 0 as t ! 1. So, for each initial condition, x 1 (t) must converge to some equilibrium x � 1 of the x 1 dynamics. Furthermore, since any δ x 1 ! 0, x � 1 must be unique and independent of initial conditions. Let us now turn to the behavior of the x 2 dynamics. Given an arbitrary initial condition (x 1,0 , x 2,0 ) for (35), let L + denote its ω-limit set. Any point in (x 1 , x 2 ) 2 L + must satisfy Since the dynamics are autonomous, L + is composed of trajectories of the system Moreover, L + must be closed and bounded. Since (37) is a natural gradient system of a g-convex function, Thm. 2 ensures that any trajectory of (37) must converge to an equilibrium point that is a global minimizer for f 2 ðx � 1 ; x 2 Þ. Considering any initial condition of (37) that begins in L + , we denote ðx � 1 ; x � 2 Þ 2 L þ as the resulting equilibrium point. However, since (35) is semicontracting, any geodesic ball around ðx � 1 ; x � 2 Þ is forward invariant for (35), which implies that L þ ¼ fðx � 1 ; x � 2 Þg. Thus, x 2 ðtÞ ! x � 2 as t ! 1. Note again that, while x � 2 depends on initial conditions, x � 1 does not.

Corollary 3.
Consider any two equilibrium points ðx � 1 ; x � 21 Þ and ðx � 1 ; x � 22 Þ for a system that satisfies the condition of Theorem 6. Then, the geodesic between these points is comprised of extremal Nash equilibrium points, all of which have the same cost.
Further, if one equilibrium of (37) is locally asymptotically stable, then it is necessarily globally attractive, and thus all trajectories of (37) converge to this unique equilibrium ðx � 1 ; x � 2 Þ regardless of initial conditions.
Proof. Proof of the first part follows immediately from applying Corollary 3.1 of [12] to the function f 2 ðx � 1 ; x 2 Þ. Proof of the second part follows immediately from the application of Corollary 1 herein. Corollary 4. Proposition 10 is true. Proof. Consider Thm. 6 with f 1 ¼ Lðx; λÞ and f 2 ¼ À Lðx; λÞ.

Hierarchical natural gradient
Consider a function f 1 (x 1 )α 1 -strongly g-convex in a metric M 1 (x 1 ), and a function f 2 (x 1 , x 2 )α 2strongly g-convex in a metric M 2 (x 2 ) for each given x 1 . Then, the hierarchical natural gradient dynamics is contracting with rate min(α 1 , α 2 ) in metric M(x 1 , x 2 ) = BlkDiag(M 1 (x 1 ), M 2 (x 2 )), under the mild assumption that the coupling Jacobian is bounded [15]. Since the overall system is both contracting and autonomous, it tends to a unique equilibrium [15] at rate min(α 1 , α 2 ), and thus to the unique solution of rf 1 ðx 1 Þx 1 ¼ 0 By recursion, this structure can be chained an arbitrary number of times, or applied to any cascade or directed acyclic graph of natural gradient dynamics. Such hierarchical optimization may play a role, for instance, in backpropagation of natural gradients in machine learning, with all descents occurring concurrently rather than in sequence. Remark 23. In large-scale optimization settings such as those appearing commonly in machine learning, natural gradient with a fully-dense metric can become intractable. In specific cases, such as natural gradient descent based on Fisher information [33], computationally effective approximations have been derived [69,70]. In addition, the combination of simple (e.g., diagonal) metrics through hierarchical structures lends an opportunity to recover significant complexity at broad scale − see, e.g., the hierarchical combination of scalar metrics to learn hierarchical representations of symbolic data in [71,72]. Such simpler metrics are also well motivated in the context of positive or monotone systems [73,74]. In special cases of a dense Hessian metric M(x) = r 2 ψ(x) from a potential ψ(x), note that continuous mirror descent (see also Proposition 5 and Example 4) provides an alternate method to compute continuous natural gradient. These methods can avoid the need to invert the metric in cases where there is an explicit inverse exists for the change of variables z = rψ(x), or when (15) can be run at a fast time scale to invert the gradient map through dynamics.

Conclusions
Overall, this paper has demonstrated that nonlinear contraction analysis provides a general perspective for analyzing and certifying the global convergence properties of gradient-based optimization algorithms. The common case of strong convexity corresponds to the special case of contracting gradient descent in the identity metric, while our analysis admits global convergence results in the significantly broader case of state-dependent metrics. This result has clear links to the case of geodesically-convex optimization wherein natural gradient descent converges to a unique equilibrium if it is contracting in any metric, broadening from the special case of g-convexity corresponding to contraction in the natural metric. Our analysis of semi-contraction of gradient systems, and the resulting smoothly connected sets of global optima may shed additional light on applications in learning with over parameterized networks [2] where the set of optimizers is recognized to take the form of a low-dimensional manifold. Results on natural primal dual and the convergence to Nash equilibria showcase the broad reach of these fundamental results, where they may serve as the basis for the generation of larger scale distributed optimization algorithms in future work. A framework we call Contraction + shows how contraction or semi-contraction properties can be combined with specific but coarse information on a system, such as the local stability of a particular equilibrium or the weak decreasing of a cost or a Lyapunov-like function, to conclude on global convergence or minimization.
A natural next step for the application of contraction in optimization is to design geodesic quorum sensing [16,75] algorithms for synchronization [76], as well as other consensus mechanisms considering time-delays [17,18], which may serve as the basis for distributed and large-scale optimization techniques on Riemannian manifolds. Other future applications will consider stochastic gradient descent in the Riemannian setting [77] with quorum sensing extensions (as, e.g., in [78,79]). Such advances could have direct applications, e.g., in the context of machine learning, among others.