On the choice of metric in gradient-based theories of brain function

This is a PLOS Computational Biology Education paper. The idea that the brain functions so as to minimize certain costs pervades theoretical neuroscience. Because a cost function by itself does not predict how the brain finds its minima, additional assumptions about the optimization method need to be made to predict the dynamics of physiological quantities. In this context, steepest descent (also called gradient descent) is often suggested as an algorithmic principle of optimization potentially implemented by the brain. In practice, researchers often consider the vector of partial derivatives as the gradient. However, the definition of the gradient and the notion of a steepest direction depend on the choice of a metric. Because the choice of the metric involves a large number of degrees of freedom, the predictive power of models that are based on gradient descent must be called into question, unless there are strong constraints on the choice of the metric. Here, we provide a didactic review of the mathematics of gradient descent, illustrate common pitfalls of using gradient descent as a principle of brain function with examples from the literature, and propose ways forward to constrain the metric.

Optimization methods to train neural network models are often taken from machine learning, a field that has had intense interactions with theoretical and computational neuroscience [20].A successful method in machine learning -despite its simplicity -has been the method of (stochastic) steepest descent or gradient descent [21].
Gradient descent and steepest descent are the same, since the negative gradient points in the direction of steepest descent (see Equation 7).Often the direction of gradient descent is visualised as a vector orthogonal to the contour lines of the cost function.The notion of orthogonality, however, assumes a Rieman-  The main message of this text.A A cost function and a metric together determine a unique flow field and update rule, given by gradient descent on the cost function in that metric.B For a given cost function there are infinitely many different flow lines and update rules (one for each choice of the metric) that lead to the minima of the cost function by gradient descent.
nian metric (also known as inner product or scalar product in vector spaces).The Riemannian metric enters also in an alternative, but equivalent definition of the direction of steepest descent: The direction of steepest descent produces the greatest absolute decrease of the cost function for a step of a fixed (and small) size, where the step size is determined by the choice of the Riemannian metric.Thus, a cost function by itself does not predict the trajectories that lead to its minima through steepest descent, however, a cost function combined with a metric does (see Figure 1).Why do we normally not think of the metric as an important and essential quantity?The physical space that surrounds us, at the scales that we encounter in everyday life, is Euclidean.Thus, a mountaineer who would like to determine the direction of steepest ascent of the terrain refers to Euclidean geometry.In this case, the steepest direction is unambiguous because the way to measure distances is intrinsic to the space and not merely an artifact of using a particular set of coordinates.On a map that faithfully represents Euclidean geometry, i.e. preserves angles and lengths up to some scaling factor, the mountaineer may find the steepest direction by drawing a curve that runs perpendicular to the contour lines (see Figure 2A red route).But if a wicked hotelier gave the mountaineer a map that does not faithfully represent Euclidean geometry, another route would be chosen when planning the route as perpendicular to the contour lines (see Figure 2B blue route).We will refer to this as the "wicked-map problem" in the following.
What may look obvious in the context of hiking maps can be confusing in contexts in which it is less clear how to draw a sensible map, i.e. how to choose a natural parametrization of an observed phenomenon.We will discuss, how naive gradient ascent or descent, as taught in text books (e.g.[4,21]), is susceptible to the "wicked-map problem".While it is simple to display the same path in different maps by following standard transformation rules, the choice of an appropriate metric remains a challenge.In other words, how should one know a priori which metric is most appropriate to predict a route with gradient ascent dynamics?We will illustrate the problems around gradient ascent and descent with three examples from the theoretical neuroscience literature and discuss ways forward to constrain the choice of metric.

The gradient is not equal to the vector of partial derivatives
Given a cost function C(x) that depends on variables x = (x 1 , . . ., x N ), where the variables x i could be synaptic weights or other plastic physiological quantities, naive gradient descent dynamics is sometimes written as [4,21] x or in continuous time The "wicked-map problem".A An ambitious mountaineer may follow the gradient in Euclidean metric to reach the mountain top (red route from square to triangle).Since the map is plotted in Cartesian coordinates, the route stands perpendicular to the contour lines.B If the ambitious mountaineer does not realise that a map given by a wicked hotelier is sheared, the blue route would be chosen, as it is now the one that stands perpendicular to the contour lines in the sheared map.The blue route corresponds to gradient ascent in another metric.Of course, each route on the normal map could be transformed to the sheared map and vice versa, but what looks like naive (Euclidean) gradient ascent in one map may look different in another map.
where η and η are parameters called learning rate.As we will illustrate in the course of this section, this has two consequences: • The "wicked-map problem": the dynamics in Equation 1and Equation 2 depend on the choice of the coordinate system.
• The "unit problem": if x i has different physical units than x j , the global learning rate η should be replaced by individual learning rates η i that account for the different physical units.
In Section 3, we will explain the geometric origin of these problems and how they can be solved.
The "wicked-map problem" often occurs in combination with the "unit problem", but it is present even for dimensionless parameters.The parameters or coordinates that are used in a given problem are mostly arbitrary; they are simply labels attached to different pointswhile the points themselves (for example, the position of the mountaineer) have properties independent of the parameters chosen to represent them.For example, it is common to scale the variables or display a figure in logarithmic units, or simply display them in a different aspect ratio (transformations like the shearing transformation in Figure 2).We expect the predictions of a theory to be independent of the choice of parametrizations.Hence, if we think of the optimization as a biophysical process that effectively minimizes a cost function, then this biophysical process should not depend on our choice of the coordinate system.However, as we will show below, a rule such as Equation 2 that equates the time derivative of a coordinate with the partial derivative of a cost function (times a constant) is not preserved under changes of parametrization (see Figure 2A,B).
In order to address the "unit problem" we can normalize each variable by dividing by its mean or maximum so as to make it unitless.However, this merely replaces the choice of an arbitrary learning rate η i for each component by the choice of an arbitrary normalizing constant for each variable.

Artificial examples
To illustrate the "wicked-map problem", let us first consider the minimization of a (dimensionless) quadratic cost C(x) = (x − 1) 2 , where x > 0 is a single dimensionless parameter.The derivative of C is given by C (x) = 2x − 2.

Naive gradient descent minimization according to Equation 2 yields η
Since x is larger than zero and dimensionless, one may choose an alternative parametrization x = √ x.The cost function in the new parametrization reads C(x) = (x 2 − 1) 2 , and its derivative is given by C (x) = 4x(x 2 − 1).
In this parametrization, it may be argued that a reasonable optimization runs along the tra- for initial condition x(0) = √ 2. After transforming this solution back into the original coordinate system with parameter x, we see that the original dynamics x(t) = 1 + e −2ηt and the new dynamics x(t) 2 = 1 − 1 2 e −8ηt +1 are very different.This is expected, because the (1-dimensional) vector field C (x) = 2 − 2x that is used for the first trajectory, should behave as x − x under a change of parametrization, which is different from the vector field C (x) = −4x(x 2 − 1) that is used for the second trajectory.This first, one-dimensional example shows that the naive gradient descent dynamics of Equation 2 does not transform consistently under a change of coordinate system.
Besides this parametrization, other equivalent ways to parametrize the normal distribution are mean µ and variance s = σ 2 or mean µ and precision τ = 1/σ 2 .Thus the function C is expressed in the other parametrizations as When we apply the same recipe as before to the new parametrizations, we obtain the dynamics Despite the different looks of the flow fields resulting from the three different parametrizations, all of them can be seen to describe dynamics that minimize the cost function (Figure 3).However, this example illustrates an important geometrical property that we will come back to later: the differential of a function f , i.e. the collection of its partial derivatives, does not transform like a proper vector.

Gradient descent in neuroscience
In this section we present three examples from published works, where it is postulated that the dynamics of a quantity relevant in neuroscience follows gradient descent on some cost function.
In 2007 a learning rule for intrinsic neuronal plasticity has been proposed to adjust two parameters a, b of a neuronal transfer function g ab (x) = (1 + exp(−(ax + b))) −1 [18].The rule was derived by taking the derivatives of the Kullback-Leibler (KL) divergence D KL (f y ||f exp ) between the output distribution f y resulting from a given input distribution over x and the above transfer function, and an exponential distribution f exp with decay parameter µ > 0. The flow field in Figure 1A of [18] (here Figure 4A) is obtained with the Euclidean metric.If x is a current or a voltage, one would encounter the "unit problem", since a and b would have different physical units; one may therefore assume that x is normalized such that x, a and b are dimensionless.
The "wicked-map problem" appears, since it is unclear whether the Euclidean distance in the (a, b)-plane is the most natural way to measure distances between the output distributions f y that are parametrized by a and b.In fact, in 2013 a different dynamics has been predicted for the same cost function, but under the assumption of the Fisher information metric 1 [22] which can be considered a more natural choice to measure distances between dis- 1 See Section 4 for more details on the Fisher metric.
tributions than the Euclidean metric (see Figure 4B).
Similarly, it has been argued that the quantal amplitude q and the release probability P rel in a binomial release model of a synapse evolve according to a gradient descent on the KL divergence from an arbitrarily narrow Gaussian distribution with fixed mean ϕ to the Gaussian approximation of the binomial release model [19].To avoid the "unit problem", the quantal amplitude q was appropriately normalised.Since q and P rel parametrize probability distributions, one may also argue for this study, that the Fisher information metric (Figure 4D) is a more natural choice, a priori, than the Euclidean metric (Figure 4C), but the corresponding flow fields are just two examples of the infinitely many possible flow fields that would Euclidean metric (see Figure 1A in [18]) and B with Fisher information metric.C Flow of quantal amplitude q and release probability P rel in a binomial release model of a synapse with Euclidean metric (see Figure 1D in [19]) and D with Fisher information metric.Under other choices of the metric (see section 4) the flow fields would again look different.
be consistent with gradient descent on the same cost function.Alternatively, one could e.g.consider metrics that depend on metabolic costs; it may be more costly to move a synapse from release probability P rel = 0.9 to release probability P rel = 1.0 than from P rel = 0.5 to P rel = 0.6.If there is no further principle to constrain the choice of metric (see e.g.section 4), data itself may guide the choice of metric.Surprisingly, the available and appropriately normalized experimental data is consistent with the Euclidean metric in P rel − q space [19], but there is probably not sufficient data to discard a metric based on metabolic cost.
Gradient descent has been popular as an approach to postulate synaptic plasticity rules [7][8][9][11][12][13][14][15][16][17].As an example, minimizing by gra-dient descent the KL divergence from a target distribution of spike trains to a model distribution of spike trains [15] is claimed to lead to the plasticity rule with a constant learning rate η.This choice of a constant learning rate is equivalent to choosing the Euclidean metric on the weight space.But there is no reason to assume that the learning rate should be constant or the same for each parameter (synaptic weight): one could just as well choose individual learning rates η ij (w ij ).This generalization corresponds still to the choice of a diagonal Riemannian metric.But, while it is often assumed that the change of a synapse depends only on pre-and postsynaptic quantities (but see [15]), it could be that there is some cross-talk between neighbouring synapses, which could be captured by non-diagonal Riemannian metrics.
This example shows that gradient descent does not lead to unique learning rules.Rather, each postulate of a gradient descent rule should be seen as a family of possibilities: there is a different learning rule for each choice of the Riemannian metric.
3. What is the gradient, then?How to do steepest descent in a generic parameter space.
In the preceding section, we have shown that the partial derivatives with respect to the parameters do not transform correctly under changes of parametrization (i.e.not as we would expect for the components of a vector or flow field).In order to work with generic spaces which may carry different parametrizations, it is useful to apply methods from differential geometry.
A Riemannian metric on an N -dimensional manifold (an intrinsic property of the space) gives rise to an inner product (possibly position-dependent) on R N for each choice of parametrization.The matrix representation of the inner product depends on the choice of parametrization.However, the dependence is such that the result of an evaluation of the inner product is independent of the choice of parametrization.When described in this language, the geometry of the trajectories in the space is therefore independent of parameter choices.
We refer to the Appendix for a detailed treatment of the gradient in Riemannian geometry.In the following, we simply give the definition of the gradient in terms of the inner product.
For a function f : R N → R and an inner product •, • : R N × R N → R, a common implicit definition (e.g.[23]) of the gradient for all non-zero vectors u = 0, i.e. the gradient (∇f )(x) is the vector that is uniquely defined by the property that its product with any vector u is equal to the derivative of f in direction u.With the Euclidean inner product v, w E = N i=1 v i w i it is a simple exercise to see that the components of the gradient are the partial derivatives.However, with any other inner product v, w G(x) = N i,j=1 v i G ij (x)w i , characterized by the position-dependent symmetric, positive definite matrix G(x), the gradient is given by i.e. the matrix product of the inverse of G(x) with the vector of partial derivatives.Note that the inverse G −1 (x) is also a symmetric, positive definite matrix.The inverse of G(x) automatically carries the correct physical units and the correct transformation behaviour under reparametrizations, i.e. the components of the matrix transform as under a reparametrization from x to x , such that the dynamics is invariant under a change of parametrization.Following standard nomenclature, we call the gradient induced by the Riemannian metric G the Riemannian gradient.
The gradient is used in optimization procedures because it points in the direction of steepest ascent.To see this we define the direction of steepest ascent as the direction u in which the change of the function f is maximal.Using the definition of the gradient in Equation 3 and determining the maximum we find where || • || = •, • denotes the norm induced by the metric •, • .

On choosing a metric
Given an arbitrary vector field one may ask whether it is possible to represent it as a steepest descent on some cost function with respect to some metric.When the metric is already known there is a systematic way to check whether the vector field can be written as a gradient, and to construct a suitable cost function (see Appendix C).If the metric is unknown, one may have to construct a metric which is tailored to the dynamical system (see also Appendix C).
Instead of constructing a custom-made metric for the dynamical system, it may be more desirable (from the perspective of finding the most parsimonious description) to choose a metric a priori and then check whether a given dynamical system has the form of a gradient descent with respect to that metric.Such an a priori choice could be guided e.g. by biophysical principles and therefore becomes an integral part of the theory.For example, a metric could reflect the equivalence of metabolic cost that is incurred in changing individual parameters.Another example is Weber's law, which implies that parameter changes of the same relative size are equivalent.This would suggest a constant (but not necessarily Euclidean) metric on a logarithmic scale.A third example is the homogeneity across an ensemble: if there are N neurons of the same type and functional relevance, we may want to constrain the metrics to those that treat all neurons identically when changing quantities such as neuronal firing thresholds or synaptic weights.
Even if it does not fully determine the metric, a principle which constrains the class of metrics is very useful when trying to fit the metric to the data (i.e. for a given cost function).Without any constraints, the specification of a Riemannian metric for an ndimensional parameter space requires the spec-ification of 1  2 n(n+1) smooth functions, i.e. the components of the matrix G in some coordinate system; these components can be constant or position-dependent.
If the parameter space describes a smooth family of probability distributions, the Fisher information matrix provides a canonical Riemannian metric on this manifold.The special status of the Fisher-Rao metric in statistics is due to the fact that it is the only metric (up to scaling factors) that has a natural behavior under sufficient statistics (see e.g.[24], Theorem 2.6 going back to Chentsov, 1972).The Riemannian gradient with respect to the Fisher-Rao metric is often called the natural gradient2 , and has been applied in machine learning [25][26][27][28][29][30][31][32][33] and neuroscience [22].Another metric on probability distributions that has recently gained a lot of attention is the optimal transport or Wasserstein metric [34][35][36].However, despite the nice mathematical properties of such metrics and their usefulness for machine learning applications, it is not clear why natural selection would favor them.Therefore, the special mathematical status of those metrics does not automatically carry over to biology or more specifically neuroscience.

Conclusions
Steepest descent or gradient descent depends on a choice of ruler (or Riemannian metric) for the parameter space of interest.The Euclidean metric is rarely a natural choice (see "wicked-map problem" in section 2), especially (but not only) for spaces of parameters that carry different physical units (see "unit problem" in section 2).In practice, the "unit problem" can be treated with a suitable normalization of the measured quantities [18,19].The "wicked-map problem", however, remains and it may be a matter of serendipity to select the parametrization in which naive gradient descent is consistent with experimental data.Also when steepest descent is invoked to postulate the dynamics of firing rates or synaptic weights [7][8][9][11][12][13][14][15][16][17], one should not ignore the possibility of non-Euclidean metrics.The additional free hyperparameters associated with the choice of Riemannian metric can significantly alter the prediction of the model when those are about trajectories along which optimization occurs (as opposed to just the targets of the optimization).Unless there is an obvious way to fix those hyperparameters, be it through previously collected data or some principles, the model's predictive power is lowered, since many flow fields can be written as a gradient descent in some metric.Whether a gradient descent model with many degrees of freedom or a phenomenological model without reference to the computational principle of gradient descent is preferable in this case, may be a matter of subjective preferences.
On the positive side, the additional degrees of freedom that accompany the choice of metric imply that a larger class of dynamics can be seen as being optimal in the sense of following a flow field consistent with gradient descent and some metric that is yet to be determined.It will be interesting to uncover the metrics that are chosen by biology, and to uncover the biophysical principles that underlie these choices.For it is the metric together with the cost function that fully specifies a gradient descent dynamics.example no. 2 (Fig. 2) The cost function is given by and its partial derivatives read The other partial derivatives can be computed either by using the chain rule, e.g.
or by expressing the function in terms of the new coordinates, e.g.
and then calculating the derivative directly: To define the corresponding vector fields, we follow the convention to denote the tangent vector in the direction of a parameter θ by ∂ θ .
The vector fields V, Ṽ , and V , defined by which implies that V can be expressed in µ, s coordinates as and Ṽ in µ, σ coordinates as 15) This shows that V and Ṽ are different.We leave the corresponding calculation for the µ, τ parametrization as an exercise.

Appendix B. Steepest descent on manifolds
Here, we give a short introduction on the calculus on manifolds and differential geometry background of this paper.For more details, the reader is referred to the excellent books by Michael Spivak [37] and by Jeffrey M. Lee [38].
Let p be a point in the manifold and v be a vector from the tangent space in p. Suppose that we want to define the directional derivative of f at point p in the direction v.We can then draw a curve γ that runs through p and which has a tangent vector equal to v at that point.For convenience, let γ(t) = p.We then define the differential df p of f at p as i.e. as a map from the tangent space to the real numbers.It can be shown that this map is linear and well-defined (i.e. it does not depend on the particular choice of γ).In a parametrization p = Φ(x) = Φ(x 1 , ..., x n ) it reads where ∂ i denotes the partial derivative with respect to x i and v i is the i'th component of v when expressed in the coordinate basis.Here, we introduced upper indices for tangent vectors and lower indices for so-called cotangent vectors.
As a linear map from the tangent space to the real numbers, df p belongs to the cotangent space at p, the dual space of the tangent space.The cotangent vector df p is expressed as in local coordinates.Recall from linear algebra that the dual space V * of a real finitedimensional vector space V is isomorphic to V but not in a canonical way, i.e. there is no preferred way to associate tangent vectors and cotangent vectors one-to-one.Given a basis of V , one can choose a dual basis of V * and this gives rise to an isomorphism, whose representation in this basis is the unit matrix.The same concept holds for the tangent and cotangent spaces of a smooth manifold.A choice of coordinates x i , i = 1, ..., n gives rise to a basis of the tangent space ∂ i , i = 1, ..., n and a corresponding dual basis dx i , i = 1, ..., n.Therefore the identification of tangent and cotangent vectors depends on the choice of coordinates.It is for this reason that tangent and cotangent vectors have to be regarded as different objects.
The different geometrical nature of tangent and cotangent vectors is the fundamental reason why a rule such as in Equation 1 or Equation 2 is problematic: on one side of the equation, we have a tangent vector (the velocity vector of the curve along which we want to move), while on the other side we have the differential of the cost function, a cotangent vector.They cannot be equal; they can at most have the same components in some coordinates, but this property is lost when changing to a different set of coordinates.Such a rule therefore does not make sense without invoking a preferential choice of coordinates.

Generalization of inner products: Riemannian metrics
In order to obtain a way to transform cotangent vectors into tangent vectors or vice versa and thereby identify them with each other, one needs to define additional structure on the manifold.This structure comes in the shape of what is called a Riemannian metric, which is a map from bivectors (i.e.pairs of tangent vectors) to the real numbers.More specifically, at each point p it specifies a quadratic form or an inner product g p on the tangent space at that point.In order to qualify for the term Riemannian, this quadratic form should in addition be positive definite. 3Lastly, the metric is usually expected to vary smoothly as a function of the position in the manifold, which means that when it is evaluated on smooth vector fields, the resulting real-valued function is smooth.Given a Riemannian metric g and a point p, a cotangent vector v is assigned to a tangent vector v in the following way or in local coordinates Since g p is a bilinear form, we see that both v itself (as a map from the tangent space to the reals) and the assignment of v to v are linear maps, and we can also see that the assignment is injective, because if it were otherwise, we could have for non-zero v 1 = v 2 and some non-zero v , which contradicts the positive definiteness of g p .Since the tangent space and the cotangent space have the same dimension, the assignment is also surjective, and we can therefore define an inverse that assigns a tangent vector ω to any cotangent vector ω.An inverse metric g −1 p may then be defined as In local coordinates, we may write where δ k j are the components of the unit matrix (i.e.δ k j = 1 if j = k and zero otherwise).Using this inverse metric, ω may be written as Indeed, as a linear map from the cotangent space to the reals, the RHS may be canonically identified with a tangent vector. 4The isomorphisms and are known as musical isomorphisms, and in terms of local coordinates, they are used to raise and lower indices.
In analogy to the case in R N we can now define the gradient on smooth manifolds for all tangent vectors v at point p and, using the definition of the inverse metric, we find The gradient on a Riemannian manifold By being given the structure of the Riemannian metric, we obtain a notion of lengths of and angles between tangent vectors, as with any other inner product space.Thus, given a point p, we can ask in which direction the steepest ascent of the function f is.The answer is given by where the maximum is taken over all unitlength tangent vectors, and the directional derivative is properly expressed via the action of the differential of f on the tangent vector.This constrained optimization has a cost function where λ is a Lagrange multiplier.In order to solve this optimization problem, we have to compute the differential of L with respect to v and set it to zero.Because of the linearity of df p and the symmetry and bilinearity of g p , we have The critical tangent vector v is therefore characterized by the vanishing of the term that is linear in to be satisfied by all tangent vectors v . 5As we developed above, this equation has a unique solution, given by v = In some cases we may start with a given dynamical system in the form of a vector field V on some manifold M .The question arises whether we can find a function f and a metric g such that the dynamical system takes the form of a (negative) gradient flow, i.e.V = −∇ g f .For this question to make sense, we fix an asymptotically stable set S, with domain of attraction A.
If g is given, e.g. from the considerations of the previous section, but we do not know f , we can compute the one-form V that is dual to V with respect to g, and check whether it is closed. 6If V is indeed closed and the domain of attraction A is contractible (this is always true if S consists of a single point), this implies the existence of a function f , unique up to an additive constant, such that V = −df , and hence V = −∇ g f , on A. A suitable potential function f may be found by picking a reference point p 0 ∈ S and integrating V along any curve that joins p 0 and p.Note that if we change to a different metric g , the corresponding V might no longer be closed and hence such a potential may cease to exist.
If neither f nor g are given, necessary and sufficient conditions for their existence on a compact manifold were given by [39,40] in terms of transversality conditions on the vector field: it needs to be transversal to the zero section at each fixed point, transversal to the boundary, and the stable and unstable manifolds of each fixed point have to meet transversally.On a non-compact manifold it is not known whether we can find a global metric, but we can always use the construction above on a compact subset.Alternatively one may find a smooth Lyapunov function (this is always possible; see Theorem 3.2 in [41]) and use the method in the next paragraph to construct a suitable metric.
Suppose that f is given, and g is sought.This case is discussed in [42].A necessary condition for the existence of g is that f is a smooth local Lyapunov function for V , i.e. f > 0 and V f = df (V ) < 0 on A\S, and f = 0 on S. But this may not be sufficient: a simple counterexample is the dynamical system dx/dt = V (x) = −x and f (x) = x 4 on R.This dynamical system has a global attractor at x = 0, and f is a global Lyapunov function since we have f (0) = 0 as well as f (x) > 0 and df (V ) = 4x 3 (−x) = −4x 4 < 0 for all x = 0.But if we want to write dx/dt = −df (x)/g(x), we obtain g(x) = −df (x)/V (x) = 4x 2 , which is not a Riemannian metric on R (not positive definite at x = 0).
If we are happy to exclude the set S, we can always find a Riemannian metric defined on A\S such that V is the negative gradient of a given smooth local Lyapunov function f : in the one-dimensional example above, we just have to divide the negative differential of f by V .In higher dimensions, we may consider the level sets f −1 (q) for q ∈ (0, a) = f (A\S), which are submanifolds of dimension n − 1.We may then choose a Riemannian metric on each level set such that it depends smoothly on q, and extend this to a Riemannian metric on M by declaring V to be orthogonal to the level sets and having a squared Riemannian length equal to |df (V )|.The conditions for being able to extend this to a metric on A are discussed in [42].

Figure 1 :
Figure1: The main message of this text.A A cost function and a metric together determine a unique flow field and update rule, given by gradient descent on the cost function in that metric.B For a given cost function there are infinitely many different flow lines and update rules (one for each choice of the metric) that lead to the minima of the cost function by gradient descent.

Figure 2 :
Figure 2:The "wicked-map problem".A An ambitious mountaineer may follow the gradient in Euclidean metric to reach the mountain top (red route from square to triangle).Since the map is plotted in Cartesian coordinates, the route stands perpendicular to the contour lines.B If the ambitious mountaineer does not realise that a map given by a wicked hotelier is sheared, the blue route would be chosen, as it is now the one that stands perpendicular to the contour lines in the sheared map.The blue route corresponds to gradient ascent in another metric.Of course, each route on the normal map could be transformed to the sheared map and vice versa, but what looks like naive (Euclidean) gradient ascent in one map may look different in another map.
dµ dt = − ∂ C ∂µ and ds dt = − ∂ C ∂s and similar expressions for C. The corresponding flowfields in Figure 3B and C differ from the one obtained with the initial parametrization (Figure 3A) and from each other.This can also be seen by applying the chain rule to the two sides of ds dt = − ∂ C ∂s and comparing the result to dσ dt = − ∂C ∂σ , the dynamics in the original parametrization.On the left hand side we get ds dt = ∂s ∂σ dσ dt , i.e. a pre-factor ∂s ∂σ .On the right hand side we get − ∂ C ∂s = − ∂σ ∂s ∂C ∂σ , i.e. a pre-factor ∂σ ∂s .If the dynamics in the new parametrization would be the same as the one in the initial parametrization, the two prefactors would be the same.

Figure 4 :
Figure 4: Gradient descent flow fields in neuroscience.A Flow of intrinsic plasticity parameters a and b with Euclidean metric (see Figure1Ain[18]) and B with Fisher information metric.C Flow of quantal amplitude q and release probability P rel in a binomial release model of a synapse with Euclidean metric (see Figure1Din[19]) and D with Fisher information metric.Under other choices of the metric (see section 4) the flow fields would again look different.

12 )
Let us express the fields V, Ṽ in the other parametrizations that are displayed in Figure 3A,B.The basis vectors ∂ σ and ∂ s are related by