Hierarchical Models in the Brain

This paper describes a general model that subsumes many parametric models for continuous data. The model comprises hidden layers of state-space or dynamic causal models, arranged so that the output of one provides input to another. The ensuing hierarchy furnishes a model for many types of data, of arbitrary complexity. Special cases range from the general linear model for static data to generalised convolution models, with system noise, for nonlinear time-series analysis. Crucially, all of these models can be inverted using exactly the same scheme, namely, dynamic expectation maximization. This means that a single model and optimisation scheme can be used to invert a wide range of models. We present the model and a brief review of its inversion to disclose the relationships among, apparently, diverse generative models of empirical data. We then show that this inversion can be formulated as a simple neural network and may provide a useful metaphor for inference and learning in the brain.


Introduction
This paper describes hierarchical dynamic models (HDMs) and reviews a generic variational scheme for their inversion. We then show that the brain has evolved the necessary anatomical and physiological equipment to implement this inversion, given sensory data. These models are general in the sense that they subsume simpler variants, such as those used in independent component analysis, through to generalised nonlinear convolution models. The generality of HDMs renders the inversion scheme a useful framework that covers procedures ranging from variance component estimation, in classical linear observation models, to blind deconvolution, using exactly the same formalism and operational equations. Critically, the nature of the inversion lends itself to a relatively simple neural network implementation that shares many formal similarities with real cortical hierarchies in the brain.
Recently, we introduced a variational scheme for model inversion (i.e., inference on models and their parameters given data) that considers hidden states in generalised coordinates of motion. This enabled us to derive estimation procedures that go beyond conventional approaches to time-series analysis, like Kalman or particle filtering. We have described two versions; variational filtering [1] and dynamic expectation maximisation (DEM; [2]) that use free and fixed-form approximations to the posterior or conditional density respectively. In these papers, we used hierarchical dynamic models to illustrate how the schemes worked in practice. In this paper, we focus on the model per se and the relationships among its special cases. We will use DEM to show how their inversion relates to conventional treatments of these special cases.
A key aspect of DEM is that it was developed with neuronal implementation in mind. This constraint can be viewed as formulating a neuronally inspired estimation and inference framework or conversely, as providing heuristics that may inform our understanding of neuronal processing. The basic ideas have already been described, in the context of static models, in a series of papers [3][4][5] that entertain the notion that the brain may use empirical Bayes for inference about its sensory input, given the hierarchical organisation of cortical systems. In this paper, we generalise this idea to cover hierarchical dynamical systems and consider how neural networks could be configured to invert HDMs and deconvolve sensory causes from sensory input. This paper comprises five sections. In the first, we introduce hierarchical dynamic models. These cover many observation or generative models encountered in the estimation and inference literature. An important aspect of these models is their formulation in generalised coordinates of motion; this lends them a hierarchal form in both structure and dynamics. These hierarchies induce empirical priors that provide structural and dynamic constraints, which can be exploited during inversion. In the second and third sections, we consider model inversion in general terms and then specifically, using dynamic expectation maximisation (DEM). This reprises the material in Friston et al. [2] with a special focus on HDMs. DEM is effectively a variational or ensemble learning scheme that optimises the conditional density on model states (Dstep), parameters (E-step) and hyperparameters (M-step). It can also be regarded as a generalisation of expectation maximisation (EM), which entails the introduction of a deconvolution or D-step to estimate time-dependent states. In the fourth section, we review a series of HDMs that correspond to established models used for estimation, system identification and learning. Their inversion is illustrated with worked-examples using DEM. In the final section, we revisit the DEM steps and show how they can be formulated as a simple gradient ascent using neural networks and consider how evoked brain responses might be understood in terms of inference under hierarchical dynamic models of sensory input.

Notation
To simplify notation we will use f x := f x (x) = h x f = hf/hx to denote the partial derivative of the function, f, with respect to the variable x. We also use ẋ = h t x for temporal derivatives. Furthermore, we will be dealing with variables in generalised coordinates of motion, which will be denoted by a tilde; x := [x,x9,x0,…] T = [x [0] ,x [1] ,x [2] ,…] T , where x [i] denotes ith order motion. A point in generalised coordinates can be regarded as encoding the instantaneous trajectory of a variable, in the sense it prescribes its location, velocity, acceleration etc.

Hierarchical Dynamic Models
In this section, we cover hierarchal models for dynamic systems. We start with the basic model and how generalised motion furnishes empirical priors on the dynamics of the model's hidden states. We then consider hierarchical forms and see how these induce empirical priors in a structural sense. We will try to relate these perspectives to established treatments of empirical priors in static and state-space models.
Hierarchical dynamic causal models. Dynamic causal models are probabilistic generative models p(y,W) based on statespace models. As such, they entail the likelihood, p(y|W) of getting some data, y, given some parameters W = {x,v,h,l} and priors on those parameters, p(W). We will see that the parameters subsume different quantities, some of which change with time and some which do not. These models are causal in a control-theory sense because they are state-space models, formulated in continuous time.
State-pace models in generalised coordinates. A dynamic input-state-output model can be written as The continuous nonlinear functions f and g of the states are parameterised by h. The states v(t) can be deterministic, stochastic, or both. They are variously referred to as inputs, sources or causes.
The states x(t) meditate the influence of the input on the output and endow the system with memory. They are often referred to as hidden states because they are seldom observed directly. We assume the stochastic terms (i.e., observation noise) z(t) are analytic, such that the covariance of z = [z,z9,z0,…] T is well defined; similarly for the system or state noise, w(t), which represents random fluctuations on the motion of the hidden states. Under local linearity assumptions (i.e., ignoring high-order derivatives of the generative model functions), the generalised output or response ỹ = [y,y9,y0,…] T obtains from recursive differentiation with respect to time using the chain rule y~g x,v ð Þzz Note that the derivatives are evaluated at each point in time and the linear approximation is local to the current state. The first (observer) equation show that the generalised states u = [ṽ,x,] T are needed to generate a generalised response that encodes a path or trajectory. The second (state) equations enforce a coupling between neighbouring orders of motion of the hidden states and confer memory on the system. At this point, readers familiar with standard state-space models may be wondering where all the extra equations in Equation 2 come from and, in particular, what the generalised motions; w9, w0, … represent. These terms always exist but are ignored in standard treatments based on the theory of Markovian processes [6]. This is because standard Markovian (c.f., Wiener) processes have generalised motion that has infinite variance and are infinitely 'jagged' or rough. This means w9, w0, … and x0, x-, … have no precision (inverse variance) and can be ignored with impunity. It is important to realise that this approximation is not appropriate for real or actual fluctuations, as noted at the inception of the standard theory; ''a certain care must be taken in replacing an actual process by Markov process, since Markov processes have many special features, and, in particular, differ from the processes encountered in radio engineering by their lack of smoothness… any random process actually encountered in radio engineering is analytic, and all its derivative are finite with probability one'' ( [6], pp 122-124). So why have standard statespace models, and their attending inversion schemes like Kalman filtering, dominated the literature over the past half-century? Partly because it is convenient to ignore generalised motion and partly because they furnish reasonable approximations to fluctuations over time-scales that exceed the correlation time of the random processes: ''Thus the results obtained by applying the techniques of Markov process theory are valuable only to the extent to which they characterise just these 'large-scale' fluctuations'' ( [6], p 123). However, standard models fail at short timescales. This is especially relevant in this paper because the brain has to model continuous sensory signals on a fast time-scale.
Having said this, it is possible to convert the generalised statespace model in Equation 2 into a standard form by expressing the components of generalised motion in terms of a standard [uncorrelated] Markovian process, z(t): Author Summary Models are essential to make sense of scientific data, but they may also play a central role in how we assimilate sensory information. In this paper, we introduce a general model that generates or predicts diverse sorts of data. As such, it subsumes many common models used in data analysis and statistical testing. We show that this model can be fitted to data using a single and generic procedure, which means we can place a large array of data analysis procedures within the same unifying framework. Critically, we then show that the brain has, in principle, the machinery to implement this scheme. This suggests that the brain has the capacity to analyse sensory input using the most sophisticated algorithms currently employed by scientists and possibly models that are even more elaborate. The implications of this work are that we can understand the structure and function of the brain as an inference machine. Furthermore, we can ascribe various aspects of brain anatomy and physiology to specific computational quantities, which may help understand both normal brain function and how aberrant inferences result from pathological processes associated with psychiatric disorders.
The first line encodes the autocorrelation function or spectral density of the fluctuations w(t) in term of smoothness parameters, c 1 ,…c n , where n is the order of generalised motion. These parameters can be regarded as the coefficients of a polynomial expansion P n (h t )w = z(t) (see [6], Equation 4.288 and below). The second line obtains by substituting Equation 2 into the first and prescribes a standard state-space model, whose states cover generalised motion; x [0] ,…,x [n] . When n = 0 we recover the state equation in Equation 1, namely, ẋ = f(x,v)+z. This corresponds to the standard Markovian approximation because the random fluctuations are uncorrelated and w = z; from Equation 3. When n = 1)w+c 1 w9 = z, the fluctuations w(t) correspond to an exponentially correlated process, with a decay time of c 1 ([6], p 121). However, generally n = ': ''Therefore we cannot describe an actual process within the framework of Markov process theory, and the more accurately we wish to approximate such a process by a Markov process, the more components the latter must have.'' ( [6], p 165). See also [7] (pp 122-125) for a related treatment.
If there is a formal equivalence between standard and generalised state-space models, why not use the standard formulation, with a suitably high-order approximation? The answer is that we do not need to; by retaining an explicit formulation in generalised coordinates we can devise a simple inversion scheme (Equation 23) that outperforms standard Markovian techniques like Kalman filtering. This simplicity is important because we want to understand how the brain inverts dynamic models. This requires a relatively simple neuronal implementation that could have emerged through natural selection. From now on, we will reserve 'state-space models' (SSM) for standard n = 0 models that discount generalised motion and, implicitly, serial correlations among the random terms. This means we can treat SSMs as special cases of generalised state-space models, in which the precision of generalised motion on the states noise is zero.
Probabilistic dynamic models. Given the form of generalised state-space models we now consider what they entail as probabilistic models of observed signals. We can write Equation 2 compactly asỹ Where the predicted response g = [g,g9,g0,…] T and motion f˜= [f,f 9,f 0,…] T in the absence of random fluctuations are and D is a block-matrix derivative operator, whose first leading-diagonal contains identity matrices. This operator simply shifts the vectors of generalised motion so x [i] that is replaced by Gaussian assumptions about the fluctuations p~z ~Nz z : 0,S S z À Á provide the likelihood, p(ỹ|x,ṽ). Similarly, Gaussian assumptions about state-noise pw w ð Þ~Nw w : 0,S S w À Á furnish empirical priors, p(x|ṽ) in terms of predicted motion We will assume Gaussian priors pṽ v ð Þ~Nṽ v :g g,S S v À Á on the generalised causes, with meang g :~g g v and covarianceS S v . The density on the hidden states p(x|ṽ) is part of the prior on quantities needed to evaluate the likelihood of the response or output. This prior means that low-order motion constrains high-order motion (and vice versa). These constraints are discounted in standard statespace models because the precision on the generalised motion of a standard Markovian process is zero. This means the only constraint is mediated by the prior p(ẋ|x,v). However, it is clear from Equation 5 that high-order terms contribute. In this work, we exploit these constraints by adopting more plausible models of noise, which are encoded by theircovariancesS S l ð Þ z andS S l ð Þ w (orprecisionsP P l ð Þ z and P P l ð Þ w ). These are functions of unknown hyperparameters, l which control the amplitude and smoothness of the random fluctuations. Figure 1 (left) shows the directed graph depicting the conditional dependencies implied by this model. Next, we consider hierarchal models that provide another form of hierarchical constraint. It is useful to note that hierarchical models are special cases of Equation 1, in the sense that they are formed by introducing conditional independencies (i.e., removing edges in Bayesian dependency graphs).
Hierarchical forms. HDMs have the following form, which generalises the (m = 1) model above Again, f (i) := f(x (i) ,v (i) ) and g (i) := g(x (i) ,v (i) ) are continuous nonlinear functions of the states. The processes z (i) and w (i) are conditionally independent fluctuations that enter each level of the hierarchy. These play the role of observation error or noise at the first level and induce random fluctuations in the states at higher levels. The causes v = [v (1) ,…,v (m) ] T link levels, whereas the hidden states x = [x (1) ,…,x (m) ] T link dynamics over time. The corresponding directed graphical model is shown in Figure 1 (right). In hierarchical form, the output of one level acts as an input to the next. When the state-equations are linear, the hierarchy performs successive convolutions of the highest level input, with random fluctuations entering at each level. However, inputs from higher levels can also enter nonlinearly into the state equations and can be regarded as changing its control parameters to produce quite complicated generalised convolutions with 'deep' (i.e., hierarchical) structure.
The conditional independence of the fluctuations at different hierarchical levels means that the HDM has a Markov property over levels, which simplifies attending inference schemes. See [8] for a discussion of approximate Bayesian inference in conditionally independent hierarchical models of static data. Consider the empirical prior implied by Equation 6 where the full prior pṽ v m ð Þ À Á~Nṽ v m ð Þ :g g,S S v À Á is now restricted to the last level. Equation 7 is similar in form to the prior in Equation 5 but now factorises over levels; where higher causes place empirical priors on the dynamics of the level below. The factorisation in Equation 7 is important because one can appeal to empirical Bayes to interpret the conditional dependences. In empirical Bayes [9], factorisations of the likelihood create empirical priors that share properties of both the likelihood and priors. For example, the prediction g (i) = g(x (i) ,ṽ (i) ) plays the role of a prior expectation on ṽ (i21) , yet it has to be estimated in terms of x (i) ,ṽ (i) . In short, a hierarchical form endows models with the ability to construct their own priors. These formal or structural priors are central to many inference and estimation procedures, ranging from mixed-effects analyses in classical covariance component analysis to automatic relevance determination in machine learning. The hierarchical form and generalised motion in HDMs furnishes them with both structural and dynamic empirical priors respectively.
The precisions and temporal smoothness. In generalised coordinates, the precision,P P z~S c ð Þ6P l ð Þ z is the Kronecker tensor product of a temporal precision matrix, S(c) and the precision over random fluctuations, which has a block diagonal form in hierarchical models; similarly forP P w . The temporal precision encodes temporal dependencies among the random fluctuations and can be expressed as a function of their autocorrelations Here € r r 0 ð Þ is the second derivative of the autocorrelation function evaluated at zero. This is a ubiquitous measure of roughness in the theory of stochastic processes [10]. Note that when the random fluctuations are uncorrelated, the curvature (and higher derivatives) of the autocorrelation are infinite. In this instance, the precision of high-order motion falls to zero. This is the limiting case assumed by state-space models; it corresponds to the assumption that incremental fluctuations are independent (c.f., a Wiener process or random walk). Although, this is a convenient assumption that is exploited in conventional Bayesian filtering schemes and appropriate for physical systems with Brownian processes, it is less plausible for biological and other systems, where random fluctuations are themselves generated by dynamical systems ( [6], p 81). S(c) can be evaluated for any analytic autocorrelation function. For convenience, we assume that the temporal correlations have the same Gaussian form. This gives Here, c is the precision parameter of a Gaussian autocorrelation function. Typically, c.1, which ensures the precisions of highorder motion converge quickly. This is important because it enables us to truncate the representation of an infinite number of generalised coordinates to a relatively small number; because highorder prediction errors have a vanishingly small precision. An order of n = 6 is sufficient in most cases [1]. A typical example is shown in Figure 2, in generalised coordinates and after projection onto the time-bins (using a Taylor expansion, whose coefficients comprise the matrix Ẽ ). It can be seen that the precision falls quickly with order and, in this case, we can consider just six orders of motion, with no loss of precision. When dealing with discrete time-series it is necessary to map the trajectory implicit in the generalised motion of the response onto discrete samples, [y(t 1 ),…,y(t N )] T = Ẽ ỹ(t) (note that this is not necessary with continuous data such as sensory data sampled by the brain). After this projection, the precision falls quickly over time-bins (Figure 2, right). This means samples in the remote past or future do not contribute to the likelihood and the inversion of discrete time-series data can proceed using local samples around the current time bin; i.e., it can operate 'on-line'.
Energy functions. We can now write down the exact form of the generative model. For dynamic models, under Gaussian assumptions about the random terms, we have a simple quadratic form (ignoring constants) The auxiliary variablesẽ e t ð Þ comprise prediction errors for the generalised response and motion of hidden states, where g(t) and f˜(t) are the respective predictions, whose precision is encoded byP P l ð Þ. The use of prediction errors simplifies exposition and may be used in neurobiological implementations (i.e., encoded explicitly in the brain; see last section and [4]). For hierarchical models, the prediction error on the response is supplemented with prediction errors on the causes   Note that the data and priors enter the prediction error at the lowest and highest level respectively. At intermediate levels the prediction errors, v (i21) 2g (i) mediate empirical priors on the causes. In the next section, we will use a variational inversion of the HDM, which entails message passing between hierarchical levels. These messages are the prediction errors and their influence rests on the derivatives of the prediction error with respect to the unknown states This form highlights the role of causes in linking successive hierarchical levels (the D T matrix) and the role of hidden states in linking successive temporal derivatives (the D matrix). The D T in the upper-left block reflects the fact that that the prediction error on the causes depends on causes at that level and the lower level being predicted; e (i)v = v (i21) 2g(x (i) ,v (i) ). The D in the lowerright block plays a homologous role, in that the prediction error on the motion of hidden states depends on motion at that order and the higher order; These constraints on the structural and dynamic form of the system are specified by the functions g = [g (1) ,…,g (m) ] T and f = [f (1) ,…,f (m) ] T , respectively. The partial derivatives of these functions have a block diagonal form, reflecting the model's hierarchical separability Note that the partial derivatives of g(x,v) have an extra row to accommodate the top level. To complete model specification we need priors on the parameters and hyperparameters. We will assume these are Gaussian, where (ignoring constants) Summary. In this section, we have introduced hierarchical dynamic models in generalised coordinates of motion. These models are about as complicated as one could imagine; they comprise causes and hidden states, whose dynamics can be coupled with arbitrary (analytic) nonlinear functions. Furthermore, these states can have random fluctuations with unknown amplitude and arbitrary (analytic) autocorrelation functions. A key aspect of the model is its hierarchical form, which induces empirical priors on the causes. These recapitulate the constraints on hidden states, furnished by the hierarchy implicit in generalised motion. We now consider how these models are inverted.

Model Inversion
This section considers variational inversion of models under mean-field and Laplace approximations, with a special focus on HDMs. This treatment provides a heuristic summary of the material in [2]. Variational Bayes is a generic approach to model inversion that approximates the conditional density p(W|y,m) on some model parameters, W, given a model m and data y. This is achieved by optimising the sufficient statistics (e.g., mean and variance) of an approximate conditional density q(W)with respect to a lower bound on the evidence (marginal or integrated likelihood) p(y|m) of the model itself. These two quantities are used for inference on the parameters of any given model and on the model per se. [11][12][13][14][15]. The log-evidence for any parametric model can be expressed in terms of a free-energy F(ỹ,q) and a divergence term, for any density q(W) on the unknown quantities The free-energy comprises the internal energy, U(y,W) = ln p(y,W) expected under q(W) and an entropy term, which is a measure of its uncertainty. In this paper, energies are the negative of the corresponding quantities in physics; this ensures the free-energy increases with log-evidence. Equation 15 indicates that F(ỹ,q) is a lower-bound on the log-evidence because the cross-entropy or divergence term is always positive. The objective is to optimise q(W) by maximising the free-energy and then use F<ln p(ỹ|m) as a lower-bound approximation to the log-evidence for model comparison or averaging. Maximising the free-energy minimises the divergence, rendering q(W)<p(W|y,m) an approximate posterior, which is exact for simple (e.g., linear) systems. This can then be used for inference on the parameters of the model selected.
Invoking an arbitrary density, q(W) converts a difficult integration problem (inherent in computing the evidence; see discussion) into an easier optimisation problem. This rests on inducing a bound that can be optimised with respect to q(W). To finesse optimisation, one usually assumes q(W) factorises over a partition of the parameters In statistical physics this is called a mean-field approximation. This factorisation means that one assumes the dependencies between different sorts of parameters can be ignored. It is a ubiquitous assumption in statistics and machine learning. Perhaps the most common example is a partition into parameters coupling causes to responses and hyperparameters controlling the amplitude or variance of random effects. This partition greatly simplifies the calculation of things like t-tests and implies that, having seen some data, knowing their variance does not tell you anything more about their mean. Under our hierarchical dynamic model we will appeal to separation of temporal scales and assume, x,] T are generalised states. This means that, in addition to the partition into parameters and hyperparameters, we assume conditional independence between quantities that change (states) and quantities that do not (parameters and hyperparameters). In this dynamic setting q(u(t)) and the free-energy become functionals of time. By analogy with Lagrangian mechanics, this calls on the notion of action. Action is the anti-derivative or pathintegral of energy. We will denote the action associated with the free energy by F , such that h t F = F. We now seek q(W i ) that maximise the action. It is fairly easy to show [2] that the solution for the states is a function of their instantaneous energy, where V(t) = h t V u is their variational energy. The variational energy of the states is simply their instantaneous energy averaged over their Markov blanket (i.e., averaged over the conditional density of the parameters and hyperparameters). Because the states are time-varying quantities, their conditional density is a function of time-dependent energy. In contrast, the conditional density of the parameters and hyperparameters are functions of their variational action, which are fixed for a given period of observation.
Where U h = ln p(h) and U l = ln p(l) are the prior energies of the parameters and hyperparameters respectively and play the role of integration constants in the corresponding variational actions; V h and V l . These equations provide closed-form expressions for the conditional or variational density in terms of the internal energy defined by our model; Equation 10. They are intuitively sensible, because the conditional density of the states should reflect the instantaneous energy; Equation 17. Whereas the conditional density of the parameters can only be determined after all the data have been observed; Equation 18. In other words, the variational energy involves the prior energy and an integral of time-dependent energy. In the absence of data, when the integrals are zero, the conditional density reduces to the prior density.
If the analytic forms of Equations 17 and 18 were tractable (e.g., through the use of conjugate priors), q(W i ) could be optimised directly by iterating these self-consistent nonlinear equations. This is known as variational Bayes; see [16] for an excellent treatment of static conjugate-exponential models. However, we will take a simpler approach that does not require bespoke update equations. This is based on a fixed-form approximation to the variational density.
The Laplace approximation. Under the Laplace approximation, the marginals of the conditional density assume a Gaussian form q(W i ) = N(W i : m i ,C i ) with sufficient statistics m i and C i , corresponding to the conditional mean and covariance of the ith marginal. For consistency, we will use m i for the conditional means or modes and g i for prior means. Similarly, we will use S i and C i for the prior and conditional covariances and P i and P i for the corresponding inverses (i.e., precisions).
The advantage of the Laplace assumption is that the conditional covariance is a simple function of the modes. Under the Laplace assumption, the internal and variational actions are (ignoring constants) C u := C(t) u is the conditional covariance of the states at time tM[0,N]. The quantities W(t) i represent the contribution to the variational action from other marginals and mediate the effect of the uncertainty they encode on each other. We will refer to these as mean-field terms.
Conditional precisions. By differentiating Equation 19 with respect to the covariances and solving for zero, it is easy to show that the conditional precisions are the negative curvatures of the internal action [2]. Unless stated otherwise, all gradients and curvatures are evaluated at the mode or mean.
Notice that the precisions of the parameters and hyperparameters increase with observation time, as one would expect. For our HDM the gradients and curvatures of the internal energy are where the covariance,S S is the inverse ofP P. The ith element of the energy gradient; U t ð Þ li~L li U t ð Þ is the derivative with respect to the ith hyperparameter (similarly for the curvatures). We have assumed that the precision of the random fluctuations is linear in the hyperparameters, where Q i~LliP P, and L llP P~0. The derivatives of the generalised prediction error with respect to the generalised states are provided in Equation 12. The corresponding derivatives with respect to each parameter, e e h t ð Þ~ẽ e uh m u t ð Þ rest on second derivatives of the model's functions that mediate interactions between each parameter and the statesẽ These also quantify how the states and parameters affect each other through mean-field effects (see below). Summary. The Laplace approximation gives a compact and simple form for the conditional precisions; and reduces the problem of inversion to finding the conditional modes. This generally proceeds in a series of iterated steps, in which the mode of each parameter set is updated. These updates optimise the variational actions in Equation 19 with respect to m i , using the sufficient statistics (conditional mean and covariance) of the other sets. We have discussed static cases of this fixed-form scheme previously and how it reduces to expectation maximisation (EM; [17]) and restricted maximum likelihood (ReML; [18]) for linear models [15]. We now consider each of the steps entailed by our mean-field partition.

Dynamic Expectation Maximisation
As with conventional variational schemes, we can update the modes of our three parameter sets in three distinct steps. However, the step dealing with the state (D-step) must integrate its conditional modem m :~m u t ð Þ over time to accumulate the quantities necessary for updating the parameters (E-step) and hyperparameters (M-step). We now consider optimising the modes or conditional means in each of these steps.
The D-step. In static systems, the mode of the conditional density maximises variational energy, such that h u V(t) = 0; this is the solution to a gradient ascent scheme; _ m m m m~V t ð Þ u . In dynamic systems, we also require the path of the mode to be the mode of the path; _ m m m m~Dm m. These two conditions are satisfied by the solution to the ansatz Here _ m m m m{Dm m can be regarded as motion in a frame of reference that moves along the trajectory encoded in generalised coordinates. Critically, the stationary solution in this moving frame of reference maximises variational action. This can be seen easily by noting _ m m m m{Dm m~0 means the gradient of the variational energy is zero and This is sufficient for the mode to maximise variational action. In other words, changes in variational action, V u , with respect to variations of the path of the mode are zero (c.f., Hamilton's principle of stationary action). Intuitively, this means tiny perturbations to its path do not change the variational energy and it has the greatest variational action (i.e., path-integral of variational energy) of all possible paths. Another way of looking at this is to consider the problem of finding the path of the conditional mode. However, the mode is in generalised coordinates and already encodes its path. This means we have to optimise the path of the mode subject to the constraint that _ m m m m~Dm m, which ensures the path of the mode and the mode of the path are the same. The solution to Equation 23 ensures that variational energy is maximised and the path is self-consistent. Note that this is a very different (and simpler) construction in relation to incremental schemes such as Bayesian filtering.
Equation 23 prescribes the trajectory of the conditional mode, which can be realised with a local linearization [19] by integrating over Dt to recover its evolution over discrete intervals For simplicity, we have suppressed the dependency of V(u,t) on the data. However, it is necessary to augment Equation 25 with any time-varying quantities that affect the variational energy. The form of the ensuing Jacobian =(t) is These forms reflect the fact that data and priors only affect the prediction error at the first and last levels respectively. The only remaining quantities we require are the gradients and curvatures of the variational energy, which are simply The mean-field term, W(t) l does not contribute to the D-step because it is not a function of the states. This means uncertainly about the hyperparameters does not affect the update for the states. This is because we assumed the precision was linear in the hyperparameters. The updates in Equation 25 provide the conditional trajectorym m t ð Þ at each time point. Usually, Dt is the time between observations but could be smaller, if nonlinearities in the model render local linearity assumptions untenable.
The E-and M-steps. Exactly the same update procedure can be used for the E-and M-steps. However, in this instance there are no generalised coordinates to consider. Furthermore, we can set the interval between updates to be arbitrarily long because the parameters are updated after the time-series has been integrated. If DtR' is sufficiently large, the matrix exponential in Equation 25 disappears (because the curvature of the Jacobian is negative definite) giving Equation 29 is a conventional Gauss-Newton update scheme. In this sense, the D-Step can be regarded as a generalization of classical ascent schemes to generalised coordinates that cover dynamic systems. For our HDM, the requisite gradients and curvatures of variational action for the E-step are Similarly, for the hyperparameters Although uncertainty about the hyperparameters does not affect the states and parameters, uncertainty about both the states and parameters affect the hyperparameter update. These steps represent a full variational scheme. A simplified version, which discounts uncertainty about the parameters and states in the D and E-steps, would be the analogue of an EM scheme. This simplification is easy to implement by removing W(t) h and W(t) u from the D-and E-steps respectively. We will pursue this in the context of neurobiological implementations in the last section.
Summary. These updates furnish a variational scheme under the Laplace approximation. To further simplify things, we will assume Dt = 1, such that sampling intervals serve as units of time. With these simplifications, the DEM scheme can be summarised as iterating until convergence D-step (states).
In this section, we have seen how the inversion of dynamic models can be formulated as an optimization of action. This action is the anti-derivative or path-integral of free-energy associated with changing states and a constant (of integration) corresponding to the prior energy of time-invariant parameters. By assuming a fixed-form (Laplace) approximation to the conditional density, one can reduce optimisation to finding the conditional modes of unknown quantities, because their conditional covariance is simply the curvature of the internal action (evaluated at the mode). The conditional modes of (mean-field) marginals optimise variational action, which can be framed in terms of gradient ascent. For the states, this entails finding a path or trajectory with stationary variational action. This can be formulated as a gradient ascent in a frame of reference that moves along the path encoded in generalised coordinates.

Results
In this section, we review the model and inversion scheme of the previous section in light of established procedures for supervised and self-supervised learning. This section considers HDMs from the pragmatic point of view of statistics and machine learning, where the data are empirical and arrive as discrete data sequences. In the next section, we revisit these models and their inversion from the point of view of the brain, where the data are sensory and continuous. This section aims to establish the generality of HDMs by showing that many well-known approaches to data can be cast as an inverting a HDM under simplifying assumptions. It recapitulates the unifying perspective of Roweis and Ghahramani [20] with a special focus on hierarchical models and the triple estimation problems DEM can solve. We start with supervised learning and then move to unsupervised schemes. Supervised schemes are called for when causes are known but the parameters are not. Conversely, the parameters may be known and we may want to estimate causes or hidden states. This leads to a distinction between identification of a model's parameters and estimation of its states. When neither the states nor parameters are known, the learning is unsupervised. We will consider models in which the parameters are unknown, the states are unknown or both are unknown. Within each class, we will start with static models and then consider dynamic models.
All the schemes described in this paper are available in Matlab code as academic freeware (http://www.fil.ion.ucl.ac.uk\spm). The simulation figures in this paper can be reproduced from a graphical user interface called from the DEM toolbox.

Models with Unknown Parameters
In these models the causes are known and enter as priors g with infinite precision; S v = 0. Furthermore, if the model is static or, more generally when g x = 0, we can ignore hidden states and dispense with the D-step.
Static models and neural networks. Usually, supervised learning entails learning the parameters of static nonlinear generative models with known causes. This corresponds to a HDM with infinitely precise priors at the last level, any number of subordinate levels (with no hidden states) One could regard this model as a neural network with m hidden layers. From the neural network perspective, the objective is to optimise the parameters of a nonlinear mapping from data y to the desired output g, using back propagation of errors or related approaches [21]. This mapping corresponds to inversion of the generative model that maps causes to data; g (i) : gRy. This inverse problem is solved by DEM. However, unlike back propagation of errors or universal approximation in neural networks [22], DEM is not simply a nonlinear function approximation device. This is because the network connections parameterise a generative model as opposed to its inverse; h: yRg (i.e., recognition model). This means that the parameters specify how states cause data and can therefore be used to generate data. Furthermore, unlike many neural network or PDP (parallel distributed processing) schemes, DEM enables Bayesian inference through an explicit parameterisation of the conditional densities of the parameters.
Nonlinear system identification. In nonlinear optimisation, we want to identify the parameters of a static, nonlinear function that maps known causes to responses. This is a trivial case of the static model above that obtains when the hierarchical order reduces to m = 1 The conditional estimates of h (1) optimise the mapping g (1) : gRy for any specified form of generating function. Because there are no dynamics, the generalised motion of the response is zero, rendering the D-step and generalised coordinates redundant. Therefore, identification or inversion of these models reduces to conventional expectation-maximisation (EM), in which the parameters and hyperparameters are optimised recursively, through a coordinate ascent on the variational energy implicit in the E and M-steps. Expectation-maximisation has itself some ubiquitous special cases, when applied to simple linear models: The general linear model. Consider the linear model, with a response that has been elicited using known causes, y = h (1) g+z (1) . If we start with an initial estimate of the parameters, h (1) = 0, the Estep reduces to These are the standard results for the conditional expectation and covariance of a general linear model, under parametric (i.e., Gaussian error) assumptions. From this perspective, the known causes g T play the role of explanatory variables that are referred to collectively in classical statistics as a design matrix. This can be seen more easily by considering the transpose of the linear model in Equation 34; y T = g T h (1)T +z (1)T . In this form, the causes are referred to as explanatory or independent variables and the data as response or dependent variables. A significant association between these two sets of variables is usually established by testing the null hypothesis that h (1) = 0. This proceeds either by comparing the evidence for (full or alternate) models with and (reduced or null) models without the appropriate explanatory variables or using the conditional density of the parameters, under the full model. If we have flat priors on the parameters, P h = 0, the conditional moments in Equation (35) become maximum likelihood (ML) estimators. Finally, under i.i.d. (identically and independently distributed) assumptions about the errors, the dependency on the hyperparameters disappears (because the precisions cancel) and we obtain ordinary least squares (OLS) estimates; m h = g 2 y T , where g 2 = (gg T ) 21 g is the generalised inverse.
It is interesting to note that transposing the general linear model is equivalent to the switching the roles of the causes and parameters; h (1)T « g. Under this transposition, one could replace the D-step with the E-step. This gives exactly the same results because the two updates are formally identical for static models, under which The exponential term disappears because the update is integrated until convergence; i.e., Dt = '. At this point, generalised motion is zero and an embedding order of n = 0)D = 0 is sufficient. This is a useful perspective because it suggests that static models can be regarded as models of steady-state or equilibrium responses, for systems with fixed point attractors.
Identifying dynamic systems. In the identification of nonlinear dynamic systems, one tries to characterise the architecture that transforms known inputs into measured outputs. This transformation is generally modelled as a generalised convolution [23]. When then inputs are known deterministic quantities the following m = 1 dynamic model applies Here g and y play the role of inputs (priors) and outputs (responses) respectively. Note that there is no state-noise; i.e., S w = 0 because the states are known. In this context, the hidden states become a deterministic nonlinear convolution of the causes [23]. This means there is no conditional uncertainty about the states (given the parameters) and the D-step reduces to integrating the stateequation to produce deterministic outputs. The E-Step updates the conditional parameters, based on the resulting prediction error and the M-Step estimates the precision of the observation error. The ensuing scheme is described in detail in [24], where it is applied to nonlinear hemodynamic models of fMRI time-series. This is an EM scheme that has been used widely to invert deterministic dynamic causal models of biological time-series. In part, the motivation to develop DEM was to generalise EM to handle state-noise or random fluctuations in hidden states. The extension of EM schemes into generalised coordinates had not yet been fully explored and represents a potentially interesting way of harnessing serial correlations in observation noise to optimise the estimates of a system's parameters. This extension is trivial to implement with DEM by specifying very high precisions on the causes and state-noise.

Models with Unknown States
In these models, the parameters are known and enter as priors g h with infinite precision, S h = 0. This renders the E-Step redundant. We will review estimation under static models and then consider Bayesian deconvolution and filtering with dynamic models. Static models imply the generalised motion of causal states is zero and therefore it is sufficient to represent conditional uncertainty on their amplitude; i.e., n = 0)D = 0. As noted above the D-step for static models is integrated until convergence to a fixed point, which entails setting Dt = '; see [15]. Note that making n = 0 renders the roughness parameter irrelevant because this only affects the precision of generalised motion.
Estimation with static models. In static systems, the problem reduces to estimating the causes of inputs after they are passed through some linear or nonlinear mapping to generate observed responses. For simple nonlinear estimation, in the absence of prior expectations about the causes, we have the nonlinear hierarchal model This is the same as Equation 33 but with unknown causes. Here, the D-Step performs a nonlinear optimisation of the states to estimate their most likely values and the M-Step estimates the variance components at each level. As mentioned above, for static systems, Dt = ' and n = 0. This renders it a classical Gauss-Newton scheme for nonlinear model estimation Empirical priors are embedded in the scheme through the hierarchical construction of the prediction errors, e and their precision P, in the usual way; see Equation 11 and [15] for more details.
Linear models and parametric empirical Bayes. When the model above is linear, we have the ubiquitous hierarchical linear observation model used in Parametric Empirical Bayes (PEB; [8]) and mixed-effects analysis of covariance (ANCOVA) analyses.
Here the D-Step converges after a single iteration because the linearity of this model renders the Laplace assumption exact. In this context, the M-Step becomes a classical restricted maximum likelihood (ReML) estimation of the hierarchical covariance components, S (i)z . It is interesting to note that the ReML objective function and the variational energy are formally identical under this model [15,18]. Figure 3 shows a comparative evaluation of ReML and DEM using the same data. The estimates are similar but not identical. This is because DEM hyperparameterises the covariance as a linear mixture of precisions, whereas the ReML scheme used a linear mixture of covariance components.
Covariance component estimation and Gaussian process models. When there are many more causes then observations, a common device is to eliminate the causes in Equation 40 by recursive substitution to give a model that generates sample covariances and is formulated in terms of covariance components (i.e., hyperparameters). Causes were sampled from the unit normal density to generate a response, which was used to recover the causes, given the parameters. Slight differences in the hyperparameter estimates (upper right), due to a different hyperparameterisation, have little effect on the conditional means of the unknown causes (lower right), which are almost indistinguishable. doi:10.1371/journal.pcbi.1000211.g003 Inversion then reduces to iterating the M-step. The causes can then be recovered from the hyperparameters using Equation 39 and the matrix inversion lemma. This can be useful when inverting ill-posed linear models (e.g., the electromagnetic inversion problem; [25]). Furthermore, by using shrinkage hyperpriors one gets a behaviour known as automatic relevance determination (ARD), where irrelevant components are essentially switched off [26]. This leads to sparse models of the data that are optimised automatically. The model in Equation 41 is also referred to as a Gaussian process model [27][28][29]. The basic idea behind Gaussian process modelling is to replace priors p(v) on the parameters of the mapping, g(v): vRy with a prior on the space of mappings; p(g(v)). The simplest is a Gaussian process prior (GPP), specified by a Gaussian covariance function of the response, S(y|l). The form of this GPP is furnished by the hierarchical structure of the HDM.
Deconvolution and dynamic models. In deconvolution problems, the objective is to estimate the inputs to a dynamic system given its response and parameters.
This model is similar to Equation 37 but now we have random fluctuations on the unknown states. Estimation of the states proceeds in the D-Step. Recall the E-Step is redundant because the parameters are known. When S (1) is known, the M-Step is also unnecessary and DEM reduces to deconvolution. This is related to Bayesian deconvolution or filtering under state-space models: State-space models and filtering. State-space models have the following form in discrete time and rest on a vector autoregressive (VAR) formulation where w t is a standard noise term. These models are parameterised by a system matrix A, an input matrix B, and an observation matrix g x . State-space models are special cases of linear HDMs, where the system-noise can be treated as a cause with random fluctuations Dt Notice that we have had to suppress state-noise in the HDM to make a simple state-space model. These models are adopted by conventional approaches for inference on hidden states in dynamic models: Deconvolution under HDMs is related to Bayesian approaches to inference on states using Bayesian belief update procedures (i.e., incremental or recursive Bayesian filters). The conventional approach to online Bayesian tracking of nonlinear or non-Gaussian systems employs extended Kalman filtering [30] or sequential Monte Carlo methods such as particle filtering. These Bayesian filters try to find the posterior densities of the hidden states in a recursive and computationally expedient fashion, assuming that the parameters and hyperparameters of the system are known. The extended Kalman filter is a generalisation of the Kalman filter in which the linear operators, of the state-space equations, are replaced by their partial derivatives evaluated at the current conditional mean. See also Wang and Titterington [31] for a careful analysis of variational Bayes for continuous linear dynamical systems and [32] for a review of the statistical literature on continuous nonlinear dynamical systems. These treatments belong to the standard class of schemes that assume Wiener or diffusion processes for state-noise and, unlike HDM, do not consider generalised motion.
In terms of establishing the generality of the HDM, it is sufficient to note that Bayesian filters simply estimate the conditional density on the hidden states of a HDM. As intimated in the introduction, their underlying state-space models assume that z t and w t are serially independent to induce a Markov property over sequential observations. This pragmatic but questionable assumption means the generalised motion of the random terms have zero precision and there is no point in representing generalised states. We have presented a fairly thorough comparative evaluation of DEM and extended Kalman filtering (and particle filtering) in [2]. DEM is consistently more accurate because it harvests empirical priors in generalised coordinates of motion. Furthermore, DEM can be used for both inference on hidden states and the random fluctuations driving them, because it uses an explicit conditional density q(x,ṽ) over both.

Models with Unknown States and Parameters
In all the examples below, both the parameters and states are unknown. This entails a dual or triple estimation problem, depending on whether the hyperparameters are known. We will start with simple static models and work towards more complicated dynamic variants. See [33] for a comprehensive review of unsupervised learning for many of the models in this section. This class of models is often discussed under the rhetoric of blind source separation (BSS), because the inversion is blind to the parameters that control the mapping from sources or causes to observed signals.
Principal components analysis. The Principal Components Analysis (PCA) model assumes that uncorrelated causes are mixed linearly to form a static observation. This is a m = 1 model with no observation noise; i.e., S (1)z = 0.
where priors on v (1) = z (2) render them orthonormal S v = I. There is no M-Step here because there are no hyperparameters to estimate. The D-Step estimates the causes under the unitary shrinkage priors on their amplitude and the E-Step updates the parameters to account for the data. Clearly, there are more efficient ways of inverting this model than using DEM; for example, using the eigenvectors of the sample covariance of the data. However, our point is that PCA is a special case of an HDM and that any optimal solution will optimise variational action or energy. Nonlinear PCA is exactly the same but allowing for a nonlinear generating function.
See [34] for an example of nonlinear PCA with a bilinear model applied to neuroimaging data to disclose interactions among modes of brain activity. Factor analysis and probabilistic PCA. The model for factor analysis is exactly the same as for PCA but allowing for observation error When the covariance of the observation error is spherical; e.g., S (1)z = l (1)z I, this is also known as a probabilistic PCA model [35]. The critical distinction, from the point of view of the HDM, is that the M-Step is now required to estimate the error variance. See Figure 4 for a simple example of factor analysis using DEM.
Nonlinear variants of factor analysis obtain by analogy with Equation 46. Independent component analysis. Independent component analysis (ICA) decomposes the observed response into a linear mixture of non-Gaussian causes [36]. Non-Gaussian causal states are implemented simply in m = 2 hierarchical models with a nonlinear transformation at higher levels. ICA corresponds to Where, as for PCA, S v = I. The nonlinear function g (2) transforms a Gaussian cause, specified by the priors at the third level, into a non-Gaussian cause and plays the role of a probability integral transform. Note that there are no hyperparameters to estimate and consequently there is no M-Step. It is interesting to examine the relationship between nonlinear PCA and ICA; the key difference is that the nonlinearity is in the first level in PCA, as opposed to the second in ICA. Usually, in ICA the probability integral transform is pre-specified to render the second-level causes supra-Gaussian. From the point of view of a HDM this corresponds to specifying precise priors on the second-level parameters. However, DEM can fit unknown distributions by providing conditional estimates of both the mixing matrix h (1) and the probability integral transform implicit in g(v (2) ,h (2) ). Sparse coding. In the same way that factor analysis is a generalisation of PCA to non-Gaussian causes, ICA can be extended to form sparse-coding models of the sort proposed by Olshausen and Fields [37] by allowing observation error.
This is exactly the same as the ICA model but with the addition of observation error. By choosing g (2) to create heavy-tailed (supra-Gaussian) second-level causes, sparse encoding is assured in the sense that the causes will have small values on most occasions and large values on only a few. Note the M-Step comes into play again for these models. All the models considered so far are for static data. We now turn to BSS in dynamic systems. Blind deconvolution. Blind deconvolution tries to estimate the causes of an observed response without knowing the parameters of the dynamical system producing it. This represents the least constrained problem we consider and calls upon the same HDM used for system identification. An empirical example of triple estimation of states, parameters and hyperparameters can be found in [2]. This example uses functional magnetic resonance imaging time-series from a brain region to estimate not only the underlying neuronal and hemodynamic states causing signals but the parameters coupling experimental manipulations to neuronal activity. See Friston et al. [2] for further examples, ranging from the simple convolution model considered next, through to systems showing autonomous dynamics and deterministic chaos. Here we conclude with a simple m = 2 linear convolution model (Equation 42), as specified in Table 1.
In this model, causes or inputs perturb the hidden states, which decay exponentially to produce an output that is a linear mixture of hidden states. Our example used a single input, two hidden states and four outputs. To generate data, we used a deterministic Gaussian bump function input v (1) = exp( 1 / 4 (t212) 2 ) and the following parameters During inversion, the cause is unknown and was subject to mildly informative (zero mean and unit precision) shrinkage priors. We also treated two of the parameters as unknown; one parameter from the observation function (the first) and one from the state equation (the second). These parameters had true values of 0.125 and 20.5, respectively, and uninformative shrinkage priors. The priors on the hyperparameters, sometimes referred to as hyperpriors were similarly uninformative. These Gaussian hyperpriors effectively place lognormal hyperpriors on the precisions (strictly speaking, this invalidates the assumption of a linear hyperparameterisation but the effects are numerically small), because the precisions scale as exp(l z ) and exp(l w ). Figure 5 shows a schematic of the generative model and the implicit recognition scheme based on prediction errors. This scheme can be regarded as a message passing scheme that is considered in more depth in the next section. Figure 6 summarises the results after convergence of DEM (about sixteen iterations using an embedding order of n = 6, with a roughness hyperparameter, c = 4). Each row corresponds to a level in the model, with causes on the left and hidden states on the right. The first (upper left) panel shows the predicted response and the error on this response. For the hidden states (upper right) and causes (lower left) the conditional mode is depicted by a coloured line and the 90% conditional confidence intervals by the grey area.
It can be seen that there is a pleasing correspondence between the conditional mean and veridical states (grey lines). Furthermore, the true values lie largely within the 90% confidence intervals; similarly for the parameters. This example illustrates the recovery of states, parameters and hyperparameters from observed timeseries, given just the form of a model.
Summary. This section has tried to show that the HDM encompasses many standard static and dynamic observation models. It is further evident than many of these models could be extended easily within the hierarchical framework. Figure 7 illustrates this by providing a ontology of models that rests on the various constraints under which HDMs are specified. This partial list suggests that only a proportion of potential models have been covered in this section.
In summary, we have seen that endowing dynamical models with a hierarchical architecture provides a general framework that covers many models used for estimation, identification and unsupervised learning. A hierarchical structure, in conjunction with nonlinearities, can emulate non-Gaussian behaviours, even when random effects are Gaussian. In a dynamic context, the level at which the random effects enter controls whether the system is deterministic or stochastic and nonlinearities determine whether their effects are additive or multiplicative. DEM was devised to find the conditional moments of the unknown quantities in these nonlinear, hierarchical and dynamic models. As such it emulates procedures as diverse as independent components analysis and Bayesian filtering, using a single scheme. In the final section, we show that a DEM-like scheme might be implemented in the brain. If this is true, the brain could, in principle, employ any of the models considered in this section to make inferences about the sensory data it harvests. Parameters and causes were sampled from the unit normal density to generate a response, which was then used for their estimation. The aim was to recover the causes without knowing the parameters, which is effected with reasonable accuracy (upper). The conditional estimates of the causes and parameters are shown in lower panels, along with the increase in free-energy or log-evidence, with the number of DEM iterations (lower left). Note that there is an arbitrary affine mapping between the conditional means of the causes and their true values, which we estimated, post hoc to show the correspondence in the upper panel. doi:10.1371/journal.pcbi.1000211.g004 Table 1. Specification of a linear convolution model. Neuronal Implementation In this final section, we revisit DEM and show that it can be formulated as a relatively simple neuronal network that bears many similarities to real networks in the brain. We have made the analogy between the DEM and perception in previous communications; here we focus on the nature of recognition in generalised coordinates. In brief, deconvolution of hidden states and causes from sensory data (D-step) may correspond to perceptual inference; optimising the parameters of the model (E-step) may correspond to perceptual learning through changes in synaptic efficacy and optimising the precision hyperparameters (M-step) may correspond to encoding perceptual salience and uncertainty, through neuromodulatory mechanisms.
Hierarchical models in the brain. A key architectural principle of the brain is its hierarchical organisation [38][39][40][41]. This has been established most thoroughly in the visual system, where lower (primary) areas receive sensory input and higher areas adopt a multimodal or associational role. The neurobiological notion of a hierarchy rests upon the distinction between forward and backward connections [42][43][44][45]. This distinction is based upon the specificity of cortical layers that are the predominant sources and origins of extrinsic connections (extrinsic connections couple remote cortical regions, whereas intrinsic connections are confined to the cortical sheet). Forward connections arise largely in superficial pyramidal cells, in supra-granular layers and terminate on spiny stellate cells of layer four in higher cortical areas [40,46]. Conversely, backward connections arise largely from deep pyramidal cells in infra-granular layers and target cells in the infra and supra-granular layers of lower cortical areas. Intrinsic connections mediate lateral interactions between neurons that are a few millimetres away. There is a key functional asymmetry between forward and backward connections that renders backward connections more modulatory or nonlinear in their effects on neuronal responses (e.g., [44]; see also Hupe et al. [47]). This is consistent with the deployment of voltage-sensitive NMDA receptors in the supra-granular layers that are targeted by backward connections [48]. Typically, the synaptic dynamics of backward connections have slower time constants. This has led to the notion that forward connections are driving and illicit an obligatory response in higher levels, whereas backward connections have both driving and modulatory effects and operate over larger spatial and temporal scales. and four outputs y 1 ,…,y 4 . The lines denote the dependencies of the variables on each other, summarised by the equations (in this example both the equations were simple linear mappings). This is effectively a linear convolution model, mapping one cause to four outputs, which form the inputs to the recognition model (solid arrow). The inputs to the four data or sensory channels are also shown as an image in the insert. doi:10.1371/journal.pcbi.1000211.g005 The hierarchical structure of the brain speaks to hierarchical models of sensory input. We now consider how this functional architecture can be understood under the inversion of HDMs by the brain. We first consider inference on states or perception.
Perceptual inference. If we assume that the activity of neurons encode the conditional mode of states, then the D-step specifies the neuronal dynamics entailed by perception or recognizing states of the world from sensory data. Furthermore, if we ignore mean-field terms; i.e., discount the effects of conditional uncertainty about the parameters when optimising the states, Equation 23 prescribes very simple recognition dynamics _ m m m m~V t ð Þ u zDm m Dm m{ẽ e T u j Where, j~e P Pẽ e~ẽ e{Lj is prediction error multiplied by its precision, which we have re-parameterised in terms of a covariance component, L~e S S{I. Here, the matrix L can be thought of as lateral connections among error-units. Equation 51 is an ordinary differential equation that describes how neuronal states self-organise, when exposed to sensory input. The form of Equation 51 is quite revealing, it suggests two distinct populations of neurons; state-units whose activity encodesm m t ð Þ and error-units encoding j(t), with one error-unit for each state. Furthermore, the activities of error-units are a function of the states and the dynamics of state-units are a function of prediction error. This means the two populations pass messages to each other and to themselves. The messages passed among the states, Dm m mediate empirical priors on their motion, while the lateral connections among the error-units, 2Lj weight prediction errors in proportion to their precision.
Hierarchical message passing. If we unpack these equations we can see the hierarchical nature of this message passing (see Figure 8).
This shows that error-units receive messages from the states in the same level and the level above, whereas states are driven by errorunits in the same level and the level below. Critically, inference requires only the prediction error from the lower level j (i) and the level in question, j (i+1) . These constitute bottom-up and lateral messages that drive conditional meansm m i ð Þ towards a better prediction, to explain away the prediction error in the level below. These top-down and lateral predictions correspond to g (i ) and f˜( i ) . This is the essence of recurrent message passing between hierarchical levels to optimise free-energy or suppress prediction error; i.e., recognition dynamics.
The connections from error to state-units have a simple form that depends on the gradients of the model's functions; from Equation 12 These pass prediction errors forward to state-units in the higher level and laterally to state-units at the same level. The reciprocal influences of the state on the error-units are mediated by backward connections and lateral interactions. In summary, all connections between error and state-units are reciprocal, where the only connections that link levels are forward connections conveying prediction error to state-units and reciprocal backward connections that mediate predictions (see Figure 8). We can identify error-units with superficial pyramidal cells, because the only messages that pass up the hierarchy are Figure 6. The predictions and conditional densities on the states and parameters of the linear convolution model of the previous figure. Each row corresponds to a level, with causes on the left and hidden states on the right. In this case, the model has just two levels. The first (upper left) panel shows the predicted response and the error on this response (their sum corresponds to the observed data). For the hidden states (upper right) and causes (lower left) the conditional mode is depicted by a coloured line and the 90% conditional confidence intervals by the grey area. These are sometimes referred to as ''tubes''. Finally, the grey lines depict the true values used to generate the response. Here, we estimated the hyperparameters, parameters and the states. This is an example of triple estimation, where we are trying to infer the states of the system as well as the parameters governing its causal architecture. The hyperparameters correspond to the precision of random fluctuations in the response and the hidden states. The free parameters correspond to a single parameter from the state equation and one from the observer equation that govern the dynamics of the hidden states and response, respectively. It can be seen that the true value of the causal state lies within the 90% confidence interval and that we could infer with substantial confidence that the cause was non-zero, when it occurs. Similarly, the true parameter values lie within fairly tight confidence intervals (red bars in the lower right). doi:10.1371/journal.pcbi.1000211.g006 prediction errors and superficial pyramidal cells originate forward connections in the brain. This is useful because it is these cells that are primarily responsible for electroencephalographic (EEG) signals that can be measured non-invasively. Similarly the only messages that are passed down the hierarchy are the predictions from state-units that are necessary to form prediction errors in lower levels. The sources of extrinsic backward connections are largely the deep pyramidal cells and one might deduce that these encode the expected causes of sensory states (see [49] and Figure 9). Critically, the motion of each state-unit is a linear mixture of bottom-up prediction error; see Equation 52. This is exactly what is observed physiologically; in that bottom-up driving inputs elicit obligatory responses that do not depend on other bottom-up inputs. The prediction error itself is formed by predictions conveyed by backward and lateral connections. These influences embody the nonlinearities implicit in g (i ) and f˜( i ) . Again, this is entirely consistent with the nonlinear or modulatory characteristics of backward connections.
Encoding generalised motion. Equation 51 is cast in terms of generalised states. This suggests that the brain has an explicit representation of generalised motion. In other words, there are separable neuronal codes for different orders of motion. This is perfectly consistent with empirical evidence for distinct populations of neurons encoding elemental visual features and their motion (e.g., motion-sensitive area V5; [39]). The analysis in this paper suggests that acceleration and higher-order motion are also encoded; each order providing constraints on a lower order, through Dm m. Here, D represents a fixed connectivity matrix that mediates these temporal constraints. Notice that _ m m m m~Dm m only wheñ e e T u j~0. This means it is perfectly possible to represent the motion of a state that is inconsistent with the state of motion. The motion after-effect is a nice example of this, where a motion percept coexists with no change in the perceived location of visual stimuli. The encoding of generalised motion may mean that we represent paths or trajectories of sensory dynamics over short periods of time and that there is no perceptual instant (c.f., the remembered present; [50]). One could speculate that the encoding of different orders of motion may involve rate codes in distinct neuronal populations or multiplexed temporal codes in the same populations (e.g., in different frequency bands). See [51] for a neurobiologically realistic treatment of temporal dynamics in decision-making during motion perception and [52] for a discussion of synchrony and attentive learning in laminar thalamocortical circuits.
When dealing with empirical data-sequences one has to contend with sparse and discrete sampling. Analogue systems, like the brain can sample generalised motion directly. When sampling sensory data, one can imagine easily how receptors generatem m 0 ð Þ :~ỹ y. Indeed, it would be surprising to find any sensory system that did not respond to a high-order derivative of changing sensory fields (e.g., acoustic edge detection; offset units in the visual system, etc; [53]). Note that sampling high-order derivatives is formally equivalent to high-pass filtering sensory data. A simple consequence of encoding generalised motion is, in electrophysiological terms, the emergence of spatiotemporal receptive fields that belie selectivity to particular sensory trajectories.
Perceptual learning and plasticity. The conditional expectations of the parameters, m h control the construction of prediction error through backward and lateral connections. This suggests that they are encoded in the strength of extrinsic and intrinsic connections. If we define effective connectivity as the rate of change of a unit's response with respect to its inputs, Equation 51 suggests an interesting antisymmetry in the effective connectivity between the state and error-units. The effective connectivity from the states to the error-units is Lm m j~ẽ e u . This is simply the negative transpose of the effective connectivity that mediates recognition dynamics; L j _ m m m m~{ẽ e T u . In other words, the effective connection from any state to any error-unit has the same strength (but opposite sign) of the reciprocal connection from the error to the state-unit. This means we would expect to see connections reciprocated in the brain, which is generally the case [39,40]. Furthermore, we would not expect to see positive feedback loops; c.f., [54]. We now consider the synaptic efficacies underlying effective connectivity.
If synaptic efficacy encodes the parameter estimates, we can cast parameter optimisation as changing synaptic connections. These changes have a relatively simple form that is recognisable as associative plasticity. To show this, we will make the simplifying but plausible assumption that the brain's generative model is based on nonlinear functions a of linear mixtures of states Under this assumption h i ð Þ j correspond to matrices of synaptic strengths or weights and a can be understood as a neuronal Figure 8. Schematic detailing the neuronal architectures that encode an ensemble density on the states and parameters of one level in a hierarchical model. This schematic shows the speculative cells of origin of forward driving connections that convey prediction error from a lower area to a higher area and the backward connections that are used to construct predictions. These predictions try to explain away input from lower areas by suppressing prediction error. In this scheme, the sources of forward connections are the superficial pyramidal cell population and the sources of backward connections are the deep pyramidal cell population. The differential equations relate to the optimisation scheme detailed in the main text and their constituent terms are placed alongside the corresponding connections. The state-units and their efferents are in black and the error-units in red, with causes on the left and hidden states on the right. For simplicity, we have assumed the output of each level is a function of, and only of, the hidden states. This induces a hierarchy over levels and, within each level, a hierarchical relationship between states, where hidden states predict causes. doi:10.1371/journal.pcbi.1000211.g008 activation function that models nonlinear summation of presynaptic inputs over the dendritic tree [55]. This means that the synaptic connection to the ith error from the jth state depends on only one parameter, m h ij which changes according to Equation 29 This suggests that plasticity comprises an associative term a h ij and a decay term mediating priors on the parameters. The dynamics of the associative term are given by Equation 21 (and exploiting the Kronecker form of Equation 22). The integral of this associative term is simply the covariance between presynaptic input and postsynaptic prediction error, summed over orders of motion. In short, it mediates associative or Hebbian plasticity. The product of pre and postsynaptic signals j T i e m m j is modulated by an activitydependent term, a' i , which is the gradient of the activation function at its current level of input (and is constant for linear models). Critically, updating the conditional estimates of the parameters, through synaptic efficacies, m h ij , uses local information that is available at each error-unit. Furthermore, the same information is available at the synaptic terminal of the reciprocal connection, where the ith error-unit delivers presynaptic inputs to the jth state. In principle, this enables reciprocal connections to change in tandem. Finally, because plasticity is governed by two coupled ordinary differential equations (Equation 55), connection strengths should change more slowly than the neuronal activity they mediate. These theoretical predictions are entirely consistent with empirical and computational characterisations of plasticity [56,57].
Perceptual salience and uncertainty. Equation 51 shows that the influence of prediction error is scaled by its precision e P P or covariance e S S~LzI that is a function of m l . This means that the relative influence of bottom-up, lateral and top-down effects are modulated by the conditional expectation of the hyperparameters. This selective modulation of afferents mirrors the gain-control mechanisms invoked for attention; e.g., [58,59]. Furthermore, they enact the sorts of mechanisms implicated in biased competition models of spatial and object-based attention mediating visual search [60,61]. Equation 51 formulates this bias or gain-control in terms of lateral connections, L~e S S{I among error-units. This means hyperparameter optimisation would be realised, in the brain, as neuromodulation or plasticity of lateral interactions among errorunits. If we assume that the covariance is a linear mixture of covariance components, R i among non-overlapping subsets of error-units, then Where L i~Ri m l i . Under this hyperparameterisation, m l i modulates subsets of connections to encode a partition of the covariance. Because each set of connections is a function of only one hyperparameter, their plasticity is prescribed simply by Equation 31 The quantities m l i might correspond to specialised (e.g., noradrenergic or cholinergic) systems in the brain that broadcast their effects to the ith subset of error-units to modulate their responsiveness to each other. The activities of these units change relatively slowly, in proportion to an associative term a l i and decay that mediates hyperpriors. The associative term is basically the difference between the sample covariance of precision-weighted prediction errors and the precision expected, under the current value of m l i . As above, changes in m l i occur more slowly than the fast dynamics of the states because they are driven by a l i , which accumulates energy gradients to optimise variational action. One could think of m l i as the synaptic efficacy of lateral or intrinsic connections that depend upon classical neuromodulatory inputs and other slower synaptic dynamics (e.g., after-hyperpolarisation potentials and molecular signalling). The physiological aspects of these dynamics provide an interesting substrate for attentional mechanisms in the brain (see Schroeder et al., [62] for review) and are not unrelated to the ideas in [63]. These authors posit a role for acetylcholine (an ascending modulatory neurotransmitter) in mediating expected uncertainty. This is entirely consistent with the dynamics of a l i that are driven by the amplitude of prediction errors encoding the relative precision of sensory signals and empirical priors. Modulatory neurotransmitters have, characteristically, much slower time constants, in terms of their synaptic effects, than glutamatergic neurotransmission that is employed by cortico-cortical extrinsic connections.
The mean-field partition. The mean-field approximation q(W) = q(u(t))q(h)q(l) enables inference about perceptual states, causal regularities and context, without representing the joint distribution explicitly; c.f., [64]. However, the optimisation of one set of sufficient statistics is a function of the others. This has a fundamental implication for optimisation in the brain (see Figure 10). For example, 'activity-dependent plasticity' and 'functional segregation' speak to reciprocal influences between changes in states and connections; in that changes in connections depend upon activity and changes in activity depend upon connections. Things get more interesting when we consider three sets, because quantities encoding precision must be affected by and affect neuronal activity and plasticity. This places strong constraints on the neurobiological candidates for these hyperparameters. Happily, the ascending neuromodulatory neurotransmitter systems, such as dopaminergic and cholinergic projections, have exactly the right characteristics: they are driven by activity in presynaptic connections and can affect activity though classical neuromodulatory effects at the post-synaptic membrane [65], while also enabling potentiation of connection strengths [66,67]. Furthermore, it is exactly these systems that Figure 10. The ensemble density and its mean-field partition. q(W) is the ensemble density and is encoded in terms of the sufficient statistics of its marginals. These statistics or variational parameters (e.g., mean or expectation) change to extremise free-energy to render the ensemble density an approximate conditional density on the causes of sensory input. The mean-field partition corresponds to a factorization over the sets comprising the partition. Here, we have used three sets (neural activity, modulation and connectivity). Critically, the optimisation of the parameters of any one set depends on the parameters of the other sets. In this figure, we have focused on means or expectations m i of the marginal densities, q(W i ) = N(W i : m i ,C i ). doi:10.1371/journal.pcbi.1000211.g010 have been implicated in value-learning [68][69][70], attention and the encoding of uncertainty [63,71].
Summary. We have seen that the brain has, in principle, the infrastructure needed to invert hierarchical dynamic models of the sort considered in previous sections. It is perhaps remarkable that such a comprehensive treatment of generative models can be reduced to recognition dynamics that are as simple as Equation 51. Having said this, the notion that the brain inverts hierarchical models, using a DEM-like scheme, speaks to a range of empirical facts about the brain: N The hierarchical organisation of cortical areas (c.f., [39]) N Each area comprises distinct neuronal subpopulations, encoding expected states of the world and prediction error (c.f., [72]).
N Extrinsic forward connections convey prediction error (from superficial pyramidal cells) and backward connections mediate predictions, based on hidden and causal states (from deep pyramidal cells) [49].
N Recurrent dynamics are intrinsically stable because they are trying to suppress prediction error [54,64].
N Principal cells elaborating predictions (e.g., deep pyramidal cells) may show distinct (low-pass) dynamics, relative to those encoding error (e.g., superficial pyramidal cells) N Lateral interactions may encode the relative precision of prediction errors and change in a way that is consistent with classical neuromodulation (c.f., [63,71]).
N The rescaling of prediction errors by recurrent connections, in proportion to their precision, affords a form of cortical bias or gain control [73,74].
N The dynamics of plasticity and modulation of lateral interactions encoding precision or uncertainty (which optimise a path-integral of variational energy) must be slower than the dynamics of neuronal activity (which optimise variational energy per se) N Neuronal activity, synaptic efficacy and neuromodulation must all affect each other; activity-dependent plasticity and neuromodulation shape neuronal responses and: N Neuromodulatory factors play a dual role in modulating postsynaptic responsiveness (e.g., through modulating in afterhyperpolarising currents) and synaptic plasticity [66,67].
These observations pertain to the anatomy and physiology of neuronal architectures; see Friston et al. [5] for a discussion of operational and cognitive issues, under a free-energy principle for the brain.
We have tried to establish the generality of HDMs as a model that may be used by the brain. However, there are many alternative formulations that could be considered. Perhaps the work of Archambeau et al. [75] is formally the closest to one presented in this paper. These authors propose an approach that is very similar to DEM but is framed in terms of SDEs. A related formulation, with particle annihilation and symmetry breaking, has been proposed [76] as a mechanism for learning. This work adopts a path integral approach to optimal control theory and reinforcement learning. Cortical processing as the statistical result of the activity of neural ensembles is an established and important idea (e.g., [77,78]). Although, computationally intensive (see Discussion), particle filtering can be efficient [79]; Furthermore, it can be combined with local linearised sequential methods (e.g., the Ensemble Kalman Filter; [80]) to provide data assimilation methods for huge data sets. In fact, Schiff and Sauer [81] have recently proposed an Ensemble Kalman Filter for the control of cortical dynamics that could have biological and engineering significance. Finally, [82] proposes a path integral approach to particle filtering for data assimilation. The common theme here is the use of ensembles to represent more realistic and complicated conditional densities. Although the biological relevance of these exciting developments remains to be established they may provide insights into neuronal computations. They also speak to ensembles of HDMs, under the Laplace assumption, to approximate the conditional density with a mixture of Gaussians (Nelson Trujillo-Barreto -personal communication).
Clearly, the theoretical treatment of this section calls for an enormous amount of empirical verification and hypothesis testing, not least to disambiguate among alternative theories and architectures. We have laid out the neurobiological and psychophysical motivation for the neuronal implementation of DEM in [3] and [4]. These papers deal with inference in the brain and motivate an overarching free-energy principle, using the notion of equilibrium densities and active agents. In [83] we address the face validity of the neuronal scheme described in this section, using synthetic birds and the perceptual categorisation of birdsong. These papers try to emulate empirical LFP and EEG studies to establish the sorts of electrophysiological responses one would expect to see in paradigms, such as those used to elicit the mismatch negativity [84,85].

Discussion
We are now in a position to revisit some of the basic choices behind the DEM scheme, in light of its neuronal implementation. Some of these choices are generic and some are specific to neuronal inversion. All can be framed in terms of assumptions about the existence and form of the approximating conditional density, q(W). The first choice was to optimise a bound on the logevidence, as opposed to the evidence itself. This choice is mandated by the fact that the evidence entails an integral pỹ yjm ð Þ~ð pỹ y,qjm ð Þdq ð58Þ that is not generally tractable. In other words, this integral has no general analytic solution, particularly when the generative model is nonlinear. This means the brain is obliged to induce a bound approximation, through q(W).
The second choice was to use a fixed-form for q(W), as opposed to a free-form. This is a more pragmatic choice that is dictated by implementational constraints. A fixed-form approximation allows one to represent the density in terms of a small number of quantities; its sufficient statistics. A free-form approximation would require an infinite number of quantities encoding the density over its support. Clearly, a fixed-form is assumption is imperative for the brain and predominates in most practical applications. In engineering and machine learning, free-form densities are usually approximated by the sample density of a large number of 'particles' that populate state-space. For dynamic systems in generalised coordinates, the ensuing scheme is known as variational filtering [1]. For standard SSMs, which ignore the high-order motion of random fluctuations, particle filtering is the most common approach. However, the dimensionality of the representational problems entailed by neuronal computations probably precludes particle-based (i.e., free-form) representations: consider face recognition, a paradigm example in perceptual inference. Faces can be represented in a perceptual space of about thirty dimensions (i.e., faces have about thirty discriminable attributes). To populate a thirty-dimensional space we would need at least 2 30 particles, where each particle could correspond to the activity of thirty neurons (note that the conditional mean can be encoded with a single particle). The brain has about 2 11 neurons at its disposal. Arguments like this suggest that free-form approximations and their attending sampling schemes are not really viable in a neuronal context (although they have been considered; see [86] and above) The third choice was a mean-field approximation; q(W) = q(u(t))q(h)q(l). This allowed us to separate the optimisation of states from parameters, using separation of temporal scales. This allowed us to optimise the states online, while learning the parameters offline. The motivation here was more didactic; in that special cases of the ensuing scheme are formally equivalent to established analyses of discrete data sequences (e.g., expectation maximisation and restricted maximum likelihood). However, the mean-field factorisation is less critical in neuronal implementations because the brain optimises both states and parameters online. We portrayed the neuronal implementation as a DEM scheme in which conditional uncertainty about the parameters was ignored when optimising the states and vice versa (i.e., mean field-effects were ignored). Alternatively, we could have relaxed the mean-field assumption and treated the solutions to Equations 51 and 52 as optimising the mean of q(u(t),h) simultaneously. In this case, meanfield effects coupling states and parameters are no longer required.
The fourth assumption was that the fixed-form of q(W) was Gaussian. This Laplace assumption affords an important simplification that may be relevant for recognition schemes that deal with large amounts of data. Under the Laplace assumption, only the conditional mean has to be optimised (because the conditional covariance is a function of the mean). The resulting recognition dynamics (Equation 51) are simple and neuronally plausible. The Laplace assumption enforces a unimodal approximation but does not require the density of underlying causes to be Gaussian. This is because nonlinearities in HDMs can implement probability integral transforms. A common example is the use of log-normal densities for non-negative scale parameters. These are simple to implement under the Laplace assumption with a log-transform, W = ln b, which endows b with a log-normal density (we use this for precision hyperparameters; see the triple estimation example above). The unimodal constraint may seem restrictive; however, we know of no psychophysical or electrophysiological evidence for multimodal representations in the brain. In fact, the psychophysics of ambiguous stimuli and related bistable perceptual phenomena suggest that we can only represent one conditional cause or percept at a time.
The final choice was to include generalised motion under q(W). The alternative would have been to assume the precisions of z9,z0,…w9,w0,…; the generalised motion of the random fluctuations, were zero (i.e., assume a serially uncorrelated process). It is important to appreciate that generalised motion always exists; the choice is whether to ignore it or not. Variational filtering and DEM assume high-order motion exists to infinite order. This is because random fluctuations in biophysical systems are almost invariably the product of dynamical systems, which renders their serial correlations analytic ( [6], p 83; [23]). The resulting optimisation scheme is very simple (Equation 23) and is basically a restatement of Hamilton's principle of stationary action. If one ignores serial correlations, one could recourse to Extended Kalman filtering (EKF) or related Bayesian assimilation procedures for standard SSMs. From the perspective of DEM, these conventional procedures have an unduly complicated construction and deal only with a special (n = 0) case of dynamic models. In [2], we show that DEM and EKF give numerically identical results, when serial correlations are suppressed.
It is interesting to consider DEM in relation to common distinctions among inversion schemes: sequential data assimilation (SDA) vs. path integral approaches or integration vs. solution of differential equations. DEM blurs these distinctions somewhat: on the one hand, DEM is a path integral approach because the unknown quantities optimise action (the path integral of energy). On the other hand, it operates online and assimilates data with a differential equation (23), whose solution has stationary action. Furthermore, this equation can be integrated over time; indeed this is the mechanism suggested for neuronal schemes. However, when using DEM to analyse discrete data (e.g., the examples in the third section), this differential equation is solved over sampling intervals, using local linearization; c.f., [19].

Summary
In summary, any generic inversion scheme needs to induce a lower-bound on the log-evidence by invoking an approximating conditional density q(W) that, for dynamic systems, covers generalised motion. Physical constraints on the representation of q(W) enforce a fixed parameterised form so that is can be encoded in terms of its parameters or sufficient statistics. The Laplace or Gaussian assumption about this fixed-form affords a substantial simplification of recognition dynamics at the price of restricting recognition to unimodal probabilistic representations; a price that evolution may well have paid to optimise neuronal schemes. The mean-field approximation is ubiquitous in the statistics but may not be necessary in an online or neuronal setting.

Conclusion
In conclusion, we have seen how the inversion of a fairly generic hierarchical and dynamical model of sensory inputs can be transcribed onto neuronal quantities that optimise a variational bound on the evidence for that model This optimisation corresponds, under some simplifying assumptions, to suppression of prediction error at all levels in a cortical hierarchy. This suppression rests upon a balance between bottom-up (prediction error) influences and top-down (empirical prior) influences that are balanced by representations of their precision (uncertainty). These representations may be mediated by classical neuromodulatory effects and slow postsynaptic cellular processes that are driven by overall levels of prediction error.
The ideas presented in this paper have a long history, starting with the notion of neuronal energy [87]; covering ideas like efficient coding and analysis by synthesis [88,89] to more recent formulations in terms of Bayesian inversion and predictive coding (e.g., [90,91]). The specific contribution of this work is to establish the generality of models that may, at least in principle, be entertained by the brain.