The two kinds of free energy and the Bayesian revolution

The concept of free energy has its origins in 19th century thermodynamics, but has recently found its way into the behavioral and neural sciences, where it has been promoted for its wide applicability and has even been suggested as a fundamental principle of understanding intelligent behavior and brain function. We argue that there are essentially two different notions of free energy in current models of intelligent agency, that can both be considered as applications of Bayesian inference to the problem of action selection: one that appears when trading off accuracy and uncertainty based on a general maximum entropy principle, and one that formulates action selection in terms of minimizing an error measure that quantifies deviations of beliefs and policies from given reference models. The first approach provides a normative rule for action selection in the face of model uncertainty or when information processing capabilities are limited. The second approach directly aims to formulate the action selection problem as an inference problem in the context of Bayesian brain theories, also known as Active Inference in the literature. We elucidate the main ideas and discuss critical technical and conceptual issues revolving around these two notions of free energy that both claim to apply at all levels of decision-making, from the high-level deliberation of reasoning down to the low-level information processing of perception.


Introduction
There is a surprising line of thought connecting some of the greatest scientists of the last centuries, including Immanuel Kant, Hermann von Helmholtz, Ludwig E. Boltzmann, and Claude E. Shannon, whereby model-based processes of action, perception, and communication are explained with concepts borrowed from statistical physics. Inspired by Kant's Copernican revolution and motivated from his own studies of the physiology of the sensory system, Helmholtz was one of the first proponents of the analysis-by-synthesis approach to perception [1], whereby a perceiver is not simply conceptualized as some kind of tabula rasa recording raw external stimuli, but rather relies on internal models of the world to match and anticipate sensory inputs. The internal model paradigm is now ubiquitous in the cognitive and neural sciences and has even led some researchers to propose a Bayesian brain hypothesis, whereby the brain would essentially be a prediction and inference engine based on internal models [2][3][4]. Coincidentally, Helmholtz also invented the notion of the Helmholtz free energy that plays an a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 distributions. This way, any behavioral, environmental, and hidden variables can be related by their statistics, and dynamical changes can be modelled by changes in their distributions.
Consider, for example, the simple probabilistic model illustrated in Fig 1, consisting of the (for simplicity, discrete) variables past and future soil quality S ≔ (S, S 0 ), past and future crop yields X ≔ (X, X 0 ), and fertilization A. The graphical model shown in the figure corresponds to the joint probability p 0 (X, S, A) given by the factorization where p 0 (S) is the base probability of the past soil quality S, p 0 (X|S) is the probability of crop yields X depending on the past soil quality S, and so forth. Given the joint distribution we can now ask questions about each of the variables. For example, we could ask about the probability distribution p(S|X = x) of soil quality S if we are told that the crop yields X are equal to a value x. We can obtain the answer from the probabilistic model p 0 by doing Bayesian inference, yielding the Bayes' posterior pðSjXÞ ¼ pðS; XÞ P s pðs; XÞ where the dependencies on X 0 , S 0 , and A have been summed out to calculate the marginal p(S, X). In general, Bayesian inference in a probabilistic model means to determine the probability of some queried unobserved variables given the knowledge of some observed variables. This can be viewed as transforming the prior probabilistic model p 0 to a posterior model p, under which the observed values have probability one, and unobserved variables have probabilities given by the corresponding Bayes' posteriors.
In principle, Bayesian inference requires only two different kinds of operations, namely marginalization, i.e., summing out unobserved variables that have not been queried, such as X 0 , S 0 and A above, and conditionalization, i.e., renormalizing the joint distribution over observed and queried variables-that may itself be the result from a previous marginalization Graphical representation of an exemplary probabilistic model. The arrows (edges) indicate causal relationships between the random variables (nodes). The full joint distribution p 0 over all random variables is sometimes also referred to as a generative model, because it contains the complete knowledge about the random variables and their dependencies and therefore allows to generate simulated data. Such a model could for example be used by a farmer to infer the soil quality S based on the crop yields X through Bayesian inference, which allows to determine a priori unknown distributions such as p(S|X) from the generative model p 0 via marginalization and conditionalization.
https://doi.org/10.1371/journal.pcbi.1008420.g001 such as p(S, X) above-to obtain the required conditional distribution over the queried variables. In practice, however, inference is a hard computational problem and many more efficient inference methods are available that may provide approximate solutions to the exact Bayes' posteriors, including belief propagation [40], expectation propagation [41], variational Bayesian inference [42], and Monte Carlo algorithms [43]. Also note that inference is trivial if the sought-after conditional distribution of the queried variable is already given by one of the conditional distributions that jointly specify the probabilistic model, e.g., p(X|S) = p 0 (X|S).
Probabilistic models can be used not only as external (observer) models, but also as internal models that are employed by the agent itself, or by a designer of the agent, in order to determine a desired course of action. In this latter case, actions could either be thought of as deterministic parameters of the probabilistic model that influence the future (influence diagrams) or as random variables that are part of the probabilistic model themselves (prior models) [44]. Either way, internal models allow making predictions over future consequences in order to find actions or distributions over actions that lead to desirable outcomes, for example actions that produce high rewards in the future. In mechanistic or process model interpretations, some of the specification procedures to find such actions are themselves meant to represent what the agent is actually doing while reasoning, whereas as if interpretations simply use these methods as tools to arrive at distributions that describe the agent's behavior. Free energy is one of the concepts that appears in both types of methods.

The two notions of free energy
Vaguely speaking, free energy can refer to any quantity that is of the form where energy is an expected value of some quantitity of interest, entropy refers to a quantity measuring disorder, uncertainty, or complexity, that must be specified in the given context, and const. is a constant term that translates between units of entropy and energy, and is related to the temperature in physically motivated free energy expressions. From relation (3), it is not surprising that free energy sometimes appears enshrouded by mystery, as it relies on an understanding of entropy, and "nobody really knows what entropy is anyway", as John Von Neumann famously quipped [45]. Historically, the concept of free energy goes back to the roots of thermodynamics, where it was introduced to measure the maximum amount of work that can be extracted from a thermodynamic system at a constant temperature and volume. If, for example, all the molecules in a box move to the left, we can use this kinetic energy to drive a turbine. If, however, the same kinetic energy is distributed as random molecular motion, it cannot be fully transformed into work. Therefore, only part of the total energy E is usable, because the exact positions and momenta of the molecules, the so-called microstates, are unknown. In this case, the maximum usable part of the energy E is the Helmholtz free energy, defined as where S is the thermodynamic entropy. In general, the transformation between two macrostates with free energies F 1 and F 2 allows the extraction of work W � F 2 − F 1 . While the two notions of free energy that we discuss in the following are vaguely inspired by the physical original, their motivations are rather distinct and the main reason they share the nomenclature is due to their general form (3) resembling the Helmholtz free energy (4).

Free energy from constraints
The first notion of free energy is closely tied to the principle of maximum entropy [46], which virtually appears in all branches of science. From this vantage point, the physical free energy is merely a special instance of a more general inference problem where we hold probabilistic beliefs about unknown quantities (e.g., the exact energy values of the molecules in a gas) and we can only make coarse measurements or observations (e.g., the temperature of the gas) that we can use to update our beliefs about these hidden variables. The principle of maximum entropy suggests that, among the beliefs that are compatible with the observation, we should choose the most "unbiased" belief, in the sense that it corresponds to a maximum number of possible assignments of the hidden variables.
3.1.1 Wallis' motivation of the maximum entropy principle. Consider the random experiment of distributing N elements randomly in n equally probable buckets with N � n, where the resulting number of elements N i in bucket i 2 {1, . . ., N} determines the probability pðz i Þ ≔ N i N . In principle, this way we could generate any distribution p over a finite set O = {z 1 , . . ., z n } that we like, however, a uniform distribution that reflects the equiprobable assignment clearly is much more likely than a Dirac distribution where all the probability mass is concentrated in one bucket. Here, the reason is that there are many possible assignments of elements among the buckets that generate the uniform distribution, whereas there is only one for a Dirac distribution. In fact, the number of possibilities of how to distribute N elements among n buckets with N i elements in the ith bucket is because N! is the number of possible permutations of all N elements, which overcounts by the number of permutations of elements inside the same bucket and thus has to be divided by the number of permutations N i ! for all i = 1, . . ., n. In the absence of any further measurement constraints, the number of possibilities (5) is maximized by N i = N/n for all i, and thus the typical distribution p � over O in this case is the uniform distribution, i.e., p � ðz i Þ ¼ 1 n for all i. Consider now the problem of having to determine a typical distribution p � over O such that the expected value E p� ½E� ≕ hEi p� of some quantity E equals a measured value ε. A simple example would be the experiment of throwing N dice and taking E to be the number of dots, i.e., Eðz 1 Þ ¼ 1; . . . ; Eðz 6 Þ ¼ 6, and trying to find the typical distribution p � over outcomes z 1 , . . ., z 6 under the constraint that the average number of dots is, say ε = 2. The solution to this problem is analogous to the case of no constraints, but this time we only consider realizations that are compatible with the measurement constraint, that is we let (N 1 , . . ., N n ) belong to the set of permissible occupation vectors A typical distribution p � for a constraint ε can then be determined by a candidate in Γ ε with the maximum number ω of possibilities (5). By assumption, N is much larger than n, so that we can get rid of the faculties by making use of Stirling's approximation ln N! ¼ N ln N À N þ Oð ln NÞ. In particular, when letting N, N i ! 1 such that pðx i Þ ¼ N i N remains finite, we obtain  (6) can be translated into an unconstrained problem by introducing a Lagrange multiplier β, known as the inverse temperature due to the analogy to thermodynamics and the Helmholtz Free Energy (4), which has to be chosen post hoc such that the constraint is satisfied. This results in the minimization of the Lagrangian which takes the form of a free energy (3). As we shall see later, F takes its minimum at the Boltzmann distribution known from statistical mechanics, given by where Z ¼ P z2O e À bEðzÞ denotes the normalization constant. Note that, the argument in the previous section implicitly assumes a uniform reference distribution, because the buckets are assumed to be equiprobable. When replacing this assumption by the assumption of a general distribution p 0 over O, we obtain the principle of minimum relative entropy [48], where the so-called Kullback-Leibler (KL) divergence D KL (pkp 0 ) = hlog(p/p 0 )i p is minimized with respect to p subject to a constraint hEi p ¼ ε. Analogous to the maximum entropy principle, this translates to the unconstrained minimization of the Lagrangian with solution given by p � ðzÞ ¼ 1 Z p 0 ðzÞ e À bEðzÞ . 3.1.3 The trade-off between energy and uncertainty. An important feature of the minimization of the free energies (7) and (9) consists in the balancing of the two competing terms of energy and entropy (cf. Fig 2). This trade-off between maximal uncertainty (uniform distribution, or p 0 ) on the one hand and minimal energy (e.g., a delta distribution) on the other hand is the core of the maximum entropy principle. The inverse temperature β plays the role of a trade-off parameter that controls how these two counteracting forces are weighted.
The maximum entropy principle goes back to the principle of insufficient reason [49][50][51], which states that two events should be assigned the same probability if there is no reason to think otherwise. It has been hailed as a principled method to determine prior distributions and to incorporate novel information into existing probabilistic knowledge. In fact, Bayesian inference can be cast in terms of relative entropy minimization with constraints given by the available information [52]. Applications of this idea can also be found in the machine learning literature, where subtracting (or adding) an entropy term from an expected value of a function that must be optimized is known as entropy regularization and plays an important role in modern reinforcement learning algorithms [8,9] to encourage exploration [53] as well as to penalize overly deterministic policies resulting in biased reward estimates [54].
From now on, we refer to a free energy expression that is motivated from a trade-off between an energy and an entropy term, such as (7) and (9), as free energy from constraints, in order to discriminate it from the notion of free energy introduced in the following section, which-despite of its resemblance-has a different motivation.

Variational free energy
There is another, distinct appearance of the term "free energy" outside of physics, which is a priori not motivated from a trade-off between an energy and entropy term, but from possible efficiency gains when representing Bayes' rule in terms of an optimization problem. This technique is mainly used in variational Bayesian inference [55], originally introduced by Hinton and van Camp [42]. As before, for simplicity all random variables are discrete, but most expressions can directly be translated to the continuous case by replacing probability distributions by probability densities and sums by the corresponding integrals.

Variational Bayesian inference.
As we have seen in Section 2, Bayesian inference consists in the calculation of a conditional probability distribution over unknown variables given the values of known variables. In the most simple case of two variables, say X and Z, and a probabilistic model of the form p 0 (X, Z) = p 0 (X|Z)p 0 (Z), Bayesian inference applies if X is observed and Z is queried. Analogous to (2), the exact Bayes' posterior p(Z|X = x) is defined by the renormalization of p 0 (x, Z) in order to obtain a distribution over Z that respects the new information X = x, with the normalization constant ZðxÞ ¼ P z p 0 ðx; zÞ ¼ pðX ¼ xÞ. In variational Bayesian inference, however, this Bayes' posterior is not calculated directly by renormalizing p 0 (x, Z) with respect to Z, but indirectly by approximating it by a distribution q(Z) that is adjusted through the minimization of an error measure that quantifies the deviation from the exact Bayes' posterior. Importantly, the value of this error measure can be determined without having to know the exact Bayes' posterior. To see this, note that the KL Minimizing the free energy from constraints (7) requires to trade off the competing terms of energy hEi p and entropy H(p), here shown exemplarily for the case of three elements. Assuming there exists a unique minimal element z � ¼ argmin z EðzÞ, then minimizing only hEi p over all probability distributions p results in the (Dirac delta) distribution δ z � that assigns zero probability to all z i 6 ¼ z � and probability one to z i = z � , and therefore has zero entropy. In contrast, minimizing only the term À 1 b HðpÞ is equivalent to maximizing H(p) and therefore would result in the uniform distribution that gives equal probability to all elements. The resulting Boltzmann distribution p � interpolates between these two extreme solutions of minimal energy (β ! 1) and maximum entropy (β ! 0). i.e., it can be decomposed into the sum of a constant term and a term that does not depend on the normalization ZðxÞ. In particular, a good approximation q(Z) of the exact Bayes' posterior (10) will effectively minimize this KL divergence, which-due to (11)-can be done by minimizing F(q(Z)kp 0 (x, Z)). In particular, the optimium of this minimization is exactly achieved at the Bayes' posterior (10), which is known as the variational characterization of Bayes' rule. This result is a special case of (14) in the following section.

Variational free energy, an extension of relative entropy.
Any non-negative function ϕ on a finite space O, can be normalized to obtain a probability distribution p ϕ = ϕ/∑ z ϕ(z) on O that differs from ϕ only by a scaling constant. In cases when it is not beneficial to carry out the sum ∑ z ϕ(z) explicitly, such a normalization might be replaced by the minimization of the variational free energy with respect to the so-called trial distributions q, because we have Thus, instead of normalizing ϕ directly, one fits auxiliary distributions q to approximate the shape of ϕ in the space of probability distributions (cf. Fig 3). If this optimization process has no constraints, then the trial distributions are adjusted until p ϕ is achieved. In the case of constraints, for instance if the trial distributions are parametrized by a non-exhaustive parametrization (e.g., Gaussians), then the optimized trial distributions approximate p ϕ as close as possible within this parametrization. The minimal value of F(qkϕ) is In particular, this implies that −F(qkϕ) � log ∑ z ϕ(z) for all q, so that varying −F(qkϕ) with arbitrary trial distributions q always provides a lower bound to the unknown normalization constant ∑ z ϕ(z). In Bayesian inference this is the normalization constant in Bayes' rule and called the model evidence, which is why the negative variational free energy is also called evidence lower bound (ELBO). The proof of (14) and (15) directly follows from Jensen's inequality and only relies on the concavity of the logarithm. As we have seen in the previous section, in variational Bayesian inference, the reference ϕ usually takes the form of a joint distribution evaluated at the observed variables, e.g., ϕ(Z) = p 0 (x, Z) in which case (14) recovers (12). The variational free energy (13) is a free energy in the sense of (3) since by the additivity of the logarithm under multiplication (log ab = log a + log b), with energy term h− log ϕi q and entropy term H(q). Note that, for the choice � ¼ e À bE , Eq (14) becomes the Boltzmann distribution (8) and the variational free energy (16) formally corresponds to the free energy from constraints (7). Variational free energy can be regarded as an extension of relative entropy with the reference distribution being replaced by a non-normalized reference function, since in the case when ϕ is already normalized, that is if ∑ z ϕ(z) = 1, then the free energy (13) coincides with the KL divergence D KL (qkϕ). In particular, while relative entropy is a measure for the dissimilarity of two probability distributions, where the minimum is achieved if both distributions are equal, variational free energy is a measure for the dissimilarity between a probability distribution q and a (generally non-normalized) function ϕ, where the minimum with respect to q is achieved at p ϕ . Accordingly, we can think of the variational free energy as a specific error measure between probability distributions and reference functions. In principle, one could design many other error measures that have the same minimum. This means that, a statement in a probabilistic setting that a distribution q � minimizes a variational free energy F(qkϕ) with respect to a given reference ϕ, is analogous to a statement in a non-probabilistic setting that some number x = x � minimizes the value of an error measure �(x, y) (e.g., the squared error �(x, y) = (x − y) 2 ) with respect to a given reference value y.

Approximate and iterative inference.
Representing Bayes' rule as an optimization problem over auxiliary distributions q has two main applications that both can simplify the inference process (cf. Fig 4). First, it allows to approximate exact Bayes' posteriors by restricting the optimization space, for example using a non-exhaustive parametrization, e.g., an exponential family. Second, it enables iterative inference algorithms consisting of multiple simpler optimization steps, for example by optimizing with respect to each term in a factorized representation of q separately. A popular choice is the mean-field approximation, which combines both of these simplifications, as it assumes independence between hidden states, effectively reducing the search space from joint distributions to factorized ones, and moreover it allows The normalization of a functon ϕ to obtain a probability distribution p ϕ is equivalent to fitting trial distributions q to the shape of ϕ by minimizing free energy. In two dimensions, the normalization of a point ϕ = (ϕ 1 , ϕ 2 ) corresponds to a (non-orthogonal) projection onto the plane of probability vectors (A). For continuous domains, where probability distributions are represented by densities, normalization corresponds to a rescaling of ϕ such that the area below the graph equals 1 (B). Instead, when minimizing variational free energy (red colour), the trial distributions q are varied until they fit to the shape of the unnormalized function ϕ (perfectly at q = p ϕ ). https://doi.org/10.1371/journal.pcbi.1008420.g003 to optimize with respect to each factor alternatingly. Note, however, that mean-field approximations have limited use in sequential environments, where independence of subsequent states cannot be assumed and therefore less restrictive assumptions must be used instead [56].
Many efficient iterative algorithms for exact and approximate inference can be viewed as examples of variational free energy minimization, for example the EM algorithm [6,57], belief propagation [40,58], and other message passing algorithms [41,[59][60][61][62]. While the (Bayesian) EM algorithm [7] and Pearl's belief propagation [58] both can be seen as minimizing the same variational free energy, just with different assumptions on the approximate posteriors, in [61], it is shown that also many other message passing algorithms such as [41,59,60] can be cast as minimizing some type of free energy, the only difference being the choice of the divergence measure as the entropy term. Simple versions of these algorithms have often existed before their free energy formulations were available, but the variational representations usually allowed for extensions and refinements-see [6,7,63,64] in case of EM and [58,62,65,66] in case of message passing.
We are now turning to the question of how the two notions of free energy introduced in this section are related to recent theories of intelligent agency.

The basic idea
The concept of free energy from constraints as a trade-off between energy and uncertainty can be used in models of perception-action systems, where entropy quantifies information processing complexity required for decision-making (e.g., planning a path for fleeing a predator) and energy corresponds to performance (e.g., distinguishing better and worse flight directions). The notion of decision in this context is very broad and can be applied to any internal variable in the perception-action pipeline [67], that is not given directly by the environment. In practice, this variational representation is often exploited to simplify a given inference problem, either by reducing the seach space of distributions, for example through a restrictive parametrization resulting in approximate inference, or by splitting up the optimization into multiple partial optimization steps that are potentially easier to solve than the original problem but might still converge to the exact solution. These two simplifications can also be combined, for example in the case of mean-field assumptions where the space of distributions is reduced and an efficient iterative inference algorithm is obtained at the same time. https://doi.org/10.1371/journal.pcbi.1008420.g004 In particular, it also subsumes perception itself, where the decision variables are given by the hidden causes that are being inferred from observations.
In rational choice theory [68], a decision-maker selects decisions x � from a set of options O such that a utility function U defined on O is maximized, The utility values U(x) could either be objective, for example a monetary gain, or subjective in which case they represent the decision-maker's preferences. In general, the utility does not have to be defined directly on O, but could be derived from utility values that are attached to certain states, for example to the configurations of the playboard in a board game. In the case of perception, utility values are usually given by (log-)likelihood functions, in which case utility maximization without constraints corresponds to greedy inference such as maximum likelihood estimation. Note that, for simplicity, in this section we consider one-step decision problems. Sequential tasks can either be seen as multiple one-step problems where the utility of a given step might depend on the policy over future steps, or as path planning problems where an action represents a full action path or policy [18,[69][70][71].
While ideal rational decision-makers are assumed to perfectly optimize a given utility function U, real behavior is often stochastic, meaning that multiple exposures to the same problem lead to different decisions. Such non-deterministic behavior could be a consequence of model uncertainty, as in Bayesian inference or various stochastic gambling schemes, or a consequence of satisficing [72], where decision-makers do not choose the single best option, but simply one option that is good enough. Abstractly, this means that, the choice of a single decision is replaced by the choice of a distribution over decisions. More generally, also considering prior information that the decision-maker might have from previous experience, the process of deliberation during decision-making might be expressed as the transformation of a prior p 0 to a posterior distribution p.
When assuming that deliberation has a cost C(p, p 0 ), then arriving at narrow posterior distributions should intuitively be more costly than choosing distributions that contain more uncertainty (cf. Fig 5A). In other words, deliberation costs must be increasing with the amount of uncertainty that is reduced by the transformation from p 0 to p. Uncertainty reduction can be understood as making the probabilities of options less equal to each other, rigorously expressed by the mathematical concept of majorization [73]. This notion of uncertainty can also be generalized to include prior information, so that the degree of uncertainty reduction corresponds to more or less deviations from the prior [74].
Maximizing expected utility hUi p with respect to p under restrictions on processing costs C(p, p 0 ) is a constrained optimization problem that can be interpreted as a particular model of bounded rationality [72], explaining non-rational behavior of decision-makers that may be unable to select the single best option by their limited information processing capability. Similarly to the free energy trade-off between energy and entropy (cf. Fig 2), this results in a tradeoff between utility hUi p and processing costs C(p, p 0 ), Here, the trade-off parameter β is analogous to the inverse temperature in statistical mechanics (cf. Eq (7)) and parametrizes the optimal trade-offs p � b ¼ argmax p F b ðpÞ between utility and cost, that define an efficiency frontier separating the space of perception-action systems into bounded-optimal, non-optimal, and non-admissible systems (cf. Fig 5).
When assuming that the total transformation cost is the same independent of whether a decision problem is solved in one step or multiple sub-steps (additivity under coarse-graining) the trade-off in (18) takes the general form (3) of a free energy in the sense of energy (utility) minus entropy (cost), because then the cost function is uniquely given by the relative entropy Note that the additivity of (19) also implies a coarse-graining property of the free energy (18) in the case when the decision is split into multiple steps, such that the utility of preceding decisions is effectively given by the free energy of following decisions. Therefore, in this case, free energy can be seen as a certainty-equivalent value of the subordinate decision problems, i.e., the amount of utility the agent would have to receive to be indifferent between this guaranteed utility and the potential expected utility of the subsequent decision steps taking account the associated information processing costs. The special case (19) has been studied extensively in multiple contexts, including quantal response equilibria in the game-theoretic literature [10,14], rational inattention and costly contemplation [11,75], bounded rationality with KL costs [12,19], KL control [76,77], entropy regularization [8,9], robustness [15,16], the emergence of heuristics [78], thermodynamic models of computation [79], and the analysis of information flow in perception-action systems [17,18]. While (19) is often regarded as an abstract measure of uncertainty reduction or a generic proxy for information processing costs, it can also be viewed as a physical capacity constraint, where the information that is required to achieve a certain expected utility is considered to be sent over a channel to the actuator [24,[80][81][82][83]. This view is also consistent with the maximum entropy principle, as (18) and (19) favor distributions p that can be generated from p 0 most easily in terms of statistics, and therefore with minimum communication complexity between p 0 and p [84].

A simple example
Ingredients. Consider the probabilistic model shown in Fig 1 with the joint distribution p 0 (X, S, A) that is specified by the factors in the decomposition (1). Here, S and X denote the current Exemplary efficiency curve resulting from the trade-off between utility and costs, that separates non-optimal from non-admissible behavior. The points on the curve correspond to bounded-optimal agents that optimally trade off utility against uncertainty, analogous to the rate-distortion curve in information theory.
https://doi.org/10.1371/journal.pcbi.1008420.g005 environmental state and the corresponding observation, and A denotes the action that must be determined in order to drive the system into a new state S 0 with observation X 0 . The decisionmaking problem is specified by assuming that we have given a utility function U over future observations X 0 which the decision-maker seeks to maximize by selecting an action A, while only having access to the current observation X. This means that the decision-maker has control over the distribution p(A|X), which replaces the prior p 0 (A) in the factorization (1) Free energy from constraints. Further assuming that the decision-maker is subject to an information processing constraint D KL (pkp 0 ) � C 0 , for some non-negative bound C 0 , results in the unconstrained optimization problem max p F(p) with free energy given by (18), where the trade-off parameter β is tuned to comply with the bound C 0 . Since the action distribution p(A|X) is the only distribution in the posterior model (20) that changes during decision-making, i.e., during the transformation from prior to posterior, the total free energy simplifies to where we have written p 0 (x|s)p 0 (s) = p(s|x)p(x) using Bayes' rule (2), and VðX; AÞ ≔ P s;s 0 ;x 0 pðsjXÞ p 0 ðs 0 js; AÞ p 0 ðx 0 js 0 Þ Uðx 0 Þ ; Note that, here the expectation with respect to p(X) does not affect the optimization with respect to p(A|X) since it can be performed pointwise for each particular realization x of X. In fact, we would have obtained the same result when conditioning on an arbitrary value X = x from the outset. However, in general, optimal information processing strategies may depend on the entire distribution p(X) and can therefore not be obtained from only considering single observations x, for example when also optimizing with respect to the prior p 0 (A), see e.g., [85]. Free energy maximization. The optimal action distribution p � (A|X) maximizing F A is a Boltzmann distribution (8) with "energy" V(X, A) and prior p 0 (A), where ZðXÞ ≔ P a p 0 ðaÞe bVðX;aÞ . Note that in order to evaluate the utility V, it is required to determine the Bayes' posterior p(S|X). This shows how in a utility-based approach, the need to perform Bayesian inference results directly from the assumption about which variables are observed and which are not.

Critical points
The main idea of free energy in the context of information processing with limited resources is that any computation can be thought of abstractly as a transformation from a distribution p 0 of prior knowledge to a posterior distribution p that encapsulates an advanced state of knowledge resulting from deliberation. The progress that is made through such a transformation is quantitatively captured by two measures: the expected utility hUi p that quantifies the quality of p and C(p, p 0 ) that measures the cost of uncertainty reduction from p 0 to p. Clearly, the critical point of this framework is the choice of the cost function C. In particular, we could ask whether there is some kind of universal cost function that is applicable to any perceptionaction process or whether there are only problem-specific instantiations. Of course, having a universal measure that allows applying the same concepts to extremely diverse systems is both a boon and a bane, because the practical insights it may provide for any concrete instance could be very limited. This is the root of a number of critical issues: i. What is the cost C? An important restriction of all deliberation costs of the form C(p, p 0 ) is that they only depend on the initial and final distributions and ignore the process of how to get from p 0 to p. When varying a single resource (e.g., processing time) we can use C(p, p 0 ) as a process-independent proxy for the resource. However, if there are multiple resources involved (e.g., processing time, memory, and power consumption), a single cost cannot tell us how these resources are weighted optimally without making further process-dependent assumptions. In general, the theory makes no suggestions whatsoever about mechanical processes that could implement resource-optimal strategies, it only serves as a baseline for comparison. Finally, simply requiring the measure to be monotonic in the uncertainty reduction, does not uniquely determine the form of C, as there have been multiple proposals of uncertainty measures in the literature (see e.g., [86]), where relative entropy is just one possibility. However, relative entropy is distinguished from all other uncertainty measures in its additivity property, that for example allows to express optimal probabilistic updates from p 0 to p in terms of additions or subtractions of utilities, such as log-likelihoods for evidence accumulation in Bayesian inference.
ii. What is the utility? When systems are engineered, utilities are usually assumed to be given such that desired behavior is specified by utility maximization. However, when we observe perception-action systems, it is often not so clear what the utility should be, or in fact, whether there even exists a utility that captures the observed behavior in terms of utility maximization. This question of the identifiability of a utility function is studied extensively in the economic sciences, where the basic idea is that systems reveal their preferences through their actual choices and that these preferences have to satisfy certain consistency axioms in order to guarantee the existence of a utility function. In practice, to guarantee unique identifiability these axioms are usually rather strong, for example ignoring the effects of history and context when choosing between different items, or ignoring the possibility that there might be multiple objectives. When not making these strong assumptions, utility becomes a rather generic concept, like the concept of probability, and additional assumptions like soft-maximization are necessary to translate from utilities to choice probabilities.
iii. The problem of infinite regress. One of the main conceptual issues with the interpretation of C as a deliberation cost is that the original utility optimization problem is simply replaced by another optimization problem that may even be more difficult to solve. This novel optimization problem might again require resources to be solved and could therefore be described by a higher-level deliberation cost, thus leading to an infinite regress. In fact, any decision-making model that assumes that decision-makers reason about processing resources are affected by this problem [87,88]. A possible way out is to consider the utilityinformation trade-off simply an as if description, since perception-action systems that are subject to a utility-information trade-off do not necessarily have to reason or know about their deliberation costs. It is straightforward, for example, to design processes that probabilistically optimize a given utility with no explicit notion of free energy, but for an outside observer the resulting choice distribution looks like an optimal free energy trade-off [89].
In summary, the free energy trade-off between utility and information primarily serves as a normative model for optimal probability assignments in information-processing nodes or networks. Like other Bayesian approaches, it can also serve as a guide for constructing and interpreting systems, although it is in general not a mechanistic model of behavior. In that respect it shares the fate of its cousins in thermodynamics and coding theory [90] in that they provide theoretical bounds on optimality but devise no mechanism for processes to achieve these bounds.

The basic idea
Variational free energy is the main ingredient used in the Free Energy Principle for biological systems in the neuroscience literature [26,33,35,91], which has been considered as "arguably the most ambitious theory of the brain available today" [92]. Since variational free energy in itself is just a mathematical construct to measure the dissimilarity between distributions and functions-see Section 3-, the biological content of the Free Energy Principle must come from somewhere else. The basic biological phenomenon that the Free Energy Principle purports to explain is homeostasis, the ability to actively maintain certain relevant variables (e.g., blood sugar) within a preferred range. Usually, homeostasis is applied as an explanatory principle in physiology whereby the actual value of a variable is compared to a target value and corrections to deviation errors are made through a feedback loop. However, homeostasis has also been proposed as an explanatory principle for complex behavior in the cybernetic literature [93][94][95][96]-for example, maintaining blood sugar may entail complex feedback loops of learning to hunt, to trade and to buy food. Crucially, being able to exploit the environment in order to attain favorable sensory states, requires implicit or explicit knowledge of the environment that could either be pre-programmed (e.g., insect locomotion) or learnt (e.g., playing the piano).
The Free Energy Principle was originally suggested as a theory of cortical responses [33] by promoting the free energy formulation of predictive coding that was introduced by Dayan and Hinton with the Helmholtz machine [5]. It found its most recent incarnation in what is known as Active Inference that attempts to extend variational Bayesian inference to the problem of action selection. Here, the target value of homeostasis is expressed through a probability distribution p des under which desired sensory states have a high probability. The required knowledge about the environment is expressed through a generative model p 0 that relates observations, hidden causes, and actions. As the generative model allows to make predictions about future states and observations, it enables to choose actions in such a way that the predicted consequences conform to the desired distribution. In Active Inference, this is achieved by merging the generative and the desired distributions, p 0 and p des , into a single reference function ϕ to which trial distributions q over the unknown variables are fitted by minimizing the variational free energy F(qkϕ). This free energy minimization is analogous to variational Bayesian inference, where the reference is always given by a joint distribution evaluated at observed quantities (cf. Section 3.2.1). In the resulting homeostatic process, the trial distributions q play the role of internal variables that are manipulated in order to achieve desired sensory consequences that are not directly controllable. Minimizing variational free energy by the alternating variation of trial distributions over actions q Actions and trial distributions over hidden states q States , is then equated with processes of action and perception. In a nutshell, the central tenet of the Free Energy Principle states that organisms maintain homeostasis through minimization of variational free energy between a trial distribution q and a reference function ϕ by acting and perceiving. Sometimes the even stronger statement is made that minimizing variational free energy is mandatory for homeostatic systems [97,98].

A simple example
Ingredients. Applying the Active Inference recipe (cf. Fig 7) to our running example from Fig 1 with current and future states S, S 0 , current and future observations X, X 0 , and action A, we need a generative model p 0 , a desired distribution p des , and trial distributions q. The generative model p 0 (X, S, A) is specified by the factors in the decomposition (1), the desired distribution p des (X 0 ) is a given fixed probability distribution over future sensory states X 0 , and the trial distributions q are probability distributions over all unknown variables, S, S 0 , X 0 , and A.
In most treatments of Active Inference in the literature, the trial distributions q are simplified, either by a full mean-field approximation over states and actions [34,35], by a partial mean-field approximation where the dependency on actions is kept but the states are treated independently of each other [99,100], or more recently [101,102] by the so-called Bethe approximation [58,65], where subsequent states are allowed to interact. In the partial meanfield assumption of [99], the trial distribution over X 0 is fixed and given by p 0 (X 0 |S 0 ), while for A, S and S 0 the trial distributions are variable but restricted to be of the mean-field form for S and S 0 , qðS; AÞ ¼ qðSÞ qðS 0 jAÞ qðAÞ; i.e., the hidden states S and S 0 are assumed to be independent given A. While mean-field approximations can be good enough for simple perceptual inference, where a single hidden cause might be responsible for a set of observations, they can be too strong simplifications for sequential decision-making problems where the next state S 0 depends on the previous state S.
In fact, as can be seen for example in S2 Notebook, mean-field assumptions may fail to show goal-directed behavior even for very simple tasks such as the navigation in a grid world. A less restrictive assumption would be a Bethe approximation, a special case of Kikuchi's cluster variation method [103], which allows S and S 0 as well as S 0 and X 0 to be stochastically dependentcf. Section C in S1 Appendix, where we derive the update equations under the Bethe assumption for the simple example of this section. In general, the Bethe approximation achieves exact marginals in tree-like models, such as the models that are considered in the Active Inference literature, because it results in update equations that are equivalent to Pearl's belief propagation algorithm [40,58]. Reference function. The reference ϕ is constructed by combining the two distributions p des and p 0 . To do so, there have been several proposals in the Active Inference literature, which fall into one of two categories: Either a specific value function Q is defined (containing p des ), which is multiplied to the generative model using a soft-max function [35,99,100], or the desired distribution is multiplied directly to the generative model [101], While the reference function in (25) is already completely specified, we still need to know how to determine the value function Q in the case of (24). For the partial mean-field assumption (23) it is defined in the literature [99,100] as QðaÞ ≔ hUðX 0 ; S 0 Þi qðX 0 ;S 0 jA ¼ aÞ þ HðqðX 0 jA ¼ aÞÞ; where U(x 0 , s 0 ) ≔ log p des (x 0 ) + log p 0 (x 0 |s 0 ) favors both desirable and plausible future observations x 0 . While here desirability and plausibility is built into the value function Q idiosyncratically, in utility-based approaches (cf. Section 4.2) only desirability has to be put into the design of the utility function, because there the likelihood p 0 (X 0 |S 0 ) of future observations is automatically taken into account by the expected utility V that is (soft-)maximized by (21). Moreover, since Q can be rewritten as QðaÞ ¼ À D KL ðqðX 0 jAÞkp des ðX 0 ÞÞ À hHðp 0 ðX 0 jS 0 ÞÞi qðS 0 jAÞ ; the extra entropy term in (26) has the effect of actions leading to consequences that more or less match the desired distribution, while also explicitly punishing actions that lead to a high variability of observations (by requiring a low average entropy of p 0 (X 0 |S 0 )), rather than trying to produce the single most desired outcome-see the discussion at the end of Section 5.3. Note also that the value function Q depends (non-linearly) on the trial distribution q(S 0 |A), because q(X 0 |A) = ∑ s 0 p 0 (X 0 |s 0 )q(s 0 |A) is itself a function of q(S 0 |A), which is problematic during free energy minimization (see (ii) in Section 5.3). Free energy minimization. Once the form of the trial distributions q-e.g., by a partial mean-field assumption (23) or a Bethe approximation (see S1 Appendix)-and the reference ϕ are defined, the variational free energy is simply determined by F(qkϕ). In the case of a meanfield assumption, the resulting free energy minimization problem is solved approximately by performing an alternating optimization scheme, in which the variational free energy is minimized separately with respect to each of the variable factors in a factorization of q, for example by alternating between min q(S) F, min q(S 0 |A) F, and min q(A) F in the case of the partial meanfield assumption (23), where in each step the factors that are not optimized are kept fixed (cf. Fig 7). In S1 Appendix we derive the update equations for the cases (24) and (25) under meanfield and Bethe approximations for the one-step example discussed in this section. Mean-field solutions for the general case of arbitrarily many timesteps together with their exact solutions can be found in S1 Notebook, where we also highlight the theoretical differences between various proposed formulations of Active Inference. The effect of some of these differences can be seen in the grid world simulations in S2 Notebook.

Critical points
The main idea behind Active Inference is to express the problem of action selection in a similar manner to the perceptual problem of Bayesian inference over hidden causes. In Bayesian inference, agents are equipped with likelihood models p 0 (X|Z) that determine the desirability of different hypotheses Z under known data X. In Active Inference, agents are equipped with a given desired distribution p des (X 0 ) over future outcomes that ultimately determines the desirability of actions A. An important difference that arises is that perceptual inference has to condition on past observations X = x, whereas naive inference over actions would have to condition on desired future outcomes X 0 = x 0 .
For a single desired future observation x 0 , Bayesian inference could be applied in a straightforward way by simply conditioning the generative model p 0 on X 0 = x 0 . Similarly, one could condition on a desired distribution p des (X 0 ) using Jeffrey's conditioning rule [104], resulting in p(A|p des ) = ∑ x 0 p(A|x 0 ) p des (x 0 ), which could be implemented by first sampling a goal x 0 � p des (X 0 ) and then inferring p(A|x 0 ) given the single desired observation x 0 . However, one of the problems with such a naive approach is that the choice of a goal is solely determined by its desirability, whereas its realizability for the decision-maker is not taken into account. This is because by conditioning on p des , the decision-maker effectively seeks to choose actions in order to reproduce or match the desired distribution.
To overcome this problem, Control as Inference or Planning as Inference approaches in the machine learning literature [77,[105][106][107][108] do not directly condition on desired future observations but on future success by introducing an auxiliary binary random variable R such that R = 1 encodes the occurence of desired outcomes. The auxiliary variable R comes with a probability distribution p 0 (R|X 0 , . . .) that determines how well the outcomes satisfy desirability criteria of the decision-maker, usually defined in terms of the reward or utility attached to certain outcomes-see the discussion in (iii) below. The extra variable gives the necessary flexibility to infer successful actions by simply conditioning on R = 1. The advantage of such an approach over direct Jeffrey conditionalization given a desired distribution over future observations can be seen in the grid world simulations in S2 Notebook, especially the ability of choosing a desired outcome that is not only desirable but also achievable-see also Fig 8. Active Inference tries to overcome the same problem of reconciling realizability and desirability, but without explicitly introducing extra random variables and without explicitly conditioning on the future. Instead, the desired distribution is combined with the generative model to form a new reference function ϕ such that the posteriors q � resulting from the minimization

Fig 8. Consequences of assuming a desired distribution p des for action planning under purely inference-based methods, expected utility, and Active
Inference, in the case of a simple example with two actions, one with a deterministic outcome and one with random outcomes. As can be seen from the displayed equations, conditioning on p des (Jeffrey conditionalization) and conditioning on success (Control as Inference/direct Active Inference) only differ in the order of normalizing and taking the expectation over X 0 . While conditioning on p des requires to first sample a target outcome from p des before an action from p(A|x 0 ) can be planned, conditioning on success directly weighs the desirability of an outcome p des (x 0 ) by its realizability p(x 0 |A). From this point of view, the expected utility approach is very similar to Control as Inference (which can also be seen in the grid world environment S2 Notebook), since it also weighs the utility of an outcome with its realizability before soft-maximizing. It only differs in how it treats the desired distribution as an exponentiated utility, moving the utility values closer together so that option A = 1 is slightly preferred. The early version [34] of Active Inference is similar to Jeffrey conditioning, because decision-makers are also assumed to match the desired distribution, by defining the value function Q as a KL divergence between the predicted and desired distributions. In later versions of Q-value Active Inference [35,99,100], the value function Q is modified by an additional entropy term that explicitly punishes observations with high variability. Consequently, even when the effect of the action on future observations is kept the same, i.e., the predictive distribution p(X 0 |A) = ∑ s 0 p 0 (X 0 |s 0 )p 0 (s 0 |A) remains as depicted in the left-hand column, the preference over actions now changes completely depending on p 0 (X 0 |S 0 )whereas in the other approaches, only the predictive distribution p(X 0 |A) and p des (X 0 ) influence planning. While there might be circumstances where this extra punishment of high outcome variability could be beneficial, it is questionable from a normative point of view why anything else other than the predicted outcome probability p(X 0 |A) should be considered for planning. See S2 Appendix for details about the choices made in the example.
https://doi.org/10.1371/journal.pcbi.1008420.g008 of the free energy F(qkϕ) contain a baked-in tendency to reach the desired future encoded by ϕ. This approach is the root of a number of critical issues with current formulations of Active Inference: i. How to incorporate the desired distribution into the reference?
Instead of using Bayesian conditioning directly in order to condition the generative model p 0 on the desired future, in Active Inference it is required that the reference ϕ contains the desired distribution in a way such that actions sampled from the resulting posterior model are more likely if they lead to the desired future. As can be seen already for the one-step case in (24) and (25), the method of how to incorporate the desired distribution into the reference function is not unique and does not follow from first principles. There have been essentially two different proposals in the literature on Active Inference of how to combine the two distributions p des and p 0 into ϕ (cf. Fig 7): Either a hand-crafted value function Q is designed that specifically modifies the action probability of the generative model, or the probability over futures X 0 under the generative model p 0 is modified by directly multiplying p des to the likelihood p 0 (X 0 |S 0 ). We discuss both of these proposals in (ii) and (iii) below.
ii. Proposal 1: Q-value Active Inference [34,35,99,100] In the most popular formulation of Active Inference, the probability over actions in the reference ϕ is defined by 1 Z p 0 ðAÞe QðAÞ , where the value function Q (also called the "expected free energy") depends non-linearly on the trial distributions q, as can be seen exemplarily in (26) for the one-step case under the partial mean-field assumption of [99,100], where q(S 0 |A) enters Q through q(X 0 |A) = ∑ s 0 p 0 (X 0 |s 0 )q(s 0 |A). Note that, because of this non-linearity the alternating free energy minimization would have no closed-form solutions (cf. S1 Appendix). This means that both the trial distributions q and the reference ϕ = ϕ(q) will change when q is varied during the minimization of the total variational free energy F(qkϕ(q)), as would be required when stipulating a single free energy functional for optimization. This highlights an important conceptual difference to variational Bayesian inference, where one assumes a fixed reference ϕ-resulting from the evaluation of a fixed probabilistic model p 0 at known variables (see Section 3.2.1)-to which distributions q are fitted by minimizing F(qkϕ). In contrast, when changing the reference ϕ(q) during the optimization process, it is no longer clear what is actually achieved by this minimization. As demonstrated by S2 Notebook, this issue has immediate practical implications, as respecting or ignoring the extra q dependency can result in very different behavior even in simple grid world simulations. In the Active Inference literature, however, the extra q-dependency of Q is largely ignored. Instead of optimizing the full free energy F(qkϕ(q)) with respect to state and action distributions, one alternatingly optimizes the free energy over states F A for each action A and then the full free energy with respect to action distributions only, so that action and perception effectively optimize two different free energies. It is crucial to note, however, that unlike in variational Bayesian inference with fixed reference, this separation does not follow from the formalism of variational free energy, but is a design choice of the Active Inference framework that imposes this separation by force (see S4 Appendix for more details). This way, both separate optimizations can be considered as variational inference in each single update, even though when alternating them the reference ϕ still changes across the combined optimization process. This is in contrast to alternating optimization schemes in variational inference (e.g., in the Bayesian EM algorithm) where the reference ϕ does not change between optimization steps. Thus, there are two choices: Either Q-value Active Inference is regarded as some kind of approximation to variational inference under a single total free energy, or one has to give up the idea of a single free energy function that is optimized. Either way, the combined process of action and perception does not correspond to a single variational inference process. Finally, another important practical issue with Q-value Active Inference models is that the definition of Q relies on a mean-field approximation of the trial distributions q, under which hidden states are assumed to be stochastically independent. This simplification is too strong for sequential decision-making tasks, which renders the approach unfit for environments where the current state depends stochastically on previous states (see S2 Notebook for a demonstration).
iii. Proposal 2: direct Active Inference [101] When multiplying p des to the generative model directly, as in (25), then the resulting reference ϕ is no longer given by a joint distribution of observations, states, and actions (since in general ∑ x 0 p des (x 0 )p 0 (x 0 |S 0 ) 6 ¼ 1). Instead, this formulation of Active Inference turns out to be a special case of previous Control as Inference approaches in the machine learning literature [105,107], where one conditions on an auxiliary success variable R. In particular, for our running example from Control as Inference then conditions actions on both, the history and future success (R = 1). For our one-step example, this results in the Bayes' posterior It is straightforward to identify p des (X 0 ) of Active Inference as a particular choice of a success probability p 0 (R = 1|X 0 ), or equivalently, log p des (X 0 ) as a reward function r = r(X 0 ), so that the joint distribution (27) reduces to the reference function ϕ in (25). Thus, the version of Active Inference in [101] is simply a variational formulation of Control as Inference that approximates exact posteriors of the form (28), like other previous variational Bayes' approaches [107,109,110].
In summary, the assumption of a desired distribution p des over future outcomes has led to various attempts in the Active Inference literature of using probabilistic inference to determine profitable actions. Either an action distribution 1 Z p 0 ðAÞe QðAÞ is built into the reference function, which presupposes optimal behavior by designing a value function Q that leads to desired consequences, or the outcome probability under the generative model p 0 is modified directly by multiplying p des to p 0 . The latter case is the variational version of Control as Inference, wellknown in the machine learning literature [77,[105][106][107][108][109][110]. Considering the issues of Q-value Active Inference discussed above, and the fact that Control as Inference does not rely on a desired distribution over outcomes, we could ask whether formulating preferences by assuming a desired distribution is well-advised. As can be seen from Fig 8, the difference between purely inference-based methods, expected utility approaches, and Active Inference is mainly in how they treat the desired distribution. Should p des be matched or is it good enough if actions are chosen that lead to a high desired outcome probability? While Control as Inference and utility-based models essentially take the latter approach, Q-value Active Inference answers this question by requiring that the desired distribution should be matched as long as the average entropy of p 0 (X 0 |S 0 ) is small. 6 So what does free energy bring to the table?

A practical tool
It is unquestionable that the concept of free energy has seen many fruitful practical applications outside of physics in the statistical and machine learning literature. As has been discussed in Section 3, these applications generally fall into one of two categories, the principle of maximum entropy, and a variational formulation of Bayesian inference. Here, the principle of maximum entropy is interpreted in a wider sense of optimizing a trade-off between uncertainty (entropy) and the expected value of some quantity of interest (energy), which in practice often appears in the form of regularized optimization problems (e.g., to prevent overfitting) or as a general inference method allowing to determine unbiased priors and posteriors (cf. Section 3.1). In the variational formulation of Bayes' rule, free energy plays the role of an error measure that allows to do approximate inference by constraining the space of distributions over which free energy is optimized, but can also inform the design of efficient iterative inference algorithms that result from an alternating optimization scheme where in each step the full variational free energy is optimized only partially, such as the Bayesian EM algorithm, belief propagation, and other message passing algorithms (cf. Section 3.2).
It is important to realize that, while the mathematical expressions of a free energy from constraints with "energy" E and trade-off parameter β and a variational free energy with reference ϕ can formally be transformed into each other by � ¼ e À bE , the two kinds of free energy are inherently distinct, both methodically and by their motivation. In the case of the free energy from constraints, we are given a constraint on some quantity E and we are trying to fulfil this constraint with minimum bias by selecting a distribution that trades off the two competing terms E and entropy. This trade-off also gives the reason for the existence of the Lagrange multiplier β that has to be determined according to the constraint. In this sense the free energy from constraints is just a special case of the far more general Lagrangian method when applied to the optimization of expected values hEi p under entropy constraints (or the other way around). In contrast, variational free energy is simply a tool to represent the normalization of a reference function ϕ in terms of an optimization problem, and therefore does a priori not assume the existence of some quantity E that we may have observed in an experiment or that has any other constraints attached, nor does one explicitly consider entropy to be constrained or optimized. Therefore, even though starting from a (positive) reference function ϕ we can always invent the existence of some quantity E and some multiplier β such that � ¼ e À bE , this does not explain why these quantities should exist or why they should be mapped into each other in that particular way. The Lagrangian method, on the other hand, explains why for a given constraint on E we have a Lagrange multiplier β, how it is determined, and why the equilibrium distribution has the form p � ¼ 1 Z e À bE .

Theories of intelligent agency
These practical use-cases of free energy formulations have also influenced models of intelligent behavior. In the cognitive and behavioral sciences, intelligent agency has been modelled in a number of different frameworks, including logic-based symbolic models, connectionist models, statistical decision-making models, and dynamical systems approaches. Even though statistical thinking in a broader sense can in principle be applied to any of the other frameworks as well, statistical models of cognition in a more narrow sense have often focused on Bayesian models, where agents are equipped with probabilistic models of their environment allowing them to infer unknown variables in order to select actions that lead to desirable consequences [14,76,111]. Naturally, the inference of unknown variables in such models can be achieved by a plethora of methods including the two types of free energy approaches of maximum entropy and variational Bayes. However, both free energy formulations go one step further in that they attempt to extend both principles from the case of inference to the case of action selection: utility optimization with information constraints based on free energy from constraints and Active Inference based on variational free energy. While sharing similar mathematical concepts, both approaches differ in syntax and semantics. An apparent apple of discord is the concept of utility [112]. Utility optimization with information constraints requires the determination of a utility function, whereas Active Inference requires the determination of a reference function. In the economic literature, subjective utility functions that quantify the preferences of decision-makers are typically restrictive in order to ensure identifiability when certain consistency axioms are satisfied. In contrast, in Active Inference the reference function involves determining a desired distribution given by the preferred frequency of outcomes. However, these differences start to vanish when weakening the utility concept to something like log-probabilities, such that the utility framework becomes more similar to the concept of probability that is able to explain arbitrary behavior. Moreover, Active Inference has to solve the additional problem of marrying up the agent's probabilistic model with its desired distribution into a single reference function (cf. Section 5.3). The solution to this problem is not unique, in particular it lies outside the scope of variational Bayesian inference, but it is critical for the resulting behavior because it determines the exact solutions that are approximated by free energy minimization. In fact, as can be seen in simple simulations such as S2 Notebook, the various proposals for this merging that can be found in the Active Inference literature behave very differently.
Also, both approaches differ fundamentally in their motivation. The motivation of utility optimization with information constraints is to capture the trade-off between precision and uncertainty that underlies information processing. This trade-off takes the form of a free energy once an informational cost function has been chosen (cf. Section 4.3). Note that Bayes' rule can be seen as the minimum of a free energy from constraints with log-likelihoods as utilities, even though this equivalence is not the primary motivation of this trade-off. In contrast, Active Inference is motivated from casting the problem of action selection itself as an inference process [34], as this allows to express both action and perception as the result of minimizing the same function, the variational free energy. However, there is no mystery in having such a single optimization function, because the underlying probabilistic model already contains both action and perception variables in a single functional format and the variational free energy is just a function of that model. Moreover, while approximate inference can be formulated on the basis of variational free energy, inference in general does not rely on this concept, in particular inference over actions can easily be done without free energy [77,[105][106][107]113].
However, there are also plenty of similarities between the two free energy approaches. For example, the assumption of a soft-max action distribution in Active Inference is similar to the posterior solutions resulting from utility optimization with information constraints. Moreover, the assumption of a desired future distribution relates to constrained computational resources, because the uncertainty constraint in a desired distribution over future states may not only be a consequence of environmental uncertainty, but could also originate from stochastic preferences of a satisficing decision-maker that accepts a wide range of outcomes. In fact, as we have seen in the discussion around Fig 8, various methods for inference over actions differ in how they treat preferences given by a distribution over desired outcomes: Some of them try to match the predictive and desired distributions, while others simply seek to reach states whose outcomes have a high desired probability. In S2 Notebook, we provide a comparison of the discussed methods using grid world simulations, in order to see their resulting behavior also in a sequential decision-making task.
A remarkable resemblance among both approaches is the exclusive appearance of relative entropy to measure dissimilarity. In the Active Inference literature it is often claimed that every homeostatic system must minimize variational free energy [97], which is simply an extension of relative entropy for non-normalized reference functions (cf. Section 3.2.2). In utility-based approaches, the relative entropy (19) is typically used to measure the amount of information processing, even though theoretically other cost functions would be conceivable [74]. For a given homeostatic process, the KL divergence measures the dissimilarity between the current distribution and the limiting distribution and therefore is reduced while approximating the equilibrium. Similarly, in utility-based decision-making models, relative entropy measures the dissimilarity between the current posterior and the prior. In the Active Inference literature the stepwise minimization of variational free energy that goes along with KL minimization is often equated with the minimization of sensory surprise (see S3 Appendix for a more detailed explanation), an idea that stems from maximum likelihood algorithms, but that has been challenged as a general principle (see [114] and the response [115]). Similarly, one could in principle rewrite free energy from constraints in terms of informational surprise, which would however simply be a rewording of the probabilistic concepts in log-space. The same kind of rewording is well-known between probabilistic inference and the minimum description length principle [116] that also operates in log-space, and thus reformulates the inference problem as a surprise minimization problem without adding any new features or properties.

Biological relevance
So far we have seen how free energy is used as a technical instrument to solve inference problems and its corresponding appearance in different models of intelligent agency. Crucially, these kinds of models can be applied to any input-output system, be it a human that reacts to sensory stimuli, a cell that tries to maintain homeostasis, or a particle trapped by a physical potential. Given the existing literature that has widely applied the concept of free energy to biological systems, we may ask whether there are any specific biological implications of these models.
Considering free energy from constraints, the trade-off between utility and information processing costs provides a normative model of decision-making under resource constraints, that extends previous optimality models based on expected utility maximization and Bayesian inference. Analogous to rate-distortion curves in information theory, optimal solutions to decision-making problems are obtained that separate achievable from non-achievable regions in the information-utility plane (cf. Fig 5). The behavior of real decision-making systems under varying information constraints can be analyzed experimentally by comparing their performance with respect to the corresponding optimality curve. One can experimentally relate abstract information processing costs measured in bits to task-dependent resource costs like reaction or planning times [20,22]. Moreover, the free energy trade-off can also be used to describe networks of agents, where each agent is limited in its ability, but the system as a whole has a higher information processing capacity-for example, neurons in a brain or humans in a group. In such systems different levels of abstraction arise depending on the different positions of decision-makers in the network [23,71,85]. As we have discussed in Section 4.3, just like coding and rate-distortion theory, utility theory with information costs can only provide optimality bounds but does not specify any particular mechanism of how to achieve optimality. However, by including more and more constraints one can make a model more and more mechanistic and thereby gradually move from a normative to a more descriptive model, such as models that consider the communication channel capacity of neurons with a finite energy budget [24].
Considering variational free energy, there is a vast literature on biological applications mostly focusing on neural processing (e.g., predictive coding and dopamine) [102,117,118], but there are also a number of applications aiming to explain behavior (e.g., human decisionmaking and hallucinations) [119]. Similarly to utility-based models, Active Inference models can be studied in terms of as if models, so that actual behavior can be compared to predicted behavior as long as suitable prior and likelihood models can be identified from the experiment. When applied to brain dynamics, the as if models are sometimes also given a mechanistic interpretation by relating iterative update equations that appear when minimizing variational free energy with dynamics in neuronal circuits. As discussed in Section 3.2.3, the update equations resulting for example from mean-field or Bethe approximations, can often be written in message passing form in the sense that the update for a given variable only has contributions that requires the current approximate posterior of neighbouring nodes in the probabilistic model. These contributions are interpreted as local messages passed between the nodes and might be related to brain signals [102]. Other interpretations [28,91,100] obtain similar update equations by minimizing variational free energy directly through gradient descent, which can again be related to neural coding schemes like predictive coding. As these coding schemes have existed irrespective of free energy [120,121], especially since minimization of prediction errors is already seen in maximum likelihood estimation [120], the question remains whether there are any specific predictions of the Active Inference framework that cannot be explained with previous models (see [39,122] for recent discussions of this question).

Conclusion
Any theory about intelligent behavior has to answer three questions: Where am I?, where do I want to go?, and how do I get there?, corresponding to the three problems of inference and perception, goals and preferences, and planning and execution. All three problems can be addressed either in the language of probabilities or utilities. Perceptual inference can either be considered as finding parameters that maximize probabilities or likelihood utilities. Goals and preferences can either be expressed by utilities over outcomes or by desired distributions. The third question can be answered by the two free energy approaches that either determine future utilities based on model predictions, or infer actions that lead to outcomes predicted to have high desired probability or match the desired distribution. In standard decision-making models actions are usually determined by a utility function that ranks different options, whereas perceptual inference is determined by a likelihood model that quantifies how probable certain observations are. In contrast, both free energy approaches have in common that they treat all types of information processing, from action planning to perception, as the same formal process of minimizing some form of free energy. But the crucial difference is not whether they use utilities or probabilities, but how predictions and goals are interwoven into action. This article started out by tracing back the seemingly mysterious connection between Helmholtz free energy from thermodynamics and Helmholtz' view of model-based information processing that led to the analysis-by-synthesis approach of perception, as exemplified in predictive coding schemes, and in particular to discuss the role of free energy in current models of intelligent behavior. The mystery starts to dissolve when we consider the two kinds of free energies discussed in this article, one based on the maximum entropy principle and the other based on variational free energy-a dissimilarity measure between distributions and (generally unnormalized) functions that extends the well-known KL divergence from information theory. The Helmholtz free energy is a particular example of an energy information tradeoff that results from the maximum entropy principle [46]. Analysis-by-synthesis is a particular application of inference to perception, where determining model parameters and hidden states can either be seen as a result of maximum entropy under observational constraints or of fitting parameter distributions to the model through variational free energy minimization. Thus, both notions of free energy can be formally related as entropy-regularized maximization of log-probabilities.
Conceptually, however, utility-based models with information constraints serve primarily as ultimate explanations of behavior, this means they do not focus on mechanism, but on the goals of behavior and their realizability under ideal circumstances. They have the appeal of being relatively straightforward generalization of standard utility theory, but they rely on abstract concepts like utility and relative entropy that may not be so straightforwardly related to experimental settings. While these normative models have no immediate mechanistic interpretation, their relevance for mechanistic models may be analogous to the relevance of optimality bounds in Shannon's information theory for practical codes [90]. In contrast, Active Inference models of behavior often mix ultimate and proximate arguments of explaining behavior [123,124], because they combine the normative aspect of optimizing variational free energy with the mechanistic interpretation of the particular form of approximate solutions to this optimization. While mean-field approaches of Active Inference may be particularly amenable to such mechanistic interpretations, they are often too simple to capture complex behavior. In contrast, the solutions of direct Active Inference resulting from a Bethe assumption are equivalent to previous Control as Inference approaches [77,[105][106][107][108][109][110] that allow for Bayesian message passing formulations whose biological implementability can be debated irrespective of the existence of a free energy functional.
Finally, both kinds of free energy formulations of intelligent agency are so general and flexible in their ingredients that it might be more appropriate to consider them languages or tools to phrase and describe behavior rather than theories that explain behavior, in a sense similar to how statistics and probability theory are not biological or physical theories but simply provide a language in which we can phrase our biological and physical assumptions.
Supporting information S1 Appendix. Derivation of exemplary update equations. We derive update equations of Qvalue and direct Active Inference for the example in Section 5.2 under mean-field and Bethe approximations. (PDF) S2 Appendix. Uncertain and deterministic options. We give additional details on the example shown in Fig 8 that illustrates the effects of assuming a particular desired distribution over three outcomes under Jeffrey conditionalization, Control as Inference, expected utility optimization, and Active Inference. S2 Notebook. Grid world simulations. We provide implementations of the models discussed in this article in a grid world environment, both as a rendered html file as well as a jupyter notebook that is available on github. (HTML)