Inferring the function performed by a recurrent neural network

A central goal in systems neuroscience is to understand the functions performed by neural circuits. Previous top-down models addressed this question by comparing the behaviour of an ideal model circuit, optimised to perform a given function, with neural recordings. However, this requires guessing in advance what function is being performed, which may not be possible for many neural systems. To address this, we propose an inverse reinforcement learning (RL) framework for inferring the function performed by a neural network from data. We assume that the responses of each neuron in a network are optimised so as to drive the network towards ‘rewarded’ states, that are desirable for performing a given function. We then show how one can use inverse RL to infer the reward function optimised by the network from observing its responses. This inferred reward function can be used to predict how the neural network should adapt its dynamics to perform the same function when the external environment or network structure changes. This could lead to theoretical predictions about how neural network dynamics adapt to deal with cell death and/or varying sensory stimulus statistics.


Introduction
Neural circuits have evolved to perform a range of different functions, from sensory coding to muscle control and decision making.A central goal of systems neuroscience is to elucidate what these functions are and how neural circuits implement them.A common 'top-down' approach starts by formulating a hypothesis about the function performed by a given neural system (e.g.efficient coding/decision making), which can be formalised via an objective function [1][2][3][4][5][6][7][8][9][10].This hypothesis is then tested by comparing the predicted behaviour of a model circuit that maximises the assumed objective function (possibly given constraints, such as noise/metabolic costs etc.) with recorded responses.
One of the earliest applications of this approach was sensory coding, where neural circuits are thought to efficiently encode sensory stimuli, with limited information loss [7][8][9][10][11][12][13].Over the years, top-down models have also been proposed for many central functions performed by neural circuits, such as generating the complex patterns of activity necessary for initiating motor commands [3], detecting predictive features in the environment [4], or memory storage [5].Nevertheless, it has remained difficult to make quantitative contact between top-down model predictions and data, in particular, to rigorously test which (if any) of the proposed functions is actually being carried out by a real neural circuit.
The first problem is that a pure top-down approach requires us to hypothesise the function performed by a given neural circuit, which is often not possible.Second, even if our hypothesis is correct, there may be multiple ways for a neural circuit to perform the same function, so that the predictions of the top-down model may not match the data.
Here we propose a framework for considering optimal computation by a recurrent neural network, that aims to overcome these problems.We first frame optimal computation by a recurrent neural network as a reinforcement learning (RL) problem [14][15][16][17][18][19][20][21] ( Fig 1).In this framework, a reward function is assumed to quantify how desirable each state of the network is for performing a given computation.Each neuron then optimises its responses (i.e. when to fire a spike) so as to drive the network towards rewarded states, given a constraint on the information each neuron encodes about its inputs.This framework is very general-different choices of reward function result in the network performing diverse functions, from efficient coding to decision making and optimal control.
Next, we show how this framework can be used to tackle the problem of inferring the reward function from the observed network dynamics.Previous work has proposed 'inverse RL' algorithms for inferring the original reward function from an agent's actions [22][23][24][25][26].
Here we show how this framework can be adapted to infer the reward function optimised by a recurrent neural network.Further, given certain conditions we show that the reward function can be expressed as a closed-form expression of the observed network dynamics.
We hypothesise that the inferred reward function, rather than e.g. the properties of individual neurons, is the most succinct mathematical summary of the network, that generalises across different contexts and conditions.Thus we could use our framework to quantitatively predict how the network will adapt or learn in order to perform the same function when the external context (e.g.stimulus statistics), constraints (e.g.noise level) or the structure of the network (e.g.due to cell death or experimental manipulation) change.Our framework could thus generate predictions for a wide range of experimental manipulations.

General approach
We can quantify how well a network performs a specific function (e.g.sensory coding/decision making) via an objective function L π (where π denotes the parameters that determine the network dynamics) (Fig 1A).There is a large literature describing how to optimise the dynamics of a neural network, π, to maximise specific objective functions, L π , given constraints (e.g.metabolic cost/wiring constraints etc.) [1][2][3][4][5][6][7][8][9][10].However, it is generally much harder to go in the opposite direction, to infer the objective function, L π , from observations of the network dynamics.
To address this question, we looked to the field of reinforcement learning (RL) [16][17][18][19][20], which describes how an agent should choose actions so as to maximise the reward they receive from their environment (Fig 1B).Conversely, another paradigm, called inverse RL [22][23][24][25][26], explains how to go in the opposite direction, to infer the reward associated with different states of the environment from observations of the agent's actions.By establishing a mapping between optimising neural network dynamics (Fig 1A) and optimising an agent's actions via RL (Fig 1B) [14,15] we could use inverse RL to infer the objective function optimised by a neural network from its observed dynamics.
To illustrate this, let us compare the problem faced by a single neuron embedded within a recurrent neural network (Fig 1C ) to the textbook RL problem of an agent navigating a maze (Fig 1D).The neuron's environment is determined by the activity of other neurons in the network and its external input; the agent's environment is determined by the walls of the maze.At each time, the neuron can choose whether to fire a spike, so as to drive the network towards states that are 'desirable' for performing a given function; at each time, the agent in the maze can choose which direction to move in, so as to reach 'desirable' locations, associated with a high reward.
Both problems can be formulated mathematically as Markov Decision Processes (MDPs) (Fig 1E).Each state of the system, s (i.e. the agent's position in the maze, or the state of the General approach.(A) Top-down models use an assumed objective function to derive the optimal neural dynamics.The inverse problem is to infer the objective function from observed neural responses.(B) RL uses an assumed reward function to derive an optimal set of actions that an agent should perform in a given environment.Inverse RL infers the reward function from the agent's actions.(C-D) A mapping between the neural network and textbook RL setup.(E) Both problems can be formulated as MDPs, where an agent (or neuron) can choose which actions, a, to perform to alter their state, s, and increase their reward.(F) Given a reward function and coding cost (which penalises complex policies), we can use entropy-regularised RL to derive the optimal policy (left).Here we plot a single trajectory sampled from the optimal policy (red), as well as how often the agent visits each location (shaded).Conversely, we can use inverse RL to infer the reward function from the agent's policy (centre).We can then use the inferred reward to predict how the agent's policy will change when we increase the coding cost to favour simpler (but less rewarded) trajectories (top right), or move the walls of the maze (bottom right).https://doi.org/10.1371/journal.pone.0248940.g001network and external input), is associated with a reward, r(s).At each time, the agent can choose to perform an action, a (i.e.moving in a particular direction, or firing a spike), so as to reach a new state s 0 with probability, p(s 0 |a, s).The probability that the agent performs a given action in each state, π(a|s), is called their policy.
We assume that the agent (or neuron) optimises their policy to maximise their average reward, hr(s)i pπ (s) (where h�i pπ (s) denotes the average over the steady state distribution, p π (s), with a policy π(a|s)).In addition, we assume that there is a constraint on the amount of information the agent can 'encode' about their state, I π (a;s) (this corresponds, for example, to constraining how much a neuron can encode about the rest of the network and external input).Intuitively, this constraint determines how complicated an agent's policy is allowed to be: the smaller, I π (a;s), the less the agent's actions, a, can vary as a function of the state, s.In the limit where I π (a;s) = 0 the agent's actions are completely independent of their state.More formally, we assume the agent finds a policy that maximises the following objective function: where λ is a constant that controls the strength of the constraint [20].Note that in the special case where the agent's state does not depend on previous actions (i.e.p(s 0 |a, s) = p(s 0 |s)) and the reward only depends on the agent's current state and action, this is the same as the objective function used in rate-distortion theory [27,28].We can also write this objective function as: where c π (s) is a 'coding cost', equal to the Kullback-Leibler divergence between the agent's policy and the steady-state distribution over actions, D KL [π(a|s)kp π (a)].This coding cost specifies how much the agent's policy in each state, π(a|s), differs from the average distribution over actions p π (a).In the limit where c π (s) = 0 for all s, then their choice of action choice will not depend on s (i.e.π(a|s) = p π (a) for all s).We hereon refer to the difference, r(s)−λc π (s), as the 'return' associated with each state.
In the Methods we explain how this objective function can be maximised via entropy-regularised RL [17,20] to obtain the optimal policy, which satisfies the relation: where s, s 0 and s 00 denote three consecutive states of the agent.Subtracting the average return, L π , from each term in the sum ensures that this series converges to a finite value [29].Thus, actions that drive the agent towards high-value states are preferred over actions that drive the agent towards low value states.Note the difference between a state's value, v(s), and its return, r(s)−λc π (s): a state with low return can nonetheless have a high-value if it allows the agent to transition to other states associated with a high return in the future.
Let us return to our toy example of the agent in a maze.Fig 1F (left) shows the agent's trajectory through the maze after optimising their policy using entropy regularized RL to maximise L π (Methods).In this example, a single location, in the lower-right corner of the maze, has a non-zero reward (Fig 1F , centre).However, suppose we didn't know this; could we infer the reward at each location just by observing the agent's trajectory in the maze?In the Methods we show that this can be done by finding the reward function that maximises the log-likelihood of the optimal policy, averaged over observed actions and states, hlogπ � (a|s)i data .If the coding cost is non-zero (λ > 0), this problem is generally well-posed, meaning there is a unique solution for r(s).
Once we have inferred the reward function optimised by the agent, we can then use it to predict how their behaviour will change when we alter their external environment or internal constraints.For example, we can predict how the agent's trajectory through the maze will change when we move the position of the walls (

Optimising neural network dynamics
We used these principles to infer the function performed by a recurrent neural network.We considered a model network of n neurons, each described by a binary variable, σ i = −1/1, denoting whether the neuron is silent or spiking respectively (Methods section).The network receives an external input, x.The network state is described by an n-dimensional vector of binary values, σ = (σ 1 , σ 2 , . .., σ n ) T .Both the network and external input have Markov dynamics.Neurons are updated asynchronously: at each time-step a neuron is selected at random, and its state updated by sampling from σ i � π i (σ i |σ, x).The dynamics of the network are fully specified by the set of response probabilities, π i (σ i |σ, x), and input statistics, p(x 0 |x) (where x and x 0 denote the external input at consecutive time steps).
As before, we use a reward function, r(σ, x), to express how desirable each state of the network is to perform a given functional objective.For example, if the objective of the network is to faithfully encode the external input, then an appropriate reward function might be the negative squared error: rðx; σÞ ¼ À ðx À xðσÞÞ 2 , where xðσÞ denotes an estimate of x, inferred from the network state, σ.More generally, different choices of reward function can be used to describe a large range of functions that may be performed by the network.The dynamics of the network (determined by the response probabilities for each neuron), π i are said to be optimal if they maximise the average reward, hrðσ; xÞi p p ðσ;xÞ , given a constraint on the mutual information between each neuron's responses and the rest of the network and external inputs, P n i¼1 I p ðs i ; σ =i ; xÞ (where σ /i denotes the state of every neuron except for neuron i).As described in the previous section, this coding constraint effectively limits how much each neurons responses can vary, depending on its inputs.Formally, this corresponds to maximising the objective function: where λ controls the strength of the constraint.For each neuron, this optimisation can be framed as an MDP, where the state, action, and policy correspond to the network state and external input {σ, x}, the neuron's proposed update, σ i , and the response probability π i (σ i |σ, x), respectively.Thus, we can optimise the network dynamics, by optimising each neuron's response probabilities, π i (σ i |σ, x), via entropy-regularised RL, as we did for the agent in the maze.As each update increases the objective function L π , we can alternate updates for different neurons to optimise the dynamics of the entire network.In the Methods, we show that this results in optimal response probabilities that satisfy the relation: where σ /i denotes the state of all neurons except for neuron i, and v π (σ, x) is the value associated with each state: v p ðσ; xÞ ¼ rðσ; xÞ À lc p ðσ; xÞ À L p þhrðx 0 ; σ 0 Þ À lc p ðσ 0 ; x 0 Þi p p ðσ 0 jx;σÞpðx 0 jxÞ À L p þhrðx 00 ; σ 00 Þ À lc p ðσ 00 ; x 00 Þi p p ðσ 00 jx;σÞpðx 00 jxÞ À L The coding cost, c p ðσ; xÞ ¼ P n i¼1 D KL ½p i ðs i jσ =i ; xÞ k p p ðs i Þ� penalises deviations from each neuron's average firing rate.The network dynamics are optimised by alternately updating the value function and neural response probabilities until convergence (see Methods).
To see how this works in practice, we simulated a network of 8 neurons that receive a binary input x (Fig 2A).The assumed goal of the network is to fire exactly 2 spikes when x = −1, and 6 spikes when x = 1, while minimising the coding cost.To achieve this, the reward was set to unity when the network fired the desired number of spikes, and zero otherwise (Fig 2B).We optimised the network as described above using entropy-regularised RL, and then plotted the optimal tuning curves for each neuron, which show how their spiking probability should optimally vary depending on the input, x, and number of spikes fired by other neurons (Fig 2C).We confirmed that after optimisation the number of spikes fired by the network was tightly peaked around the target values (Fig 2D).Decreasing the coding cost reduced noise in the network, decreasing variability in the total spike count.

Inferring the objective function from the neural dynamics
We next asked if one could use inverse RL to infer the reward function optimised by a neural network, just from its observed dynamics (Fig 2D).For simplicity, let us first consider a recurrent network that receives no external input.In this case, the optimal dynamics (Eq 6) correspond to Gibbs sampling from a steady-state distribution: . We can combine this with the Bellmann equality, which relates the reward, value and cost functions (according to: rðσÞ ¼ v p ðσÞ þ lc p ðσÞ À hv p ðσ 0 Þi pðσ 0 jσÞ þ L p ; see Methods) to derive an expression for the reward function: where p(σ i |σ /i ) denotes the probability that neuron i is in state σ i , given the current state of all the other neurons and C is an irrelevant constant (see Methods).Without loss of generality, we can set the coding cost, λ, to 1 (since altering λ rescales the inferred reward and coding cost by the same factor, rescaling the objective function without changing its shape).In the Methods, we show how we can recover the reward function when there is an external input.In this case, we do not obtain a closed-form expression for the reward function, but must instead infer it via maximum likelihood.
To see how our inverse RL approach could work in principle for a network that receives external inputs, we return to our earlier model network, whos 'function' was to fire a constant number of spikes, depending on the the input (i.e. 2 spikes when x = −1, 6 spikes when x = 1).Despite the simplicity of the model, it is not easy to guess the function it performs just by observing the responses of the network (Fig 2D), or neuronal tuning curves (Fig 2C).However, we can use inverse RL we to infer the reward function optimised by the network Fig 2E, from its observed dynamics.Note, that our method did not make any a priori assumptions about the parametric form of the reward function, which was allowed to vary freely as a function of the network state and input, (σ, x).Nonetheless, we can use a simple clustering algorithm (e.g.k-means) to recover the fact that the inferred reward took two binary values; further analysis reveals that the reward is only non-zero when the network fired exactly 2 spikes when x = −1, and 6 spikes when x = 1.As for the agent in the maze, we can use this inferred reward function to predict how the network dynamics will vary depending on the internal/external constraints.For example, we can predict how neural tuning curves will vary if we alter the input statistics by decreasing the probability that the binary input, Our ability to correctly infer the reward function optimised by the network will be fundamentally limited by the amount of available data.

Inferring efficiently encoded stimulus features
An influential hypothesis, called 'efficient coding', posits that sensory neural circuits have evolved to encode maximal information about sensory stimuli, given internal constraints [7][8][9][10][11][12][13].However, the theory does not specify which stimulus features are relevant to the organism, and thus should be encoded.Here we show how one could use inverse RL to: (i) infer which stimulus features are encoded by a recorded neural network, and (ii) test whether neurons encode maximal information about these features, given assumed constraints on information transmission.
Efficient coding posits that neurons maximise information encoded about some relevant stimulus feature, y(x), given constraints on the information encoded by each neuron about their inputs, x (Fig 4A).This corresponds to maximising: where λ controls the strength of the constraint.Noting that the second term is equal to the coding cost we used previously (Eq 5), we can rewrite this objective function as: where we have omitted terms which don't depend on π.Now this is exactly the same as the objective function we have been using so far (Eq 5), in the special case where the reward function, r(σ, x), is equal to the log-posterior, logp π (y(x)|σ).As a result we can maximise L π via an iterative algorithm, where on each iteration we update the reward function by setting r(x, σ) logp π (y(x)|σ), before then optimising the network dynamics, via entropy-regularised RL.Thus, thanks to the correspondence between entropy-regularised RL and efficient coding we could derive an algorithm to optimise the dynamics of a recurrent network to perform efficient coding [30].
As an illustration, we simulated a network of 7 neurons that receive a sensory input consisting of 7 binary pixels (Fig 4B,top).In this example, the 'relevant feature', y(x) was a single binary variable, which was equal to 1 if 4 or more pixels were active, and -1 otherwise (Fig 4B ,  bottom).We optimised the network dynamics using the efficient-coding algorithm described above. .We verified that the optimised network encodes significantly more information about the relevant feature than a network of independent neurons, over a large range of coding costs (Fig 4E).Now, imagine that we just observe the stimulus and neural responses (Fig 4F).Can we recover the feature, y(x) encoded by the network?To do this, we first use inverse RL to infer the reward function from observed neural responses (in exactly the same way as described in the previous section) (Fig 4G).We made no a priori assumptions about the reward function, so that it could vary freely depending on the network state and input, (σ, x).However, as we described above, in the case where the network encodes maximal information about some stimulus feature, y(x) (given assumed coding constraints), then the inferred reward, r(σ, x) should be proportional to the log-posterior, logp(y(x)|σ).In this case, given σ, the inferred reward, r(σ, x) will only depend on changes to the input, x, that alter y(x).As a result, we can use the inferred reward to recover which inputs, x, map onto the same value of y(x), and thus, what is the 'relevant feature' encoded by the network.In our example, we see that the inferred reward collapses onto two curves only (blue and red in Fig 4G ), depending on the total number of pixels in the stimulus.This allows us to deduce that the encoded feature, y(x), must be a sharp threshold on the number of simultaneously active pixels.Note that this is not easy to see by looking directly at the neuronal tuning curves, which vary smoothly with the number of active pixels (Fig 4C).Next, having recovered y(x), we can check whether the network encodes maximal information about this feature (given the assumed contraints on information transmission), by checking whether the inferred reward, r(σ, x) is proportional to the log-posterior, logp(y(x)|σ), as predicted by the theory.
Note that our general approach could also generalise to more complex efficient coding models, where the encoded variable, y, is not a binary function of the input, x.In this case, we can perform a cluster analysis (e.g.k-means) to reveal which inputs, x, map onto similar reward.If the network is performing efficient coding then these inputs should also map onto the same encoded feature, y(x).
Finally, once we have inferred the function performed by the network, we can predict how its dynamics will vary with context, such as when we alter the input statistics.For example, in our simulation, reducing the probability that input pixels are active causes the neural population to split into two cell-types, with distinct tuning curves and mean firing rates (Fig 4H) [13].

Parametric model of neural responses
The basic framework described above is limited by the fact that the number of states, n s , scales exponentially with the number of neurons (n s = 2 n ).Thus, it will quickly become infeasible to compute the optimal dynamics as the number of neurons increases.Likewise, we will need an exponential amount of data to reliably estimate the sufficient statistics of the network, required to infer the reward function (Fig 3).
For larger networks, this problem can be circumvented by using tractable parametric approximation of the value function and reward functions.As an illustration, let us consider a network with no external input.If we approximate the value function by a quadratic function of the responses, our framework predicts a steady-state response distribution of the form: p(σ) / exp (∑ i,j6 ¼i J ij σ i σ j + ∑ i h i σ i ), where J ij denotes the pairwise couplings between neurons, and h i is the bias.This corresponds to a pairwise Ising model, which has been used previously to model recorded neural responses [30,31].(Note that different value functions could be used to give different neural models; e.g.choosing v(x) = f(w � x), where x is the feed-forward input, results in a linear-nonlinear neural model.)In Methods section we derive an algorithm to optimise the coupling matrix, J, for a given reward function and coding cost.
To illustrate this, we simulated a network of 12 neurons arranged in a ring, with reward function equal to 1 if exactly 4 adjacent neurons are active together, and 0 otherwise.After optimisation, nearby neurons were found to have positive couplings, while distant neurons had negative couplings (Fig 5A).The network dynamics generate a single hill of activity which drifts smoothly in time.This is reminiscent of ring attractor models, which have been influential in modeling neural functions such as the rodent and fly head direction system [32][33][34].(Indeed, eqn.6 suggests why our framework generally leads to attractor dynamics, as each transition will tend to drive the network to higher-value 'attractor states').
As before, we can then use inverse RL to infer the reward function from the observed network dynamics.However, note that when we use a parametric approximation of the value function this problem is not well-posed, and we have to make additional assumptions about the form of the reward function.We first assumed a 'sparse' reward function, where only a small number of states, σ, are assumed to be associated with non-zero positive reward (see Methods).Using this assumption, we could well recover the true reward function from observations of the optimised neural responses (with an r 2 value greater than 0.9).
Having inferred the reward function optimised by the network, we can then use it to predict how the coupling matrix, J, and network dynamics will vary if we alter the internal/external constraints.For example, we can use the inferred reward to predict how increasing the coding cost will result in stronger positive couplings between nearby neurons and a hill of activity that sometimes jumps discontinuously between locations (Fig 5B ); removing connections between distant neurons will result in two uncoordinated peaks of activity (Fig 5C ); finally, selectively activating certain neurons will 'pin' the hill of activity to a single location (Fig 5D ).
To illustrate the effect of assuming different reward functions, we considered two different sets of assumptions (in addition to the sparse model, described above): a 'pairwise model', We optimized the parameters of a pairwise coupled network, using a reward function that was equal to 1 when exactly 4 adjacent neurons were simultaneously active, and 0 otherwise.The resulting couplings between neurons are schematized on the left, with positive couplings in red and negative couplings in blue.The exact coupling strengths are plotted in the centre.On the right we show an example of the network dynamics.Using inverse RL, we can infer the original reward function used to optimise the network from its observed dynamics.We can then use this inferred reward to predict how the network dynamics will vary when we increase the coding cost (B), remove connections between distant neurons (C) or selectively activate certain neurons (D).where the reward is assumed to be a quadratic function of the network state, and a 'global model' where the reward is assumed to depend only on global spike count (see Methods).In all three cases, the inferred reward function provided a reasonable fit to the true reward function, (averaged over states visited by the network; Fig 6A).However, only the sparse and pairwise models were able to predict how neural responses changed when, for example, we optimised the network with a higher coding cost (Fig 6B ).

Discussion
A large research effort has been devoted to developing 'top-down' models, which describe the network dynamics required to optimally perform a given function (e.g.decision making [6], control [3], efficient sensory coding [8] etc.).Here we describe how one can use inverse RL to: (i) infer the function optimised by a network from observed neural responses; (ii) predict the dynamics of a network upon perturbation.
An alternative 'bottom-up' approach is to construct phenomenological models describing how neurons respond to given sensory stimuli and/or neural inputs [30,31,35,36].In common with our approach, such models are directly fitted to neural data.However, these models generally do not set out to reveal the function performed by the network.Further, they are often poor at predicting neural responses in different contexts (e.g.varying stimulus statistics).
Here we hypothesize that it is the function performed by a neural circuit that remains invariant, not its dynamics or individual cell properties.Thus, if we can infer what this function is, we should be able to predict how the network dynamics will adapt depending on the context, so as to perform the same function under different constraints.As a result, our theory could predict how the dynamics of a recorded network will adapt in response to a large range of experimental manipulations, such as varying the stimulus statistics, blocking connections, knocking out/stimulating cells etc. (Note however, that to predict how the network adapts to these changes, we will need to infer the 'full' reward function; in some cases, this may require measuring neural responses in multiple environments).
Another approach is to use statistical methods, such as dimensionality reduction techniques, to infer information about the network dynamics (such as which states are most frequently visited), which can then be used to try and gain insight about the function it performs [37][38][39][40].For example, in the context of sensory coding, an approach called 'maximally informative dimensions' seeks to find a low-dimensional projection of the stimulus that is most informative about a neuron's responses [41].However, in contrast to our approach, such approaches do not allow us to recover the objective function optimised by the network.As a result, they do not predict how neural responses will alter depending on the the internal/external constraints.It is also not clear how to relate existing dimensionality reduction methods, such as PCA, to the dynamics of a recurrent neural network (e.g. are certain states visited frequently because of the reward, external stimulus, or internal dynamics?).Nonetheless, in future work it could be interesting to see if dimensionality reduction techniques could be used to first recover a compressed version of the data, from which we could more easily use inverse RL methods to infer the objective function.
There is an extensive literature on how neural networks could perform RL [14,15,21,[42][43][44].Our focus here was different: we sought to use tools from RL and inverse RL to infer the function performed by a recurrent neural network.Thus, we do not assume the network receives an explicit reward signal: the reward function is simply a way of expressing which states of the network are useful for performing a given function.In contrast to previous work, we treat each neuron as an independent agent, which optimises their responses to maximise the reward achieved by the network, given a constraint on how much they can encode about their inputs.As well as being required for biological realism, the coding constraint has the benefit of making the inverse RL problem well-posed.Indeed, under certain assumptions, we show that it is possible to write a closed form expression for the reward function optimised by the network, given its steady-state distribution (Eq 40).
Our framework relies on several assumptions about the network dynamics.First, we assume that the network has Markov dynamics, such that its state depends only on the preceding time-step.To relax this assumption, we could redefine the network state to include spiking activity in several time-steps.For example, we could thus include the fact that neurons are unlikely to fire two spikes within a given temporal window, called their refractory period.Of course, this increase in complexity would come at the expense of decreased computational tractability, which may necessitate approximations.Second, we assume the only constraint that neurons face is a 'coding cost', which limits how much information they encode about other neurons and external inputs.In reality, biological networks face many other constraints, such as the metabolic cost of spiking and constraints on the wiring between cells.Some of these constraints could be included explicitly as part of the inference process.For example, by assuming a specific form of approximate value function (e.g. a quadratic function of the neural responses), we can include explicit constraints about the network connectivity (e.g.pairwise connections between neurons).Other constraints (e.g. the metabolic cost of firing a spike), that are not assumed explicitly can be incorporated implicitly as part of the inferred reward function (e.g. a lower inferred reward for metabolically costly states, associated with high firing rates).Finally, we assume that all (i) neurons in the network jointly optimise the same reward function, and (ii) each neuron is optimised with full knowledge of the dynamics of the rest of the network.Future work could look at incorporating ideas from the field of multi-agent reinforcement which addresses cases where these assumptions are not necessarily true [45,46].
In addition to assumptions about the network dynamics, for large networks we will also need to assume a particular parametric form for the reward function.For example, in the context of efficient coding (Fig 4), this is equivalent to assuming a particular form for the decoder model (e.g. a linear decoder) that 'reads-out' the encoded variable from neural responses [10].To test the validity of such assumptions, we could see how well the reward function, inferred in one context, was able to generalise to predict neural responses in other contexts.
For our framework to make predictions, the network must adapt its dynamics to perform the same function under different internal/external constraints.Previous work suggests that this may hold in certain cases, in response to changes in stimulus statistics [47,48], or optogonetic perturbations [49].Even so, it may be that real neural networks only partially adapt to their new environments.It would thus be interesting, in the future, to extend our framework to deal with this.For example, we could assess whether contextual changes in the neural responses enable the network to ascend the gradient of the objective function (given the new constraints) as predicted by our model.
A central tenet of our work is that the neural network has evolved to perform a specific function optimally, given constraints.As such, we could obtain misleading results if the recorded network is only approximately optimal.To deal with this, recent work by one of the present authors [50] proposed a framework in which the neural network is assumed to satisfy a known optimality criterion approximately.In this work, the optimality criterion is formulated as a Bayesian prior, which serves to nudge the network towards desirable solutions.This contrasts with the work presented here, where the network is assumed to satisfy an unknown optimality criterion exactly.Future work could explore the intersection between these two approaches, where the network is assumed to perform an unknown optimality criterion approximately.In this case, one will likely need to limit the space of possible reward functions, so that inference problem remains well-posed.
Our work unifies several influential theories of neural coding, that were considered separately in previous work.For example, we show a direct link between entropy-regularised RL [17][18][19][20] (Fig 1), ring-attractor networks [32][33][34] (Fig 5 ), and efficient sensory coding [7][8][9][10][11][12][13] ( Fig 4).Further, given a static network without dynamics, our framework is directly equivalent to rate-distortion theory [27,28].Many of these connections are non-trivial.For example, the problem of how to train a network to efficiently code its inputs remains an open avenue of research.Thus, the realisation that efficient sensory coding by a recurrent network can be formulated as a multi-agent RL problem could help develop of future algorithms for learning efficient sensory representations (e.g., in contrast with brute force numerical optimization as in [9]).Indeed, recent work has shown how, by treating a feed-forward neural network as a multi-agent RL system, one can efficiently train the network to perform certain tasks using only local learning rules [51].However, while interesting in its own right, the generality of our framework means that we could potentially apply our theory to infer the function performed by diverse neural circuits, that have evolved to perform a broad range of different functional objectives.This contrasts with previous work, where neural data is often used to test a single top-down hypothesis, formulated in advance.

Entropy-regularised RL
We consider a Markov Decision Process (MDP) where each state of the agent s, is associated with a reward, r(s).At each time, the agent performs an action, a, sampled from a probability distribution, π(a|s), called their policy.A new state, s 0 , then occurs with a probability, p(s 0 |s, a).
We seek a policy, π(a|s), that maximises the average reward, constrained on the mutual information between between actions and states.This corresponds to maximising the Lagrangian: L p ¼ hrðsÞi p p ðsÞ À lI p ðA; SÞ ð11Þ ¼ hrðsÞ À lc p ðsÞi p p ðsÞ ð12Þ where λ is a lagrange-multiplier that determines the strength of the constraint, and c π (s) = D KL [π(a|s)kp(a)].Note that while in the rest of the paper we consider continuous tasks, where the Lagrangian is obtained by averaging over the steady-state distribution, our framework can also be applied without little changes to finite tasks, which occur within a time window, and where the Lagrangian is given by: L Þ. Unlike the continuous task described above, the optimal network dynamics may not have a corresponding equilibrium distribution [16].Now, let us can define a value function: v p ðsÞ ¼ rðsÞ À lc p ðsÞ À L p þhrðs 0 Þ À lc p ðs 0 Þi p p ðs 0 jsÞ À L p þhrðs 00 Þ À lc p ðs 00 Þi p p ðs 00 jsÞ À L where s, s 0 and s 00 denote the agent's state in three consecutive time-steps.We can write the following Bellmann equality for the value function: v p ðsÞ ¼ rðsÞ À lc p ðsÞ À L p þ hv p ðs 0 Þi p p ðs 0 jsÞ : ð14Þ Now, consider the following greedy update of the policy, π(a|s): p � ðajsÞ ¼ arg max p 0 ðrðsÞ À L p À lc p 0 ðsÞ þ hv p ðs 0 Þi p p 0 ðs 0 jsÞ Þ ð15Þ ¼ arg max p 0 ðÀ lc p 0 ðsÞ þ hv p ðs 0 Þi p p 0 ðs 0 jsÞ Þ ð16Þ To preform this maximisation, we write the following Lagrangian: where the last term is required to enforce the constraint that ∑ a π(a|s) = 1.Setting the derivative to zero with respect to γ(s) and π(a|s), we have: We can solve these equations to obtain the optimal greedy update: l hv p ðs 0 Þi pðs 0 js;aÞ : ð20Þ where Z π (s) is a normalisation constant.
Using the policy improvement theorem [16], we can show that this greedy policy update is guaranteed to increase L π .To see this, we can substitute π � into the Bellmann equation to give the following inequality: v p ðsÞ � rðsÞ À lc p � ðsÞ À L p þ hv p ðs 0 Þi p p � ðs 0 jsÞ ð21Þ L p � rðsÞ À lc p � ðsÞ þ hv p ðs 0 Þi p p � ðs 0 jsÞ À v p ðsÞ ð22Þ Next, we take the average of both sides with respect to the steady-state distribution, p π � (s): Thus, repeated application of the Bellmann recursion (Eq 14) and greedy policy update (Eq 20) will return the optimal policy, π � (a|s), which maximises L π .
Inverse entropy-regularized RL.We can write the Bellmann recursion in Eq 14 in vector form: where v, c and r are vectors with elements, v s � v(s), c s � c(s), and r s � r(s).P is a matrix with elements P ss 0 = p π (s 0 |s).We have defined r 0 � hr(s)i pπ (s) and c 0 � hc(s)i pπ (s) (and thus L π = r 0 −λc 0 ).Rearranging: We can solve this system of equations (up to an arbitrary constant, v 0 ) to find an expression for v as a linear function of the reward: Substituting into Eq 20, we can express the agent's policy directly as a function of the reward: where π a is a vector with elements, (π a ) s � π(a|s) and P a is a matrix, with elements (P a ) ss 0 � p(s 0 |a, s).
To infer the reward function, r(s) (up to an irrelevant constant and multiplicative factor, λ), we use the observed policy, π(a|s) and transition probabilities, p(s 0 |a, s), to estimate b, A, P a and p a .We then perform numerical optimisation to find the reward that maximises the loglikelihood of the optimal policy in Eq 28, h log p � ðajsÞi D , averaged over observed data, D.

Optimising a neural network via RL
We consider a recurrent neural network, with n neurons, each described by a binary variable, σ i = −1/1, denoting whether a given neuron is silent/fires a spike in each temporal window.The network receives an external input, x.The network state is described by a vector of n binary values, σ = (σ 1 , σ 2 , . .., σ n ) T .Both the network and input are assumed to have Markov dynamics.Neurons are updated asynchronously, by updating a random neuron at each time-step with probability p i ðs 0 i jsÞ.The network dynamics are thus described by: where dðs 0 j ; s j Þ ¼ 1 if s 0 j ¼ s j and 0 otherwise.Equivalently, we can say that at each time, a set of proposed updates, σ, are independently sampled from σi � Q i p i ðs i jσÞ, and then a neuron i is selected at random to be updated, such that s 0 i si .We define a reward function, r(σ, x), describing which states are 'desirable' for the network to perform a given function.The network dynamics are said to be optimal if they maximise the average reward, hr(σ, x)i p(σ,x) given a constraint on how much each neuron encodes about its inputs P n i Iðs i ; σ; xÞ.This corresponds to maximising the objective function: We can decompose the transition probability for the network (Eq 29), into the probability that a given neuron proposes an update, si , given the network state, σ, and the probability of the new network state, σ 0 , given si and σ: Thus, the problem faced by each neuron, of optimising p i ðs i jσ; xÞ so as to maximise L, is equivalent to the MDP described in Methods section, where the action a, and state s correspond to the neuron's proposed update si and state of the network and external inputs fs; xg.Thus, we can follow the exact same steps as in Methods section, to show that p i ðs i jσÞ is optimised via the following updates: vðσ; xÞ rðσ; xÞ À lcðσ; xÞ þ hvðσ 0 ; x 0 Þi pðσ 0 ;x 0 jσ;xÞ À L ð34Þ where σ /i denotes the state of all neurons except for neuron i.As updating the policy for any given neuron increases the objective function, L, we can alternate updates for different neurons to optimise the dynamics of the network.

Inferring network function via inverse RL
After convergence, we can substitute the expression for the optimal policy into the Bellman equality, to obtain: Rearranging, we have: Thus, if we can infer the value function from the observed neural responses, then we can recover the associated reward function through Eq 37.
To derive an expression for the reward, we first consider the case where there is no external input.In this case, the optimal neural dynamics (Eq 35) correspond to Gibbs sampling from: Rearranging, we have a closed-form expression for the value function, v(σ), in terms of the steady-state distribution: We can then combine Eqs 39 and 37 to obtain a closed-form expression for the reward function (up to an irrelevant constant): Since we don't know the true value of λ, we can simply set it to unity.In this case, our inferred reward will differ from the true reward by a factor of 1 l .However, since dividing both the reward and coding cost by the same factor has no effect on the shape of the objective function, L (but only alters its magnitude), this will not effect any predictions we make using the inferred reward.
With an external input, there is no closed-form solution for the value function.Instead, we can infer v(σ, x) numerically by maximising the log-likelihood, h log p � ðσ 0 jσ; xÞi D ¼ h log P n i¼1 p � i ðs 0 i jσ; xÞdðσ 0 =i ; σ =i Þi D , where p � i ðs 0 i jσ; xÞ denote optimal response probabilities.Once we know v(σ, x) we can compute the reward from Eq 37.

Approximate method for larger networks
RL model.To scale our framework to larger networks we approximate the value function, v(σ, x), by a parametric function of the network activity, σ and input, x.Without loss of generality, we can parameterise the value function as a linear combination of basis functions: v ϕ (σ, x) � ϕ T f(σ, x).From Eq 36, if the network is optimal, then the value function equals: 1 nl hϕ T f ðs;x 0 Þi pðx 0 jxÞ : ð41Þ In the exact algorithm, we updated the value function by setting it equal to vðσ; xÞ (Eq 34).Since in the parametric case this is not possible, we can instead update ϕ to minimise: where v� ϕ ðσ; xÞ is the target value function, defined as in Eq 41, with parameters, � ϕ.We follow the procedure set out in [18,19] to transform this into a stochastic gradient descent algorithm.First, we perform n batch samples from the current policy.Next, we perform a stochastic gradient descent update: where η is a constant that determines the learning rate.Finally, after doing this n epoch times, we update the target parameters, � ϕ ϕ.These steps are repeated until convergence.Inverse RL.We can infer the parameters of the value function, ϕ, by maximising the loglikelihood: h log p ϕ ðσ 0 jσ; xÞi D .We can choose the form of the value function to ensure that this is tractable.For example, if the value function is quadratic in the responses, then this corresponds to inferring the parameters of a pairwise Ising model [30,31].
Just as we did for the value function, we can express the reward function as a linear combination of basis functions: r(σ, x) = θ T g(σ, x).Thus, Eq 45 becomes: If the reward function has the same number of parameters than the approximate value function (i.e.f(σ) and g(σ) have the same size), then we can solve this equation to find θ.Alternatively, if the reward function has more parameters than the value function, then we require additional assumptions to unambiguously infer the reward.
Pairwise coupled network.We considered a network of 12 neurons arranged in a ring.We defined a reward function that was equal to 1 if exactly 4 adjacent neurons were active, and 0 otherwise.We defined the coding cost as described in the previous section, to penalise deviations from the population averaged mean firing rate.
We approximated the value function by a quadratic function, We optimised the parameters of this value function using the algorithm described previously, in the Methods section entitled 'Approximate method for larger networks', with λ = 0.05.We used batches of n batch = 40 samples, and updated the target parameters after every n epoch = 100 batches.
Concretely, substituting in the quadratic value function described above into Eq 43, we obtain the following updates for the network couplings: where, We inferred the reward function from the inferred network couplings, J and h as described in the Methods section entitled 'Approximate methods for larger networks'.As described previously, this problem is only well-posed if we assume a low-d parametric form for the reward function, or add additional assumptions.We therefore considered several different sets of assumptions.For our initial 'sparse model', we set up a linear programming problem in which we minimised l 1 = ∑ σ r(σ), under the constraint that the reward was always greater than 0 while satisfying the optimality criterion given by Eq 47.For the 'pairwise model' we assumed that r = ∑ i,j W ij σ i σ j .We fitted the parameters, W ij , so as to minimise the squared difference between the left and right hand side of Eq 47.Finally, for the 'global model' we assumed that r = ∑ j δ j,m W j , where m is the total number of active neurons and δ ij is the kronecker-delta.Parameters, W j were fitted to the data as for the pairwise model.
Finally, for the simulations shown in Fig 5, panels B-D, we ran the optimisation with λ = 0.1, 0.01 and 0.1, respectively.For panel 3C we removed connections between neurons separated by a distance of 3 or more on the ring.For panel 3D we forced two of the neurons to be continuously active.

Fig 1 .
Fig 1.General approach.(A)Top-down models use an assumed objective function to derive the optimal neural dynamics.The inverse problem is to infer the objective function from observed neural responses.(B) RL uses an assumed reward function to derive an optimal set of actions that an agent should perform in a given environment.Inverse RL infers the reward function from the agent's actions.(C-D) A mapping between the neural network and textbook RL setup.(E) Both problems can be formulated as MDPs, where an agent (or neuron) can choose which actions, a, to perform to alter their state, s, and increase their reward.(F) Given a reward function and coding cost (which penalises complex policies), we can use entropy-regularised RL to derive the optimal policy (left).Here we plot a single trajectory sampled from the optimal policy (red), as well as how often the agent visits each location (shaded).Conversely, we can use inverse RL to infer the reward function from the agent's policy (centre).We can then use the inferred reward to predict how the agent's policy will change when we increase the coding cost to favour simpler (but less rewarded) trajectories (top right), or move the walls of the maze (bottom right).
Fig 1F, lower right), or increase the coding cost so as to favour simpler (but less rewarded) trajectories (Fig 1F, upper right).

Fig 2 .
Fig 2. Training and inferring the function performed by a neural network.(A) A recurrent neural network receives a binary input, x. (B) The reward function equals 1 if the network fires 2 spikes when x = −1, or 6 spikes when x = 1.(C) After optimisation, neural tuning curves depend on the input, x, and total spike count.(D) Simulated dynamics of the network with 8 neurons (left).The total spike count (below) is tightly peaked around the rewarded values.(E) Using inverse RL on the observed network dynamics, we infer the original reward function used to optimise the network from its observed dynamics.(F) The inferred reward function is used to predict how neural tuning curves will adapt depending on contextual changes, such as varying the input statistics (e.g.decreasing p(x = 1)) (top right), or cell death (bottom right).Thick/thin lines show adapted/original tuning curves, respectively.https://doi.org/10.1371/journal.pone.0248940.g002 x, is equal to 1 (Fig 2F, upper), or remove a cell from the network while keeping other aspects of the simulation unchanged (Fig 2F, lower) (see Methods).
Fig 3A shows how the correlation between the inferred and true reward increases with the amount of data samples used to infer the reward.(Note that each discrete time-step is considered to be one data sample.)As the number of samples is increased, the distribution of inferred rewards becomes more tightly peaked around two values (Fig 3B), reflecting the fact that the true reward function was binary.Of course, with real neural data we will not have access to the 'true' reward function.In this case, we can test how well our inferred reward function is able to predict neural responses in different conditions.Fig 3C and 3D shows how the predicted response distribution (when we alter the input statistics by increasing the probability that the binary input x = 1, Fig 3C, or remove cells, Fig 3D) becomes more accurate as we increase the number of samples used to estimate the reward function.

Fig 3 .
Fig 3. Inferring the reward from limited data.(A) The r 2 -goodness of fit between the true reward, and the reward inferred using a finite number of samples (a sample is defined as an observation of the network state at a single timepoint).The solid line indicates the r 2 value averaged over 20 different simulations, while the shaded areas indicate the standard error on the mean.(B) Distribution of rewards inferred from a variable numbers of data samples.As the number of data samples is increased, the distribution of inferred rewards becomes more sharply peaked around 0 and 1 (reflecting the fact that the true reward was binary).(C) The KL-divergence between the optimal response distribution with altered input statistics (see Fig 2F, upper) and the response distribution predicted using the reward inferred in the initial condition from a variable number of samples.The solid line indicates the KL-divergence averaged over 20 different simulations, while the shaded areas indicate the standard error on the mean.A horizontal dashed line indicates the KL-divergence between the response distribution with biased input and the original condition (that was used to infer the reward).(D) Same as panel (C), but where instead of altering the input statistics, we remove cells from the network (see Fig 2F, lower).https://doi.org/10.1371/journal.pone.0248940.g003 Fig 4C shows how each neuron's spiking probability varies with both the number of active pixels and number of spikes fired by other neurons.Fig 6D shows how the optimal readout, p(y|σ), depends on the number of spiking neurons (Fig 4D)

Fig 4 .
Fig 4. Efficient coding and inverse RL. (A) The neural code was optimised to efficiently encode an external input, x, so as to maximise information about a relevant stimulus feature y(x).(B) The input, x consisted of 7 binary pixels.The relevant feature, y(x), was equal to 1 if >3 pixels were active, and -1 otherwise.(C) Optimising a network of 7 neurons to efficiently encode y(x) resulted in all neurons having identical tuning curves, which depended on the number of active pixels and total spike count.(D) The posterior probability that y = 1 varied monotonically with the spike count.(E) The optimised network encoded significantly more information about y(x) than a network of independent neurons with matching stimulus-dependent spiking probabilities, p(σ i = 1|x).The coding cost used for the simulations in the other panels is indicated by a red circle.(F-G) We use the observed responses of the network (F) to infer the reward function optimised by the network, r(σ, x) (G).If the network efficiently encodes a relevant feature, y (x), then the inferred reward (solid lines) should be proportional to the log-posterior, logp(y(x)|σ) (empty circles).This allows us to (i) recover y(x) from observed neural responses, (ii) test whether this feature is encoded efficiently by the network.(H) We can use the inferred objective to predict how varying the input statistics, by reducing the probability that pixels are active, causes the population to split into two cell types, with different tuning curves and mean firing rates (right).https://doi.org/10.1371/journal.pone.0248940.g004

Fig 5 .
Fig 5. Pairwise coupled network.(A)We optimized the parameters of a pairwise coupled network, using a reward function that was equal to 1 when exactly 4 adjacent neurons were simultaneously active, and 0 otherwise.The resulting couplings between neurons are schematized on the left, with positive couplings in red and negative couplings in blue.The exact coupling strengths are plotted in the centre.On the right we show an example of the network dynamics.Using inverse RL, we can infer the original reward function used to optimise the network from its observed dynamics.We can then use this inferred reward to predict how the network dynamics will vary when we increase the coding cost (B), remove connections between distant neurons (C) or selectively activate certain neurons (D). https://doi.org/10.1371/journal.pone.0248940.g005

Fig 6 .
Fig 6.Effect of assuming different types of reward function.We compared the inferred reward when we assumed a sparse model (i.e. a small number of states associated with non-zero positive reward) a pairwise model (i.e. the reward depends on the first and second-order response statistics) and a global model (i.e. the reward depends on the total number of active neurons only).(A) r 2 goodness of fit between the true and the inferred reward, assuming a sparse, pairwise, or global model.(B) The KL-divergence between the optimal response distribution with high coding cost (see Fig 5B) and the response distribution predicted using the reward inferred in the initial condition, assuming a sparse, pairwise, or global model.A horizontal dashed line indicates the KL-divergence between the response distribution with high coding cost and the original condition (that was used to infer the reward).https://doi.org/10.1371/journal.pone.0248940.g006 hf ðσÞgðσÞT i D θ ¼ hf ðσ; xÞr ϕ ðσ; xÞi D : ð47Þ lc p ðs 0 Þi p p ðs 0 jsÞ À L p þhrðs 00 Þ À lc p ðs 00 Þi p p ðs 00 js 0 Þp p ðs 0 jsÞ 1 l hv p ðs 0 Þi pðs 0 js;aÞ ; ð3Þwhere v π (s) is the 'value' associated with each state, defined as the total return predicted in the future if the agent starts in a given state, minus the average return, L π : v p ðsÞ ¼ rðsÞ À lc p ðsÞ À L p þhrðs 0 Þ À hrðsÞ À lc p � ðsÞi p p � ðsÞ þ hv p ðs 0 Þi p p � ðs 0 jsÞp p � ðsÞ À hv p ðsÞi p p � ðsÞ ð23Þ � hrðsÞ À lc p � ðsÞi p p � ðsÞ � L � p : xÞi pðσ;xÞ À l KL ½p i ðs i jσ; xÞ k p i ðs i Þ� is the coding cost associated with each state, and penalises deviations from each neuron's average firing rate.