Fig 1.
A quasi-mathematical description of the framework of Active Inference (based on [14]).
The flow chart on the right shows the sequence of updates occurring over a single trial consisting of time-steps t = 1….T.
Fig 2.
Mathematical outline of the framework of Active Inference.
Fig 3.
The cycle of updates under Active Inference, expanded to show the calculation of the state- action prediction error and the application of model decay.
Left: non-mathematical ‘cartoon’ explanation of the cycle of updates. Right: more detailed update cycle, to be compared with the version shown in Fig 1.
Fig 4.
Simple go/no-go task modelled under AI.
(a) Structure of the task (see main text) (b)-(d) The state-action heatmap showing inferences on the agent’s state over a rare ‘Go!’ trial. Large updates are required at t = 2, after the animal receives the ‘go’ cue which forces it to update its action plans and state inferences. This update is proposed to cause a large, time specific input into LC, which causes a sudden phasic burst of LC activity. The lower part of the Fig shows the full modelling of the go/no-go task, with components as described in Fig 1.
Fig 5.
Plot of state-action prediction error, simulated LC spiking and behaviour during 100 trials of the go/no-go task, (for agents with a fixed value of model decay parameter α not linked to state-action prediction error).
Each point within the task is assumed to last 1s and is associated with a single state-action prediction error. In (a) the raw prediction error is extracted for t = 2, when the animal receives a cue (this is the error between t = 1 and t = 2) and t = 3 when the animal receives feedback on its response to the cue (the error between t = 2 and t = 3). Because the prediction error explicitly evaluates differences between update cycles, there is no error available for the first time point. Each trial has therefore been collapsed to two time points, each lasting 1 second. In (a) the occurrence of the ‘go’ cue causes strong peaks in prediction error. This is converted into a simulated LC firing rate in (b). To visualise LC firing, a firing probability p is calculated for each second using the state-action prediction error (SAPE) as the input into a logistic function, so that where k = 8, and m is as above. Each second was then further split into 0.1s bins, during which the unit generated a single spike with probability p. This gives a physiologically reasonable [1,22,23] maximum firing rate of 10hz if p = 1. This is converted into a simulated LC firing rate in (b), showing phasic LC activation when the ‘go’ cue is heard. Plot (c) is a graphical representation of behaviour during the task at times t = 2 and t = 3 for each trial, in which the position of the coloured block describes the agent’s location and the colour shows the agent’s observation after moving.
Fig 6.
Modelling a 3-arm explore/exploit task under Active Inference.
(a) shows the mathematical structure of the task. There are seven states, including one neutral starting point and 3 arm locations which can be combined with either a reward / no reward. There are 7 observations; here these have a 1-to-1 mapping to states (A matrix). Actions 1–4 simply move the agent to locations 1–4 respectively. The probability of obtaining a reward in a given arm (p2 for action 2, above) is held static for a fixed number of trials, with one arm granting a reward with a 90% probability and the others with 10% probability. This is then switched, so that the agent must adjust its priors and its behaviour. (b) shows the state-action prediction errors and simulated LC responses over a typical run of 100 trials for an agent with a fixed (α = 16) value of the model decay parameter.
Fig 7.
The explore/exploit task simulated with fixed and flexible values of model decay.
(a) and (b) show the behavioural output from the explore/exploit task for agents with a fixed α parameter, specifically α = 32 (slow model decay) or α = 2 (fast model decay). The agent with α = 2 is hyperflexible in its behaviour and changes its strategy after single failed trials. In contrast, the α = 32 agent is inflexible and persists in seeking reward in the same location despite multiple failed trials. (c) and (d) show the outcome of simulations involving fixed α agents contrasted with the performance of an agent with a flexible value of α set by the state-action prediction error. Each simulation consisted of 150 trials in which the location of the high probability arm changed either every 15 or every 50 trials. The simulation was repeated 50 times. (c) and (d) show the average reward obtained in bins of 20 trials (shaded errors show standard error of the mean), alongside the mean total reward gained by each agent (error bars show S.E.M.;***P<0.0001, one way ANOVA followed by Tukey posthoc test). The less stable/more stable environments favour the α = 2 / α = 32 agent respectively: however, the flexible agent is able to perform as well (or better) in both scenarios. In (e) the location of the reward changes after random intervals, and the flexible agent clearly outperforms both of the fixed- α agents.
Fig 8.
Reversal learning during the go/no-go task.
(a)–(c) show the performance of an agent with a value of model decay determined by state-action prediction error during a reversal of cues in the go/no-go task. The agent begins with a well-trained understanding (via 750 trials of training) that cue 2 indicates that a reward is available. At trial 35 (t = 70) the cue/context relationship is reversed, and the agent must now learn that cue 1 indicates the ‘Go’ context. This initially causes numerous unsuccessful trials, violating the learnt model and producing high prediction errors (a). Note that prediction errors are initially elevated at both timepoints in each trial because both the previously rare cue and the subsequent lack of reward are unexpected. These prediction errors result in a lowering in the parameter decay factor (b), which in turn flattens the agent’s priors causing more variability in behaviour. Eventually the agent learns the new contingencies and the model stabilises, with the re-emergence of phasic bursts of LC activity on ‘Go’ trials (a, c). From trial 125 onwards, the peak of phasic activity begins to transition towards the presentation of the cue rather than the reward. Plot (d) is a graphical representation of behaviour during the task at times t = 2 and t = 3 for each trial, in which the position of the coloured block describes the agent’s location and the colour shows the agent’s observation after moving. (e) shows performance over 50 repeats of the reversal learning task shown in (a), for agents with a fixed or flexible value of α. All agents begin with a near optimal d’ value (measured over bins of 20 trials). However, only the agent with α determined by the state-action prediction error is able to return to optimal levels of performance within the 300 trials shown. (f) and (g) show characteristics of the mean prediction error response to ‘go’ and ‘no-cue’ cues during the static (non-reversed) task as reward and probability parameters are varied, for agents with a flexible value of α((f) ***P<0.0001; one-way ANOVA between different c values, followed by Tukey posthoc test, (g) ** P<0.001, ***P<0.0001; two tailed Student’s t-test between go/no go contexts for fixed cue probabilities, one way ANOVA followed by Tukey posthoc test for ‘go’ peaks with different cue probabilities).