Models that learn how humans learn: the case of depression and bipolar disorders

Computational models of learning and decision-making processes in the brain play an important role in many domains. Such models typically have a constrained structure and make specific assumptions about the underlying human learning processes; these may make them underfit observed behaviours. Here we suggest an alternative method based on learning-to-learn approaches, using recurrent neural networks (RNNs) as a flexible family of models that have sufficient capacity to represent the complex learning and decision-making strategies used by humans. In this approach, an RNN is trained to predict the next action that a subject will take in a decision-making task, and in this way, learns to imitate the processes underlying subjects’ choices and their learning abilities. We demonstrate the benefits of this approach with a new dataset containing behaviour of uni-polar depression (n=34), bipolar (n=33) and control (n=34) participants in a two-armed bandit task. The results indicate that the new approach is better than baseline reinforcement-learning methods in terms of overall performance and its capacity to predict subjects’ choices. We show that the model can be interpreted using off-policy simulations, and thereby provide a novel clustering of subjects’ learning processes – something that often eludes traditional approaches to modelling and behavioural analysis.

Introduction 1 A computational model of learning in decision-making can be regarded as a mathematical function that 2 inputs past experiences (such as chosen actions and the rewards that result), and outputs predictions of 3 the actions that will be taken in the future (e.g., Busemeyer and Stout, 2002;Dezfouli et al., 2007;4 Montague et al., 2012;Daw et al., 2006). Typically, experimenters specify a set of assumptions that 5 define and constrain the general structure and form of a whole class of computational models, leaving 6 free a set of parameters that are estimated from the data. For example, in value-based decision-making, 7 a common assumption is that subjects' choices are determined in a noisy manner by learned action 8 values (often called Q values; Watkins, 1989), which are updated given experience of rewards. This 9 model has two free parameters: the level of noise and the learning rate governing updating. 10 1/36 Despite the flexibility afforded by the free parameters, a class of computational models can only 11 capture learning processes that fall within the boundaries of the assumptions embedded in its structure. 12 If the actual learning and choice method used by the subjects is more complex or otherwise different 13 from that implied by the structure of the model, then it will under-and/or otherwise mis-fit the data.
14 For example, if the model assumes that subjects are using a single learning-rate parameter to update the 15 values of actions, but in reality rewards and punishments are modulated by two different learning-rates, 16 then the model will fail to provide a complete representation of the learning processes (e.g., Piray et al., 17 2014). Because of this, in practice, the process of computational modelling typically involves various 18 forms of analysis in order to confirm that the assumptions about the behaviour are correct. If they are 19 found wanting, then different assumptions must be made, leading to different models that must be 20 selected between to find the one that misfits least. This iterative process has become standard scientific 21 practice for model development, and has been influential in domains such as cognitive modelling, 22 computational psychiatry (e.g., Busemeyer and Stout, 2002;Dezfouli et al., 2007;Montague et al., 2012), 23 and model-based analysis of neural data (e.g., Daw et al., 2006). 24 By contrast, we consider an alternative approach to modelling involving minimal assumptions about 25 the underlying learning processes used by subjects. Instead, these are captured by a very flexible class of 26 models on the basis of observing the information the subjects see and the choices they make, and 27 predicting the latter. This process is known as learning-to-learn (Hochreiter et al., 2001;Wang et al., 28 2016; Duan et al., 2016;Weinstein and Botvinick, 2017). 1 Since the models are flexible, they can come 29 automatically to characterize the major behavioural trends exhibited by the subjects, without requiring 30 tweaking and engineering explicitly based on behavioural analysis of data. This approach is particularly 31 useful when major trends in the data are not apparent in behavioural summary statistics. Even if such 32 trends are visible, it might be complicated to create compact models that encompass them adequately. 33 In particular, we consider as our flexible class recurrent neural networks (RNNs), which are known to 34 have sufficient capacity to represent a wide range of learning processes used by humans (and other 35 animals). Learning to learn involves adjusting the weights in the networks so they can predict the 36 choices that subjects will make as those subjects themselves learn. Once the weights have been trained, 37 they are frozen, and the model is simulated in the actual learning task to assess its predictive capacity 38 and to gain insights into human behaviour. 39 To illustrate and evaluate this approach, we focus on a relatively simple decision-making task in 40 which subjects had a choice between two key presses that were rewarded probabilistically (a two-arm 41 bandit task). Data from three groups were collected: healthy subjects, and patients with depression and 42 bipolar disorders. The results showed that the new method was able to learn subjects' decision-making 43 strategies more accurately than baseline models. Furthermore, we show that off-policy simulations of the 44 model help visualise, and thus uncover, the properties of the learning process behind subjects' actions. depression group compared to bipolar and healthy groups, and higher depression in bipolar group 70 compared to healthy group. Mania scores were significantly higher in bipolar group compared to 71 healthy group. Both patient groups had significantly reduced SOFAS scores compared to healthy 72 group, but did not differ from one another. Age of mental illness onset was younger in depression 73 group compared to bipolar group [t(56) = −2.14, p = 0.04], however duration of illness did not The instrumental learning task ( Figure 1) involved participants choosing between pressing the left or 78 right button in order to earn food rewards (an M&M chocolate or a BBQ flavoured cracker). We refer to 79 these two key presses as L and R for left and right button presses respectively. Fourteen healthy 80 participants (41.2% of the group) and 13 bipolar participants (36.7% of the group) completed the task 81 in an fMRI setting, using a 2 button Lumina response box. The remaining healthy and bipolar 82 participants, and all depression participants, completed the task on a computer with a keyboard, where 83 the "Z" and "?" keys were given as L and R. Although the performance of subjects was overall higher in 84 the fMRI settings [β = 0.050, SE= 0.024, p = 0.041] 2 , the mode of task completion (in fMRI setting vs 85 on a computer) had no significant effect on how choices were adjusted on a trial-by-trial basis, and 86 2 The intercept term was random-effect at the group level (healthy or bipolar), and the mode of task completion (in fMRI settings vs on a computer) was the fixed-effect; the probability of staying on the same action was the dependent variable; see section Statistical analysis for details. The intercept term was random-effect at the group level (healthy or bipolar), and the mode of task completion (in fMRI settings vs on a computer) was the fixed-effect; the probability of selecting the better key was the dependent variable; see section Statistical analysis for details.

4/36
was small (that is, at intermediate values of D near or equal to zero), we compared the predictive value of P and D over choice at different levels of P and D in a logistic regression. Figure 2b shows we were able successfully to identify conditions under which P and D are differentiated: at small differences in action values (the middle tertile of D values), P was a significant predictor, whereas D was not. Conversely, Fig. 2c shows that P and D were significant predictors across all tertiles of P values (pso0.001). This result confirms that when choices were made in the presence of small differences in action value, P values better discriminated the best action.
Dorsolateral prefrontal cortex tracks the relative advantage. To identify the neural regions involved in the computation of the relative advantage values that guided choice, we defined a stick function for each response and parametrically modulated this by P in a response-by-response fashion for each participant. As we used a free-response task and the interval between choices was not systematically jittered, we cannot determine whether the model variables had separate effects at the time of each choice (or between choice and feedback). We can only determine whether neural activity was related to the time course of the model variables across the 40-s block as subjects tried to learn the best action (for example, Fig. 2a). An SPM one-sample t-test with the parametric regressor representing P revealed neural activity positively related to P in a single large cluster in the right middle frontal gyrus, with the majority of voxels overlapping BA9 (dlPFC 22,23 ; peak voxel: 44, 25, 37; t ¼ 5.98, family-wise cluster (FWEc) P ¼ 0.012). Figure 2a shows the cortical regions where the BOLD response covaried with the P values of each response, implicating these regions in encoding the relative likelihood that the left action is best (Q Left 4Q Right ). (a) Before the choice, no stimuli indicated which button was more likely to lead to reward. When the participant made a choice, the button chosen was highlighted (green) and on rewarded trials the reward stimulus was presented for 1,000 ms duration. After each block of trials, the participant rated how causal each button was. (b) Mean response rate (responses per second) was higher for the high-contingency action (blue) over low-contingency action (red) in each condition. (c) Causal ratings were higher for the highcontingency action (blue) over low-contingency action (red) in each condition. Response rate and causal rating significantly varied with contingency, Po0.001. Vertical bars represent s.e.m.  Figure 1. Structure of the decision-making task. Before the choice, no stimulus indicated which button was more likely to lead to reward. When the participant made a choice, the button chosen was highlighted (green) and on rewarded trials the reward stimulus was presented for 500ms duration. After each block of trials, the participant rated how causal each button was in earning rewards.
appeared in the centre of the screen for 500ms. A tally of accumulated winnings remained on the bottom 97 of the screen for the duration of the task. Responding was self-paced during the 12 blocks, each 40-s in 98 length. At the end of each block, participants were asked to judge, on a 10-point scale, how likely it was 99 that pressing each button earned them the reward on the previous trial.

100
The task began with a 0.25 contingency practice block, a hunger rating (0 to 10) and a pleasantness 101 rating for each food outcome (-5 to +5). The data from hunger ratings and subjective ratings (at the 102 end of each block) was missing for some subjects and it was not used in the analysis. The set of available actions is denoted by A. Here A = {L, R}, with L and R referring to left and right 106 key presses respectively. A set of subjects is denoted by S, and the total number of trials completed by 107 subject s ∈ S over the whole task (all blocks) is denoted by T s . a s t denotes the action taken by subject s 108 at trial t. The reward earned at trial t is denoted by r t , and we use a t to refer to an action taken at time 109 t, either by the subjects or the models (in simulations).

111
The architecture used is based on recurrent neural network model (rnn) and is depicted in Figure 2.

112
The model is composed of an LSTM layer (Long short-term memory; Hochreiter and Schmidhuber, 1997) 113 and an output softmax layer with two nodes (since there are two actions in the task). The inputs to the 114 LSTM layer are the previous action (a t−1 coded using one-hot transformation) and the reward received 115 after taking action (r t−1 ∈ {0, 1}). The outputs of the softmax are probabilities of selecting each action, 116 which are denoted by π t (a; rnn) for action a ∈ A at trial t.

5/36
In the learning-to-learn phase, the aim is to train weights in the network so that the model learns to predict subjects' actions given their past observations (i.e., learns how they learn). For this purpose, the objective function for optimising weights in the network (denoted by Θ) for subject set S is, where a s t is the action selected by subjects s at trial t, and π t (.; rnn) is the probability that model 118 assigns to each action. Note that the policy is conditioned on the previous actions and rewards in each 119 block of training, which are not shown in notations for simplicity.

120
Models were trained using maximum-likelihood (ML) estimation method, where Θ is a vector containing free-parameters of the model (in both LSTM and softmax layers). The 121 models were implemented in TensorFlow (Abadi et al., 2016) and optimized using Adam optimizer 122 (Kingma and Ba, 2014). Note that Θ was estimated for each group of subjects separately. Networks with 123 different numbers N c of LSTM cells (N c ∈ {5, 10, 20}) were considered, and the best model was selected 124 using leave-one-out cross-validation (see below). Early stopping was used for regularization and the 125 optimal number of training iterations was selected using leave-one-out cross-validation.

126
The total number of free parameters (in both the LSTM layer and softmax layer) were 190, 580, and 127 1960 for the networks with 5, 10, and 20 LSTM cells, respectively. In order to control for the effect of 128 initialization of network weights on the final results, a single random network of each size (5, 10, 20) was 129 generated, and was used to initialize the weights in the network.

130
After the learning-to-learn phase, the weights in the network were frozen and the trained model was 131 used for three purposes: (i) cross-validation (see below), (ii) on-policy simulations and (iii) off-policy 132 simulations. For cross-validation, the previous actions of the test subject(s) and the rewards experienced 133 by the subject(s) were fed into the model, but unlike the learning-to-learn phase, the weights were not 134 changing and we only recorded the prediction of the model about the next action. Note that even though 135 the weights in the network were fixed, the output of the network changed from trial to trial due to the 136 recurrent nature of these networks. Also, due to the small sample size we used the same set of subjects 137 for testing the model and for the validation of model hyper-parameters (N c and number of optimization 138 iterations).

139
Other than being used for calculating cross-validation statistics, trained models were used for 140 on-policy and off-policy simulations (with frozen weights). In the on-policy simulations, the model 141 received its own actions and earned rewards as the inputs (instead of receiving the action selected by the 142 subjects). In the off-policy simulations, the set of actions and rewards that the model received was fixed 143 and predetermined. The details of these simulations are reported in the Results section.

144
Baseline methods 145 We used three baseline methods, ql, qlp and gql, which are variants and generalizations of Q-learning 146 (Watkins, 1989). ql model. After taking action a t−1 at time t − 1, the value of the action, denoted by Q t (a t−1 ), is updated as follows, where φ is the learning-rate and r t−1 is the reward received after taking the action. Given the action values, the probability of taking action a ∈ {L, R} in trial t is: where β > 0 is a free-parameter and controls the contribution of values to the choices (balance between 148 exploration and exploitation). The free-parameters of this variant are φ and β. Note that the probability 149 that models predict for each action at trial t, is necessarily based on the data before observing the action 150 and reward at trial t. Further, since there are only two actions, we can write is the standard logistic sigmoid.

152
qlp model. This model is inspired by the fact that humans and other animals have a tendency to stick with the same action on multiple trials (i.e., perseverate), or sometimes to alternate between the actions (independent of the reward effects; Lau and Glimcher, 2005). We therefore call this model qlp, for Q-learning with perseveration. In it, action values are updated according to equation 3 similar to ql model, but the probability of selecting actions is, π t (a; qlp) = e βQt(a)+kt(a) a ∈A e βQt(a )+kt(a ) , Therefore, there is a tendency for selecting the same action again in the next trial (if κ > 0) or switching 153 to the other action (if κ < 0), and, in the specific case that κ = 0, the qlp model reduces to ql.

155
gql model. As we will show in the results section, neither ql nor qlp fit the behaviour of the subjects in the task. As such, we aimed to develop a baseline model which could at least capture high-level behavioural trends, and we built a generalised Q-learning model, gql, to compare with rnn.
In this variant, instead of learning a single action value for each action, the model learns N different values for each action, where the difference between the values learned for each action is that they are updated using different learning-rates. The action values for action a are denoted by Q(a), which is a vector of size N , and the corresponding learning-rates are denoted by vector Φ of size N (0 Φ 1). Based on this, the value of action a t−1 at trial t − 1 is updated as follows, where represents element-wise Hadamard product. For example, if N = 2, and Φ = [0.1, 0.05], then it 156 means that the model learns two different values for each action (L, R actions) and one of the values will 157 be updated using learning-rate 0.1 and the other one is updated using learning-rate 0.05. In the specific 158 case that N = 1, the above equation reduces to equation 3 used in ql and qlp models, in which only a 159 single value is learned for each action.

160
In the qlp model, the current action is affected by the last taken action (perseveration). This close to one it means that action a was taken frequently in the past and being close to zero implies that 165 the action was taken rarely. Similar to action values, for each action, N different histories are tracked, 166 each of which is modulated by a separate learning-rate. Learning-rates are represented in vector Ψ of size 167 N (0 Ψ 1). Assuming that action a t−1 was taken at trial t − 1, H(a) updates as follows, Intuitively, according to the above equation, if action a was taken on a trial, H(a) increases (the amount 169 of increase depends on the learning-rate of each entry), and for the rest of the actions, H(other actions) 170 will decrease (again the amount of decrement is modulated by the learning rates). For example, if N = 2, 171 and Ψ = [0.1, 0.05], it means that for each action two choice tendencies will be learned, one of which is 172 updated by rate 0.1 and the other one by rate 0.05.

173
Having learned Q(a) and H(a) for each action, the next question is how these two combine to guide 8/36 choices. Q-learning models assume that the contribution of values to choices is modulated by parameter β. Here, since the model learns multiple values for each action, we assume that each value is weighted by a separate parameter, denoted by vector B of size N . Similarly, in the qlp model the contribution of perseveration to choices is controlled by parameter κ, and here we assume that parameter K modulates the contribution of previous actions to the current choice. Based on this, the probability of taking action a at trial t is, where "·" operator refers to inner product. Here, we also add an extra flexibility to the model by allowing values to interact with history of previous actions for influencing choices. For example, if N = 2, we allow the two learned values for each action to interact with the two learned action histories of each action, leading to four interaction terms, and the contribution of each interaction term to choices is determined by matrix C of size N × N (N = 2 in this example), The free-parameters of this model are Φ, Ψ, B, K, and C. In this paper we use models with N = 1, 2, 10, 174 which have 5, 12 and 140 free parameters respectively. We used N = 2 for the results reported in the 175 main text, since this model setting was able to capture several behavioural trends while still being 176 interpretable. The results using N = 1, 10 are reported in the supplementary materials to illustrate the 177 models' capabilities in extreme cases.

178
Objective function. The objective function for optimising the models was the same as the one chosen for rnn, where as mentioned before, a s t is the action selected by subject s at trial t, and π t (.; M) is the probability that model M assigns to each action. Models were trained using maximum-likelihood estimation method, where Θ is a vector containing the free-parameters of the models. Optimizations for all models were 179 performed using Adam optimizer (Kingma and Ba, 2014), and using automatic differentiation method 180 provided in TensorFlow (Abadi et al., 2016). The free-parameters with the limited support (φ, β, Φ, Ψ) 181 were transformed to satisfy the constraints.

Performance measures
183 Two different measures were used for quantifying the predictive accuracy of the models. The first measure is the average log-probability of the models' prediction for the actions taken by subjects. For a 9/36 group of subjects denoted by S, we define negative log-probability (nlp) as follows: The other measure is the percentage of actions predicted correctly, where . denotes the indicator function. Unlike, '%correct', nlp takes the probabilities of predictions

199
We first focus on the high-level evaluation of the new approach in terms of making predictions about 200 subjects' actions and diagnostic labels. Then, in the following sections we focus on behavioural analysis 201 of the data using the rnn.

202
Prediction analysis

203
Model settings. For the rnn model, leave-one-out cross-validation was used to determine the number 204 of cells and optimisation iterations required for the rnn model to achieve the highest prediction accuracy. 205 The results are shown in Figure S1, which shows that the lowest mean negative log-probability (nlp) is 206 achieved by 10 cells in the LSMT layer and after 1100, 1200 optimisation iterations for healthy and  Action prediction. Here the aim was to quantify how well the models predict the actions chosen by 212 the subjects. We used Leave-one-out cross-validation (as above) and calculated prediction accuracy 213 measures for the held out subjects. The results are reported in Figure 3. Left-panel in the figure shows 214 prediction accuracy in terms of nlp (averaged over leave one-out cross-validation folds; lower values are 215 better) and the right-panel shows the percentage of actions predicted correctly (higher values are better). 216 Focusing on nlp measures, which unlike '%correct' take the certainty of predictions into account, we 217 make two observations. Firstly, among the baseline models, gql provided the highest performance in 218 terms of nlp in all the three groups, which was statistically significant in depression and bipolar Based on this, we conclude that rnn 223 was more predictive than the baseline models. Indeed, we show in the next section that it captures some 224 trends in the behaviour of the subjects that other models fail to capture.

225
Diagnostic label prediction. Next, we sought to evaluate the new approach in terms of predicting 226 diagnostic labels using a leave-one-out cross-validation method. In each run, one of the subjects were 227 held out, and a rnn model was fitted to the rest of the subject's group. This model, along with the 228 versions of the same model fitted on all the subjects in each of the other two groups, were used to predict 229 the diagnostic label for the held out subject. This prediction was based on which of the three models 230 provided the best fit (lowest nlp) for that subject. The results are reported in Table 2. The baseline 231 random performance is near 33%. As the table shows, the highest performance is achieved in the 232 healthy group, in which 64% of subjects are classified correctly. On the other hand, in depression 233 group there is a significant portion of subjects classified as healthy. The overall correct classification 234 rate of the model is 52%, while gql achieved 50% accuracy (Table S1). We therefore conclude that 235 although gql was unable to accurately characterize behavioural trends in the data (as we will show 236 below), the group differences that were captured by gql were sufficient to guide diagnostic label 237 predictions. 238 4 The intercept term was random-effect at the cross-validation fold level; model (gql =1, qlp/rnn =0) was the fixed-effect.  Figure 4 shows the probability of selecting the better key (the key with the higher reward probability For the case of rnn, the number of cells, optimization iterations and model initialisation were based on 254 the numbers obtained using cross-validation (see above). Negative log-likelihood for each model is 255 reported in Table S5 (see Table S7 for the effect of the initialisation of the network on the negative 256 log-likelihood of trained rnn. See Table S6 for negative log-likelihood when a separate model is fitted to 257 each subject in the case of baseline models).

258
Models were simulated on-policy (i.e., actions were selected by the model according to the 259 probabilities that the model assigned to each action on each trial) in the task conditions with the same 260 probabilities and for the same number of trials that each subject completed in each block. The results of 261 the simulations are shown in Figure 4 (rnn, gql, ql, qlp). In the case of rnn, similar to the subjects' 262 data, the probability of selecting the better key was significantly higher than the other key in all the initially unaware that the objective of the task is to collect rewards, its actions were directed toward the 266 better key by following the strategy that it had learned from subjects' actions. A similar pattern was 267 5 The intercept term was random-effect at the subject level; key (low reward probability=0, high reward probabilities=1) was the fixed-effect; dependent variable was the probability of selecting the key. 6 The intercept term was the random-effect at the subject level; and key (low reward probability=0, high reward probabilities=1), groups (healthy, depression/bipolar) and their interaction were fixed-effects; dependent variable was the probability of selecting the key. observed for gql, qlp and ql models, which is not surprising as the structure of these models includes 268 value representations which can be used for reward maximization. This pattern is expected from baseline reinforcement-learning models, i.e., ql and qlp, as in these 281 models, earning rewards increases the value of the taken action, which raises the probability of taking 282 that action in the next trial. Indeed, such a learning process is embedded in the parametric forms of ql 283 and qlp models, and cannot be reversed no matter what values are assigned to the free-parameters of 284 these models. As such, we designed gql as a baseline model with more relaxed assumptions and a higher 285 capacity compared to qlp and ql, which enabled it to produce the same pattern as the subjects' choices, 286 similar to rnn. Despite this, gql provided a lower performance in terms of predicting subjects' choices 287 compared to rnn, which shows there are behavioural trends that this model failed to represent, even 288 though it was able to capture high-level behavioural statistics. Furthermore, it is not immediately clear 289 how actions were directed toward the better key, and at the same time the probability of switching to 290 the other action after earning rewards is higher, as it seems to imply actions will be diverted from the 291 better key. In the next section, we aim to show how these two observations can be explained using 292 off-policy model simulations. Off-policy simulations 294 In this section we aim to use off-policy simulations of the models to uncover the learning and 295 action-selection processes behind subjects' choices. Off-policy means that actions are not selected by the 296 model in the simulation, but they are fixed and fed into the model, and the model is used only for 297 making prediction about the next action. In this way we can control what inputs the model receives and 298 thus examine how they affect predictions.

299
Simulations of the models (rows) are shown in Figure 6 for healthy group, in which each panel 300 shows a separate simulation across 30 trials (horizontal axis). For trials 1-10, the action that was fed to 301 the model was R, and for trials 11-30 it was L (the action fed into the model at each trial is shown in the 302 ribbons below each panel). The rewards associated with these trials varied among simulations (the 303 columns) and are shown by black crosses (x) in the graphs (see Section S2 for more details on how 304 simulation parameters were chosen).

305
Focusing on the rnn model, we can see that in the first 10 trials the predicted probability of taking R 306 is higher than L; this then reverses in the next 20 trials. This shows that perseveration (i.e., sticking 307 with the previously taken action) is an element in action selection, and is also consistent with the fact 308 that the qlp model (which has a parameter for perseveration) performs better than the ql model in the 309 cross-validation statistics (see Figure 3) 8 . We make four further sets of observations regarding how  Figure 4 that the probability of staying on an action is above 50%, irrespective of whether a reward was earned in the previous trial or not. This, however, does not provide any evidence for perseveration, as trials are not statistically independent. For example, in late training trials, a subject might have discovered which action returns more reward on average, and therefore stays on an action irrespective of reward, without necessarily relying on perseveration.
14/36 The ribbon below each panel shows the action which was fed to the model on each trial. In the first 10 trials, the action that the model received was R and in the next 20 trials it was L. Rewards are shown by black crosses (x) on the graphs. See text for the interpretation of the graph. Note that the models' prediction for each trial is before seeing which action and reward was fed to the model on that trial.
The immediate effect of reward on choices. Focusing on rnn simulations in Figure 6, an 312 observation is that earning a reward (shown by black crosses) causes a 'dip' in the probability of staying 313 on an action, which shows the tendency to switch to the other action. This is consistent with the 314 observation made in Figure 5 that the probability of switching increases after reward. We see a similar 315 pattern in gql, but in ql and qlp models the pattern is reversed, i.e., the probability of choosing an 316 action increases after a reward due to the increases in action values (the effects are rather small for qlp 317 and may not be clear for this model), which is again consistent with the observation in Figure 5. The 318 reason that gql is able to produce different predictions to that of ql and qlp is that in this model, the 319 contribution of action values to choices can be negative, i.e., higher values can lead to lower a probability 320 of staying on an action (see Section S1 for more explanation).

321
The effect of previous rewards on choices. The next observation is with respect to the effect of 322 previous rewards on the probability of switching after a reward. First we focus on the rnn model and on 323 the trials shown by red arrows in Figure 6. The red arrows point to the same trial number, but the 324 number of rewards earned prior to the trial is different. As the figure shows, the probability of switching 325 after reward is lower in the right-panel compared to the left and middle panels. The only difference 326 between simulations is that in the right panel, two more rewards were earned before the red arrow.
327 Therefore, the figure shows that although the probability of switching is higher after reward, it gets The ribbon below each panel shows the action which was fed to the model on each trial. In the first 10 trials, the action that the model received was R and in the next 20 trials it was L. Rewards are shown by black crosses (x) on the graphs. See text for the interpretation of the graph. Note that the simulation conditions is same as the one shown in Figure 6, and the first row here (healthy group) is same as the first row shown in Figure 6 which is shown here again for the purpose of comparison with other groups.
smaller as more rewards are earned from an action. Indeed, this strategy makes subjects switch more 329 from the inferior action as rewards are sparse on that action, and switch less from the superior action, as 330 it is more frequently rewarded. This can reconcile the observations made in Figures 5, 4 that more 331 responses were made on the better key, and at the same time, the probability of switching after reward 332 was higher. As shown in Figure 6, gql model produced a pattern similar to rnn, which is because this 333 model tracks multiple values for each action (see Section S1 for details). Figure 7 shows the same 334 simulations using rnn for all the groups (see Figures S5, S6, S7 for gql, qlp and ql models 335 respectively). Comparing the predictions at the red arrows for depression and bipolar groups, we see 336 a pattern similar to healthy group, although the differences are smaller in the bipolar group (see 337 Figure S9 for the effect of the initialisation of the model).

338
The above observations are consistent with the pattern of choices in empirical data as shown in 339 Figure 8-left panel, which depicts the probability of staying on an action after earning reward as a 340 function of how many rewards were earned after switching to the action (a similar graph using on-policy 341 simulation of rnn is shown in Figure S11). In all the three groups, the probability of staying on an 342 action (after earning a reward) was significantly higher when more than two rewards were earned  The intercept was random-effect at the subject level; whether zero rewards or more than two rewards were earned previously was fixed-effect. The effect of repeating an action on choices. In the previous section we investigated the effect of 347 previous rewards on choices. In this section we elaborate on how the history of previous actions affects 348 current choices. Focusing on rnn simulations in the left-panel of Figure 6, an observation is that after 349 switching to action L (after trial 10) the probability of staying on the action gradually decreases, i.e., 350 although there is a high chance the next action will be similar to the previous one, subjects developed a 351 tendency to make a switch the longer they stayed with an action. To compare this pattern with 352 empirical data, we calculated the probability of staying on an action as a function of how many times the 353 action was taken since switching, which is shown in Figure 8:right-panel 10 (similar graphs for rnn 354 on-policy simulations is shown in Figure S11). As the figure shows, for the healthy group, the chance 355 of staying on an action decreases as the action is taken more times [β = −0.005, SE= 0.001, p = 0.006] 11 , 356 which is consistent with the behaviour of rnn. With regard to the baseline models, going back to 357 Figure 6, we do not see a similar pattern, although in gql there is a small decrement in the probability 358 of staying on an action after earning the first reward.

359
Symmetric oscillations between actions. Next, we focus on rnn simulations for groups 360 depression and bipolar in Figure 7. In the depression group, the probability of staying on an action 361 is almost flat with a slight decrement in the middle. For the bipolar group, there is a dip around 10 362 trials after switching to action L (which will be around trial 20), and then the policy becomes flat.

363
Referring to the empirical data, as shown in Figure 8:right-panel, the effect of number of actions in stay 364 probabilities is not monotonic. In particular, as shown in Figure 8:right-panel, for depression and 365 bipolar groups the probability of staying on an action immediately after switching to the action is 366 around 50% -60% (shown by the bar at x = 0 in Figure 8:right-panel), i.e., there is a 40% -50% chance 367 that the subject immediately switches back to the previous action. Based on this, we expect to see a 'dip' 368 10 To be consistent with off-policy simulations, only trials in which (i) subjects did not earn a reward on that trial, (ii) subjects did not earn reward since switching to the current action, were included in the graph. 11 The intercept was random-effect at the subject level; the number of times that an action was repeated since switching to the action was fixed-effect (between zero to 15 times). The dependent variable was the probability of staying on an action.

17/36
in the simulations of depression and bipolar groups in Figure 7 just after switching to L action, which 369 is not the case, pointing to an inconsistency between model predictions and empirical data.

370
To look closer at the above effect, we define a run of key presses as a sequence of presses on a certain 371 key in a row, without switching to the other action 12 . Figure 9 shows the relationship between 372 consecutive run lengths, i.e., the length of the current run of actions as a function of the length of the 373 previous run of actions (see Figures S13, S14, and S15 for the similar graphs using on-policy simulations 374 of rnn, gql with N = 2 and gql with N = 10, respectively). The dashed line in the figure indicates the 375 points at which the current run length is the same the previous run length. Being close to this line 376 implies that subjects are performing symmetric oscillations between the two actions, i.e., going back and 377 forth between the two actions while performing an equal number of presses on each key. In particular, as 378 the graph shows in the bipolar group, and to an extent the in depression group, a run of short length 379 will trigger another run with a similar length. This implies that, if for example by chance a subject 380 performs a run of length 1, it will initiate a sequence of oscillations between the two actions, which will 381 keep the stay probabilities low during short runs, consistent with what we see at x = 0 in 382 Figure 8:right-panel. This effect cannot be seen in the simulations that we showed in Figure 7, because 383 the length of the previous run before switching to action L was 10 (there were 10 R actions), and 384 therefore we do not expect the next run to be of length 1, neither do we expect to see a dip in policies 385 just after the first switch.

386
As shown in Figure S10, majority of runs are of length 1 in the depression, and bipolar groups 387 (around 17%, 37%, and 45% of runs are of length 1 in the healthy, depression, and bipolar groups 388 respectively). Given this, and the specific pattern of oscillations in the depression and bipolar groups, 389 the next question is whether in the models a run of length 1 will trigger the oscillations, similar to the 390 empirical data. We used a combination of off-policy and on-policy model simulations to answer this 391 question. That is, during the off-policy phase we forced the model to make an oscillation between the 392 two actions, and then after that we let the model select actions. We expect to see that in the healthy 393 group, the model will converge to one of the actions, but in depression and bipolar groups, we expect 394 to see the initial oscillations trigger further switches. Simulations are presented in Figure 10 10-20) were selected based on which action the model assigns the highest probability 13 . As the 399 simulation shows, at the beginning the probability that the model assigns to action R is high, but then 400 after feeding the oscillations, the model predicts that the future actions will be oscillating in depression 401 and bipolar groups, but not in healthy group, consistent with what we expect to observe.

402
Therefore, rnn is able to produce symmetric oscillations and its behaviour is consistent with the 403 subjects' actions. As Figure 10 shows, besides rnn, gql was also able to produce length 1 oscillations to 404 some extent (as shown for bipolar group), which can be the reason that the prediction accuracy 405 achieved by this model is significantly better than qlp in bipolar and depression groups (Figure 3) in 406 12 For example, if the executed actions are L, R, R, L, then the length of the first of run is 1 (L), the length of second run is 2 (R, R), and the length of the third run is 1 (L).
13 Note that in on-policy simulations, typically actions are selected probabilistically according to the probabilities that a model assigns to each action. However, in the on-policy simulations presented in this section in order to get consistent results across simulations, actions were not selected probabilistically, but they were chosen based on which actions get the highest prediction probability. which length 1 oscillations are more common (see Section S2 for more details). However, as shown in 407 Figure S14, gql failed to produce oscillations of longer lengths (even if we increase the capacity of gql 408 by using N = 10; see Figure S15), while rnn was able to do so ( Figure S13). This inability in the gql 409 model is particularly problematic in depression and healthy groups, as these two groups tend to 410 match the length of consecutive runs of actions. This could be the reason that in the cross-validation 411 statistics rnn is significantly better than gql in depression and bipolar groups.

412
Summary. Firstly, rnn was able to capture the immediate effect of rewards on actions (i.e., the 'dip' 413 after rewards), as well as the effect of previous rewards on choices. gql has the same ability, which 414 enabled it to reproduce behavioural summary statistics shown in Figures 4, 5. Baseline 415 reinforcement-learning models (qlp and ql) failed to capture either trend. Secondly, rnn was able to 416 capture how choices change as an action is chosen repeatedly in a row, and also the symmetric 417 oscillations between the actions, which gql was unable to do so.

419
Based on recurrent neural networks, we provide a method for learning a computational model that can 420 characterize human learning processes in decision-making tasks. Unlike previous works, the current 421 approach makes minimal assumptions about these learning processes; we showed that this agnosticism is 422 important to be able to explain the data. In particular, subjects apparently used a mixture of different 423 processes to select actions; there were some differences in these processes between healthy and the 424 psychiatric groups. The rnn model was able to learn these processes from data. These processes were to 425 a large extent inconsistent with Q-learning models, and were also rather hidden in the overall 426 performance of the subjects in the task. This is an example of how our proposed framework can 427 outperform previous approaches. Furthermore, we show that the model can be interpreted using The ribbon below each panel shows the action that was fed to the model (for the first 9 trials), and the action that was selected by the model (for the rest of trials). During off-policy trials, the sequence of actions that was fed to the model was R, R, R, R, R, R, L, R, L. See the text for the interpretation of the graph.
It might be possible to design different variants of Q-learning models (e.g., based on the analysis 432 presented before) and obtain a more competitive prediction accuracy. For example, although it is 433 non-trivial, one can design a new variant of gql which is able to track the oscillatory behaviour. Our aim 434 here was not to rule out this possibility, but to show that the strength of our analysis lies in its ability to 435 automatically extract learning features -which were initially invisible in task performance metrics -436 from subjects' actions through learning to learn, without requiring feature engineering in the models. In 437 this respect, although gql was not used in the previous works, we designed and used it as the baseline 438 model that can correctly capture high-level summary statistics of behaviour ( Figures 5, 4), although 439 misses deeper trends that are essential to characterize behaviour, particularly in psychiatric groups.

440
Indeed, our approach inherits this benefit from neural networks which have significantly simplified 441 feature engineering in different domains (Lecun et al., 2015). However, our approach also inherits the 442 black-box nature, i.e., the lack of interpretable working mechanism, of neural networks. This might not 443 be an issue in some applications such as predicting diagnostic label of the subjects; however, it needs to 444 be addressed in other applications in which the target of the study is obtaining an interpretable working 445 mechanism. We did however show that running controlled experiments on the model through the 446 off-policy simulations can provide some insights into the processes behind subjects' choices. Interpreting 447 neural networks is an active area of research in machine learning (e.g., Karpathy et al., 2015), and the 448 approach proposed here can benefit from further developments in this area. Even as a pure black-box 449 model, the current approach can also contribute to the previous methods of computational modelling by 450 providing a baseline for predictive accuracy. That is, as long as a candidate model does not provide 451 better than or equal performance to rnn models, it means that there are certainly accessible behavioural 452 20/36 trends that have been missed in the model structure. This is particularly important due to the natural 453 randomness in human choices, making it unclear in many scenarios whether the model at hand (e.g., a 454 Q-learning model) has reached the limit of predictability of choices, or whether it requires further 455 improvements.

456
As we showed, off-policy simulations of the model can be used to gain insights into the model's 457 working mechanism. However, off-policy simulations need to be designed manually to determine the 458 inputs to the model. Here, we designed the initial off-policy simulations based on the specific questions 459 and hypotheses that we were interested in and using overall behavioural statistics ( Figure 6; Section S2). 460 However, an important part of the behavioural processes, i.e., the tendency of subjects to oscillate 461 between the actions, was not visible in those simulations, and because of this we designed another set of 462 inputs to investigate the oscillations (Figure 10). This shows that the choices of off-policy simulations 463 affect the interpretation of the model's working mechanism. As such, although rnn can be trained 464 automatically and without intuition into the behavioural processes behind actions (e.g., Barak, 2017), 465 the other part, i.e., designing off-policy simulations, is not automated and does need manual hypothesis 466 generation. Automating this process requires a method that generates representative inputs (and 467 network outputs) that most discriminately describe the differences between the psychiatric groups. We 468 did not address this limitation in this work, and left it for future research.

469
Recurrent neural networks have been previously used to study reward-related decision-making (Song 470 et al., 2017;Zhang et al., 2018), perceptual decision-making, performance in cognitive tasks, 471 working-memory (Miconi, 2017;Carnevale et al., 2015;Mante et al., 2013;Song et al., 2016;Barak et al., 472 2013;, motor patterns, motor reach and timing (Sussillo et al., 2015;Hennequin et al., 473 2014;Rajan et al., 2016;Laje and Buonomano, 2013). Typically, in these studies a rnn is trained based 474 on the performance of the model in the task, which is different from the current study in which the aim 475 of training is to generate a behaviour similar to the subjects', even if it leads to poor performance in the 476 task. An exception is for example the study in Sussillo et al. (2015) in which a network was trained to 477 generate outputs similar to electromyographic (EMG) signals recorded in behaving animals during a 478 motor reach task. Interestingly, the study found that even though the model was trained purely based on 479 EMG signals, the internal activity of the model resembled neural responses recorded from the motor 480 cortex of the animals. A similar approach can be employed in future works to investigate whether brain 481 activities during decision-making are related to the network activity.

482
With regard to predicting subjects' diagnostic labels, it is not surprising that the model was unable 483 to achieve a high level of classification accuracy in predicting diagnostic labels. The reason is that there 484 is a high level of heterogeneity in patients with the same diagnostic label, which for example is reflected 485 in the wide variation in treatments and treatment outcomes in depression (e.g., Rush et al., 2006). Such 486 variations might be reflected in the learning and choice abilities of the subjects, in which case, may be 487 predicted using the model's inferred labels for each subject. On the other hand, purely as a diagnostic 488 tool the current approach may help clinicians in situations that using questionnaires is not applicable 489 (e.g., due to language/cultural barriers).

490
In the model fitting procedure used here, a single model was fitted to all subjects in each group, 491 despite possible individual differences within a group. This was partly because we were interested in 492 obtaining a single parameter set for making predictions for the held out subject in leave-one-out 493 cross-validation experiments. That is, even if a mixed-effect model was fitted to the data, at the end, a 494 summary of group statistics is required for making predictions about a new subject. In other 495 applications one might be interested in estimating parameters for each individual (either network weights 496 21/36 or parameters of the reinforcement-learning models); in this respect using a hierarchical model fitting 497 procedure would be a more appropriate approach, which has been done previously for the 498 reinforcement-learning models (e.g., Piray et al., 2014) and it would be an interesting future step to 499 develop it for rnn models.

500
Along the same lines, a single rnn model, due to its rich set of parameters, might be able to learn 501 and detect individual differences (e.g., differences in the learning-rates of subjects) at early trials of the 502 task, and then use this information for making predictions for later trials. For example, in the 503 learning-to-learn phase, the model might learn that subjects either have a very high, or a very low 504 learning-rate. Then, when being evaluated in the actual learning task, the model can use its observations 505 from subjects' choices at early trials to determine whether the learning-rate for that specific subject is 506 high or low, and then utilise that information for making more accurate predictions in latter trials.

507
Determining such individual-specific traits in early trials of the task is presumably not part of the 508 computational processes occurring in the subject's brain during the task, but it is occurring in the model 509 merely to make more accurate predictions. Therefore, if the network learns to do so, it might not be 510 straightforward to treat such models as the computational models for subject's choices, but only as the 511 models that are able to make predictions for the choices. In gql simulations presented in Figure 6, an observation is that earning a reward (shown by black 520 crosses) causes a 'dip' in the probability of staying on an action, which shows the tendency to switch to 521 the other action. This is consistent with the observation made in Figure 5 that the probability of 522 switching increases after rewards. The reason that gql is able to produce different predictions to that of 523 ql and qlp is that in this model, the contribution of action values to choices can be negative, i.e., higher 524 values can lead to a lower probability of staying on an action. Indeed, examining the learned parameters 525 for this model (Table S4) revealed that for each action, values are updated at two different rates, a slow 526 rate (0.145) and a fast rate (0.815), and the coefficient for the values that are updated at the faster rate 527 is negative (−1.002). This implies that after earning a reward the value of the action taken increases 528 quickly, but that increase leads to the lower probability of selecting the action -which makes the 'dip' in 529 policies following the rewards.

530
The next observation was with respect to the effect of previous rewards on the probability of 531 switching after a reward. As shown in Figure 6, gql model produced a pattern similar to rnn, and it 532 can be seen that the probability of staying on an action increases after earning a reward, causing the 533 depth of the dip after earning reward to become smaller as more rewards are earned. This ability of gql 534 is because this model tracks two values for each action, one of them updated with a fast learning-rate 535 and the other with a slow learning-rate. The one that updates faster plays a role in the dip following 536 each reward. On the other hand, the value that updates slower (at learning-rate 0.145) has an opposite 537 effect since it contributes to choices with a positive coefficient (4.258), and therefore, with increases in 538 the value after reward the probability of staying with an action increases. Based on this, allowing the 539 model to track two different values for each action is important, and the model will not be able to 540 produce this behaviour if it tracks only one value for each action (N = 1) as shown in Figure S8. In the simulations shown in Figure 6, action R is fed into the model for the first 10 trials before 543 switching to the other action. This is based on the fact that in the empirical data, the average length of 544 staying with an action (when one reward is earned in the middle of the execution of the action) is 9.8. 545 The first, second and third rewards in Figure 6 are delivered after an action was taken 4, 12, and 17 546 times respectively. This is based on the fact that in the empirical data, the average number of 547 key-presses in order to earn the first, second and third rewards is 4.07, 11.6, and 17.4 respectively.

548
In the simulations shown in Figure 10, the reason for adding leading R before oscillations is to show 549 that the models do not oscillate all the time, but only after they are fed with oscillations. Indeed, qlp is 550 in principle able to produce 1-step oscillations (singe-action runs) by assigning a negative weight to the 551 perseveration parameter, i.e., instead of the model having a tendency to stay on the previously selected 552 action, it will have a tendency not to stay on the selected action. However, under this condition the 553 model will keep oscillating between the actions from trial 1, implying that it can only produce runs of 554 length 1 no matter what the length of the previous run of actions was, which is inconsistent with the 555 empirical data presented in Figure 9.  Figure S6. Off-policy simulations of qlp. Each panel shows a simulation for 30 trials (horizontal axis), and the vertical axis shows the predictions of each model on each trial. The ribbon below each panel shows the action which was fed to the model on each trial. In the first 10 trials, the action that the model received was R and in the next 20 trials it was L. Rewards are shown by black crosses (x) on the graphs. See text for the interpretation of the graph. Note that the simulation conditions is same as the one depicted in Figures 7 and 6 30/36  Figure S8. Off-policy simulations of gql with N = 1. Each panel shows a simulation for 30 trials (horizontal axis), and the vertical axis shows the predictions of each model on each trial. The ribbon below each panel shows the action which was fed to the model on each trial. In the first 10 trials, the action that the model received was R and in the next 20 trials it was L. Rewards are shown by black crosses (x) on the graphs. See text for the interpretation of the graph. Note that the simulation conditions is same as the one depicted in Figures 7 and 6 31/36  Figure S9. The effect of the initialisation of the network on the off-policy simulations of rnn. The simulation conditions are the same as the ones depicted in Figures 7 and 6. Here, 15 different initial networks were generated and optimised and the policies of the models at each trial were averaged. The gray ribbon around the policy shows the standard deviation of the policies. Each panel shows a simulation for 30 trials (horizontal axis), and the vertical axis shows the predictions of each model on each trial. The ribbon below each panel shows the action which was fed to the model on each trial. In the first 10 trials, the action that the model received was R and in the next 20 trials it was L. Rewards are shown by black crosses (x) on the graphs. See text for the interpretation of the graph. #previous actions since switching to the current action stay probability Figure S11. rnn simulations. The graph is similar to Figure 8 but using data from rnn simulations. (Left-panel) Probability of staying on an action after earning reward as a function of number of actions taken since switching to the current action (averaged over subjects). Each red dot represents the data for each subject. (Right-panel) Probability of staying on an actions as a function of number of actions taken since switching to the current action. The red line was obtained using Loess regression (Local Regression), which is a non-parametric regression approach. The grey area around the red line represents 95% confidence interval. Error-bars represent 1SEM. #previous actions since switching to the current action stay probability Figure S12. gql simulations (N = 2). The graph is similar to Figure 8 but using data from gql simulations with N = 2. (Left-panel) Probability of staying on an action after earning reward as a function of number of actions taken since switching to the current action (averaged over subjects). Each red dot represents the data for each subject. (Right-panel) Probability of staying on an actions as a function of number of actions taken since switching to the current action. The red line was obtained using Loess regression (Local Regression), which is a non-parametric regression approach. The grey area around the red line represents 95% confidence interval. Error-bars represent 1SEM.

33/36
Healthy Depression Bipolar 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0 5 10 15 20 length of previous run length of current run Figure S13. rnn simulations. The graph is similar to Figure 9 but using data from rnn simulations. Median number of actions executed in a row before switching to another action (run of actions) in each subject as a function of length of previous run of actions (averaged over subjects). The dotted line shows the points in which the length of previous and current run are the same. Note that the use of median instead of average was because we aimed to illustrate most common 'length of current run', instead of average run length in each subject. Error-bars represent 1SEM.

Healthy
Depression Bipolar 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0 3 6 9 length of previous run length of current run Figure S14. gql simulations (N = 2). The graph is similar to Figure 9 but using data from gql simulations with N = 2. Median number of actions executed in a row before switching to another action (run of actions) in each subject as a function of length of previous run of actions (averaged over subjects). The dotted line shows the points in which the length of previous and current run are the same. Note that the use of median instead of average was because we aimed to illustrate most common 'length of current run', instead of average run length in each subject. Error-bars represent 1SEM.

34/36
Healthy Depression Bipolar 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 0 5 10 length of previous run length of current run Figure S15. gql simulations (N = 10). The graph is similar to Figure 9 but using data from gql simulations with N = 10. Median number of actions executed in a row before switching to another action (run of actions) in each subject as a function of length of previous run of actions (averaged over subjects). The dotted line shows the points in which the length of previous and current run are the same. Note that the use of median instead of average was because we aimed to illustrate most common 'length of current run', instead of average run length in each subject. Error-bars represent 1SEM.

35/36
Table S1. Prediction of diagnostic labels using gql (N = 2). Number of subjects for each trueand predicted-labels. The numbers inside parenthesis are the percentage of number subjects relative to the total number of subjects in each diagnostic group.