Improving the reliability of model-based decision-making estimates in the two-stage decision task with reaction-times and drift-diffusion modeling

A well-established notion in cognitive neuroscience proposes that multiple brain systems contribute to choice behaviour. These include: (1) a model-free system that uses values cached from the outcome history of alternative actions, and (2) a model-based system that considers action outcomes and the transition structure of the environment. The widespread use of this distinction, across a range of applications, renders it important to index their distinct influences with high reliability. Here we consider the two-stage task, widely considered as a gold standard measure for the contribution of model-based and model-free systems to human choice. We tested the internal/temporal stability of measures from this task, including those estimated via an established computational model, as well as an extended model using drift-diffusion. Drift-diffusion modeling suggested that both choice in the first stage, and RTs in the second stage, are directly affected by a model-based/free trade-off parameter. Both parameter recovery and the stability of model-based estimates were poor but improved substantially when both choice and RT were used (compared to choice only), and when more trials (than conventionally used in research practice) were included in our analysis. The findings have implications for interpretation of past and future studies based on the use of the two-stage task, as well as for characterising the contribution of model-based processes to choice behaviour.


Introduction
Animal and human decision-making research suggests that when an agent deliberates on a course of action more than one control system contributes to choice. A dominant idea invokes a contribution of a model-free and a more sophisticated goal-directed model-based system. Both influence the choice process but their relative influence is assumed to vary between individuals and conditions [1]. Theoretical questions regarding human planning [2], reasoning [3], development [4], voluntary-control [5], learning under uncertainty [6], motor control [7], and deployment of attentional resources [8], all invoke this dichotomy. Human neuroimaging studies suggest these two systems rely on distinct neural substrates [9][10][11]. Additionally, various hypotheses regarding the clinical relevance of this distinction have been put forward [12], with relative deficits in a model-based system suggested as underpinning clinical conditions such as compulsivity [13,14], substance-use [15], and obesity [16]. Even moral judgments, especially of a deontological nature, are suggested to include a model-free and model-based contributions [17].
The contribution of model-free and model-based systems is often probed via a multi-step decision tree task, wherein participants are asked to make state selections (navigate their way in an artificial maze) to attain goal rewards [18]. In this paradigm, a habitual model-free learner is assumed to select actions based on a reward history alone, without considering task structure. In contrast, to make better choices, a model-based learner is assumed to avail of a cognitive map that takes account of transition structure. While in some cases these two systems can lead to the same action, the task also generates a scenario where these two systems lead to different actions, enabling measurement of their relative contribution to the decision-making process [18][19][20].
In the two-stage decision task participants navigate from a first to a second stage to gain rewards (see Fig 1). The second stage usually entails four bandits and participants make first stage choices that probabilistically determine which bandits are available at the second stage (see Fig 1). First-stage choices provide descriptive measures as well as model parameters that quantify the contribution of model-free and model-based systems [21]. Thus, when making a first-stage choice, a model-free system considers whether an action led previously to a reward. In contrast, a model-based system also considers the transition probabilities. For example, a model-based learner might select a specific action that was not rewarded on the last trial, based on an inference form task structure that it will more reliably lead to a more rewarding secondstage state.
This two-stage decision task is considered a gold-standard for measuring model-based/ model-free contributions to choice behaviour across computational [21], neuroimaging [2], behavioural [22], developmental [4] and clinical studies [13,23]. Variants of the task are also used in animal studies [24]. Despite its widespread use no study has yet provided task reliability estimates. Almost all studies make exclusive use of a metric of choice, but not reaction-time (RT), to derive estimates of model-basedness. However, a decision-making literature indicates RT data is an important source of information [25,26], with a recent study suggesting it might improve parameter identifiability in reinforcement learning paradigms [27]. It remains unclear whether a combination of choice and RT data might improve the reliability of modelbased scores.
Here, using the two-stage task we estimated the reliability of model-based estimates by exploiting a large data set which included two distinct testing time points derived from the Neuroscience in Psychiatry Network's study [28]. We started by describing a widely used computational model [18] that allows quantification of model-based/free trade-off in first-stage choices. A schematic of the two-stage task (panel A) and an example of a random walk used to generate the true expected value for each of the four bandits at the task second-stage (panel B). At the first-stage participants choose between two options (represented by abstract fractal images) that determined the presentation of the second-stage via fixed transition probabilities of 70% ('common') or 30% ('rare'). At the second-stage, participants again choose between two bandits that led to receipt of reward (£0 or £1 play pounds). Note the second-stage included two pairs of bandits where the composition of each pair was fixed, but where the value of each bandit drifted slowly and independently. More specifically the reward associated with the second-stage bandits were subjected to random walks and thus had to be constantly learned by participants.
https://doi.org/10.1371/journal.pcbi.1006803.g001 Improving reliability of model-based estimates with drift-diffusion modeling While the latter model is designed to predict choice, we extended on this model to determine whether a combination of choice and RT data might improve parameter recovery, as well as improve the reliability of model-based estimates [27]. We demonstrate a relationship between model-parameters and model-agnostic measures; one based on choices at the first stage and used widely in past studies, and the other based on RTs at the second stage that is much less reported in the literature [4,29,30]. We reported internal (within measurement) and temporal (between measurements) stability of model-parameters and model-agnostic scores, separately and combined. Overall, our study allowed estimation of psychometric properties of choicebased model-based estimates and demonstrates how these can be improve upon using RTs.
Model-parameter estimates for model-based/free trade-off w-parameter I (RL model, choice only). Daw at el., (2011) [18] reported a reinforcement learning (RL) model that allows quantification of a model-based/free trade-off from first-stage choice behaviour. Here, the value of each first-stage bandit was the sum of two components: (1) model-free value-reflecting the amount of previous reward that followed this bandit selection, and (2) model-based value-reflecting the highest value of the two bandits that is reached by a common/rare transition following this first-stage action. In simple terms, while the model-free component was "keeping tabs" of reward history following the selection of a firststage action, the model-based component was "looking forward" into the second-stage, and considered the best bandit that a first-stage action was likely to lead to (by means of a common transition). The model first updated model-free Q-values at each trial, which were initialized to zero at the beginning of the experiment, and updated at the end of each trial according to a SARSA reward prediction error algorithm [31,32].
Let a 1 /a 2 be the actions selected in the first/second stage of the task and reward at trial t be r (t) �{0,1}. The values of the actions in the second-stage were updated according to: where α was a learning rate (free parameter) and (r (t) -Q MF (a2,t) ) represented a reward prediction error. The model-free values of first-stage actions were updated using both the value of the second-stage action, and reward prediction error of the second-stage action: The overall model included five parameters: a w-parameter (model-based/model-free trade-off), a learning-rate (α, the updating rate of values given a new outcome), a decision temperature (β, the amount of decision stochasticity), an eligibility trace (λ, referring to updating first-stage decision values based on second-stage outcomes) and a choice-perseveration parameter (p, tendency to repeat first-stage choices regardless of previous outcome or transition).
w-parameter II (DDM-RL, choice & RT). While many reinforcement learning models are designed to explain agents' choices, the value differences between two choices also influences decision-time [33,34]. Moreover, it has been argued that use of RTs can increase the reliability of parameter estimates in RL models [27]. To explore whether including RTs improve reliability of w-parameter estimates, we extended the Daw et al., 2011 RL model to predict a combination of choice and RT, based on an assumption that value discriminability will be reflected in both choice and RT (with higher value differences leading to quicker RTs).
Previous studies have argued that a combination of choice and RT can be predicted by integrating Q-learning RL algorithms with evidence accumulation mechanisms [27,[34][35][36][37]. Here, we used the Wiener drift-diffusion model whereby the decision process is described as a continuous random walk (or diffusion) process [38][39][40]. In this model, evidence accumulated towards one of two boundaries, with time to reach one of the two boundaries and the identity of the attained boundary, determining decision-time and choice. At each time point, the amount of evidence in favour of one of the two alternatives drifted as a function of: where d was a drift-rate towards the selected action, and s 2 is fixed to 1 to allow scaling. The distance between the boundaries was determined by a free parameter-a, which represented the response policy. Greater distance between boundaries (a parameter) led to a more cautious policy (slower and more accurate response) while faster drift-rate (d parameter) led to greater sensitivity (quicker and more accurate decisions). Finally, X 0 (the amount of evidence when the process starts) was equal to 0.5 � a, given that we assumed no prior bias towards one of the two alternatives. The time of first passage, and which boundary was attained first, were therefore probabilistic with a probability determined by the parameter set. Next, to account for a combination of choice and decision-time generated by the value differences, Eq 7 was adjusted so that the drift-rate for each trial was the difference between the Q-values of the two alternatives, and the upper/lower boundary represented the selected and competitor choices a and a' [27,34]: Therefore ; d ¼ bðQ a; t ð Þ À Q a 0 ; t À � Þ, meant a greater value difference between the selected and alternative action translates in the model as higher drift-rates. The b parameter allowed further scaling (as both fixing s 2 = 1 and the Q-value range is arbitrary). Finally, a non-decision time parameter τ was added to the decision-time representing the duration that passed without any diffusion process (e.g., early perceptual/late motor processes), with RT being the sum of decision-time (accounted for by the first passage of time in the Wiener diffusion model) and τ. Overall, the current drift-diffusion model with reinforcement learning (DDM-RL) included five parameters from the Daw model accounting for RL algorithms (α 1 , α 2 , λ, p,w) along with three DDM parameters (b,a,τ). Estimation of log-likelihood given a combination of choice and RT per trial was obtained by using an analytic solution for Wiener's first passage of time [38] (the code was obtained from [41]).

Model-agnostic estimates for model-based/free trade-off
While the w-parameter serves as a process-based measurement of model-based/free trade-off (and thus has a clear interpretation), the literature makes use of model-agnostic measures that map directly to changes in the w-parameter. Model-agnostic measures can be beneficial as they are straightforward to calculate (can be done without model fitting or RL algorithms).
Here we examined a commonly used score based on first-stage choice [4,18,23,30], describing a greater tendency of a model-based agent to revisit a state that was previously rewarded as a function of common vs. uncommon transitions. We show that this tendency translates into systematic value differences in the second-stage. Giving the known relationship between discriminability and RT [33,42] this observation allowed us to elaborate a second score based on second-stage RTs (RT 2 ) [4,30].

MB-I (choice).
Model-based choice is estimated by calculating the interaction effects of transition (common vs. rare) and reward (rewarded vs. non-rewarded) in the previous trial on the probability of repeat of a first-stage choice in the next trial. A purely model-free learner is blind to transition structure, and therefore should display a higher probability of repeating the last first-stage choice when rewarded, regardless of transition (reward's main effect, see S1 Fig,  left panel). In contrast, a model-based learner has knowledge of transition probabilities, and makes use of that knowledge to choose the best bandit on the next trial. For a model-based learner a reward increases the probability of repeating a first-stage choice only when the previous transition was common, while after uncommon transitions reward reduces the probability of repeating first-stage action (increasing the chance that the agent would return to the state where the reward was obtained). This translates into a positive interaction of previous reward and previous transition on the probability of repeat of the first-stage action (see S1 Fig, right panel).
Let transition and reward at trial t be represented by X transition(t) �{0,1} for common and uncommon and X reward(t) �{0,1} for unrewarded and rewarded trials. Sticking with the same choice at the first-stage can be represented by Y stay �{0,1} for switch and stay choices. Then stay probability (chance of repeating first-stage choice) can be defined as P(Y stay | X transition(t), X reward(t) ). Individual MB-I (choice) scores can then be obtained by calculating the paired interaction of previous transition and reward on stay probability. This was achieved by first calculating the effect of transition separately for rewarded and unrewarded trials: Unrewarded ¼ PðY stayðtþ1Þ jX transitionðtÞ ¼ 0; X rewardðtÞ ¼ 0Þ À PðY stayðtþ1Þ jX transitionðtÞ ¼ 1; X rewardðtÞ ¼ 0Þ then calculating the difference between the two effects: Scores that are constrained by group prior (i.e., hierarchical MB-I (choice) ) were obtained by fitting a mixed-effect logistic for trial-by-trial data (the following indicates Wilkinson notation [43]): PðY stayðtþ1Þ Þ⁓ Transition ðtÞ � Reward ðtÞ þ ðTransition ðtÞ � Reward ðtÞ j SubjectÞ ð12Þ Whereby the MB-I (choice) score was the individual slope of the transition x reward interaction .

MB-II (RT).
Here, we added a less traditional MB score to ascertain whether this score increased the reliability of MB estimates [4,30]. This score is based on an assumption that more model-based participants are quicker to make a second-stage choice after common transitions, and slower after uncommon transitions, compared to model-free participants. This score was previously found to correlate positively with MB-I (choice) , and w-parameter [4,30], as well as correlate with right ventral striatum dopamine levels during task performance, mimicking an effect obtained with model-based model parameters [30]. The score was obtained as follows: with MB-II (RT) scores that are also constrained by group parameter obtained by a mixed effect linear regression whereby MB-II (RT) was the individual slope of the transition main effect (the following indicates Wilkinson notation [43]): RT 2ðtÞ ⁓ Transition ðtÞ þðTransition ðtÞ j SubjectÞ ð14Þ In the results section below, we test the relationship between process-based and modelagnostic estimates which further allowed us to describe why stronger deployment of modelbased strategies in the first-stage might lead to higher MB-II (RT).

Obtaining model-based estimates from empirical data
Psychometric properties in empirical data. We used data from the Neuroscience in Psychiatry Network's (NSPN) study [28]. This is a community-based longitudinal sample of young volunteers (age 14-24 years), living in Cambridgeshire and London regions, UK. The study was designed to measure developmental change. Participants completed a two-step task lab testing, among other measurements. Here we focused on psychometric properties of the task, across questions referring to development. Participants were excluded if they had recognised learning disabilities or neurological disorders. Our final dataset included 554 subjects Process-based estimates. To obtain w-parameters estimates at the individual level, we fitted the RL (choice) and DDM-RL (choice & RT) to empirical data, separately for each time point. Fitting was performed either with (i.e., hierarchical fit) or without group priors (i.e., individual fit, see Methods for further details). We fit two versions of the RL model, with five parameters (single α and β for both stages) and seven parameters (α 1 , α 2 , β 1 , β 2 for each stage), and compared the two by calculating Bayesian inference criteria (BIC int ) which penalizes for the number of parameters (difference of 10 points or more is considered strong evidence, with lower scores indicating better fit) [44]. BIC at both baseline [BIC int_5_parameters = 134738.6, BIC int_7_parameters = 134194.7] and follow-up [BIC int_5_parameters = 216901.9, BIC int _7_parameters = 215646] favoured the seven parameter model. Hence, this model was used for all further analysis (see S2 Table for parameters  descriptive statistics).
We also fit two versions for the DDM-RL, one with single DDM parameters (a,b,τ) at each stage and one with two sets (a 1 ,b 1 ,τ 1 ,a 2 ,b 2 ,τ 2 ) at each stage. BIC at both baseline [BIC int_8_parameters = 113648.1, BIC int _11_parameters = 104220.4] and follow-up [BIC int_8_parameters = 174546, BIC int_11_parameters = 162964.3] favoured the 11 parameter model. Hence, this model was used on further analysis (see S2 Table for parameters descriptive statistics).
Model-agnostic estimates. We next calculated model-agnostic scores, MB-I (choice) and MB-II (RT) for each individual from choice and RT data. We calculated the scores separately at each time point either with (i.e., hierarchical scores, Eqs 12 and 14) or without group priors (i.e., individual scores, see Eqs 11 and 13 and S2 Fig for histograms). To obtain an estimate of the effect-size of model-agnostic measures we calculated η 2 estimates (explained variance) for each score, across participants and across time measurements. For MB-I (choice) the transition � reward interaction factor explained 25.1% of the first-stage stay probability. For MB-II (RT) , the transition factor explained 67.3% of the mean RT 2 variability.

The relationship between process-based and model-agnostic estimates
First, we wanted to assess whether changes in w-parameter directly affected MB-I (choice) , and MB-II (RT) . To do this we systematically increased the w-parameter in an RL-DDM model, and calculated both MB-I (choice) and MB-II (RT) , in-silico. A strong relationship between all three estimates was evident (see Fig 2A).
While the relationship between the w-parameter and MB-I (choice) is well documented [18], that is not the case for MB-II (RT). This raises a question as to why does RT 2 differ as a function of model-based/free trade-off? Our assumption was that compared to a model-free agent, a model-based agent has better discriminability between the values of the two options (larger Qvalue differences) when reached by a common vs. an uncommon transition. This is because a model-based agent is more likely to choose a first-stage bandit that will lead to the maximal Qvalue out of the four second-stage alternatives via a common transition (Eq 3 and Fig 2D). Therefore, common transitions for model-based agents have a better chance of having a larger Q-value difference and hence better discriminability between the two values (see Fig 2B), whereby higher discriminability is tightly related to quicker RTs. To test this assumption we performed two complimentary in-silico analyses. First, we located in each trial the secondstage state that had the highest Q-value bandit out of the four second-stage bandits (hereafter, 'best state') and the alternative state (hereafter, 'worst state'). We then calculated the averaged ΔQ-value (maximum-minimum Q-value) for the best vs. worst state and compared the two ΔQ-values for the two states. We found that on average the best state also had better value discriminability (mean ΔQ-value = .59) compared to the worst state (mean ΔQ-value = .27; see Fig 2C). This confirms the first part of our assumption, postulating that the state that holds the highest Q-value bandit (out of the four bandits) also entails higher value discriminability. Second, since model-based agents are more likely to reach the state with the highest Q-value bandit by means of a common transition (see Eq 3 and Fig 2D), it follows that higher w-parameter will lead to higher ΔQ-value for common vs. uncommon transitions. A second analysis confirmed this assumption, showing higher ΔQ-value for common and lower ΔQ-value for uncommon transitions as w-parameters values were increased in-silico (see Fig 2B). In the RL-DDM model ΔQ-value is directly translated into higher drift-rates, leading to quicker RTs on average (see Eq 8). Thus, this result confirms our assumption that systematic differences in model-predictions for RT 2 as a function of transition type, are a direct result of model-based/ free trade-off at the first-stage.
To examine the relationship between the three scores in empirical data, MB-I (choice) , MB-II (RT) and w-parameter we averaged each score across both time points (baseline and follow-up) and calculated Pearson/Spearman correlation coefficient (see Table 1). Results suggest a strong relationship between MB-I (choice) , MB-II (RT) (with ⁓37% shared variance, see Fig 3A), and a moderate relationship of both with w-parameter (see Fig 3B and 3C). Examining Table 1 shows that hierarchical scores outperform individual ones in general. Finally, we examined whether correlations between the w-parameter and the scaling (b 2 ), or the threshold (a 2 ) We then standardized the eleven mean scores, separately for each MB score. Results showed a strong relationship between the w-parameter and both model-agnostic measures. (B) Here we illustrate how deployment of model-based strategies in the first-stage is affecting MB-II (RT) via systematic effects on second-stage value discrimination. Specifically, Panel B presents averaged ΔQ-value (max-min Q-value) for the secondstage state the agent visited. Results confirmed that higher w-parameter values lead to higher/lower value discriminability (ΔQ-value) after common/uncommon transitions, respectively. Notably, in the DDM-RL model ΔQ-values are directly and positively associated with drift-rates and hence contribute to faster RTs (see Eq 8). This result illustrate why higher w-parameter is associated with quicker/slower RT 2 after common/ uncommon transitions, respectively. (C/D) To further demonstrate how deployment of model-based strategies in the first-stage leads to systematic value differences in the second-stage we labelled in each trial the best and worst state (state that included the highest Q-value out of the four available second-stage bandits, and the alternative state). Panel C shows that across all simulation the best state was related with higher value discriminability (higher ΔQ-value), regardless of the w-parameter. Panel D further shows that higher w-parameter is related with higher probability of visiting the best state by means of common transitions (see Eq 3). Therefore, Panels C & D illustrates the reason that higher wparameter leads to higher value discriminability after common trials as illustrated in Panel B. Improving reliability of model-based estimates with drift-diffusion modeling parameters in the second-stage contribute to the empirical correlations between the w-parameter and MB-II (RT) . For example, if model-based individuals also had systematically larger thresholds and/or systematically higher scaling in the second-stage this might inflate MB-II (RT) . We calculated Spearman correlations and found a modest positive correlation between w and b 2 (r = .09, p = .03) and a non-significant correlation between a 2 (r = .06, p = .14). Importantly, Spearman partial-correlation analysis indicated that the empirical relationship reported in Table 1 between the w-parameter and MB-II (RT) remains largely unaffected after controlling for b 2 parameter estimates (Partial correlation = .41, p < .001).

Parameter recovery
To examine the influence of task length on the reliability of estimated parameters, we performed a parameter recovery analysis. Table 2 presents the correlation between the true and recovered parameters, as a function of the number of trials in the analysis. While the w-parameter reached an acceptable value only after ⁓1000 trials for the RL model, the DDM-RL model reached the same value after as little as 200 trials (see Table 2 and Fig 4A and 4B). Furthermore, the fact that the learning rate for the first-stage showed better recovery for the DDM-RL vs. RL model, suggests an overall better parameter recovery for the first-stage in the former.

Internal stability
To measure internal stability, we calculated split-half reliability scores [45][46][47]. These scores are obtained by splitting a task into two (or more) parts and then estimating the extent to which the different parts reflect the same score. Here, we adopted a common practice of splitting the task into odd and even trials [45,46]. Note that we can still calculate MB score even when omitting odd/even trials by omitting the behaviour of the previous trial from the analysis, but not the coding (rare/common, rewarded/unrewarded). We estimated a MB score separately for each part, calculated a Pearson correlation coefficient to estimate the extent that the two parts reflect the same score, and used the Spearman-Brown formula to get a final estimate of internal constancy (owing to splitting the task into two parts, the Pearson correlation scores reflects internal consistency for only half of the trials; Spearman-Brown formula allows correction of the correlation estimate to reflect internal reliability as if it was obtained in two parts, each with a complete number of trials). A conventional criterion is that the odd-even correlation should exceed .7. Table 3 summarises our findings showing good reliability for To test for the effect of the number of trials on internal stability, we performed the half-test reliability analysis using the first 20 trials alone. We then repeated the analysis, each time adding one additional trial and re-calculating the reliability scores.  Improving reliability of model-based estimates with drift-diffusion modeling One reason MB-I (choice) might have lower internal stability is that it reports on four types of trials (see Eqs 9-11). Moreover, uncommon trials are less frequent by definition, and so these estimates are potentially noisier. Consequently, we examined the internal consistency of stay probability for these four trial types separately, again using the follow-up data alone (where we had more trials).  We found for all four types an acceptable internal stability after 200 trials (see Fig 5). Why would MB-I (choice) score show low internal consistency after 200 trials, if the four conditions used to calculate it are reasonably stable? The answer may rest in the observation that difference scores are mathematically less reliable than their components, as long as the measurement noise is independent [48]. This means that only high reliability for each of the four conditions will provide reasonable reliability for the MB score itself. Therefore, it is plausible that after ⁓350 trials, when the four probabilities attain good internal reliability, will an interaction score become acceptably reliable.

Temporal stability
To assess temporal stability we calculated Pearson's correlation between baseline and followup MB estimates. Results are presented in Table 3. We found overall low temporal stability for the w-parameter in both RL and DDM-RL models (see S3 Table for the remaining parameters). Model-agnostic scores had low to medium stability with slightly better estimates for hierarchical MB-I (choice) scores compared with individual ones. We also applied a method suggested by Silva and Hare, allowing to correct for a minority of cases where the state value can cause misinterpretation of MB-I (choice) [49]. Specifically, we added two control variables to the hierarchical regression coding the fractals selected at the first-stage and second-stages (See Eq 2 in Silva and Hare study [49]), however we found very similar temporal stability (r = .41) for MB-I (choice) .
To assess whether better temporal stability were obtained by aggregating the two MB scores (choice and RTs) we turned to latent factors analysis using structural equation modelling (SEM) [50]. SEM is a multivariate method that combines factor analysis and multiple regression. It allows estimation of structural relationships between latent constructs and their measured variables (indicators). Latent factors are considered less noisy than their counterparts [51], but come with the disadvantage that these latent factors are sometimes difficult to interpret [52]. Here, we constructed two MB latent factors for baseline and follow-up time points, with each latent factor predicting MB-I (choice) and MB-II (RT) . Pearson correlation between the baseline and follow-up latent MB factors, showed better estimates compared to the separate scores (Table 3,   Improving reliability of model-based estimates with drift-diffusion modeling Finally, as the time-gap between measurements was not the same for all participants (see Procedure), we explored whether temporal reliability might increase with shorter time-gaps. We calculated a regression where baseline scores, time-gap and their paired interaction predicted follow-up scores (separately for MB I & II). For the individual MB-I (choice) we found a significant interaction (p < .01), such that a shorter time-gap predicted higher temporal stability. To assess further the magnitude of this interaction effect, we performed a median split on time-gap estimates (above/below 18 months between baseline and follow-up). We found that participants in the short time-gap group had higher temporal stability, (r = .41) compared with those tested on longer time-gaps (r = . 13). No interaction effect of time-gap on temporal stability estimates was found for hierarchical regression MB-I (choice) or MB-II (RT) scores (all p-values>.8). Finally, our data set had a subset of 61 participants that had an additional measurement, six months after baseline. S1 Text presents temporal stability scores between baseline and the six-month follow-up.

Power analysis for group level estimates
A question of great practical interest is how reliability in group-level estimates scale with group size. Specifically, an interest in group differences in model-based abilities (e.g. assessing differences between clinical and healthy populations, or effects of manipulations) renders it important to estimate the chance of finding a between-subjects effect in group comparisons, given such an effect exists in the population. We performed a power analysis using simulated data from the DDM-RL model, examining an ability to detect differences in MB estimates between two groups (control vs. experiment). Specifically, we examined statistical power (chance for a statistically significant result given an effect exists in the population) as a function of effect-size between groups (small, medium or large), the number of participants (30, 100 or 500 per group) and the number of trials (200, 500 or 1000).We generated two truncated wparameter Gaussian distributions (control, experiment) with a standard deviation of .1 (based on the empirical parameters obtained from fitting the model, see S2 Table) and means set to generate a small effect-size (group means were .49/.51, Cohen's d = . 2), medium effect-size (group mean was .475/.525, Cohen's d = .5) or large effect-size (group means .46/.54, Cohen's d = .8). For each of the three effect-sizes we sampled across 1000 iterations of N participants (30, 100 or 500) with n trials (50, 100 or 500).
For each iteration we calculated MB I&II scores (choice & RT) for each agent, and calculated the test statistics chi-square of a MANOVA analysis with both scores as dependent variables. A statistically significant chi-square means that an H0 (no group difference in modelbased estimates) can be rejected given the data. As we know the ground truth (H0 is false), we calculated the percentage of studies that would reach the correct conclusion (reject H0). Therefore, for each effect-size, sample size (N) and experiment length (n) we calculated the percentage of samples above the critical chi-square, representing the proportion of studies that would obtain a statistically significant effect. Table 4 presents the results, showing very low power for small effect sizes for all sample sizes, reasonable power for a medium effect with 500 subjects, and good power for a large effect with 500 subjects. empirical data for the four conditions that assemble MB-I (choice) (CR: common-rewarded, CU: common-unrewarded, UR: uncommon-rewarded, UU: uncommon-unrewarded, see Eq 9-11). Ribbons present 95% CI. The horizontal line represents the .7 criteria for internal stability.

Discussion
Our study examined the psychometric properties of model-based/model-free estimates in a two-stage decision task. First, we quantified the extent to which participants consider both the transition and reward history when making a first-stage choice (MB-I (choice) ). Second, we quantified the benefit/cost in second-choice RTs following common/uncommon transitions (MB-II (RT) ).
While MB-I (choice) is widely exploited in the literature [4,13,18,30], the latter is less used [4,30], raising the question as to whether it actually helps quantify MB processes and whether it can enhance the reliability of MB estimates. We considered a computational model that predicted both choice and RT (DDM-RL). A simulation analysis showed a close relationship between the process-based w-parameter (quantifying model-based/free trade-off at the first stage) and MB-II (RT) . Specifically, we found that deployment of model-based strategies at the first-stage affected the value discriminability of bandits at the second-stage. As discriminability is strongly and negatively related to RTs [33,42], it follows that these value differences should be observed at the level of RTs. A tight relationship between the w-parameter and MB-II (RT) highlights the latter as a promising index of the deployment of model-based strategies at the first-stage. Furthermore, our findings suggest that a model that predicted a combination of choice and RT had better recovery properties for the w-parameter, as well as greater internal consistency for model-agnostic measures in simulated data (both choice and RT scores).
What do our results suggest in terms of reliability of model-agnostic MB measures? First, we found that MB-I (choice) , but not MB-II (RT) had a low internal stability when the 201 trials version of the task was used. An antecedence search for Daw et al.'s (2011) study revealed 31 studies that directly used the same paradigm to quantify model-free and model-based involvement from observed data. Of these 31 studies, 25 studies used 201 trials or less, and the remainder used approximately 300 trials. The low internal reliability of MB-I (choice) seems to be attributable to the fact that this score is based on estimation of four conditions. Moreover, the fact that uncommon trials are less frequent, renders it even more challenging. According to our calculations, MB-I (choice) is expected to attain reasonable internal reliability at ⁓350 trials, at least 75% longer than is usual practice up to now. Second, as regards temporal stability, both MB-I (choice) and II (RT) fell short in terms of expected conventional criteria. The use of the group prior (hierarchical fit) vastly improved the internal consistency score for MB-I (choice) , and its temporal stability to a lesser extent. However this should be interpreted with caution, since the group prior might drive a more stable, but less valid measure. Finally, using both MB-I (choice) and II (RT) to estimate a latent model-based factor, showed much better stability Table 4 Table values show the chance of finding a statistically significant between group effect as a function of true effect-size, sample-size and number of trials in the experiment. Improving reliability of model-based estimates with drift-diffusion modeling between the two time points. This is in line with previous claims that latent factor analysis using SEM can be a simple and straightforward method of reducing measurement noise [51].

. Statistical power (percent of studies that rejected the null hypothesis, given an effect exists) for a between group design (control vs. experiment).
Overall, a DDM-RL model improved psychometric properties for model-based estimates in the two-step task as reflected in better parameter recovery, as well as improved stability of model-agnostic scores (calculated from simulated data). It also provided a link between model-based/free trade-off in first-stage choices and RT differences in second-stage choices. This link is important because it supports a claim that both choice and RT based model-agnostic scores reflect model-based/free trade-off at the first-stage, allowing use of the combination of these two scores to provide a more reliable estimation. However, when looking at the psychometric properties of the w-parameter, it seemed DDM-RL did not improve the temporal stability compared to that of the RL (choice only) model. One reason for this might be that the number of trials was not enough to allow DDM-RL to attain a stable estimation of the wparameter (121 trials in baseline and 201 at follow-up). Another possible reason might be that model-based/free trade-off in the first stage influence RTs in the second-stage via additional processes not accounted for by DDM-RL. For example, MB-II (RT) score might also be related to the expectation participants have for the common transition-related state. That is, participants with stronger deployment of model-based strategies in the first-stage, also take more time to make a decision regarding a second-stage selection, shortening RT for common trials. However, in the uncommon transition, model-based individuals would need to inhibit their expectation and switch to an unexpected state, thus prolonging response latency [53]. A model-free participant would be less sensitive to such expectations and show similar RTs on both transitions. However, DDM-RL model was unable to capture these set-shifting effects, therefore such possibilities need to the object of further study.
Two straightforward procedures can promote increased reliability in model-based estimates using the two-step paradigm. The first is relatively easy to implement, and involves acquiring more trials than commonly used in current practice. The second is consideration be given to using both first-stage choice and second-stage RTs to attain higher stability. Until now, most if not all decision models in the goal-directed literature use choice data, disregarding the time an agent takes to commit to that choice. Decision models that account for a combination of choice and RT (e.g., evidence accumulation models) might prove more reliable than models that rely on choice alone. Finally, where possible, the use of repeated measures indicating a latent model-based factor should be much preferred.
Given the noisy estimates for model-based behaviour, this raises questions concerning positive findings already reported (see introduction). Our findings would suggest that past studies underestimate the true effect of different contributions to choice behaviour. An example here is the small effect sizes found in studies that compare clinical and healthy participants [13,29]. While some studies report MB I & II [4,29,30], we suggest that use of a combination of these two scores in both model-agnostic and computational analyses should increase reliability of model-based estimates, allowing for better assessment of effect-sizes and boosting replicability.
In conclusion, a conceptual distinction between model-based vs. model-free processes in behaviour control has fostered a rich and growing literature. However, we highlight potential reliability issues that can be addressed by relatively simple measures, such as increasing the number of trials and by applying modelling to both choice and RTs. One goal of computational neuroscience is to assess cognitive processes at a single subject level. For example, predicting/explaining clinical symptoms based on computational/neurological cognitive estimates to eventually inform clinical decisions [54,55]. This requires provision of stable and reliable estimates and our study highlights ways this can be advanced.

Limitations
Two limitations need to be considered. First, test-retest stability might suffer from the fact that, in our study, at the second measurement participants had more experience with the task. This can either reduce re-test stability (both measurements are not measuring exactly the same thing) and/or might increase the apparent reliability of the second measurement, as more experience in the task can lead participants to behave more consistently. Our analysis cannot differentiate between those aspects. Second, our data were obtained from an adolescent and young adult population and the study findings cannot be generalised outside the type of population we investigated.

Ethics statement
The study was carried out in accordance with the Declaration of Helsinki and Good Clinical Practice guidelines. Ethical approval was granted by Cambridge Central Research Ethics Committee (12/EE/0250).

Participants
Data was obtained from the Neuroscience in Psychiatry Network's (NSPN) study [28]. This is a community-based longitudinal sample of young volunteers (age 14-24 years), living in Cambridgeshire and London regions, UK. The study was designed to measure developmental change. Participants were recruited by invitation sent to local general practitioners (GP), adverts in the community, schools and further education colleges. Written informed consent was given by the participants aged 16-24 years, those aged 14-15 years gave written informed assent and their parents/legal guardian provided written informed consent. Participants were recruited in an age-sex-stratified sample, for the following five age groups: 14-15, 16-17, 18-19, 20-21, and 22-24 years. Participants were invited to take part in detailed behavioural assessments including computer-based evaluations, clinical assessments and IQ measures, during three time points (N = 819, N = 63, N = 571). Only participants that had completed the measurement of interest (two-stage task) were included in the current analysis. In order not to reduce the sample-size substantially (owing to lower sample-size at the second time point) we used data only from the first and third time-points (henceforth, baseline and follow-up). However we report a subset of our analysis with the second time-point in S1 Text. Our final dataset included 554 subjects (274 males, 280 females) in two time points: baseline (mean age = 18.85, range 14.1 to 24.98) and follow-up (mean age = 20.33, range 15.11 to 26.48). Further details about recruitment, participants consent, and ethical approval can be found at Kiddle et al., (2018) [28] (Note that the Kiddle et al., study was reported before data collection was completed, and has a slightly lower amount of participants included. The current analysis is based on data collected up until March 2018. To the best of our knowledge, no more participants were tested after that date).

Procedure
For both measurements (baseline and follow-up) participants were invited to a lab session in one of the UK's collaborating institutions [28]. The mean time gap between the two measurements was 17.75 months (range 11.76 to 31.44 month). At each measurement session participants completed computer-based cognitive evaluations, clinical assessments and IQ measures. At the end of the assessment day, participants were paid a fixed amount plus a bonus based on performance. For the purpose of this study we focus on analysing data obtained from the twostage task.

Two-stage decision task
The task was the same as the one developed by Daw at el., (2011) [18], and is described in Fig  1. Participants are instructed to win as many rewards (play pounds) as possible, and were told also that they would receive a payment bonus based on overall task performance. At each of the stages, subjects select one of two stimuli within 2 seconds. The inter-trial interval was randomly selected from a uniformed distribution ranging from 1 to 2 seconds. The task included 121 trials at baseline and 201 trials at follow-up (a shorter version in baseline was given due to time constraints, and increased to 201 to match Daw et al., 2011). A short break was provided after half of the trials were completed.

Participant exclusion and pre-processing
We included participants that had a completed data set for both baseline and follow-up (N = 569). We excluded participants that responded in the two-stage task with the same key on more than 95% of the trials (two participants), or had implausible RTs (below 150 ms ) on more than 10% of the trials (13 participants; Gillan et al., 2016). This resulted in the inclusion of a total of 554 participants in our full analyses. For the remaining two-stage task data, the first trial in each block, as well as trials with implausible RTs (below 150 ms ) were omitted from the analysis (1% of the overall trials).

Parameter recovery
To perform parameter recovery analysis we randomly selected parameters values for 100 agents from uniform distribution with ranges set to α[0,1], β [1,8], λ[0,1], w[0,1], p[0,.5], b 1/ 'lmer' R package, with a Laplace approximation and bound optimization by quadratic approximation (BOBQA). Variables coding was done similar to previous studies, with Transition coded as -1 or 1 for uncommon/common transitions, and Reward as -1/1 for unrewarded / rewarded trials [4,13,49]. The logistic regression included fixed effects and random effects for the full factorial design where Stay is predicted by intercept, previous transition, previous reward and their interaction. The random effects for the transition x reward interaction was then used as MB-I(choice). For MB-II (RT) we fitted a mixed effect linear regression using the same R package, predicting RT 2. The linear regression included fixed and random effect for the intercept and transition effects. Slope random effect for the transition effect was used for MB-II (RT) scores.

Structural equation modeling
Temporal stability analysis was performed twice, once by clamping each score and once without clamping. That is, instead of estimating separate parameters for the loading of each MB score on the latent factor, we also tested model fit when 'clamping' the parameters so that each MB score has the same loading for both time-measurements. The model with clamping showed slightly better fit to the data (RMSE = .087, BIC = 139.24) compared to the one without clamping (RMSE = .126, BIC = 144.96), and was therefore used in this analysis.

Data availability
Open-Science Framework (OSF) project including: (1) a Matlab code for simulating RL and DDM-RL models (2) a .csv data file with empirical observations (fully anonymized) and (3)  To assess the ability of the DDM-RL to predict participants RTs, we calculated for each empirical RT 2 distribution at baseline/followup nine RT percentiles (.1 to .9). We then simulated for each individual 50 experiments with 1000 trials each, based on the fitted parameters, and calculated RT percentiles from simulated data. The plot suggests a good match between empirical and predicted RTs as can be seen by a linear trend between the empirical vs. simulated percentiles (with a tendency of the model to overestimate long RTs). Each color representing a different individual. (TIF)

S5 Fig. Model predictions and empirical behaviour for the task's second-stage, as a function of value discriminability.
Here, we simulated for each participants Q-values using the individual's parameters (hierarchical RL-DDM) and the sequence of events the individual experienced (i.e., rewards and transitions during performance). For each trial we calculated ΔQ-value (maximum-minimum), and averaged model predictions for choices and RTs (based on simulations of 100 decisions per trial). Therefore, for each trial we obtained empirical choices and RTs taken form participants behaviour as well as averaged choices and RTs simulated by the model, based on the trial-by-trial Q-values. Trials were then binned into five bins according to ΔQ-value of 0 to .2, .2 to .4, .4 to .6, .6 to .8 or .8 to 1 (represented in the x-axis across all plots). Results are presented separately for model-free and model-based participants (grouped by means of median split over the w-parameter estimates). (A/B) Probability of selecting the bandit with the higher Q-value, as a function of value discriminability (difference between high and low Q-value bandit). (C/D) Mean reaction-times as a function of value discriminability. Overall, these plots present a good fit between model prediction and participants' behaviour, with no visual difference between model-based and model-free behaviour. Error bars for empirical data represent standard error. Trials were binned into five bins according to ΔQ-value of 0 to .2, .2 to .4, .4 to .6, .6 to .8 or .8 to 1 (represented in the x-axis across all plots). We then further binned for each individual RTs into five bins, separately for each ΔQ-value bin (total of 25 bins pre individual). We then calculated the mean RT (in seconds) for each of the 25 bins separately for model-free and model-based participants (group by means of median split over the w-parameter estimates). (A/B) Observed second-stage RTs as a function for model-free and model-based individuals. (C/D) Model predictions for model-free and model-based individuals. Overall, we did not find any differences between model-free and model-based individuals in terms of how good the model predicted RTs. This is despite a slight tendency of the model to predict quicker RTs for the last bin, and slower RTs for the fast bin. Error bars for empirical data represent standard error. (TIF)