Reliability of gamified reinforcement learning in densely sampled longitudinal assessments

Reinforcement learning is a core facet of motivation and alterations have been associated with various mental disorders. To build better models of individual learning, repeated measurement of value-based decision-making is crucial. However, the focus on lab-based assessment of reward learning has limited the number of measurements and the test-retest reliability of many decision-related parameters is therefore unknown. In this paper, we present an open-source cross-platform application Influenca that provides a novel reward learning task complemented by ecological momentary assessment (EMA) of current mental and physiological states for repeated assessment over weeks. In this task, players have to identify the most effective medication by integrating reward values with changing probabilities to win (according to random Gaussian walks). Participants can complete up to 31 runs with 150 trials each. To encourage replay, in-game screens provide feedback on the progress. Using an initial validation sample of 384 players (9729 runs), we found that reinforcement learning parameters such as the learning rate and reward sensitivity show poor to fair intra-class correlations (ICC: 0.22–0.53), indicating substantial within- and between-subject variance. Notably, items assessing the psychological state showed comparable ICCs as reinforcement learning parameters. To conclude, our innovative and openly customizable app framework provides a gamified task that optimizes repeated assessments of reward learning to better quantify intra- and inter-individual differences in value-based decision-making over time.

options with high reward magnitudes compared to an integration of both reward magnitude and win probability with equal importance (Behrens et al., 2007).
In that case, decision-making with equal weighting would be achieved with  = 1, whereas  > 1 would lead to decisions that are predominantly based on the learned probabilities and less on differences in reward points (i.e., avoiding the risk of choosing the wrong option).In contrast,  < 1 would lead to decisions that are predominantly based on differences in reward points at the risk of choosing the wrong option.

Model Comparisons
We compared four different models (one learning rate and multiplicative weighting (1LR_gamma), two learning rates and, e.g., γ = 1 equal multiplicative weighting (2LR), two learning rates and free multiplicative weighting (2LR_gamma), and two learning rates and additive weighting (2LR_lambda)) by calculating the Bayesian information criterion (BIC) across all runs included in the analyses and additionally evaluated which model explained the data best in most runs, again using the BIC for each individual run.The winning model (BIC = 1,004,374, ΔBIC compared to less complex models between 92,585 -242,904) included two learning rates and the additive weighting (Figure S3).Thus, we performed all following analyses using the model including two learning rates and the additive mixture parameter λ.

Test-retest reliability
The ICC describes the reliability of a measure on the scale of a correlation where   is the variance explained by the random intercept (ID) and   denotes the residual variance.This ICC assesses absolute agreement and corresponds to an ICC derived from a random effects model with repeated measures (ICC(1,k), (Koo & Li, 2016;Shrout & Fleiss, 1979)).Additionally, we calculated the conditional ICC taking (fixed) run effects (log-transformed) into account with the following mixed-effects model: This ICC assesses consistency when considering systematic differences between timepoints and corresponds to the ICC derived from a two-way mixed effects model with repeated measures (ICC(3,k), (Koo & Li, 2016;Shrout & Fleiss, 1979)).We interpreted the ICC according to recommendations by Shrout and Fleiss (1979) so that values < 0.4 reflect poor, values between 0.4 and 0.6 reflect fair, between 0.6 to 0.75 reflect good, and values > 0.75 reflect excellent reliability.
coefficient, where values close to 1 reflect high similarity within participants, whereas lower ICCs indicate lower similarity within participants.As described in Raudenbush and Bryk (2002), we derived the unconditional ICC based on the null model: Parameter ~ (1|ID) implemented with lme4 in R The ICC describes the reliability of a measure on the scale of a correlation coefficient, where values close to 1 reflect high similarity within participants, whereas lower ICCs indicate lower similarity within participants.with the formula:

Figure A .
Figure A. Model comparisons of 4 candidate computational models revealed a model with 4 free parameters (learning rates for wins and losses, reward sensitivity, and an additive mixture parameter) to show the best model fit and lowest BIC across all runs (a) as well as in the highest number of single runs (b)

Figure C .
Figure C. Average reward is correlated with the model-fit (fval) as well as higher reward sensitivities (beta) and the weighting of learned win probabilities (Lambda).In contrast, learning rates especially for punishments are negatively associated with the average reward.

Table A .
Reliability measures of the basic reinforcement learning model(Behrens et  al., 2007)