Modeling flexible behavior in childhood to adulthood shows age-dependent learning mechanisms and less optimal learning in autism in each age group

Flexible behavior is critical for everyday decision-making and has been implicated in restricted, repetitive behaviors (RRB) in autism spectrum disorder (ASD). However, how flexible behavior changes developmentally in ASD remains largely unknown. Here, we used a developmental approach and examined flexible behavior on a probabilistic reversal learning task in 572 children, adolescents, and adults (ASD N = 321; typical development [TD] N = 251). Using computational modeling, we quantified latent variables that index mechanisms underlying perseveration and feedback sensitivity. We then assessed these variables in relation to diagnosis, developmental stage, core autism symptomatology, and associated psychiatric symptoms. Autistic individuals showed on average more perseveration and less feedback sensitivity than TD individuals, and, across cases and controls, older age groups showed more feedback sensitivity than younger age groups. Computational modeling revealed that dominant learning mechanisms underpinning flexible behavior differed across developmental stages and reduced flexible behavior in ASD was driven by less optimal learning on average within each age group. In autistic children, perseverative errors were positively related to anxiety symptoms, and in autistic adults, perseveration (indexed by both task errors and model parameter estimates) was positively related to RRB. These findings provide novel insights into reduced flexible behavior in relation to clinical symptoms in ASD.


Introduction
Flexible behavior is a fundamental part of everyday life. It requires learning from feedback to guide decisions and adapting responses when feedback changes. These cognitive processes are implicated in a range of neurodevelopmental and neuropsychiatric conditions, including autism spectrum disorder (ASD; [1]), as well as attention-deficit hyperactivity disorder (ADHD) and anxiety, both of which frequently co-occur in ASD [2][3][4][5]. In particular, reduced flexible behavior is suggested to underpin core features of restricted, repetitive behaviors (RRB) in ASD, such as insistence on sameness. However, current evidence is inconclusive, and the mechanisms by which these impairments arise remain unclear [6,7]. Studies of neurotypical individuals show that the cognitive processes underlying flexible behavior and reinforcement learning change through childhood and adolescence into adulthood [8,9]. Therefore, a developmental approach within ASD that characterizes component learning processes is likely to bring us closer to understanding mechanisms of (in)flexible behavior and identifying therapeutic targets.
Probabilistic reversal learning (PRL) paradigms require individuals to find a balance between learning structure in an uncertain environment while remaining flexible to change [10]. Typically, participants must learn using feedback which of a set of stimuli is most rewarded and adapt their responses when the rule changes, in order to maximize favorable outcomes. PRL paradigms therefore provide a direct assessment of flexible choice behavior (in addition to tapping reinforcement learning), as they require information to be integrated over a number of trials in order to detect true changes, and-much like interacting with our environment-this trial-and-error learning is continually updated throughout the task. Furthermore, PRL paradigms do not require tracking of extradimensional shifts, thereby constraining the recruitment of additional cognitive domains [11,12].
Previous literature has reported reduced reversal learning in ASD relative to controls and a positive relationship between reversal errors and RRB [1,13]. In contrast, others have reported poorer overall task performance but unspecific to reversal adaptation [14,15], or no differences in reversal learning nor any associations with ASD symptomatology [16,17]. It is worth noting that these inconsistencies in ASD-related changes in cognitive flexibility are also reflected in the broader literature using alternative paradigms (see [7,18] for reviews).
With respect to reinforcement learning, studies of reward processing suggest atypical or diminished neural responses to rewards in ASD [19][20][21][22], though results from adolescent studies are less consistent [23][24][25]. If reinforcement is differentially experienced in ASD, it is likely to impact on decision-making processes and behavior. In addition to establishing differences, associations between learning and phenotypic correlates warrant further study in order to elucidate whether such differences necessarily manifest in impairments related to symptom severity.
Several factors may have contributed to inconsistencies in the literature. First, previous studies have often studied single age groups or a broad age range within a small sample size. Evidence from both cognitive and neuroimaging studies attests to important developmental differences in reinforcement learning and flexible behavior in neurotypical individuals [26][27][28]. Young children often perseverate, taking longer than older children to learn new rules and switch their responses [8]. During adolescence, notable changes in goal-directed decisionmaking occur, often manifesting in risky decisions thought to be attributable to hypersensitivity to rewards [29][30][31]. In adulthood, there is evidence for the use of more sophisticated, "controlled" cognitive strategies [32,33]. Hence, a developmental approach in ASD is needed to ascertain whether potential impairments reflect delayed development or atypical cognitive processes.
Second, previous studies have also tended to use task performance measures that often aggregate error scores and do not directly characterize learning processes governing behavior. Computational models capture the dynamics of learning over time-emulating a participant's experience-and delineate component processes underlying PRL by approximating mechanisms that may have led to task behavior. Estimating and comparing different reinforcement learning models allows for the evaluation of competing mechanisms by quantifying how likely each model is to have generated the observed behavior. Moreover, by approximating putative mechanisms, computational models enable better mapping between behavior and neurobiology, particularly important for understanding neurodevelopmental disorders [34].
Studies of ASD using modeling have shown evidence of slower, faster, and equal rates of learning compared to neurotypical individuals. Optimal learning rates depend on the stability of the task environment. A changeable environment requires fast learning guided by recent feedback, whereas a stable environment requires slower learning over time (e.g., [35,36]). Crucially, probabilistic feedback also requires learning to ignore "misleading" punishment. Previously, autistic adults were shown to have a slower learning rate than neurotypical adults when using higher-probability reward contingencies, but they performed comparably or outperformed neurotypical adults when the contingency was near chance [21,22]. Perhaps, then, a key difficulty lies in learning regularities and ignoring irregularities, in addition to learning change per se [37]. This is consistent with previous findings of a tendency to "overlearn" volatility in ASD adults, resulting in reduced learning of probabilistic errors [38]. Whether these findings extend to children and adolescents (see [39] for differing findings) and which underlying processes are different in ASD remain to be seen.
Here, we examined learning processes underlying flexible behavior in ASD and typical development (TD) across developmental stages using a PRL paradigm. Our secondary aim was to investigate possible relationships with symptomatology in ASD. To achieve this, we (1) tested a large sample of individuals with a wide age range that was sufficiently powered to compare children, adolescents, and adults and (2) used reinforcement learning models to compare quantitative mechanistic explanations of flexible behavior and identify the latent processes on which individuals may differ. We included measures of RRB subtypes as our focus, social-communication difficulties for comparison, and associated symptoms of ADHD and anxiety as frequently co-occurring features that may also relate to atypical learning and flexible behavior. Based on previous literature, we hypothesized that younger age groups would perform less well on the task than older age groups and that autistic individuals would perform less well than neurotypical individuals. Additionally, we hypothesized differences in dominant underlying cognitive processes across development. Finally, we predicted that reduced flexible behavior would be related to higher RRB symptom severity, in particular behavioral rigidity/ insistence on sameness.

Ethics statement
The study was approved by the independent local ethics committees of the participating centers (

Participants
This study was part of the EU-AIMS Longitudinal European Autism Project (LEAP; [40,41])a multidisciplinary, multicenter study of children (6-11 years), adolescents (12-17 years), and adults (18-30 years) with and without ASD from six European sites. The current study included data from 321 individuals with an existing clinical diagnosis of ASD and 251 typically developing (TD) individuals, with full-scale IQ scores ranging from 74 to 148. Descriptive statistics for the sample are listed in Table 1. Full-scale IQ was measured using the Wechsler scales (see [41]). Although ASD individuals were additionally assessed using the Autism Diagnostic Observation Schedule [42,43] and Autism Diagnostic Interview-Revised (ADI-R, [44]), reaching instrument cutoffs were not inclusion criteria, as clinical judgment has been found to consistently improve diagnostic stability [45]. However, task behavioral analyses were repeated in a subset of individuals who meet ADI-R criteria as specified by [46] (S1 Table). Although the full EU-AIMS LEAP sample includes individuals with mild intellectual disabilities (N = 83), initial analyses showed evidence of poor task learning in this group, and thus they were omitted from further analyses. Those with only partial data (N = 3) or who chose the same stimulus throughout the task (N = 1) were excluded from analysis (see S1 Text for further sample information).

Experimental paradigm
Participants completed a computerized PRL task whereby they were instructed to choose one of two colored shapes (vertical yellow bars or horizontal blue bars) presented in two of four possible locations with an 80:20 reward/punishment contingency ( Fig 1A). Positive feedback consisted of green, smiling emoticons and negative feedback of red, frowning emoticons (i.e., reward/punishment) and accompanying sounds (bell chime/buzzer, respectively). The task employed a pseudorandom fixed sequence comprising 80 trials with a reversal midway. Participants' first stimulus choice was considered correct in the acquisition phase; after the reversal, the initially incorrect stimulus became the usually rewarded stimulus and vice versa (Fig 1B  and 1C). To reduce task demand and avoid potential floor effects in the younger age groups or clinical sample, the contingency ratio was higher than some previous studies (70:30; [10,47]). Participants used arrow keys to respond and had unlimited response time per trial (see S1 Text for task instructions). This paradigm has previously been used in neurotypical individuals and other clinical groups [47,48] and was specified by the European Medicines Agency in their letter of support for EU-AIMS LEAP [49].

Analysis of task behavior
Behavioral performance on the task was assessed using accuracy during acquisition and reversal phases, perseverative errors, and win/lose feedback sensitivity. Accuracy was quantified as the proportion of correct responses. Perseverative errors were defined as two or more consecutive errors during the reversal phase-i.e., trials in which the participant chose the previously rewarded stimulus, despite negative feedback-and are reported as a proportion of reversal phase trials. Win-stay and lose-shift behaviors index the effect of an outcome on the subsequent choice. They are defined, respectively, as repeating the previous choice following reward (as a proportion of total rewarded trials) and changing the response following punishment (as a proportion of total punished trials). As in previous studies using this task [10,47,48,50,51], reaction time is not examined here because it is unlikely to capture task-relevant processes, since no response speed instructions are given nor is there a time limit for responding (see S1

Reinforcement learning models
We compared three reinforcement learning models to examine different computational mechanisms driving information integration and the cognitive processes underlying learning and

PLOS BIOLOGY
flexible adaptation. Each model extends the Rescorla-Wagner value update rule [52] but in different ways in terms of how information is integrated. The Rescorla-Wagner update rule assumes that individuals assign and update internal stimulus value signals based on the prediction error, i.e., the mismatch between outcome (received reward/punishment following choice of this stimulus) and prediction (expected value of choosing this stimulus). Below, we omit results from the original Rescorla-Wagner model, as all other models consistently outperformed it (see S1 Text and S2 Table).
(1) Counterfactual update model. Previous studies suggest individuals may use counterfactual updating in reversal learning tasks, as it captures the anti-correlatedness of the choice stimuli (i.e., where one is correct, the other is incorrect; [53,54]). The counterfactual update (CU) model extends the standard Rescorla-Wagner algorithm by updating the value of both choice stimuli.
Here, the value V of both the chosen c and unchosen nc stimulus are updated with the actual prediction error and the counterfactual prediction error per trial t, respectively. O is the outcome received. The learning rate η evidences the magnitude of the value update affected by both prediction errors-put simply, the speed of learning. In this framework, reduced flexible behavior may be underpinned by too frequent response switches quantified by excessive value updating after punishment. Feedback is received in the form of a smiling green face (positive) or a sad red face (negative) and is probabilistic, meaning that some is "misleading" (e.g., trial 3). Win-stay trials are those in which individuals repeat their stimuli choice following positive feedback (e.g., trials 2 and 3), and lose-shift trials are those in which individual change their stimuli choice following negative feedback (e.g., trials 4 and 5). (B) The structure of the task -the first stimuli chosen by each participant is correct in the acquisition phase (trials 1-40; here: yellow). Feedback was given with an 80:20 reward/punishment ratio; green blocks indicate reward and red blocks indicate punishment. In the reversal phase (trials 41-80), the true correct stimulus is reversed (here: blue) as is the contingency schedule. (C) Overall trial-by-trial behavior-All participants' data, sorted by performance, with average performance overlaid (black line) regardless of diagnosis or age group. Compare to (B) to see how task structure is experienced in practice (see S1 Data). https://doi.org/10.1371/journal.pbio.3000908.g001 (2) Reward-punishment model. Alternatively, reduced flexible task behavior may result from reduced punishment learning. Reduced punishment learning would have a disproportionate effect during the reversal phase because punishments following choices of the previously rewarded stimulus would have a diminished influence on choice behavior due to a failure to devalue this stimulus. To assess whether this mechanism drives reduced flexible behavior, we use a different extension of the Rescorla-Wagner model, with separate learning rates for reward and punishment (reward-punishment model [R-P]; [47]). This allows for the capture of differential learning to feedback types.
Here, η rew is the learning rate for rewards and η pun is the learning rate for punishment; O is the outcome received. In this model, only the chosen stimulus value is updated.
(3) Experience-weighted attraction-dynamic learning rate model (EWA-DL). Finally, reduced flexible behavior may result from a growing insensitivity to novel information. By this mechanism, a failure to update values based on new information (i.e., accumulating negative feedback denoting a true reversal) would cause perseveration of the previously rewarded response and delayed or even complete failure to switch. We examined this mechanism using the experience-weight parameter from a reduced version of the EWA model as presented in previous work [47], where we used the formulation of a nonstationary learning rate through updating of an experience weight. This dynamical learning rate allows for interpolation between different forms of updating (accumulating versus averaging rho shifts from 0 to 1). Note that we do not use the exact same model of the original EWA model [55], as we omit the feature of blending belief-based versus reinforcement learning. To make this distinction clear, we have labeled this model as EWA-DL (but note that it is the identical model to [47]). The EWA-DL model extends classic reinforcement learning with an experience-weight parameter that captures the attribution of significance to past experience over and above new information as an individual progresses through the task. This effectively reduces the learning rate over time. Thus, in this context, perseveration would arise from a slowness, after reversal, to update the value of the now usually rewarded stimuli due to an overreliance on preceding task experience. The growth of the experience weight n and update of the stimulus values V are defined as follows: Here, n c,t is the "experience weight" of the chosen stimulus on trial t, which is updated on every trial using the experience decay factor ρ. V c,t is the value of choice c on trial t for outcome O received in response to that choice, and φ is the decay factor for the previous payoffs. In this model, φ is equivalent to the inverse of the learning rate in Rescorla-Wagner models (or alternatively, n = 1 -φ; see also [47]). For ρ > 0, the experience weights promote more sluggish updating with time. Previous work has shown the EWA-DL to be the winning model in neurotypical adults in the same PRL task [47]. Softmax action selection. For all models, a softmax choice function was used to compute the action probability given the action values. On each trial t, the action probability of choosing option A (over B) was defined as follows: Here, β (0 < β < 5) is the inverse temperature parameter that governs the stochasticity of the choice, computed using inverse logit transfer. We set the upper bound to 5, as individual parameters are regularized by group-level parameters that prevent extreme parameter estimates (see parameter estimation section), and our data indeed showed that all β estimates are smaller than 5. We refer to β in this paper as value sensitivity, as it reflects sensitivity to the difference in stimulus values, that is, the degree to which a (perceived) difference in stimulus values determines choice (see S1 Text). Higher β values denote decisions driven by relative value whereas lower β values denote more choice stochasticity. Additionally, a small indifference point parameter α (−0.5 < α < 0.5) is introduced, which captures any selection bias in which both options are equally likely to be selected. Including this indifference point parameter systematically improved performance of all models. The action probability of options A and B by definition sum to 1:

Parameter estimation and model selection/validation
Parameter estimation was performed with hierarchical Bayesian analysis (HBA) using Stan language in R (RStan; [56,57]), adopted from the hBayesDM package [58]. Posterior inference was performed using Markov chain Monte Carlo (MCMC) sampling in RStan. The models were fit separately for each of six groups-diagnosis (ASD, TD) × developmental stage (children, adolescents, adults)-and compared within each group to assess how well they fit the data (goodness-of-fit) while accounting for model complexity. Comparison of model fit was assessed per group using Bayesian bootstrap and model averaging, whereby log-likelihoods for each model were evaluated at the posterior simulations and a weight obtained for each model. Model weights include a penalizing term for model complexity and a normalizing term according to the number of models being compared; thus, for each group, model weights sum to 1 [59]. Higher model weight indicates better model fit. We conducted model recovery analyses, and, for completeness, we also ran model fitting across age groups (see S1 Text). Finally, we established that the winning models could replicate the observed behavior using one-stepahead prediction (e.g., [60]). Here, parameters are drawn from the joint posterior distribution and combined with the outcome sequence to predict future choices thereby quantifying absolute model fit. That is, we let the model take random draws from each participant's joint posterior distribution to generate choices. We iterated this procedure as many times as the number of samples (i.e., 4,000) per trial per participant. We implemented two ways to assess posterior predictions. First, we computed the predictive accuracy using the number of correct predictions divided by the total number of iterations and tested if this accuracy was significantly better than chance level (i.e., 50%). Second, we analyzed the generated data in the same way as we analyzed the observed data and compared whether results from generated data captured the behavioral pattern in our behavioral analysis (for further details on model specification and validation, see S1 Text).

Optimal learning parameters
We identified the optimal learning parameters for each model using simulation. Taking the CU model as an example, we first took the learning rate from a grid with 1,000 steps from 0 to 1 and then simulated choice data for every learning rate. We computed how often the simulated choice data matched the correct option (i.e., the more rewarding option). We repeated this simulation 10,000 times and identified the optimal learning rate as the value that resulted in the highest choice accuracy. We used the same procedure to determine the optimal learning parameter(s) for the R-P model and the EWA-DL.
Clinical measures ASD symptomatology. Two measures were used to assess RRB symptom severity in ASD: (1) The ADI-R [44] is a structured parent/caregiver interview comprising 93 questions assessing most severe/early developmental ASD symptoms, which yields an algorithm score for RRB based on 12 items; (2) The Repetitive Behavior Scale-Revised (RBS-R; [61]) is a 43-item parent-report questionnaire tapping current RRB, which typically yields a total score and five subscales [62]. Here, we use the Ritualistic-Sameness and Stereotyped Behavior subscales as the best indices of behavioral rigidity (see S3 Table for a comparison of all subscales). To examine whether relationships were specific to RRB, ADI-R domain scores for Communication and Reciprocal Social Interaction were included, as were T-scores for the Social Communication Index on the Social Responsiveness Scale 2nd Edition (SRS-2; [63])-a parent-report questionnaire assessing current social-communication difficulties. On all measures, higher scores indicate greater symptom severity.
Comorbid symptomatology. The DSM-5 rating scale of ADHD [64] and the Beck Anxiety Inventory (BAI; [65]) were used to assess associated symptoms. For ADHD symptoms, parents of all ASD participants completed the parent-report form, and in addition, ASD adults completed the self-report form. For anxiety, adult participants completed the BAI in selfreport form, whereas adolescents completed the self-report version of the anxiety subscale of the Beck Youth Inventories (BYI-II; [66]). Parents/caregivers of children completed the same BYI-II subscale in parent-report form.

Statistical analysis
All analyses were conducted in R [67]. First, we characterized the cohort with respect to sex, age, and IQ differences. Second, to examine the effects of diagnosis and age group on the task performance measures, we employed linear mixed-effects models using the lme4 package in R [68]. The models included diagnosis and age group (and for accuracy, phase) as between-participant factors (including their interaction[s]) and site as a random factor. Including sex in the models did not improve model fit. Post hoc pairwise comparisons were computed from contrasts between factors using lsmeans package with Tukey adjustments [69]. Following the reinforcement learning model comparisons and validation using one-step-ahead predictions, we examined case-control differences on winning model parameters in each age group. Finally, we used correlational analyses to examine associations between task behavior, model parameters, and symptomatology. Symptomatology associations were conducted only in the ASD groups using Spearman's correlations owing to non-normality in scores. Significance thresholds for correlational analyses are Bonferroni-corrected for multiple comparisons-children/adolescents (.05/11): p = .0045 and adults (.05/13): p = .0038. Effect sizes are reported as Cohen's d.

Sex, age, and IQ group differences
Diagnostic groups did not differ on sex or age, either overall or within each age group (all p > .1). However, all groups differed significantly on full-scale IQ, with TD groups scoring higher than ASD groups (p ranging .01-.005; d ranging 0.32-0.47). Therefore, for all further group comparisons, we assessed whether results changed with IQ as a confound regressor, and, in addition, we conducted analyses of task behavior in an IQ-matched subsample (S2 Text and S4 Table). Results were largely unchanged throughout (see S2 Text and S2 Fig).

Task behavior
Grouped trial-by-trial behavior is shown in Fig 2A and descriptive statistics in Table 1. All diagnostic and age groups performed above chance in both phases of the task, showing task comprehension (all p < 2.2 × 10 −16 ; see S3 Text, S3 Fig and S5 Table). A repeated-measures analysis of accuracy showed significant main effects of phase (F [1,566] Fig 2B).
Next, a significant main effect of diagnosis on perseverative errors was observed (F [1,565.42] = 11.07, p = .0009, d = 0.30; Fig 2C), such that ASD individuals made on average significantly more perseverative errors than TD individuals; however, there was no significant effect of age nor interaction between diagnosis and age group (p > .2). For both accuracy and perseverative Trial-by-trial data for each age group with diagnostic group averages overlaid. More evidence of task understanding in adults, as indicated by more correct task behavior and steeper shifts at reversal in comparison to children. (B) Task accuracy was greater (1) in the acquisition phase compared to the reversal phase, (2) in older age groups compared to younger, and (3) in TD individuals compared to ASD individuals. (C-E) Linear mixed-effects models showed a main effect of diagnosis for all three task performance measures (perseverative errors, win-staying, lose-shifting) and a main effect of age for win-staying (D) and lose-shifting (E) but not perseverative errors (C). For win-staying, a diagnosis × age group interaction was also found. Post hoc tests revealed ASD adolescents showed significantly reduced win-staying compared with TD adolescents (D), ��� p < .001 (see S1 Data). ASD, autism spectrum disorder; TD, typical development. Regarding feedback sensitivity, ASD individuals showed on average significantly less winstay and more lose-shift behavior relative to TD individuals, and for both there was a main effect of age (win-stay: diagnosis [F (1, (2,390.88) = 19.50, p = 8.5 × 10 −9 ]). Pairwise post hoc comparisons revealed win-staying increased and lose-shifting decreased with age (Fig 2D and 2E). For win-stay behavior, the predicted interaction between diagnosis and age group was approaching significance (p = .057). A between-diagnosis group analysis of each age group revealed ASD adolescents showed less win-staying than TD adolescents (p < .0008; Fig 2D, d = 0.54), which survived Bonferroni correction (correcting for task behavioral measures × age groups: p-value = .05/[3 × 3] = .0056). For lose-shift behavior, there was no significant interaction between diagnosis and age group (p = .3). Results were again consistent in the IQ-matched subsample and when IQ was entered as a confound regressor (S2 Text and S2 Fig).
The pattern of results reported here is also replicated in the additional analyses conducted with the subset of ASD individuals who meet ADI-R criteria (S2 Text and S2 Fig).

Model comparison and validation
Model weightings are shown in Fig 3A, and all winning model's parameters had independent contributions (S4 Fig). There were no between-diagnosis group differences in terms of model preference, only changes across development. Within both ASD and TD age groups, model weights showed that for children, the CU model provided the highest model evidence; for adolescents, the R-P model provided the highest model evidence; and for adults, the EWA-DL provided the highest model evidence. Results were unchanged when models were fitted with (z-scored) IQ as a covariate (see S6 Table). Model recovery results showed that all models' identities can be well recovered (S5 Fig). Collapsing age groups, the R-P model provided the highest model evidence in both diagnostic groups (S7 Table). One-step-ahead predictions of each group's winning model showed the models captured the key features of task behavior (e.g., the first response to negative feedback, the switch at reversal), with posterior predictive accuracy values of 0.61 and above. All models performed significantly better than chance level (p � 1.23 × 10 −11 ). Average simulated behavior closely resembled participants' behavior ( Fig 3B).

Within-model diagnostic group comparisons
We then investigated which computational mechanisms underpin poorer task performance in ASD for the different age groups. To this end, we compared diagnostic groups on parameter estimates from the winning model of each age group (  Fig  3C). Simulations showed the optimal learning rate (i.e., leading to higher choice accuracy) for the CU model is 0.18 (Fig 3D, see also S1 Text), which is closer to the learning rate for TD children (M TD = 0.19) than the learning rate for ASD children (M ASD = 0.26). A higher learning rate in our learning schedule reflects oversensitivity to feedback (including probabilistic punishment, which should be ignored). There were no differences on the other model parameters (β, α; p > .1). Results were unchanged with IQ as a confound regressor.
Adolescents-R-P model. A repeated-measures feedback type × diagnosis linear mixedeffect model with learning rates as dependent variables showed a significant main effect of feedback type (F [1,202] = 33.04, p = 3.20 × 10 −8 ) and a significant interaction between feedback type and diagnosis (F [1,202] = 12.57, p = .0004), but no main effect of diagnosis (p = .1; Fig 3C). Reward learning rates were significantly larger than punishment learning rates (p < .0001, d = 0.43). Pairwise post hoc comparisons showed autistic adolescents' reward learning rate was significantly lower than TD adolescents' reward learning rate (p = .004, d = −0.39), but their punishment learning rates were not significantly different (p = .7). Additionally, TD ) Evidence (model weights) for models within each diagnostic and age group. Very similar patterns are observed for TD and ASD groups; winning models for children, adolescents, and adults are the CU, R-P, and EWA-DL, respectively. (B) One-step-ahead posterior predictions for each age and diagnostic group according to winning models. Colored lines indicate diagnostic-group-averaged trial-by-trial task behavior; shaded areas indicate 95% HDI of the one-step-ahead simulation using the entire posterior distribution. Compare with actual task data in Fig 2A. Posterior predictive accuracies are also indicated on each plot (ASD: red; TD: blue). (C) Model parameter comparisons. Within each winning model and thus age group, parameter estimates were compared between diagnostic groups: (1) ASD children showed a significantly higher learning rate (η) than TD children, in which simulations showed the optimal learning rate to be 0.18; (2) ASD adolescents showed a significantly lower reward learning rate than TD adolescents, but no difference between punishment learning rates was observed; (3) ASD adults showed significantly lower φ than TD adults, the optimal value was shown to be 0.85 in simulations, and ASD adults also showed significantly greater experience decay (ρ) than TD adults, suggesting great perseveration. (D) Learning rate simulations showing optimal learning rates for each model (Counterfactual update, compare to Fig 3C Children

PLOS BIOLOGY
adolescents' reward learning rate was significantly higher than both their punishment learning rate (p < .001, d = 0.74) and ASD adolescents' punishment learning rate (p < .001, d = 0.62).
In the context of the R-P model (with two learning rates), simulations showed the optimal reward and punishment learning rates for choice accuracy are 0.96 and 0.60, respectively (Fig 3D and S6 Fig). This optimal pattern of a reward learning rate higher than the related punishment learning rate is also shown in TD adolescents' learning rates, whereas autistic adolescents showed on average similar levels of reward and punishment learning and reduced learning from rewards compared to TD adolescents. In addition to reduced learning from rewards, autistic adolescents also showed significantly lower value sensitivity (β; t [169.27] = −7.24, p = 1.51 × 10 −11 , d = −1.05, 95% CI −1.32 to −0.73), reflecting more stochastic choice behavior. These results suggest that reduced reward learning and lower value sensitivity drive worse task performance in ASD adolescents. Results were unchanged with IQ as a confound regressor.
For associations between task behavior and model parameters, see S4 Text and S8 Table.

Symptomatology correlations in ASD
All correlations with symptomatology are listed in S9 Table and S10 Table. Here, we discuss only those that remained significant after Bonferroni correction for multiple comparisons.
No correlations with learning rates (η, η rew , η pun , φ) nor lose-shift behavior survived Bonferroni correction in any age group. Of note, no significant associations between either task behavior or model parameters and social-communication difficulties were observed.

Discussion
In this study, we examined flexible behavior on a PRL task and used reinforcement learning models to investigate underlying learning mechanisms in autistic and neurotypical children, adolescents, and adults. Overall, we found evidence of on average reduced flexible behavior in autistic individuals, as indexed by poorer task performance across measures. Our results also show a developmental effect whereby older age groups outperformed younger age groups on the task. Using computational modeling of behavior, we showed that dominant learning mechanisms shift with developmental stage, but not diagnosis, and that poorer task performance in ASD is underpinned by atypical use of the age-related dominant learning mechanism in each age group. Furthermore, we found evidence for an association between perseveration and behavioral rigidity in ASD, but only in adults.
These findings emphasize the importance of a developmental framework when examining mechanistic accounts of both intact and reduced flexible behavior. Although the role of development is well documented in the neurotypical literature, particularly with respect to key brain regions for cognitive flexibility, goal-directed decision-making, and feedback learning [9,26,70], age-related differences in ASD have been relatively understudied. Examining learning mechanisms across development, we found dominant differential integration of reward and punishment feedback in both adolescent groups, corresponding with literature that suggests neurotypical adolescents are hyperresponsive to rewards [29,71]. In contrast, children's behavior was best captured by a single learning rate, and adults showed evidence of increasingly weighting their accumulating experience to inform subsequent decisions and slow down new learning. This dominant experience-weight mechanism in adults is consistent with previous neurotypical research [47]; however, our study is the first to report the same dominant mechanism in ASD adults. These results therefore posit that cognitive and reinforcement-based processes are governed primarily by age, leading to the relative dominance of different learning mechanisms in different age groups. In this way, differential feedback learning may be developing in children and strengthened in adolescence, and experience weighting may similarly develop and then prevail in adulthood.
Previous research suggests that reversal learning-and, more broadly, cognitive flexibilityis impaired in ASD (e.g., [1,72]) and may be underpinned by the recruitment of different brain regions to TD [22]. Our findings provide support for the impairment hypothesis in that on average the ASD group was less accurate and more perseverative and showed reduced outcome sensitivity compared to the TD group. Furthermore, this pattern of results was consistent in both subsample analyses, showing robustness of findings in both an IQ-matched subsample and a subsample including only those ASD individuals who reach ADI-R criteria [46]. Notably, autistic adolescents showed reduced win-staying compared to TD adolescents, in line with previous studies that showed reduced win-staying in adults [21,22]. However, in this study, we did not find reduced win-staying specifically in autistic adults compared to TD adults.
Our computational modeling findings suggest that reduced flexible behavior in the ASD group is underpinned by significant differences in the efficient use of learning mechanisms within each age group on this task. Both the children and adult ASD groups showed faster learning rates compared to their TD counterparts. Here, faster learning rates are less optimal, as they result in reduced ability to ignore probabilistic feedback. These results are consistent with predictive coding and Bayesian accounts of ASD that suggest "overlearning" in response to feedback and difficulties ignoring noise, putatively due to precise or inflexible prediction errors [37,38]. Indeed, studies using volatile task environments or near-chance reward contingencies have reported intact learning and updating or superior performance in ASD [22,39]. In these contexts, fast learning rates are optimal, as changes are more frequent and therefore updating must be too.
Thus, findings demonstrate that altered learning rates in ASD have different effects on behavior depending on the learning environment and, in tandem, that computational models characterize differences rather than solely deficits, shedding light on environments in which differences may be expressed as strengths rather than difficulties. The computational differences in ASD appear to manifest as pronounced difficulties when the environment is less volatile, and learning when to ignore probabilistic feedback is as important as tracking change. These difficulties may underpin the marked difficulties with minor (probabilistic) deviations in routines or unexpected changes in ASD that caregivers so frequently report [73]. In different environments, faster learning may manifest in strengths; these differences have important implications for intervention development.
In ASD adolescents, reduced flexible behavior-and, particularly, reduced win-stayingwas underpinned by reduced reward learning compared to TD adolescents. This finding is consistent with previous research showing impaired reward circuitry dysfunction in autistic adolescents [74]. Whereas neurotypical adolescents are thought to demonstrate increased risk due to high reward sensitivity, reduced reward learning in autistic adolescents may result in reduced risk-taking and serve as a protective effect [75]. Reduced reward learning could also have implications for behavioral interventions. If autistic adolescents do not learn from typical rewards in the same way that TD adolescents do, the type(s) of rewards used in behavioral interventions would require adapting [76]. For example, there is evidence to suggest autistic individuals assign specific reward value to their circumscribed interests such that they may be of value in intervention design [77][78][79].
Reduced flexible behavior has previously been associated with RRB in ASD [1,[80][81][82], though results are not consistent despite a strong theoretical link. Here, we observed robust, moderately strong associations between perseveration and RRB in autistic adults. We also found no evidence of associations with social-communication difficulties, providing support for the specificity to RRB. On the RBS-R, these associations were specific to the Ritualistic-Sameness and Stereotyped Behavior subscales, capturing behavioral rigidities. Previous literature has also reported associations between flexibility impairments and RRB symptom severity in ASD adults [83] with mixed findings in children and adolescents [82,[84][85][86]. Moving forward, examining this association across developmental stages will continue to be important.
To our knowledge, this study is the first to elucidate a potential learning mechanism by which behavioral rigidity manifests in autistic adults: perseveration as a result of a reluctance or inability to switch-"getting stuck"-because new information is devalued in favor of past experience, which in turn impedes updating choice behavior. Furthermore, as this mechanism has been associated with dopamine transporter differences in neurotypical adults [47], and abnormalities in the dopaminergic system have been implicated in ASD [87], this study highlights a potential mechanistic link between neurobiology and behavior worthy of further study.
Beyond perseveration, RRB in autistic adults positively associated with reduced value sensitivity (i.e., more stochastic choice behavior). This mechanism was also associated with more ADHD symptoms in autistic adults. Reduced value sensitivity has previously been identified as a key factor in poor task performance in anhedonia [88]. Together, these findings suggest that value sensitivity may have transdiagnostic value in explaining aspects of reduced flexible behavior. As altered decision-making is prevalent across many neurodevelopmental and neuropsychiatric disorders, examining underlying processes in relation to symptom dimensions rather than purely diagnostic categories will likely be of greater value for understanding implicated brain circuitries [89].
In autistic adolescents, we found no relationship between performance measures or learning mechanisms and clinical symptoms. In children with ASD, we observed a positive association between perseverative behavior and anxiety symptoms. Previous studies have demonstrated a relationship between anxiety and reduced flexible behavior in non-autistic adults [90,91] and children and adolescents with anxiety disorders [92]. One plausible link between perseveration and anxiety may be the intolerance of uncertainty (IU) construct, as uncertainty is inherent in probabilistic tasks. IU is a core construct in anxiety disorders [93] and a possible transdiagnostic mechanism [94] shown to be relevant for anxiety in ASD [95]. Associations between anxiety and RRB in ASD have frequently been reported [96,97]. Together, our findings broadly support the notion that reduced flexible behavior is of clinical relevance in ASD; however, the extent to which particular processes may be differentially linked to specific aspects of RRB versus commonly co-occurring features of anxiety or ADHD at different developmental stages will require further examination.

Limitations
This study has a number of limitations. Firstly, despite the large sample size and wide age range, the sample does not include children younger than 6 or adults above 30 years of age. Future research including very young children and older adults could allow for the assessment of any other age-related changes in dominant learning mechanisms. Secondly, it is important to note that each group's winning model is only relative to the other models tested herealthough we note that the models capture behavior well and perform far above chance. However, it is (always) possible that other models may perform even better and further models may be developed in the future. A full model with all parameters combined was not possible because of convergence issues, emphasizing the relative dominance of learning mechanisms rather than any suggestions of mutual exclusivity. We highlight, nevertheless, that the study is the first to compare reinforcement learning models in ASD across age groups. Thirdly, our approach necessitated that we implicitly treated each diagnostic and age group as relatively homogeneous. The increasing recognition of the considerable phenotypic and etiological diversity of ASD indicates potential individual differences in learning processes within or across these a priori defined subgroups. Estimating the learning strategy for each individual would allow for a "bottom-up" approach to identifying potential subgroups based on learning strategies. Fourth, our sample was limited to individuals with an ASD diagnosis and TD counterparts. Given that reduced flexible behavior and atypical reinforcement learning are implicated in many other areas of psychiatry, it would be informative to extend this study with a transdiagnostic sample, in the context of the research domain criteria framework (RDoC; [89]). Additionally, given the growing literature suggesting differential reward processing in ASD, future work could assess potential differences in learning and flexible behavior in the context of different reward modalities, i.e., use different types of feedback, such as monetary stimuli. Finally, it will be crucial to verify our results through replication. The current sample has been reassessed as part of a longitudinal project, thereby providing some opportunity for this.

Conclusions
Current results suggest group-level impairments in flexible behavior across developmental stages in ASD. We show evidence of developmental shifts in dominant computational mechanisms underlying PRL that are consistent across ASD and TD individuals. Within each age group, differences in model parameter estimates showed less optimal learning in ASD, underpinning poorer task performance. Additionally, we show that perseverative behavior-and, in adults, learning mechanisms-were related to behavioral rigidities or co-occurring symptoms of anxiety or ADHD. Findings emphasize the importance of understanding reduced flexible behavior in ASD within a developmental framework and underline the strength of computational approaches in ASD research.
Supporting information S1 Data. Excel spreadsheet containing, in separate sheets, the underlying numerical data for figures and figure panels: 1C, 2A-2E, 3C, 3D, 4A-4J, S1, S2A-S2L, S3A-S3B, S4, and S7. Trial-by-trial average proportion of correct responses (here, yellow in acquisition phase, blue in reversal phase) plotted separately for the groups that passed and failed the learning criterion. The red lines indicate the mean for that task phase (acquisiton/reversal) and the orange lines indicate the 95% confidence intervals. Thus, both groups performed above chance in both task phases. (B) Diagnostic and age group average proportion of correct responses for each task phase, plotted separately for the pass/fail groups to confirm that perfgormance above chance was maintained even within diagnostic and age subgroups.