Influence of learning strategy on response time during complex value-based learning and choice

Measurements of response time (RT) have long been used to infer neural processes underlying various cognitive functions such as working memory, attention, and decision making. However, it is currently unknown if RT is also informative about various stages of value-based choice, particularly how reward values are constructed. To investigate these questions, we analyzed the pattern of RT during a set of multi-dimensional learning and decision-making tasks that can prompt subjects to adopt different learning strategies. In our experiments, subjects could use reward feedback to directly learn reward values associated with possible choice options (object-based learning). Alternatively, they could learn reward values of options’ features (e.g. color, shape) and combine these values to estimate reward values for individual options (feature-based learning). We found that RT was slower when the difference between subjects’ estimates of reward probabilities for the two alternative objects on a given trial was smaller. Moreover, RT was overall faster when the preceding trial was rewarded or when the previously selected object was present. These effects, however, were mediated by an interaction between these factors such that subjects were faster when the previously selected object was present rather than absent but only after unrewarded trials. Finally, RT reflected the learning strategy (i.e. object-based or feature-based approach) adopted by the subject on a trial-by-trial basis, indicating an overall faster construction of reward value and/or value comparison during object-based learning. Altogether, these results demonstrate that the pattern of RT can be informative about how reward values are learned and constructed during complex value-based learning and decision making.

In naturalistic settings, humans must use information about an object, such as its features (i.e. color, taste, texture), to determine whether it is rewarding. This is because one cannot learn the reward values for a large number of options that exist in the real world. For example, the features of fruit in a school cafeteria line may allow students to choose between many types of fruit available. A green, squishy pear may be compared to a red, crispy apple, based on what one knows about these features. Individual features, however, are not always predictive of reward; whereas a green apple may be tasty, a green banana may not. Therefore, in order for learning to adapt to different situations, it must be adjusted to incorporate reward feedback differently depending on the nature of the environment. Currently, extensive literature addresses how humans exploit various heuristics to tackle real-world decision-making and learning tasks [19][20][21][22][23]. We have recently shown that humans use a heuristic strategy to tackle learning reward values in dynamic multi-dimensional environments [24]. More specifically, rather than directly estimating the reward value of individual options based on reward feedback (object-based learning), subjects estimate average reward values of features and then combine these values to estimate reward values for individual options (feature-based learning). Importantly, these learning strategies not only require different representations of reward values and updates based on reward feedback but also involve distinct methods for construction or comparison of reward values to make a decision.
We hypothesized that the way reward values are constructed and compared could influence how quickly a choice between two alternative options can be made. To examine this influence, we considered three stages involved in value-based decision making. First, objects and/or their features need to be detected and identified in order to evoke the corresponding representations of reward values. Second, in the case of feature-based learning, the individual values of relevant features should be combined to construct the final value of an object. Third, the final values of two alternative objects should be compared to make a decision [25][26][27][28].
Because the choice during a given trial depends primarily on the difference between the reward values of the two options on that trial [25,[29][30][31][32][33], decision making should be easier and faster when this difference is larger. This predicts that RT should decrease with the absolute value of the difference in reward values on a given trial. Moreover, if reward values of objects are directly estimated/learned (object-based learning), retrieval of reward values only requires identification of the two objects. In the case of feature-based learning, however, the subject has to construct reward values of objects after retrieving their feature values and combining them. This predicts longer RT when feature-based learning is adopted, indicating that RT could reflect the dominant learning strategy on a trial-by-trial basis when both strategies are present and compete with each other to determine choice.
To test the aforementioned predictions, here we analyzed choice and RT data during a set of multi-dimensional learning and decision-making tasks that can promote different types of learning strategies.

Overview of experiments and approach
Human subjects performed a set of multi-dimensional learning and decision-making tasks. During each trial, the subject selected between pairs of objects (colored shapes or patterned shapes) that would yield reward with different probabilities (Fig 1). These probabilities could stay the same (static environment) or unpredictably change over time (dynamic or volatile environment). Moreover, the relationship between reward probabilities of objects and their features created generalizable or non-generalizable environments. In a generalizable environment, the reward probabilities assigned to different objects could be estimated from their features based on a multiplicative rule. In contrast, in a non-generalizable environment, reward probabilities assigned to different objects could not be accurately estimated using their features. By fitting choice behavior with seven different models, we identified which of the two classes of learning strategies (feature-based or object-based) was adopted by individual subjects during each experiment. We then examined prediction of the best object-based and featurebased models on a trial-by-trial basis in order to study how RT depended on the adopted learning strategy.

Subjects
Subjects were recruited from the student population at Dartmouth College. The exclusion criterion was similar for all experiments and was defined as performance below chance level (0.5). In total, 59 subjects were recruited (34 females) to perform one or both of Experiments 1 and 2 (33 subjects performed both experiments) in a pseudo-randomized order and on separate days. The chance criterion for Experiments 1 and 2 was equal to performance threshold of 0.5406 (equal to 0.5 plus 2 times s.e.m., based on the average of 608 trials after excluding the first 10 trials of each block). As a result, data from a few subjects (8 out of 51 subjects in Experiment 1 and 19 out of 41 in Experiment 2) whose performance fell below this threshold were excluded from the analysis. An additional subject was excluded from Experiment 2 for submitting the same response throughout the experiment. For Experiment 3, 36 additional subjects were recruited (20 females). The exclusion criterion for this experiment was equal to performance threshold of 0.5447 (equal to 0.5 plus 2 times s.e.m., based on the average of 500 trials after excluding the first 30 trials of each session), resulting in exclusion of 9 subjects. For Experiment 4, 36 more subjects were recruited (22 females). A similar exclusion criterion in this experiment resulted in removal of 11 subjects whose performance fell below 0.5404 (equal to 0.5 plus 2 times s.e.m., based on the average of 612 trials after excluding the first 30 trials of each session). No subject had a history of neurological or psychiatric illness. Subjects were compensated with a combination of money and "t-points," which are extra-credit points for classes within the Dartmouth College Psychological and Brain Sciences department. After a base rate of $10/hour or 1 t-point/hour, subjects were additionally rewarded based on their performance (e.g., number of reward points), by up to $10/hour.

Ethics statement
All experimental procedures were approved by the Dartmouth College Institutional Review Board, and informed, written consent was obtained from all subjects before participating in the experiment.

Experiments 1 and 2
In these experiments, subjects completed two sessions (each composed of 768 trials and lasting approximately one hour) of a choice task during which they selected between a pair of objects on each trial (Fig 1A). Subjects were asked to choose between objects that differed in color and shape (red triangle, blue triangle, red square, blue square) and to select the object that was more likely to provide a reward in order to maximize the total number of reward points.
During each trial, the selection of an object was rewarded independently of the reward probability of the other object. This reward schedule was fixed for a block of trials (block length, L = 48), after which it changed to another reward schedule without any signal to the subject. Sixteen different reward schedules were used; eight schedules consisted of generalizable rules for how combinations of features (color or shape) predicted objects' reward probabilities, and the other eight lacked generalizable rules for how combinations of features predicted objects' reward probabilities [24]. Therefore, the main difference between Experiments 1 and 2 was that the environments in these experiments were composed of reward schedules with generalizable and non-generalizable rules, respectively. Critically, as the subjects moved between blocks of trials, reward probabilities were reversed in order to induce volatility in reward information throughout the task. On each block of the generalizable environment (Experiment 1), feature-based rules could be used to correctly predict reward throughout the task. In the non-generalizable environment (Experiment 2), however, the reward probability assigned to each object could not be determined based on the objects' features (e.g. in a multiplicative fashion). More details about the reward schedules in Experiments 1 and 2 are provided elsewhere [24].

Experiment 3
In this experiment, subjects completed two sessions, each of which included 280 choice trials interleaved with five or eight short blocks of estimation trials (each estimation block with eight trials). Subjects were asked to choose the more rewarding of two objects during each trial of the choice task. These objects were drawn from a set of eight objects constructed using combinations of three distinct patterns and three distinct shapes ( Fig 1C). The two objects presented during each trial always differed in both pattern and shape. Other aspects of the choice task were similar to those in Experiments 1 and 2, except that reward feedback was given for both objects rather than just the chosen object in order to accelerate learning. Subjects then provided estimates of each object's reward during estimation trials. Possible values for these estimates ranged from 5% to 95% in 10% increments. All subjects completed five blocks of estimation trials throughout the task (after trials 42, 84, 140, 210, and 280 of the choice task), and some subjects had three additional blocks of estimation trials (after trials 21, 63, and 252) to better assess changes in estimations over time. Each session of the experiment was approximately 45 minutes in length, with a break between sessions. The second session was similar to the first, but with different sets of shapes and patterns.
Selection of a given object was rewarded (independently of the other object on a given trial) based on a reward schedule with a moderate level of generalizability such that only some objects' reward probabilities could be determined based on their feature values. In Experiment 3, one feature (shape or pattern) was informative about reward probability, while the other feature was not. Although the informative feature was predictive of reward on average, this prediction was not generalizable to all individual objects. Additionally, the non-informative feature did not follow any generalizable rule. That is, the average reward probability across objects with a similar feature value was equal to 0.5 (e.g. shapes S1 to S3 in Fig 1C). This reward schedule ensured that subjects could not use a generalizable feature-based rule to accurately predict reward probability for all objects. Similarly to Experiments 1 and 2, the informative feature was randomly assigned and counter-balanced across subjects to minimize the effects of intrinsic pattern or shape biases. More details about the reward schedules are provided elsewhere [24].

Experiment 4
This experiment was similar to Experiment 3, except that each feature could take on one of four values for each feature dimension (i.e. there were four shapes and four patterns), resulting in an environment with higher dimensionality. Each subject completed two sessions of 336 choice trials interleaved with five or eight short blocks of estimation trials (each block with eight trials). The objects in this experiment were drawn from a set of twelve objects, which were combinations of four distinct patterns and four distinct shapes. The probabilities of reward for different objects (the reward matrix) were set such that there was one informative feature and one non-informative feature, just as in Experiment 3 ( Fig 1C). Minimum and maximum average reward values for features were similar for Experiments 3 and 4.

Model fitting
To capture subjects' learning and choice behavior, we used seven different reinforcement learning (RL) models based on object-based or feature-based learning, which rely on different assumptions about how reward values are represented and updated (see below). These models were fit to experimental data by minimizing the negative log likelihood of the predicted choice probability given different model parameters using the 'fminsearch' function in MATLAB (MathWorks, Inc., Natick, MA). Three measures of goodness-of-fit were used to determine the best model to account for the behavior in each experiment: average negative log likelihood (-LL), Akaike information criterion (AIC), and Bayesian information criterion (BIC).

Object-based RL models
This group of models, based on standard RL, directly estimated reward value of each object based on feedback from previous trials. The uncoupled object-based RL model updated only the reward value of the chosen object using separate learning rates for rewarded and unrewarded trials as follows: where t represents the trial number, V choO is the estimated reward value of the chosen object, rðtÞ is the trial outcome (1 for rewarded, 0 for unrewarded), and a rew and a unr are the learning rates for rewarded and unrewarded trials, respectively, and d is the Dirac delta function The coupled object-based RL model, on the other hand, updated reward values of both the chosen and unchosen object on each trial in opposite directions. The value of the chosen object was updated with Eq 1, whereas that of the unchosen object was updated as if the reward values were anti-correlated: where V uncO represents the estimated reward value of the unchosen object. Given the estimated value functions for each object, the probabilities of choosing one of the objects (O1 and O2) were determined using the following equation: where P O1 is the probability of choosing object 1, V O1 and V O2 are the reward values of the objects presented to the left and right, respectively, bias measures a response bias toward the left option in order to capture the subject's location bias, and s is a parameter measuring the level of stochasticity in the decision process.

Feature-based RL models
The feature-based RL models utilized the combined reward values of an object's features to estimate the reward value of an object. Feature reward values were modeled with standard RL, updated similarly to those for the object-based RL model, but where the reward value of the chosen (unchosen) object was instead the chosen (unchosen) feature. For Experiments 1 and 2 in which the alternative objects could have a feature in common (e.g. both were squares), this model resulted in the poorest fit when the reward values of both features of the chosen object were updated. As a result, in the models presented here, we only updated the reward value of the unique feature when there was a shared feature.
Similarly to the object-based RL models, the probability of choosing a shape was modeled with a logistic function: where V shapeO1 ðV colorO1 Þ and V shapeO2 ðV colorO2 Þ are the reward values associated with the shape (color) of left and right objects, respectively, bias measures a response bias toward the left option, and w shape and w color determine the influence of the two features on the final choice.

RL models with decay
We also modeled the 'forgetting' of reward values, when a given object or feature was not presented or chosen on a given trial, by decaying corresponding reward values in the uncoupled models. More specifically, we assumed that reward values of all objects or features that were not presented or chosen decayed to 0.5 (chance or equal information about reward or no reward) with a rate of d: where t is the trial number and V is estimated reward probability of an object or a feature. Forgetting of reward values was implemented through decaying toward 0.5 because forgetting implies that information about an object or feature should go to zero (i.e. maximal entropy), which corresponds to V = 0.5.

Analysis of response time
For most analyses, response time (RT) was normalized (z-scored) by first subtracting the mean RT of a given subject from their RT values and then dividing the outcome by the standard deviation of RT for that subject. This normalization allowed us to directly compare data from all subjects. In addition to RT, we also z-scored all independent variables in order to obtain the standardized regression coefficients directly. Standardized regression coefficients quantify the relative influence of predictors measured on different scales. Unlike z-scoring, a constant term in the generalized linear model (GLM) cannot address different ranges of RT for different subjects (variability relative to mean RT), which could result in subjects with more variable RT influencing the results more strongly. Because RTs were z-scored for individual subjects, we did not include an intercept in the GLM. We removed trials with RT less than 50 msec and more than 5 sec since they correspond to choices with no deliberation and when the subject was distracted, respectively. We used several GLMs with the following regressors (all z-scored) to predict the normalized RT on each trial: 1) the difference between the actual or estimated (using the model that provides the best fit of choice data) reward probabilities of the two objects presented on a given trial; 2) the trial number within a block; 3) reward outcome on the preceding trial; and 4) a 'model-adoption' index comparing the prediction of the best feature-based and objectbased models on a given trial. The model-adoption index was defined as the difference between the goodness-of-fit by the best feature-based and object-based models using BIC per trial (BIC p ) defined as: where k indicates the number of parameters in a given model. The best object-based and feature-based models were determined based on the BIC for each subject in a given experiment. We note that even though we used BIC for these analyses, similar results were obtained with AIC. Thus the model-adoption index determines the degree to which feature-based or object-based models predicted choice behavior in a given trial more accurately for an individual subject (positive values correspond to a better prediction by the object-based model).
To further investigate the relationship between RT and the adopted model on a given trial, we grouped trials into object-based, feature-based, and equivocal trials depending on which model provided a better fit (based on BIC p ) and whether the absolute value of model-adoption index was larger than a threshold. More specifically, equivocal trials were defined as trials in which the absolute value of the model-adoption index was less than 0.1. This resulted in assigning approximately 39, 39, and 22 percent of trials as object-based, feature-based, and equivocal, respectively (over all experiments). The percentages of trials assigned to these three types in a given experiment were compatible with the overall learning strategy used in that experiment (S1 Table).
We also calculated the relationship between RT and the adopted model in BIC-matched trials. More specifically, for each subject we first picked object-based and feature-based trials for which BICs were between the largest minimum and smallest maximum BIC of object-based and feature-based trials. We then sorted these trials based on their BIC values and computed the mean BIC for each set of trials, and then one by one removed the largest BIC trials from the group with the larger mean BIC, and the smallest BIC trials from the group with the smaller mean BIC. This procedure was repeated until the two means of the BIC values from the two sets of trials were as close to each other as possible. We then calculated the average normalized RT for these two sets of trials.
To test whether there was any effect of previous participation on RT for subjects who performed both Experiments 1 and 2, we also added 'previous participation' as a regressor (equal to 0 and 1 if the current experiment was the first and second experiment, respectively) to our GLMs for fitting RT during these experiments.

Results
To examine how RT depends on the construction and comparison of reward values, we used various reinforcement learning models to determine whether object-based or feature-based learning was adopted by individual subjects in a given experiment. For each model, we computed Bayesian information criterion (BIC) as the measure of goodness-of-fit in order to determine the best overall model to account for the behavior in each experiment (Table 1). Other measures of goodness of fit such as -log likelihood (-LL), and Akaike information criterion (AIC) also revealed similar results as shown in S1 This analysis showed that in Experiment 1 (dynamic generalizable environment), the feature-based models provided better fits than the object-based models, indicating that subjects adopted feature-based learning more often in this experiment. In contrast, subjects more frequently adopted object-based learning in Experiment 2 (dynamic non-generalizable environment). In Experiment 3 (static and non-generalizable environment), subjects more frequently adopted object-based learning. An examination of fits and the pattern of estimations over time revealed that in this experiment, subjects adopted feature-based learning first before slowly switching to object-based learning [24]. In contrast, in Experiment 4 (static non-generalizable environment), which involved a larger number of objects and slightly larger generalizability than Experiment 3, subjects more frequently adopted feature-based learning and did not switch to object-based learning.
Altogether, the analysis of subjects' choice behavior demonstrates that subjects adopted different learning strategies depending on the structure of reward in each environment. As mentioned above, each of these learning strategies requires a different method for the construction and/or comparison of reward value. Object-based learning relies on directly representing and The average RTs across all subjects (using median RT of all trials for each subject) were equal to 0.78±0.04s, 0.85±0.05s, 1.04±0.05s, and 1.01±0.03s for Experiments 1 to 4, respectively. A longer RT for Experiments 3 and 4 could be attributed to a larger number of objects in those experiments (8 and 12, respectively, in Experiments 3 and 4, rather than 4 in Experiments 1 and 2). Experiments 1 and 2 were identical except for the way in which reward probabilities were assigned to the four objects, resulting in generalizable and non-generalizable environments. Because of the non-generalizable reward schedules in Experiment 2, this experiment was more difficult to perform than Experiment 1. This was reflected in lower average performance (harvested reward) (Wilcoxon rank-sum test, p < 0.001; the average performance for Experiments 1 and 2 were equal to 0.564±0.004 and 0.544±0.003, respectively) and a larger number of subjects whose performance did not differ from chance in Experiment 2 relative to Experiment 1 (Pearson's chi-square test, P = .0014; 8 of 51 in Experiment 1, and 19 of 41 in Experiment 2).
Despite the fact that Experiment 2 was more difficult, the average RT across all subjects in Experiment 2 was not statistically different from that in Experiment 1 (Wilcoxon rank-sum test, P = 0.2). This observation can be explained by noting that subjects adopted object-based learning in Experiment 2 more often than in Experiment 1 (74% vs. 38% in Experiments 2 and 1, respectively), and the average RT for subjects who adopted object-based learning was smaller than that for subjects who adopted feature-based learning in both experiments (twosided Wilcoxon rank-sum test, Experiment 1: P = 0.01, Experiment 2: P = 0.04). More specifically, the average median RTs for subjects who adopted object-based and feature-based learning in Experiment 1 were equal to 0.71±0.06s and 0.85±0.05s, respectively. The average median RTs for subjects who adopted object-based and feature-based learning in Experiment 2 were equal to 0.79±0.07s and 0.88±0.07s, respectively. These results provide preliminary evidence supporting our hypothesis that construction and comparison of reward value is faster when the object-based strategy is adopted.
Given these results, we next utilized generalized linear models (GLMs) to examine the influence of several factors on the pattern of RT in each experiment. These included the absolute value of the difference in the actual or estimated probability of reward on the two objects presented on a given trial (absolute difference in reward probability), the trial number within a given experiment, reward outcome on the preceding trial, and the model-adoption index (equal to BIC p (Ft)-BIC p (Obj); see Methods). We used normalized RT (z-scored for each subject) in order to analyze data from all subjects together.
The results of the GLM fits revealed several important aspects of RT in our experiments (Table 2). First, subjects became faster over time in all experiments by an average of 300-400ms at the end relative to the beginning of the experiments (Fig 2). Moreover, in Experiments 1 and 2, in which there was a reversal in reward probabilities after every block of 48 trials (dynamic environments), RT increased at the beginning of each block and then decreased as subjects learned the reward probabilities associated with the four objects in each block (Fig  3). This resulted in average of 20ms and 40ms decrease in RT over the course of each block in Experiments 1 and 2, respectively.
Second, we found that the absolute difference in the actual or estimated reward probabilities negatively affected RT such that RT was maximal when this difference was close to zero ( Table 2; Figs 4 and 5). Importantly, the absolute difference in estimated reward probabilities (based on the fit of choice behavior of each subject) was a better predictor than the absolute difference in the actual reward probabilities (compare R 2 based on the GLMs in panels A and B of Table 2). The absolute difference in the estimated reward probabilities resulted in about 120ms change in RT (Fig 5). Third, the reward outcome on the previous trial also led to shorter RT after rewarded trials compared to after unrewarded trials. Overall, RT on trials following rewarded trials was shorter than RT following unrewarded trials by 33±12ms, 50±24ms, 50 ±11ms, and 16±13ms in Experiments 1 to 4, respectively, indicating that subjects spent more time making a choice following unrewarded trials.
Previously, it has been shown that the modulation of RT by reward on the preceding trial depends on whether or not the rewarding feature of the target stayed the same between the two trials [34]. Therefore, we also examined whether the effect of reward on the following trial depends on the presence of the previously selected object. To do so, we fit RT data with a new Table 2. Factors influencing RT. We used two sets of GLMs to predict the normalized RT as a function of the absolute difference in the actual (A) or estimated (B) probability of reward on the two objects presented on a given trial (absolute difference in actual/subjective reward probability), the trial number within a block of the experiment, the difference between BIC per trial (BIC p ) based on the best feature-based and object-based models (i.e., model-adoption index) for a given subject, and the reward outcome on the preceding trial. Reported values are the normalized regression coefficients (±s.e.m.), p-values for each coefficient (two-sided t-test), and adjusted R-squared for each experiment. No interaction term was statistically significant and thus, interactions terms are not reported here.

A Regressor
Abs. difference in actual reward prob.  GLM with additional terms: 1) whether the object selected on the previous trial was present or not (previously selected present), and 2) the interactions between reward outcome on the previous trial and previously selected object being present/absent. The results of this GLM not only revealed that presentation of an object selected on the previous trial reduced RT but also  Construction of reward value and response time showed a strong interaction between this effect and reward outcome on the previous trial (Table 3). Therefore, we next computed the average RT in four sets of trials depending on whether the preceding trial was rewarded or not and whether the option selected on the previous trial was present or absent. Interestingly, we found that subjects were faster following rewarded than unrewarded trials when the object presented on the preceding trial was not present, and this was true for all experiments (Fig 6). In contrast, when the previously selected object was present, subjects were faster following unrewarded trials (see S2 Table for detailed statistics). Overall, there was a larger proportion of trials in which the previously selected object was absent rather than present, resulting in an overall faster RT for rewarded than unrewarded trials. However, subjects were faster when the previously selected object was present rather than absent, but only after unrewarded trials. Therefore, we also found differential effects of previous reward on RT depending on the presence or absence of the object selected on the preceding trial, similarly to previous experiments.
Finally, we found that the model-adoption index (equal to BIC p (Ft)-BIC p (Obj), measuring the degree to which the object-based model provides a better fit than the feature-based model on a given trial) had a negative effect on RT (Table 2). This illustrates that subjects were faster  Table 3. Predicting RT with a GLM that also includes whether the object selected on the preceding trial was present or absent. We used a GLM to predict the normalized RT as a function of the absolute difference in the estimated (subjective) probability of reward of the two objects presented on a given trial, the trial number within a block of the experiment, the difference between BIC per trial (BIC p ) based on the best feature-based and object-based models for a given subject, the reward outcome on the previous trial, presence of an object that was selected on the previous trial (prev. selected present), and the interaction of the last two predictors. Reported values are the normalized regression coefficients (±s.e.m.), p-values for each coefficient (two-sided t-test), and adjusted R-squared for each experiment.

Regressor
Abs. difference in subjective reward prob. Construction of reward value and response time on trials in which the object-based model provided a better fit (i.e., when they adopted objectbased learning). To further investigate this relationship, we computed RT for object-based, feature-based, and equivocal trials (trials where the adopted model was ambiguous; see Methods). This analysis illustrated that subjects were faster by about 40ms when they adopted an objectbased compared to a feature-based model (Fig 7A-7B; Table 3). Moreover, RT was fastest on equivocal trials. At the same time, however, the minimum BIC based on the two best models on each trial, a measure of the best prediction by either model, was the largest for equivocal trials ( Fig 7C; Table 4). These indicate that subjects were slightly faster on equivocal trials, perhaps because on those trials they made decisions more randomly as suggested by the reduced quality of fit. To address the concern that faster RT on object-based trials could also be due to more random choice behavior on those trials, we calculated the average normalized RT on feature-based and object-based trials after matching the BIC values (see Methods). We found that even on the BIC-matched trials (i.e. trials with similar quality of fit) subjects were still faster when they adopted the object-based learning compared to when they adopted feature-based learning (one-sided ranksum test, p = 0.03, using data from all four experiments; Fig 7D). Therefore, even though the quality of fit was in general worse for object-based trials, faster RT on these trials was not due to more random behavior on those trials. Overall, these results illustrate that RT is influenced by the learning strategy adopted by subjects on a trial-by-trial basis. Because subjects also became faster over time, a shorter RT on trials when the object-based strategy was adopted could be exaggerated by a gradual shift from using feature-based to object-based strategy and shorter RT as each experiment progressed (especially in Experiment 3 in which this shift was prominent). To test this possibility, we also fit RT using a GLM model that included all interaction terms between the four predictors in Table 2. None of these  Construction of reward value and response time interactions and especially the interaction between trial number and model-adoption index (measured by the difference in BIC p values) were statistically significant (p > .05). We also calculated the average RT over the course of the experiment across all subjects separately for trials in which object-based or feature-based strategy was adopted (Fig 2). We found a consistently shorter RT on object-based trials during the entire course of each experiment. Taken together, these results suggest that the overall faster RT on object-based trials was not caused by a combination of a shift to object-based learning and faster decision making as subjects progressed through each experiment.
Because a large number of subjects performed in both Experiments 1 and 2, we also tested the effect of previous participation on RT in a given experiment (see Methods). This analysis did not reveal any effects of previous participation on RT. We also used a GLM to predict the average value of the difference in the goodness-of-fit (<BIC p (Ft)-BIC p (Obj)>) based on the current experiment (1 or 2), previous participation, and their interactions (capturing the order of experiments). The result of this GLM did not reveal any effect of previous participation or the order of experiments on the learning strategy.
Finally, to ensure that exclusion criterion did not bias our results in a certain way, we analyzed RT of excluded subjects using the same models and methods used for included subjects. We found higher BIC values for excluded compared to included subjects, suggesting that on average, choice behavior was more difficult to fit for removed subjects (S3 Table). More importantly, we found that the overall best model based on object-based or feature-based strategy (models with decay) did not provide a significantly better fit than the counterpart model for excluded subjects. This indicates that the adopted model could not be detected well in excluded subjects. Nevertheless, we also analyzed RT data from excluded subjects in all four experiments using the same GLM used for included subjects (S4 Table). This analysis revealed that the dependence of RT on the four regressors was similar for included and excluded subjects during Experiments 2 and 4 (experiments with largest percentages of excluded subjects), although the effect of the difference in BIC p was not significant for Experiment 4. The signs of all other regressors with a significant effect on RT were similar for excluded and included subjects. Altogether, these results demonstrate that excluded subjects likely did not engage in the experiment or use any certain strategy. More importantly, our exclusion criterion did not lead to removal of meaningful information nor did it bias our conclusion.

Discussion
We investigated the pattern of RT during a set of a multi-dimensional learning and decisionmaking tasks in order to provide insights into neural mechanisms underlying the construction and comparison of reward values. Specifically, we tested whether the type of learning strategies adopted by the subjects and the underlying reward value representation were reflected in the pattern of RT during binary choice. We found a decrease in RT over the course of experiments and after rewarded trials. Moreover, RT was strongly modulated by the difference in the value of reward probabilities assigned to the two alternative objects on a given trial such that it was maximal when the absolute value of the difference in the actual or estimated reward probabilities was minimal. As expected, this effect was stronger for estimated than for actual reward probabilities. More importantly, the model adopted by the subject on a trial-by-trial basis influenced RT such that subjects were faster when they made a choice using the object-based rather than the feature-based strategy. These results confirm our hypotheses that at least two stages of value-based decision making, construction of reward value for each option and value comparison between pairs of options, are influenced by the learning strategy adopted by the subject. A crucial aspect of our experiments, which allowed us to study the influence of value construction on RT, is the rich multi-dimensional task in which multiple learning strategies coexist and interact to determine choice. The subjects in our experiments could perform the task as a search (i.e. foraging) for more rewarding objects or features on each trial (especially during Experiments 3 and 4 with a large set of objects). Therefore, our results could be relevant to visual search and object identification based on reward values [35]. We found that RT for selection between two options increased as the subjective values of the two options became more similar. If we consider selection between the two objects on a given trial as a search for a more rewarding object (target), our observation dovetails with the finding in visual search that more visual similarity between the target and distractor increases RT [5]. Interestingly, there are a number of studies in which the visual search has been manipulated by differential reward assignment on targets or distractors [34,[36][37][38].
Overall, those studies have demonstrated that task-irrelevant distractors previously associated with larger rewards can slow down target detection [36,37], and this reduction in performance has been attributed to attentional capture by the rewarding feature. Interestingly, in an experiment in which different reward magnitudes (large and small) were randomly delivered on correct trials, the magnitude of reward differentially modulated RT on the following trial depending on whether the rewarding feature of the target stayed the same or not between the two trials [34]. If the feature stayed the same, subjects were faster after high-reward trials relative to low-reward trials. In contrast, if the feature switched, subjects were slower after high-reward trials relative to low-reward trials.
Our results also could be linked to the distance effect in numerical cognition where the reaction time for discriminating two numbers decreases as the difference between the two numbers increases [39,40]. Interestingly, the distance effect seems to be influenced more strongly by response-related processes than overlap between the representations of numbers [41]. The response time in our experiments, however, is influenced by both the difference between the estimated values of competing objects on each trial (value comparison) and how their values are constructed (representation of value).
The feature-based learning provides a heuristic for learning and choice in dynamic, multidimensional environments. Importantly, heuristic feature-based learning not only reduces dimensionality but also increases adaptability without compromising precision [24]. Here, we showed that subjects were faster when they made a choice using the object-based rather than the feature-based model. Based on these results, we predict that there will be a larger difference in RT when feature-based versus object-based strategy is adopted for more high-dimensional options. However, RT using the feature-based strategy could be dramatically reduced for highdimensional options if the comparison between the two options is limited to dissimilar features only, allowing the feature-based learning to be a "fast and frugal" heuristic [20]. The dependence of RT on the subject's adopted model of the environment also implies that the adopted model should be estimated and considered when interpreting the top-down effects during conventional or foraging visual search tasks [35] (see [42] for a review of top-down effects).
Finally, with regard to how value comparison contributes to RT during decision making, we found that the difference in the probability of reward assigned to the two targets presented on each trial predicted RT, similarly to our previous observation [43]. There is growing interest in economics to use RT in order to infer a decision maker's preference beyond what can be revealed only by choice [18,44,45]. For example, based on the observation that RT increases as the subjective values of options become closer to each other in a binary choice, Konovalov and Krajbich [46] have suggested that RT could be used to detect the indifference point. In another work, Clithero [44] has suggested that RT can be used along with choice to improve out-ofsample predictions. Our results suggest that even the adopted model could be predicted based on RT on a given trial. It would be interesting to test in the future whether RT can be used to improve the fit of choice behavior. However, as we showed here, RT monotonically decreases over time even during continuous learning, indicating that precautions must be taken if RT is used as a measure of the indifference point in such situations.
Altogether, by examining behavior during a set of complex, multi-dimensional learning and decision-making tasks, we show that the pattern of RT can be used to investigate different stages of value-based leaning and decision making, and more specifically, how reward value is constructed and compared on a single trial. , and Bayesian information criterion (BIC), respectively. The goodness-of-fit values are computed by averaging over all subjects (mean ± s.e.m.) separately for three featurebased RLs and their object-based counterparts and for Experiments 1 to 4. The significance level of the comparison between each model that provides the best fit in a given experiment and its object-based or feature-based counterpart is coded as: 0.01 < P < 0.05 ( Ã ), 0.001 < P < 0.01 ( ÃÃ ), and P < 0.001 ( ÃÃÃ ). (TIFF) S1 Table. Percentage of subjects who adopted feature-based learning and percentages of trials grouped as feature-based, equivocal, and object-based in each experiment. Reported values are mean±std for each experiment. During Experiments 1 and 4, more subjects adopted feature-based learning whereas during Experiments 2 and 3 more subjects adopted objectbased learning. The differences between the percentages of trials identified as feature-based and object-based were compatible with the overall adopted learning strategy in each experiment. (DOCX) S2 Table. Comparisons of RT during different types of trials depending on the presence or absence of previously selected object and rewarded or unrewarded outcome. Reported are the p-values (two-sided signed-rank test) for comparisons of the average RT between a given pair of trials (across subjects) depicted in Fig 6. (DOCX) S3 Table. The average values of the Bayesian information criterion (BIC) based on three feature-based RLs and their object-based counterparts, separately for excluded subjects in each of four experiments. Reported are the average values of BIC over all subjects (mean±s.e. m.) and p-values for comparisons of BIC values between each model and its object-based or feature-based counterparts (two-sided Wilcoxon signed-rank test). The overall best model (feature-based or object-based with decay) and its object-based or feature-based counterpart are highlighted in cyan and brown, respectively. (DOCX) S4 Table. Factors influencing RT for excluded subjects. We used a GLM to predict the normalized RT as a function of the estimated probability of reward of the two objects presented on a given trial (absolute difference in subjective reward probability), the trial number within a block of the experiment, the difference between BIC per trial (BIC p ) based on the best featurebased and object-based models (i.e., model-adoption index) for a given subject, and the reward outcome on the preceding trial. Reported values are the normalized regression coefficients (±s. e.m.), p-values for each coefficient (two-sided t-test), and adjusted R-squared for each experiment. No interaction term was statistically significant and thus, interactions terms are not reported here. (DOCX)