Similarities and differences in spatial and non-spatial cognitive maps

Learning and generalization in spatial domains is often thought to rely on a “cognitive map”, representing relationships between spatial locations. Recent research suggests that this same neural machinery is also recruited for reasoning about more abstract, conceptual forms of knowledge. Yet, to what extent do spatial and conceptual reasoning share common computational principles, and what are the implications for behavior? Using a within-subject design we studied how participants used spatial or conceptual distances to generalize and search for correlated rewards in successive multi-armed bandit tasks. Participant behavior indicated sensitivity to both spatial and conceptual distance, and was best captured using a Bayesian model of generalization that formalized distance-dependent generalization and uncertainty-guided exploration as a Gaussian Process regression with a radial basis function kernel. The same Gaussian Process model best captured human search decisions and judgments in both domains, and could simulate realistic learning curves, where we found equivalent levels of generalization in spatial and conceptual tasks. At the same time, we also find characteristic differences between domains. Relative to the spatial domain, participants showed reduced levels of uncertainty-directed exploration and increased levels of random exploration in the conceptual domain. Participants also displayed a one-directional transfer effect, where experience in the spatial task boosted performance in the conceptual task, but not vice versa. While confidence judgments indicated that participants were sensitive to the uncertainty of their knowledge in both tasks, they did not or could not leverage their estimates of uncertainty to guide exploration in the conceptual task. These results support the notion that value-guided learning and generalization recruit cognitive-map dependent computational mechanisms in spatial and conceptual domains. Yet both behavioral and model-based analyses suggest domain specific differences in how these representations map onto actions.

Thinking spatially is intuitive. We remember things in terms of places [1][2][3], describe 2 the world using spatial metaphors [4,5], and commonly use concepts like "space" or 3 "distance" in mathematical descriptions of abstract phenomena. In line with these 4 observations, previous theories have argued that reasoning about abstract conceptual 5 information follows the same computational principles as spatial reasoning [6][7][8]. This 6 has recently gained new support from neuroscientific evidence suggesting that common 7 neural substrates are the basis for knowledge representation across domains [9-13]. 8 One important implication of these accounts is that reinforcement learning [14] in 9 non-spatial domains may rely on a map-like organization of information, supported by 10 the computation of distances or similarities between experiences. These representations 11 of distance facilitate generalization, allowing for predictions about novel stimuli based 12 on their similarity to previous experiences. Here, we ask to what extent does the search 13 for rewards depend on the same distance-dependent generalization across two different 14 domains -one defined by spatial location and another by abstract features of a Gabor 15 patch -despite potential differences in how the stimuli and their similarities may be 16 processed? 17 We formalize a computational model that incorporates distance-dependent 18 generalization and test it in a within-subject experiment, where either spatial features 19 or abstract conceptual features are predictive of rewards. This allows us to study the 20 extent to which the same organizational structure of cognitive representations is used in 21 both domains, based on examining the downstream behavioral implications for learning, 22 decision making, and exploration. 23 Whereas early psychological theories described reinforcement learning as merely 24 developing an association between stimuli, responses, and rewards [15][16][17], more recent 25 studies have recognized that the structure of representations plays an important role in 26 making value-based decisions [11,18] and is particularly important for knowing how to 27 generalize from limited data to novel situations [19,20]. This idea dates back to Tolman, 28 who famously argued that both rats and humans extract a "cognitive map" of the options, such as the distances between locations in space [22], and -crucially - 31 facilitates flexible planning and generalization. While cognitive maps were first 32 identified as representations of physical spaces, Tolman hypothesized that similar 33 principles may underlie the organization of knowledge in broader and more complex 34 cognitive domains [21]. 35 As was the case with Tolman, neuroscientific evidence for a cognitive map was 36 initially found in the spatial domain, in particular, with the discovery of spatially 37 selective place cells in the hippocampus [23,24] and entorhinal grid cells that fire along 38 a spatial hexagonal lattice [25]. Together with a variety of other specialized cell types 39 that encode spatial orientation [26,27], boundaries [28,29], and distances to objects [30], 40 this hippocampal-entorhinal machinery is often considered to provide a cognitive map 41 facilitating navigation and self-location. Yet more recent evidence has shown that the 42 same neural mechanisms are also active when reasoning about more abstract, 43 conceptual relationships [31][32][33][34][35][36], characterized by arbitrary feature dimensions [37] or 44 temporal relationships [38,39]. For example, using a technique developed to detect 45 spatial hexagonal grid-like codes in fMRI signals [40], Constantinescu et al. found that 46 human participants displayed a pattern of activity in the entorhinal cortex consistent 47 with mental travel through a 2D coordinate system defined by the length of a bird's legs 48 and neck [9]. Similarly, the same entorhinal-hippocampal system has also been found to 49 reflect the graph structure underlying sequences of stimuli [10] or the structure of social 50 networks [41], and even to replay non-spatial representations in the sequential order 51 that characterized a previous decision-making task [42]. At the same time, much 52 evidence indicates that cognitive map-related representations are not limited to medial 53 temporal areas, but also include ventral and orbital medial prefrontal 54 areas [9, 11,40,[43][44][45]. Relatedly, a study by Kahnt and Tobler [46] using 55 uni-dimensional variations of Gabor stimuli showed that the generalization of rewards 56 was modulated by dopaminergic activity in the hippocampus, indicating a role of 57 non-spatial distance representations in reinforcement learning. 58 Based on these findings, we asked whether learning and searching for rewards in 59 spatial and conceptual domains is governed by similar computational principles. Using a 60 within-subject design comparing spatial and non-spatial reward learning, we tested 61 whether participants used perceptual similarities in the same way as spatial distances to 62 generalize from previous experiences and inform the exploration of novel options. To 63 ensure commensurate stimuli discriminability between domains, participants completed 64 a training phase where they were required to reach the same level of proficiency in asymmetric task order effect, where performing the spatial task first boosted 83 performance on the conceptual task but not vice versa. These findings provide a clearer 84 picture of both the commonalities and differences in how people reason about and 85 represent both spatial and abstract phenomena in complex reinforcement learning tasks. 86

87
129 participants searched for rewards in two successive multi-armed bandit tasks (Fig 1). 88 The spatial task was represented as an 8 × 8 grid, where participants used the arrow 89 keys to move a highlighted square to one of the 64 locations, with each location 90 representing one option (i.e., arm of the bandit). The conceptual task was represented 91 using Gabor patches, where a single patch was displayed on the screen and the arrow 92 keys changed the tilt and stripe frequency (each having 8 discrete values; see Fig. S1), 93 providing a non-spatial domain where similarities are relatively well defined. Each of 94 the 64 options in both tasks produced normally distributed rewards, where the means of 95 each option were correlated, such that similar locations or Gabor patches with similar 96 stripes and tilts yielded similar rewards (Fig. S2), thus providing traction for 97 similarity-guided generalization and search. The strength of reward correlations were 98 manipulated between subjects, with one half assigned to smooth environments (with 99 higher reward correlations) and the other assigned to rough environments (with lower 100 reward correlations). Importantly, both classes of environments had the same 101 expectation of rewards across options. 102 The spatial and conceptual tasks were performed in counter-balanced order, with 103 each task consisting of an initial training phase (see Methods) and then 10 rounds of 104 bandits. Each round had a different reward distribution (drawn without replacement 105 from the assigned class of environments), and participants were given 20 choices to 106 acquire as many points as possible (later converted to monetary rewards). The search 107 horizon was much smaller than the total number of options and therefore induced an 108 explore-exploit dilemma and motivated the need for generalization and efficient 109 exploration. The last round of each task was a "bonus round", where after 15 choices, 110 participants were shown 10 unobserved options (selected at random) and asked to make 111 judgments about the expected reward and their level of confidence (i.e., uncertainty 112 about the expected rewards). These judgments were used to validate the internal belief 113 representations of our models. All data and code, including interactive notebooks 114 containing all analyses in the paper, is publicly available at 115 https://github.com/charleywu/cognitivemaps.

116
Computational Models of Learning, Generalization, and Search 117 Multi-armed bandit problems [52,53] are a prominent framework for studying learning, 118 where various reinforcement learning (RL) models [14] are used to model the learning of 119 reward valuations and to predict behavior. A common element of most RL models is 120 some form of prediction-error learning [54,55], where model predictions are updated 121 based on the difference between the predicted and experienced outcome. One classic 122 example of learning from prediction errors is the Rescorla-Wagner [55] model, in which 123 the expected reward V (·) of each bandit is described as a linear combination of weights 124 w t and a one-hot stimuli vector x t representing the current state s t :  Experiment design. a) In the spatial task, options were defined as a highlighted square in a 8 × 8 grid, where the arrow keys were used to move the highlighted location. b) In the conceptual task, each option was represented as a Gabor patch, where the arrow keys changed the tilt and the number of stripes ( Fig S1). Both tasks corresponded to correlated reward distributions, where choices in similar locations or having similar Gabor features predicted similar rewards ( Fig S2). c) The same design was used in both tasks. Participants first completed a training phase where they were asked to match a series of target stimuli. This used the same inputs and stimuli as the main task, where the arrow keys modified either the spatial or conceptual features, and the spacebar was used to make a selection. After reaching the learning criterion of at least 32 training trials and a run of 9 out of 10 correct, participants were shown instructions for the main task and asked to complete a comprehension check. The main task was 10 rounds long, where participants were given 20 selections in each round to maximize their cumulative reward (shown in panels a and b). The 10th round was a "bonus round" where after 15 selections participants were asked to make 10 judgments about the expected reward and associated uncertainty for unobserved stimuli from that round. After judgments were made, participants selected one of the options, observed the reward, and continued the round as usual.
Learning occurs by updating the weights w as a function of the prediction error where r t is the observed reward, V (x t ) is the reward expectation, and 127 0 < η ≤ 1 is the learning rate parameter. In our task, we used a functions [64] and explaining biases in how people extrapolate from limited data [59].

150
Formally, a GP defines a multivariate-normal distribution P (f ) over possible value 151 functions f (s) that map inputs s to output y = f (s). 152 The GP is fully defined by the mean function m(s), which is frequently set to 0 for 153 convenience without loss of generality [47], and kernel function k(s, s ) encoding prior 154 assumptions (or inductive biases) about the underlying function. Here we use the radial 155 basis function (RBF) kernel: encoding similarity as a smoothly decaying function of the squared Euclidean distance 157 between stimuli s and s , measured either in spatial or conceptual distance. The 158 length-scale parameter λ encodes the rate of decay, where larger values correspond to 159 broader generalization over larger distances.

160
Given a set of observations D t = [s t , y t ] about previously observed states and associated rewards, the GP makes normally distributed posterior predictions for any novel stimuli s , defined in terms of a posterior mean and variance: The posterior mean corresponds to the expected value of s while the posterior variance 161 captures the underlying uncertainty in the prediction. Note that the posterior mean can 162 also be rewritten as a similarity-weighted sum:

163
July 13, 2020 6/39 where each s i is a previously observed input in s t and the weights are collected in the 164 vector w = K(s t , s t ) + σ 2 I −1 y t . Intuitively, this means that GP regression is 165 equivalent to a linearly weighted sum, but uses basis functions k(·, ·) that project the 166 inputs into a feature space, instead of the discrete state vectors. To generate new 167 predictions, every observed reward y i in y t is weighted by the similarity of the 168 associated state s i to the candidate state s based on the kernel similarity. This 169 similarity-weighted sum (Eq 7) is equivalent to a RBF network [65], which has featured 170 prominently in machine learning approaches to value function approximation [14] and as 171 a theory of the neural architecture of human generalization [66] in vision and motor 172 control.

173
Uncertainty-directed exploration 174 In order to transform the Bayesian reward predictions of the BMT and GP models into 175 predictions about participant choices, we use upper confidence bound (UCB) sampling 176 together with a softmax choice rule as a combined model of both directed and random 177 exploration [19,50,51].

178
UCB sampling uses a simple weighted sum of expected reward and uncertainty: to compute a value q for each option s, where the exploration bonus β determines how 179 to trade off exploring highly uncertain options against exploiting high expected rewards. 180 This simple heuristic-although myopic-produces highly efficient learning by

185
The UCB values are then put into a softmax choice rule: where the temperature parameter τ controls the amount of random exploration. Higher 187 temperature sampling leads to more random choice predictions, with τ → ∞ converging 188 on uniform sampling. Lower temperature values make more precise predictions, where 189 τ → 0 converges on an arg max choice rule. Taken together, the exploration bonus β 190 and temperature τ parameters estimated on participant data allow us to assess the 191 relative contributions of directed and undirected exploration, respectively.

192
Behavioral Results

193
After pre-training participants were highly proficient in discriminating the stimuli, 194 achieving at least 90% accuracy in both domains (see Fig. S3   While performance was strongly correlated between the spatial and conceptual tasks 206 (Pearson's r = .53, p < .001, BF > 100; Fig. 2b), participants performed systematically 207 better in the spatial version (paired t-test: t(128) = 6.0, p < .001, d = 0.5, BF > 100). 208 This difference in task performance can largely be explained by a one-directional transfer 209 effect (Fig. 2c). Participants performed better on the conceptual task after having 210 experienced the spatial task (t(127) = 2.8, p = .006, d = 0.5, BF = 6.4). This was not 211 the case for the spatial task, where performance did not differ whether performed first 212 or second (t(127) = −1.7, p = .096, d = 0.3, BF = .67). Thus, experience with spatial 213 search boosted performance on conceptual search, but not vice versa.

214
Participants learned effectively within each round and obtained higher rewards with 215 each successive choice (Pearson correlation between reward and trial: r = .88, p < .001, 216 BF > 100; Fig 2d). We also found evidence for learning across rounds in the spatial 217 task (Pearson correlation between reward and round: r = .91, p < .001, BF = 15), but 218 not in the conceptual task (r = .58, p = .104, BF = 1.5).

219
Patterns of search also differed across domains. Comparing the average Manhattan 220 distance between consecutive choices in a two-way mixed ANOVA showed an influence 221 of task (within: F (1, 127) = 13.8, p < .001, η 2 = .02, BF = 67) but not environment 222 (between: F (1, 127) = 0.12, p = .73, η 2 = .001, BF = 0.25, Fig. 2e). This reflected that 223 participants searched in smaller step sizes in the spatial task (t(128) = −3.7, p < .001, 224 d = 0.3, BF = 59), corresponding to a more local search strategy, but did not adapt 225 their search distance to the environment. Note that each trial began with a randomly 226 sampled initial stimuli, such that participants did not begin near the previous selection 227 (see Methods). The bias towards local search (one-sample t-test comparing search  Table S1), while treating participants as random effects. 237 This provides initial evidence for generalization-like behavior, where participants 238 actively avoided areas with poor rewards and stayed near areas with rich rewards.

239
In summary, we find correlated performance across tasks, but also differences in both 240 performance and patterns of search. Participants were boosted by a one-directional 241 transfer effect, where experience with the spatial task improved performance on the 242 conceptual task, but not the other way around. In addition, participants made larger 243 jumps between choices in the conceptual task and searched more locally in the spatial 244 task. However, participants adapted these patterns in both domains in response to 245 reward values, where lower rewards predicted a larger jump to the next choice.

247
To better understand how participants navigated the spatial and conceptual tasks, we 248 used computational models to predict participant choices and judgments. Both GP and 249 BMT models implement directed and undirected exploration using the UCB exploration 250 bonus β and softmax temperature τ as free parameters. The models differed in terms of 251 learning, where the GP generalized about novel options using the length-scale parameter 252 λ to modulate the extent of generalization over spatial or conceptual distances, while 253 the BMT learns the rewards of each option independently (see Methods).

254
Both models were estimated using leave-one-round-out cross validation, where we 255 compare goodness of fit using out-of-sample prediction accuracy, described using a 256 pseudo-R 2 (Fig 3a). The differences between models were reliable and meaningful, with 257 the GP model making better predictions than the BMT in both the conceptual 258 (t(128) = 3.9, p < .001, d = 0.06, BF > 100) and spatial tasks (t(128) = 4.3, p < .001,

295
Whereas generalization was similar between tasks, there were intriguing differences 296 in exploration. We found substantially lower exploration bonuses (β) in the conceptual 297 task (Z = −5.0, p < .001, r = −.44, BF > 100), indicating a large reduction of directed 298 exploration, relative to the spatial task. At the same time, there was an increase in 299 temperature (τ ) in the conceptual task (Z = 6.9, p < .001, r = −.61, BF > 100), 300 corresponding to an increase in random, undirected exploration. These domain-specific 301 differences in β and τ were not influenced by task order or environment (two-way 302 ANOVA: all p > .05, BF < 1). Despite these differences, we find some evidence of  previous studies; [19,70]), they displayed reduced levels of directed exploration in the 310 conceptual task, substituting instead an increase in undirected exploration. Again, this 311 is not due to a lack of effort, because participants made longer search trajectories in the 312 conceptual domain (see Fig S4a). Rather, this indicates a meaningful difference in how 313 people represent or reason about spatial and conceptual domains in order to decide 314 which are the most promising options to explore.  (Table S2 and ribbons showing the 95% CI (undefined for the BMT model, which makes identical predictions for all unobserved options). d) Correspondence between participant confidence ratings and GP uncertainty, where both are rank-ordered at the individual level. Black dots show aggregate means and 95% CI, while the colored line is a linear regression.

331
Using parameters estimates from the search task (excluding the entire bonus round), 332 we computed model predictions for each of the bonus round judgments as an out-of-task 333 prediction analysis. Whereas the BMT invariably made the same predictions for all 334 unobserved options since it does not generalize (Fig. 4c), the GP predictions were  Table S2 for details).  Table S2).

348
Thus, participant search behavior was consistent with our GP model and we were 349 also able to make accurate out-of-task predictions about both expected reward and 350 confidence judgments using parameters estimated from the search task. These domains was also in line with our behavioral results. Performance was correlated across 377 domains and benefited from higher outcome correlations between similar bandit options 378 (i.e., smooth vs. rough). Subsequent choices tended to be more local than expected by 379 chance, and similar options where more likely to be chosen after a high reward than a 380 low reward outcome.

381
In addition to revealing similarities, our modelling and behavioral analyses provided 382 a diagnostic lens into differences between spatial and conceptual domains. Whereas we 383 found similar levels of generalization in both tasks, patterns of exploration were 384 substantially different. Although participants showed clear signs of directed exploration 385 (i.e., seeking out more uncertain options) in the spatial domain, this was notably 386 reduced in the conceptual task. However, as if in compensation, participants increased 387 their random exploration in the conceptual task. This implies a reliable shift in space. This assumption of a random policy produces a nearly identical similarity metric 404 as the RBF kernel [76], with exact equivalencies in certain cases [77]. 405 However, the SR can also be learned online using the Temporal-Difference learning 406 algorithm, leading to asymmetric representations of distance that are skewed by the  Previous work has also investigated transfer across domains [80], where inferences 418 about the transition structure in one task can be generalized to other tasks. Whereas 419 we used identical transition structures in both tasks, we nevertheless found asymmetric 420 transfer between domains. A key question underlying the nature of transfer is the 421 remapping of representations [81,82], which can be framed as a hidden state-space 422 inference problem. Different levels of prior experience with the spatial and conceptual 423 stimuli could give rise to different preferences for reuse of task structure as opposed to 424 learning a novel structure. This may be a potential source of the asymmetric transfer 425 we measured in task performance. 426 Additionally, clustering methods (e.g., [79]) can also provide local approximations of 427 GP inference by making predictions about novel options based on the mean of a local 428 cluster. For instance, a related reward-learning task on graph structures [76] found that 429 a k-nearest neighbors model provided a surprisingly effective heuristics for capturing 430 aspects of human judgments and decisions. However, a crucial limitation of any 431 clustering models is it would be incapable of learning and extrapolating upon any 432 directional trends, which is a crucial feature of human function learning [59,60]. 433 Alternatively, clustering could also play a role in approximate GP inference [83], by representational similarities. But in our study it remains possible that different patterns 464 of exploration could instead result from a different visual presentation of information in 465 the spatial and the non-spatial task. It is, for example, conceivable that exploration in a 466 (spatially or non-spatially) structured environment depends on the transparency of the 467 structure in the stimulus material, or the alignment of the input modality. In our case 468 the spatial structure was embedded in the stimulus itself, whereas the conceptual 469 structure was not. Additionally, the arrow key inputs may have been more intuitive for 470 manipulating the spatial stimuli. While generalization could be observed in both highly influenced by spatial features, even when they were irrelevant. This present 478 study was designed to overcome these issues by presenting only task-specific features, 479 yet future work should address the computational features that allow humans to 480 leverage structured knowledge of the environment to guide exploration. There is also a 481 wide range of alternative non-spatial stimuli that we have not tested (for instance 482 auditory [12] or linguistic stimuli [85,86]), which could be considered more "conceptual" 483 than our Gabor stimuli or may be more familiar to participants. Thus, it is an open 484 empirical question to determine the limits to which spatial and different kinds of 485 conceptual stimuli can be described using the same computational framework.

486
Our model also does not account for attentional mechanisms [87] or working memory 487 constraints [88,89], which may play a crucial role in influencing how people integrate 488 information differently across domains [90]. To ask whether feature integration is 489 different between domains, we implemented a variant of our GP model using a Shepard 490 kernel [65], which used an additional free parameter estimating the level of integration 491 between the two feature dimensions (Fig. S10). This model did not reveal strong 492 differences in feature integration, yet replicated our main findings with respect to 493 changes in exploration. Additional analyses showed asymmetries in attention to 494 different feature dimensions, which was an effect modulated by task order (Fig. S4d-f). 495 Task order also modulated performance differences between domains, which only 496 appeared when the conceptual task was performed before the spatial task (Fig. 2c).

497
Experience with the spatial task version may have facilitated a more homogenous 498 mapping of the conceptual stimuli into a 2D similarity space, which in turn facilitated 499 better performance. This asymmetric transfer may support the argument that spatial 500 representations have been "exapted" to other more abstract domains [6][7][8]. For example, 501 experience of different resource distributions in a spatial search task was found to 502 influence behavior in a word generation task, where participants exposed to sparser should investigate this phenomenon with alternative models that make stronger 508 assumptions about representational differences across domains. 509 We also found no differences in predictions and uncertainty estimates about unseen 510 options in the bonus round. This means that participants generalized and managed to 511 track the uncertainties of unobserved options similarly in both domains, yet did not or 512 could not leverage their representations of uncertainty for performing directed 513 exploration as effectively in the conceptual task. Alternatively, differences in random 514 exploration could also arise from limited computational precision during the learning of 515 action values [92]. Thus, the change in random exploration we observed may be due to 516 different computational demands across domains. Similar shifts increases to random 517 exploration have also been observed under direct cognitive load manipulations, such as 518 by adding working memory load [93] or by limiting the available decision time [94].  [80], state similarities in multi-task reinforcement learning [95], and target 529 hypotheses supporting generalization [90], whether or not all of these recruit the same 530 computational principles and neural machinery remains to be seen. Institute for Human Development and all participants gave written informed consent.

552
We varied the task order between subjects, with participants completing the spatial 553 and conceptual task in counterbalanced order in separate sessions. We also varied 554 between subjects the extent of reward correlations in the search space by randomly 555 assigning participants to one of two different classes of environments (smooth vs. rough), 556 with smooth environments corresponding to stronger correlations, and the same 557 environment class used for both tasks (see below).

559
Each session consisted of a training phase, the main search task, and a bonus round. At 560 the beginning of each session participants were required to complete a training task to 561 familiarize themselves with the stimuli (spatial or conceptual), the inputs (arrow keys 562 and spacebar), and the search space (8 × 8 feature space). Participants were shown a 563 series of randomly selected targets and were instructed to use the arrow keys to modify 564 a single selected stimuli (i.e., adjusting the stripe frequency and angle of a Gabor patch 565 or moving the location of a spatial selector, Fig. 1c) in order to match a target stimuli 566 displayed below. The target stayed visible during the trial and did not have to be held 567 in memory. The space bar was used to make a selection and feedback was provided for 568 800ms (correct or incorrect). Participants were required to complete at least 32 training 569 trials and were allowed to proceed to the main task once they had achieved at least 90% 570 accuracy on a run of 10 trials (i.e., 9 out of 10). See Fig S3 for analysis of the training 571 data.

572
After completing the training, participants were shown instructions for the main 573 search task and had to complete three comprehension questions (Figs S11-S12) to 574 ensure full understanding of the task. Specifically, the questions were designed to ensure 575 participants understood that the spatial or conceptual features predicted reward. Each 576 search task comprised 10 rounds of 20 trials each, with a different reward function 577 sampled without replacement from the set of assigned environments. The reward 578 function specified how rewards mapped onto either the spatial or conceptual features, 579 where participants were told that options with either similar spatial features (Spatial 580 task) [19,96] or similar conceptual features (Conceptual task) [20,57] would yield similar 581 rewards. Participants were instructed to accumulate as many points as possible, which 582 were later converted into monetary payoffs.

583
The tenth round of each sessions was a "bonus round", with additional instructions 584 shown at the beginning of the round. The round began as usual, but after 15 choices, 585 participants were asked to make judgments about the expected rewards (input range: 586 [1,100]) and their level of confidence (Likert scale from least to most confident: [0,10]) 587 for 10 unrevealed targets. These targets were uniformly sampled from the set of 588 unselected options during the current round. After the 10 judgments, participants were 589 asked to make a forced choice between the 10 options. The reward for the selected 590 option was displayed and the round continued as normal. All behavioral and 591 computational modeling analyses exclude the last round, except for the analysis of the 592 bonus round judgments.

594
Participants used the arrow keys to either move a highlighted selector in the spatial task 595 or change the features (tilt and stripe frequency) of the Gabor stimuli in the conceptual 596 task (Fig S1). On each round, participants were given 20 trials to acquire as many  In both tasks the last round was a "bonus round", which solicited judgments about the 612 expected reward and their level of confidence for 10 unrevealed options. Participants 613 were informed that the goal of the task remained the same (maximize cumulative 614 rewards), but that after 15 selections, they would be asked to provide judgments about 615 10 randomly selected options, which had not yet been explored. Judgments about 616 expected rewards were elicited using a slider from 1 to 100 (in increments of 1), while 617 judgments about confidence were elicited using a slider from 0 to 10 (in increments of 618 1), with the endpoints labeled 'Least confident' and 'Most confident'. After providing 619 the 10 judgments, participants were asked to select one of the options they just rated, 620 and subsequently completed the round like all others.

621
Environments 622 All environments were sampled from a GP prior parameterized with a radial basis 623 function (RBF) kernel (Eq 4), where the length-scale parameter (λ) determines the rate 624 at which the correlations of rewards decay over (spatial or conceptual) distance. Higher 625 λ-values correspond to stronger correlations. We generated 40 samples of each type of 626 environments, using λ rough = 2 and λ smooth = 4, which were sampled without 627 replacement and used as the underlying reward function in each task (Fig S2).

628
Environment type was manipulated between subjects, with the same environment type 629 used in both conceptual and spatial tasks. The Bayesian Mean Tracker (BMT) is a simple but widely-applied associative learning 633 model [69, 97,98], which is a special case of the Kalman Filter with time-invariant 634 reward distributions. The BMT can also be interpreted as a Bayesian variant of the 635 Rescorla-Wagner model [56], making predictions about the rewards of each option j in 636 the form of a normally distributed posterior: The posterior mean m j,t and variance v j,t are updated iteratively using a delta-rule update based on the observed reward y t when option j is selected at trial t: where δ j,t = 1 if option j was chosen on trial t, and 0 otherwise. Rather than having a fixed learning rate, the BMT scales updates based on the Kalman Gain G j,t , which is defined as: where θ 2 is the error variance, which is estimated as a free parameter. Intuitively, the As with the behavioral analyses, we omit the 10th "bonus round" in our model 646 cross-validation. For each of the other nine rounds, we use cross validation to iteratively 647 hold out a single round as a test set, and compute the maximum likelihood estimate 648 using differential evolution [99] on the remaining eight rounds. Model comparisons use 649 the summed out-of-sample prediction error on the test set, defined in terms of log loss 650 (i.e., negative log likelihood).

652
As an intuitive statistic for goodness of fit, we report predictive accuracy as a pseudo-R 2 : 653 comparing the out-of-sample log loss of a given model M k against a random model 654 M rand . R 2 = 0 indicates chance performance, while R 2 = 1 is a theoretically perfect 655 model.

656
Protected exceedance probability 657 The protected exceedance probability (pxp) is defined in terms of a Bayesian model 658 selection framework for group studies [71,72]. Intuitively, it can be described as a 659 random-effect analysis, where models are treated as random effects and are allowed to 660 differ between subjects. Inspired by a Polya's urn model, we can imagine a population 661 containing K different types of models (i.e., people best described by each model), much 662 like an urn containing different colored marbles. If we assume that there is a fixed but 663 unknown distribution of models in the population, what is the probability of each model 664 being more frequent in the population than all other models in consideration?

665
This is modelled hierarchically, using variational Bayes to estimate the parameters of 666 a Dirichlet distribution describing the posterior probabilities of each model P (m k |y) 667 given the data y. The exceedance probability is thus defined as the posterior probability 668 that the frequency of a model r m k is larger than all other models r m k =k under 669 consideration: with its scale set to √ 2/2, as suggested by [101]. All statistical tests are non-directional 698 as defined by a symmetric prior (unless otherwise indicated).

699
Non-parametric comparisons are tested using either the frequentist 700 Mann-Whitney-U test for independent samples, or the Wilcoxon signed-rank test for 701 paired samples. In both cases, the Bayesian test is based on performing posterior 702 inference over the test statistics (Kendall's r τ for the Mann-Whitney-U test and 703 standardized effect size r = Z √ N for the Wilcoxon signed-rank test) and assigning a prior 704 using parametric yoking [102]. This leads to a posterior distribution for Kendall's r τ or 705 the standardized effect size r, which yields an interpretable Bayes factor via the 706 Savage-Dickey density ratio test. The null hypothesis posits that parameters do not 707 differ between the two groups, while the alternative hypothesis posits an effect and 708 assigns an effect size using a Cauchy distribution with the scale parameter set to 1 We use a two-way mixed-design analysis of variance (ANOVA) to compare the means of 719 both a fixed effects factor (smooth vs. rough environments) as a between-subjects 720 variable and a random effects factor (conceptual vs. spatial) as a within-subjects 721 variable. To compute the Bayes Factor, we assume independent g-priors [106] for each 722 effect size θ 1 ∼ N (0, g 1 σ 2 ), · · · , θ p ∼ N (0, g p σ 2 ), where each g-value is drawn from an Mixed effects regressions are performed in a Bayesian framework with brms [108] using MCMC methods (No-U-Turn sampling [109] with the proposal acceptance probability set to .99). In all models, we use a maximal random effects structure [110], and treat participants as a random intercept. Following [111] we use the following generic weakly informative priors: All models were estimated over four chains of 4000 iterations, with a burn-in period of 731 1000 samples.   Tukey boxplots show median (line) and 1.5x IQR, with diamonds indicating group means. b) Average correct choices during the training phase. In the last 10 trials before completing the training phase, participants had a mean accuracy of 95.0% on the spatial task and 92.7% on the conceptual task (difference of 2.3%). In contrast, in the first 10 trials of training, participants had a mean accuracy of 84.1% in the spatial task and 68.8% in the conceptual (difference of 15.4%). c) Heatmaps of the accuracy of different target stimuli, where the x and y-axes of the conceptual heatmap indicate tilt and stripe frequency, respectively. d) The probability of error as a function of the magnitude of error (Manhattan distance from the correct response). Thus, most errors were close to the target, with higher magnitude errors being monotonically less likely to occur. Each dot is the aggregate mean, while the lines show the fixed effects of a Bayesian mixed-effects model (see Table S1), with the ribbons indicating the 95% CI. The relationship is not quite linear, but is also found using a rank correlation (r τ = .18, p < .001, BF > 100). The dashed line indicates random chance. d) Search trajectories decomposed into the vertical/stripe frequency dimension vs. horizontal/tilt dimension. Bars indicate group means and error bars show the 95% CI. We find more attention given to the vertical/stripe frequency dimension in both tasks, with a larger effect for the conceptual task (F (1, 127) = 26.85, p < .001, η 2 = .08, BF > 100), but no difference across environments (F (1, 127) = 1.03, p = .311, η 2 = .005, BF = 0.25). e) We compute attentional bias as ∆ dim = P (vertical/stripe frequency)− P (horizontal/tilt), where positive values indicate a stronger bias towards the vertical/stripe frequency dimension. Attentional bias was influenced by the interaction of task order and task (F (1, 127) = 8.1, p = .005, η 2 = .02, BF > 100): participants were more biased towards the vertical/stripe frequency dimension in the conceptual task when the conceptual task was performed first (t(66) = −6.0, p < .001, d = 0.7, BF > 100), but these differences disappeared when the spatial task was performed first (t(61) = −1.6, p = .118, d = 0.2, BF = .45). f ) Differences in attention and score. Each participant is represented as a pair of dots, where the connecting line shows the change in score and ∆ dim across tasks. We found a negative correlation between score and attention for the conceptual task only in the conceptual first order (r τ = −.31, p < .001, BF > 100), but not in the spatial first order (r τ = −.07, p = .392, BF = .24). There were no relationships between score and attention in the spatial task in either order (spatial first: r τ = .03, p = .738, BF = .17; conceptual first: r τ = −.03, p = .750, BF = .17). The relationship between mean performance and predictive accuracy, where in all cases, the best performing participants were also the best described. b) The best performing participants were also the most diagnostic between models, but not substantially skewed towards either model. Linear regression lines strongly overlap with the dotted line at y = 0, where participants above the line were better described by the GP model. c Model comparison split by which task was performed first vs. second. In both cases, participants were better described on their second task, although the superiority of the GP over the BMT remains, comparing only task one (paired t-test: t(128) = 4.6, p < .001, d = 0.10, BF = 1685) or only task two (t(128) = 3.5, p < .001, d = 0.08, BF = 27).
July 13, 2020 33/39 GP parameters and performance. a) We do not find a consistent relationship between λ estimates and performance, which were anectdotally correlated the spatial task (r τ = .13, p = .030, BF = 1.2) or negatively correlated in conceptual task (r τ = −.22, p < .001, BF > 100). b) Higher beta estimates were strongly predictive of better performance in both conceptual (r τ = .32, p < .001, BF > 100) and spatial tasks (r τ = .31, p < .001, BF > 100). c) On the other hand, high temperature values predicted lower performance in both conceptual(r τ = −.59, p < .001, BF > 100) and spatial tasks (r τ = −.58, p < .001, BF > 100). , suggesting participants were more sensitive to the reward values (i.e., more substantial updates to their means estimates). Error variance was also somewhat correlated across tasks (r τ = .18, p = .003, BF = 10). b) As with the GP model reported in the main text, we also found strong differences in exploration behavior in the BMT. We found lower estimates of the exploration bonus in the conceptual task (Z = −5.9, p < .001, r = −.52, BF > 100). The exploration bonus was also somewhat correlated between tasks (r τ = .16, p = .006, BF = 4.8). c) Also in line with the GP results, we again find an increase in random exploration in the conceptual task (Z = −6.9, p < .001, r = −.61, BF > 100). Once more, temperature estimates were strongly correlated (r τ = .34, p < .001, BF > 100). We also considered an alternative form of the GP model. Instead of modeling generalization as a function of squared-Euclidean distance with the RBF kernel, we use the Shepard kernel described in [65], where we instead use Minkowski distance with the free parameter ρ ∈ [0, 2]. This model is identical to the GP model reported in the main text when ρ = 2. But when ρ < 2, the input dimensions transition from integral to separable representations [112]. The lack of clear differences in model parameters motivated us to only include the standard RBF kernel in the main text. a) We find no evidence for differences in generalization between tasks (Z = −1.8, p = .039, r = −.15, BF = .32). There is also marginal evidence of correlated estimates (r τ = .13, p = .026, BF = 1.3). b) There is anecdotal evidence of lower ρ estimates in the conceptual task (Z = −2.5, p = .006, r = −.22, BF = 2.0). The implication of a lower ρ in the conceptual domain is that the Gabor features were treated more independently, whereas the spatial dimensions were more integrated. However, the statistics suggest this is not a very robust effect. These estimates are also not correlated (r τ = −.02, p = .684, BF = .12). c) Consistent with all the other models, we find systematically lower exploration bonuses in the conceptual task (Z = −5.5, p < .001, r = −.49, BF > 100). There was weak evidence of a correlation across tasks (r τ = .14, p = .021, BF = 1.6). d) We find clear evidence of higher temperatures in the conceptual task (Z = −6.3, p < .001, r = −.56, BF > 100), with strong correlations across tasks (r τ = .41, p < .001, BF > 100)  Note: We report the posterior median (Est.) and 95% highest posterior density (HPD) interval. σ 2 indicates the individual-level variance and τ 00 indicates the variation between individual intercepts and the average intercept. See Methods for full specification of model structure and priors. Note: We report the posterior median (Est.) and 95% highest posterior density (HPD) interval. In the first model (Model Prediction), participant judgments in the range [1,100] are used to predict the GP posterior mean, whereas the second model (Model Uncertainty) uses confidence judgments in the range [1,11] to predict the GP posterior variance. All GP posteriors are computed based on individual participant λ-values, estimated from the corresponding bandit task. σ 2 indicates the individual-level variance and τ 00 indicates the variation between individual intercepts and the average intercept. See Methods for full specification of model structure and priors.