Research Article

# Cortical and Hippocampal Correlates of Deliberation during Model-Based Decisions for Rewards in Humans

• aaronmb@princeton.edu

Affiliation: Department of Psychology, Program in Cognition and Perception, New York University, New York, New York, United States of America

Current address: Princeton Neuroscience Institute, Princeton, New Jersey, United States of America.

X
• Affiliations: Department of Psychology, Program in Cognition and Perception, New York University, New York, New York, United States of America, Center for Neural Science, New York University, New York, New York, United States of America

X
• Published: December 05, 2013
• DOI: 10.1371/journal.pcbi.1003387
• Featured in PLOS Collections

## Abstract

How do we use our memories of the past to guide decisions we've never had to make before? Although extensive work describes how the brain learns to repeat rewarded actions, decisions can also be influenced by associations between stimuli or events not directly involving reward — such as when planning routes using a cognitive map or chess moves using predicted countermoves — and these sorts of associations are critical when deciding among novel options. This process is known as model-based decision making. While the learning of environmental relations that might support model-based decisions is well studied, and separately this sort of information has been inferred to impact decisions, there is little evidence concerning the full cycle by which such associations are acquired and drive choices. Of particular interest is whether decisions are directly supported by the same mnemonic systems characterized for relational learning more generally, or instead rely on other, specialized representations. Here, building on our previous work, which isolated dual representations underlying sequential predictive learning, we directly demonstrate that one such representation, encoded by the hippocampal memory system and adjacent cortical structures, supports goal-directed decisions. Using interleaved learning and decision tasks, we monitor predictive learning directly and also trace its influence on decisions for reward. We quantitatively compare the learning processes underlying multiple behavioral and fMRI observables using computational model fits. Across both tasks, a quantitatively consistent learning process explains reaction times, choices, and both expectation- and surprise-related neural activity. The same hippocampal and ventral stream regions engaged in anticipating stimuli during learning are also engaged in proportion to the difficulty of decisions. These results support a role for predictive associations learned by the hippocampal memory system to be recalled during choice formation.

## Author Summary

We are always learning regularities in the world around us: where things are, and in what order we might find them. Our knowledge of these contingencies can be relied upon if we later want to use them to make decisions. However, there is little agreement about the neurobiological mechanism by which learned contingencies are deployed for decision making. These are different kinds of decisions than simple habits, in which we take actions that have in the past given us reward. Neural mechanisms of habitual decisions are well-described by computational reinforcement learning approaches, but have not often been applied to ‘model-based’ decisions that depend on learned contingencies. In this article, we apply reinforcement learning to investigate model-based decisions. We tested participants on a serial reaction time task with changing sequential contingencies, and choice probes that depend on these contingencies. Fitting computational models to reaction times, we show that two sets of predictions drive simple response behavior, only one of which is used to make choices. Using fMRI, we observed learning and decision-related activity in hippocampal and ventral cortical areas that is computationally linked to the learned contingencies used to make choices. These results suggest a critical role for a hippocampal-cortical network in model-based decisions for reward.

### Introduction

Every day, we learn new information that is not immediately relevant to our current goals. We might learn the layout of a new neighborhood, or, while traveling a familiar street, happen upon a restaurant that is about to open. Though we might not receive any rewards — e.g., a friendly neighbor or a great meal — during our initial experience, we still learn our way around. If, later, we decide to seek a particular reward, we are usually quite capable of using the knowledge we gained from such exploration to achieve our goal. This is known as goal-directed or model-based decision making: the construction of plans to achieve rewards, incorporating knowledge about contingencies in the world [1][3]. The neural systems that support these forms of decisions are a focus of much ongoing research.

In this study, we provide evidence that the hippocampus and related cortical regions support the contingencies necessary to perform model-based decisions. We show that ongoing learning of the required contingencies can be measured in two kinds of behavior: simple responses and deliberative choices. Further, we show that BOLD signal in the regions of interest scales with multiple computational variables that describe the use of these contingencies to perform action selection.

#### Representations in model-based decisions

Model-based decisions stand in contrast to a simpler sort of learned decision making whose neural instantiation is better understood: simply learning to repeat rewarded behaviors [4][6]. To explain the former, more knowledge-driven path to decisions, researchers have long argued that the brain maintains internal representations of the contingency structure of a task — a “world model” or, in spatial tasks, a “cognitive map” — that can be adaptively applied to drive behavior. Like a map of space, these representations describe the relationships between situations and actions, separate from any ties to reward. The reliance on these representations is a defining characteristic of goal-directed decisions [1], [2]. Therefore, to identify the neural mechanisms of these decisions, researchers must first identify the representations that guide them.

#### From learning to action

Here, to examine in detail the process by which contingency representations are learned and inform action choice, we combined a sequential learning task [7] with an interleaved decision task in which rewards depended on contingencies learned in the first task. In the learning task, participants were presented with one of four photograph images at a time, and asked simply to press the key corresponding to that image. Which of the four images appeared next depended, probabilistically, on the image currently being viewed. The sequential learning task allowed us to measure the gradual, trial-by-trial, acquisition of these probabilistic contingencies linking the four image stimuli. Participants' responses provided two observable measurements of learning: reaction time to identify each image, and image-specific BOLD activity in the ventral stream visual cortex.

Reaction times to identify an image indicated the degree to which subjects expected it, given the previous one — a classic and relatively direct measure of the learned predictive association [8][12] — and category-specific BOLD also reflected engagement of the neural representation of each image in anticipation of its presentation [13]. By fitting computational models to this progression of subject expectations, we extracted a computational signature of the learning process, the learning rate, and used it to generate timeseries of decision variables based on these learned contingencies.

This enabled us to quantitatively characterize the influence of these associations when participants were asked, in the interleaved decision probes, to draw on them to make decisions. Specifically, participants were told that one of the four images was, for a short period of time, to be associated with a reward. They were then asked which of two other images would lead to that rewarded image as quickly as possible. This manipulation has a form similar to a latent learning paradigm [14], [15], in which contingencies are learned separately from their link to reward. By requiring subjects to use knowledge of the contingencies to guide their decisions, this design allows us to probe how and whether the contingencies are used to seek trial-specific goals — contingencies that are exclusively the realm of model-based decision processes.

Comparing the learning rates fit to behavior and BOLD responses we observed a striking match between hippocampal correlates of sequential learning and the learning underlying the reaction times, choices, prediction errors, and ventral visual stream activity, during both simple identification responses and deliberative decisions for reward. These results suggest that regions involved in sequential learning, including hippocampus and ventral cortical areas, indeed provide the necessary contingency representations to support model-based choice — and, critically, demonstrate the use of particular associations learned by these regions during model-based decision making.

### Results

Our task trains participants on probabilistic sequential contingencies linking image stimuli (Figure 1). Then, on probe trials interspersed with the learning, the task offers participants the opportunity to make decisions for rewards, using their estimates of those sequential contingencies to inform their choices (Figure 2). Previously, we showed that two neural processes — associated with the hippocampus and striatum, respectively — develop separate estimates of the contingencies in the learning portion of this task [7]. As the hippocampal system has long been a candidate for learning the relations (e.g., maps or models) supporting flexible choice, our hypothesis is that goal-directed decisions will depend on the contingency estimates learned by the hippocampal system.

To test this hypothesis, we fit computational learning models to explain behavioral and neural observables (such as reaction times, decisions, and BOLD activity) in terms of recent experience with image transitions. Following the approach developed previously [7], for each observable we estimate a learning rate parameter, which measures how far into the past its behavior is affected by previous events. Since the learning rate measures which particular events the observable is sensitive to, we use it as signature of the underlying associative learning process. We then compare these estimates across different observables to investigate whether they might be driven by common learned associations.

We first examine reaction times for behavioral evidence of prediction learning during the sequential image presentations, verifying that the key results from the earlier study are replicated in the present design. Next, we examine how this learning is used to guide goal-directed choices for reward.

We then carry these analyses over to neuroimaging data, observing neural correlates of learned predictions across both task phases. One source of such correlates is image category-specific BOLD signals in visual ventral stream regions during the sequential learning task. During choice probes, we identify analogous content-specific activations that reflect deliberative computations supporting model-based decisions.

#### Behavior

##### Two processes learn serial order relationships.

Participants performed a sequential response task in which they were asked to press a key corresponding to one of four exemplar images, each displayed one at a time (Figure 1). The sequence was generated according to a first-order Markov process: at each step, an image's successor was chosen from a probability distribution over the four images. The distributions over next images were different for each current image. Participants were instructed as to the existence, but not the content, of this transition structure. They were told that these contingencies would change periodically, and without notice, throughout the experiment.

As has often been observed in such tasks [8], reaction times (RTs) were facilitated for images that were conditionally more probable given their predecessor (Figure 3). The impression that RTs are faster for conditionally more probable images is confirmed by performing a multiple linear regression with the ground-truth (programmed) conditional probability as the explanatory variable of interest. Across participants, the regression weight for this quantity was indeed significantly negative (one-sample t-test, ; mean effect size 0.44 ms RT per percentage conditional probability) and, at an individual level, reached significance (at ) for all 17 participants.

This speeding allowed us to use RT as a behavioral index of participants' image expectation, and to leverage this to study how subjects updated their expectations trial-by-trial, by fitting computational learning models to the RT timeseries. As in our previous study [7], RTs were well explained by combining two incremental learning processes [16], [17]. The processes each separately learn a table of conditional image succession probabilities, updating it incrementally in response to the prediction error at each observation, but with the size of this update in each of the independent processes controlled by a different learning rate parameter (). To explain reaction times, the two conditional probability predictions are combined in a weighted average with some proportion . This two-process learning model provided a better fit to RTs than a one-process model for all 17 subjects individually (average log Bayes Factor 12.53, with no individual Bayes Factor in favor of the one-process model), and for the population as a whole (summed log Bayes Factor 213.08). The means, over the population, of the model's best fitting parameters were , , with a weight of to the slower rate. To generate regressors for fMRI we refit the group's behavior, taking all parameters as fixed effects across the population. (This regularizes the parameter estimates and allows us to examine variations in neurally implied learning rate estimates relative to a common baseline.) The fixed-effect parameter estimates were and , weighted at , which did not significantly differ from the ensemble of individual estimates (all ).

These data are consistent with our hypothesis that sequential learning arises from two distinct learning processes, which are superimposed to produce reaction time behavior.

##### Only slow-process associations drive choice.

Our next aim was to examine how these predictions were used to make decisions for reward, and in particular to what extent decisions draw on either or both of the learning processes that drive reaction times.

At pseudorandom intervals throughout the task, participants encountered a choice probe (Figure 2) in which they were asked to use their current estimates of image contingencies to make decisions for reward.

Participants were informed that one of the four images was now worth money ($1 to$5) each time it occurred during the next several trials. They were next asked to choose from which of two other images to restart the sequence, so as to maximize their chance of winning money.

To examine how learned sequential transition probabilities influence choice behavior, we fit choices with a model in which participants chose between the two starting images on the basis of the estimated probability of each image leading to the rewarded image in one step. (We did not find evidence that participants took into account the possibility that choosing an image would lead to the rewarded image on timesteps following the first.) In particular, the model assumes that the chance of choosing an option depends on a decision variable defined as the difference between the conditional probability that the rewarded image would follow each of the two options. In this model, choice preferences depend on the transition probabilities learned in the preceding sequential response trials, and therefore they also depend on the learning rate. Because each learning rate implies a different series of transition probabilities, they also imply a different timeseries of choice preferences.

We fit learning models to the choices to answer the question: Which learning rate (or rates) for transition probabilities provided the best explanation for choice behavior? Considering the possibility that, like RTs, choices were due to some weighted combination of probabilities learned at two rates, we compared one- and two-process models. However, in this case a model with a single free learning rate provided a better fit for all 17 subjects individually (mean log Bayes Factor 2.31), and across the population (summed log Bayes Factor 39.26 versus the two rate model).

This single free learning rate, fit to choices, matched the slow learning rate fit to reaction times. Across subjects, the mean best-fit learning rate was 0.10+/−0.05, which was smaller than the fast learning rate obtained for RTs () but not significantly different from the slow learning rate () (Figure 3). These results suggest that choices, unlike reaction times, exclusively result from associations learned at a single timescale, consistent with the slow process observed in RTs.

How are these learned transition probabilities used to compute action values? The standard model is that expected values are computed by multiplying the probability of each option image leading to the goal image by the reward value of that goal image. These expected values are then transformed into choice probabilities using a softmax function, with a free parameter .

Another approach, inspired by race models [18], is based on the idea that the outcome predictions driving choice might involve discrete retrievals of next-step images, proportional to the estimated transition probabilities [19], [20]. In this model, choice probabilities result from a thresholded comparison process after some number of draws from the binomial distribution () defined by the transition probabilities. This approach is similar to the sort of sequential sampling processes used to model perceptual decisions [21]. Fitting this model to the set of choices by each participant gives an additional parameter, , the average number of draws. Here, binomial sampling noise introduces stochasticity in the choices similar to the softmax logistic distribution often used in decision models [22], with playing a role analogous to softmax's inverse temperature. (See Materials and Methods, section Choice models, for more details.) In fact, choices are also similarly fit by the softmax, and the foregoing results concerning learning rate are robust to either choice rule. We adopt the sampling model because the process-level description of decision noise motivates analyses of neuroimaging data during choice formation, presented below.

At the fixed, slow learning rate, the best-fit value of was 4.675+/−1.25 samples, across subjects. As in our learning rate analysis, we estimated this as a fixed effect (4.177), for generating our fMRI regressors (see Choice difficulty in Neuroimaging results).

#### Neuroimaging

We next identified neural correlates of each learning process.

##### Stimulus anticipation in each process has distinct neural substrates.

We began by looking for correlates of participants' anticipation of the next image to appear. Specifically, we sought activity that reflected how difficult it might be to predict this next image. Previous work [7], [9], [10] has shown that BOLD activity in hippocampus and elsewhere covaries with the participants' modeled uncertainty about future events. This may reflect a process of spreading activation, by which an image triggers activations of likely successor images, which are more numerous in situations of uncertainty. Also consistent with this idea, the anterior portion of the hippocampus was recently shown more directly to reflect such anticipation in sequential relationships among abstract stimuli [23].

Here, uncertainty is formally defined as the “forward entropy,” or entropy of the model's prediction about the identity of the next image, conditional on the current one. This is a trial-by-trial function of the model's learned transition probabilities, which in turn depend on the learning rate fit to behavior. These regressors are specified as parametric modulators on delta functions placed at the onset of the currently presented image.

The two-process model as fit to reaction times therefore gives rise to two entropy timeseries, one each from predictions generated at the fast and slow learning rates. Based on our previous results [7], we expected to find different correlates corresponding to the entropy timeseries from each process: in hippocampus for the slower learning rate and in striatum for the faster learning rate. We defined, using the AAL template library, anatomical masks of the structures in which we observed above-threshold activations in our previous study: left hippocampus for slow learning rate entropy and bilateral caudate for fast learning rate entropy [7]. Accordingly, when forward entropy was computed according to the slow learning rate process, a cluster of significantly correlated activity was observed in the region identified in our previous study, left anterior hippocampus (peak −26, −10, −18; corrected for family-wise error due to multiple comparisons over an anatomically-defined mask of left hippocampus; Figure 4).

We ran a separate regression containing an identical GLM except for the entropy regressor, which was now computed according to the fast learning rate. In this GLM, we observed activation on the tail of right caudate (peak 24, −14, 26) that was significant when corrected for multiple comparisons over an anatomically-defined mask of bilateral caudate (). (A symmetric cluster in left caudate was observed at uncorrected, but did not survive correction for multiple comparisons.)

The foregoing results suggest two prediction processes that each learn at a rate corresponding to one of those observed in the RT behavior, with anatomically separate substrates. As in our previous study [7], we more directly tested the correspondence of learning rate to neural structure within a single GLM by independently estimating the learning rate that best explained entropy-related BOLD signals in each area. We located voxels of interest in an unbiased manner and fit the learning rate using a Taylor approximation to the entropy regressor's dependence on the parameter [7], [24], [25]. Neural learning rate estimates are visualized, superimposed over the behaviorally-obtained learning rates, in Figure 5.

Matching our previous results [7], the fast learning rate from RTs matched the one computed from BOLD signal in the striatum. In the mean over participants, the learning rate implied by BOLD in caudate was . This rate was significantly larger than the slow learning rate fit to RTs (), but not significantly different from the fast learning rate ().

In our prior study [7], the slow learning rate from RTs matched the one computed from BOLD signal in the anterior hippocampus; here, though the hippocampal BOLD learning rate () was numerically closer to the slow rate fit to RTs, it was statistically different from both that rate as well as the fast (both ). Importantly, however, it was not statistically distinguishable from the learning rate fit to choices () — thus supporting the critical link, from learning to choices — and also significantly smaller than the striatal learning rates computed from BOLD (paired samples; ).

Taken together with the behavioral model fits, these neuroimaging results and learning rate computations support the suggestion that two distinct processes learn to estimate the sequential contingencies embedded in our image identification task. Further, neural activity in two structures reflects anticipation (indexed by forward entropy) according to the estimates of each processes, with learning rates that differ from one another and approximate those identified in reaction time behavior.

##### Neural decision computations are uniquely explained by the slow process.

We next sought correlates of decision computations driven by the learned transition probabilities. Our analysis of choice behavior indicated that decisions were informed by the sequential contingencies learned at a rate consistent with the slow learning rate fit to RTs. Therefore we hypothesized that activity related to decision computations would also be identified with a similar learning rate. If this indeed reflected a common underlying learning process, it would engage the anterior hippocampus, which was shown to support slow learning in the sequential learning task.

We first analyzed activity during the deliberation period leading up to the choice. Similar to our analysis of anticipatory activity during sequential response trials, we probed the neural correlates of deliberation by asking: how difficult was it for the participant to make this decision? We used as our measurement of choice difficulty the uncertainty (variance) in the decision variable (the value difference between options) that led to the current choice, computed using the choice model parameters fit to behavior (for details, see Choice models in Materials and Methods). This quantity, which was motivated by the process-level model of decision noise, is similar to the entropy measure used to define uncertainty during the learning task. The key difference is that the distribution being analyzed lumps images into two categories (rewarded vs nonrewarded) rather than predicting all four separately.

This regressor was specified at the time of onset of the choice screen.

In our region of prior interest, an area of left anterior hippocampus was activated, though only marginally significant after multiple comparison correction over our anatomical mask (; Figure 6b). This activation is similar to that seen to entropy during the stimulus prediction task.

Does this activity reflect learning similar to one of the processes observed in RT behavior? We again estimated the learning rate implied by these BOLD correlates. The learning rate computed from anterior hippocampal BOLD during choices matched the slow learning rate fit to RT. The mean learning rate that best explained this activity was (Figure 5). This was different from the fast learning rate from RT behavior (), but did not differ from the slow RT learning rate (). The involvement of the hippocampal region in both phases of the task, showing the same type of learned associations, supports the idea that a common learning process supports both behaviors.

##### Choice difficulty engages a fronto-temporal memory network.

Additionally, at the whole brain level, the choice difficulty measure revealed correlates in a broad fronto-temporal network that appears to correspond to a component of the ‘default network’, a set of brain regions that has been associated with constructive memory and mindwandering [26], [27].

In particular, two clusters survived correction for multiple comparisons over the entire brain: a region of anterior medial PFC (peak 4, 64, −2; ), and a region of posterior cingulate cortex (peak −2, −18, 32; ; Figure 6a). Also, activation in a third component of the default network, the dorsomedial PFC (peak 14, 40, 40) survived whole-brain multiple comparison correction for cluster extent (), but not peak (). Together with the above-reported anterior hippocampal cluster, the overall pattern of activation is consistent with previous observations of the fronto-temporal memory component of the default network [28].

We ruled out alternative explanations for activity in these regions, or other variables that might correspond to the notion of ‘choice difficulty’. The choice difficulty regressor was not significantly correlated with reaction time (across subjects, mean ), nor the expected value of the choice (mean ).

##### Exclusion criteria.

Data from seven participants were excluded from analysis due to their being unusable for various reasons, leaving seventeen participants analyzed here. For three participants, this was due to failure to behaviorally demonstrate learning of the sequential contingencies embedded in the task. As we did in our previous study [7], we excluded subjects for failure to learn when a regression model with only nuisance regressors (the ‘constant’ model) proved a statistically superior explanation of participant RTs than any of the other models considered here, which each include regressors of interest specifying the estimated conditional probability of images (see Analysis, below). Statistical superiority over the constant model was measured by the Bayesian Information Criterion (BIC; [89]), used to correct likelihood scores when comparing models with different numbers of parameters. The rationale for excluding these subjects was that if they fail to learn the contingencies, it is not possible to ask the central question of the present study: how they use this learning to guide choices.

For the others, data were unusable due to operator error in operating the MRI unit (one participant), excessive head motion (two participants) and a failure to enter decisions on choice trials due to misunderstood instructions (one participant). Volumes during which instantaneous motion was mm in any direction were excluded from analysis. Data from participants were excluded due to excessive motion when a large percentage () of volumes were excluded by this criterion.

Participants performed a serial reaction time (SRT) task in which they observed a sequence of image presentations and were instructed to respond using a pre-trained keypress assigned to that image. The experiment was controlled by a script written in Matlab (Mathworks, Natick, MA, USA), using the Psychophysics Toolbox [90]. The stimulus set consisted of four grayscale images that were matched for size, contrast, and luminance. The images were chosen because they represent categories known to preferentially engage different areas of the ventral visual stream — bodies [32], faces [33], houses [34], and household objects [35]. Each participant viewed the same four images. During behavioral training, the keys corresponded to the innermost fingers on the home keys of a standard USA-layout keyboard (D, F, J, K). Participants were instructed to learn the responses as linking a finger and an image, rather than a key and an image (e.g. left index finger, rather than ‘F’). For the MRI sessions, the same fingers were used to respond on two MR-compatible button boxes. The mappings between the four images and four responses were one-to-one, pseudorandomly generated for each participant prior to their training session, trained to the criterion prior to the fMRI session, and fixed throughout the course of training and experiment sessions. Participants were informed that the key-to-image mapping was fixed, and that they were not being evaluated on the correctness of responses.

At each trial, one of the pictures was presented in the center of the screen, where it remained for three seconds, plus or minus uniformly distributed pseudorandom jitter, up to 474 ms in increments of 59 ms (the length of one slice in the MRI session). Participants were instructed to continue pressing keys until they responded correctly or ran out of time. Correct responses triggered a gray bounding box which appeared around the image for the lesser of 300 ms or the remaining trial time (Figure 1). Thus, each image presentation occurred for the programmed amount of time, regardless of participant response. The inter-trial interval consisted of 237 ms of blank screen.

The test phase of the scanning session proceeded with three blocks of 250 trials: 210 sequential response trials, 20 reward display screens (see Choice trials, below) and 20 choice trials. The first two blocks were followed by a rest period of participant-controlled length. During the rest period, participants were presented with a screen that was blank except for a fixation cross. Scan blocks after the first were initiated manually by the operator only after the participant pressed any of the relevant keys twice, to alert the operator that they were prepared to continue the task. Total experiment time — inclusive of training, practice and test periods — was approximately 1.5 hours, conducted continuously.

##### Stimulus sequence.

For training, the sequence of images was selected according to a uniform distribution. Participants were instructed to emphasize learning the mappings between image and finger, disregarding speed of response in favor of correctly identifying the on-screen image.

In the test phase, participants were instructed to respond as quickly as they could, disfavoring accuracy as they had already been trained to criterion. The sequence of images was generated pseudorandomly according to a first-order Markov process, meaning that the probability of viewing a particular image was solely dependent on the identity of the previous image, with the conditional relationship specified by a 4×4 transition matrix (Figure 1). To motivate the choice trials, unlike in our previous study [7], participants were informed that conditional probability structure existed in the task. Four transition matrices were generated pseudorandomly at the start of the experiment for each subject, in a manner designed to balance two priorities: (i) to equalize the overall presentation frequencies for each image over the long and medium term (formally: fast mixing to a uniform stationary distribution), while (ii) examining response properties across a wide sample of conditional image transition probabilities. The procedure used to generate matrices satisfying these constraints is described in detail in our previous study [7].

Transition matrices were replaced at three evenly-spaced intervals — the second matrix was used starting on trial 188, the third matrix on trial 376, and the fourth on trial 563. Participants were informed that the structure would change, but they were not informed of when or how. The experiment display offered no indication of the shift to a different transition matrix, nor were matrix changes aligned with the onset of rest periods.

Time to first keypress was recorded as our primary behavioral dependent variable. Participants were not informed that RTs were being recorded, and no information was provided as to overall accuracy or speed either during or after the experiment. Trials on which the first keypress was incorrect were discarded from behavioral analysis.

Twenty choice rounds were interspersed throughout each of the three scanning sessions, for sixty choice rounds total per participant. Each choice round consisted of three parts (Figure 2). First, the reward display screen, visible for one second, notified the participant of which image was going to be rewarded and how much each occurrence of it would be worth. The rewarded image was chosen pseudorandomly from a uniform distribution over potential images. Reward values were whole dollar values between one and five, chosen pseudorandomly from a uniform distribution. Next, after a variable inter-stimulus interval of between two and eight seconds, chosen from a truncated exponential distribution with a mean of four, the participant was given five seconds to select between one of two different images. The two option images were chosen pseudorandomly from a uniform distribution, with the condition that they not be identical to the reward image. Participants were instructed to choose the image that was most likely to get them to the reward over the next few trials, and thereby earn the most money. Immediately after the choice was entered, the subsequent image was picked according to the conditional distribution implied by the image that the participant selected. The next image was then displayed after the standard ITI of 237 ms. Beginning with this first image after the choice — the ‘outcome’ image — text above each ensuing image indicated either a dollar amount (between $1 and$5), if it was the rewarded image, or 0 if it was not (Figure 2), for the extent of the choice round. The length of the choice round — that is, the number of images presented with dollar figures above them — was chosen from a truncated exponential distribution, with minimum of one, a maximum of eight and a mean of four, and adjusted to ensure a total of 80 trials across all of the choice rounds in a each session. To allow for equilibration of any transient effects, choice rounds did not occur within the first thirty trials of each scanning session. #### Analysis Our analysis proceeded in several steps meant to first characterize the associative learning process, and then use this characterization to test behavioral and neural predictions about choices. Each participant's trial-by-trial RTs for correct identifications were regressed on explanatory variables including the estimated conditional probability of the picture currently being viewed given its predecessor — defined, in separate models (described below), in a number of different ways representing different accounts of learning — together with several effects of no interest. Trials on which the first keypress was not correct were excluded from behavioral analysis. Effects of no interest included stimulus-self transitions, image identity effects and a linear effect of trial number. Stimulus-self transitions were included to account for variance due to motor response readiness for the same keypress appearing twice in a row, above and beyond the preparation implied by any effect of the variables of interest. Image identity effects were included to account for any differential response time by each finger. Trial number effects were included to account for any monotonic shift in response time over the course of the experiment. These nuisance effects were identical across all models considered; the models differed in how they specified the explanatory variable of interest, the conditional probability of each image. In our initial analysis, the conditional probabilities were specified as the ground-truth contingencies: the probabilities actually encoded in the transition matrix. Having established that RT reflected such learning by demonstrating a significant correlation with these idealized probabilities (Figure 2), subsequent analyses used computational models to generate a timeseries of probability estimates such as would be produced by different learning rules with the same experience history as the participant (see Learning models for details). Similarly, the learning rules for conditional probability were fit (separately) to choices in the decision trials, estimated so as to maximize the likelihood that the model would have selected the same options as did the participant, given the same series of experience (see Choice models for details). The learning models involved additional free parameters controlling the learning and decision processes (e.g. learning rates), which were jointly estimated together with the regression weights by maximum likelihood. For behavioral analysis, models were fit and parameters were estimated separately for each participant. At the group level, regression weights were tested for significance using a t-test on the individual estimates across participants [91]. To generate regressors for fMRI analysis (below) we refitted the behavioral model to estimate a single set of the parameters that optimized the RT and choice likelihoods aggregated over all participants (i.e. treating the behavioral parameters as fixed effects). This approach allowed us to characterize baseline learning-related activity separate from individual variation in neurally implied learning rates relative to this common baseline. For the former, in our experience [22], [25], [38], [92][95], enforcing common model parameters provides a simple regularization that improves the reliability of population-level neural results. Our neural model characterizes between-subjects variation in the learning rate parameter over this baseline, because it includes (as additional random effects across participants) the partial derivatives of each of the regressors of interest with respect to the learning rate. ##### Learning models. Based on our previous results analyzing contingency learning in an SRT task [7], we considered learning rules of the form proposed by Rescorla and Wagner [17] (see also [15]), which update entries in a 4×4 stimulus-stimulus transition matrix in light of each trial's experience. The appropriate estimate from this matrix at each step was then used as an explanatory variable for the RTs in place of the ground-truth probabilities. Formally, at each trial the transition matrix was updated according to the following rule, for each image :(1) where is the identity of the image observed at trial and is a free learning-rate parameter. This rule preserves the normalization of the estimated conditional distribution. Our primary model of interest for reaction times — again, drawn from our previous work [7] — was a weighted combination of two Rescorla-Wagner processes, each with different values of the learning rate parameter . Each process updated its matrix as above, independently, but the behaviorally expressed estimate of conditional probability was computed by combining the output of each process according to a weighted average with weight (a free parameter) :(2) As the models considered here differ in the number of free parameters, we compared their fit to the reaction time data using Bayes factors ([96]; the ratio of posterior probabilities of the model given the data) to correct for the number of free parameters fit. We approximated the log Bayes factor using the difference between scores assigned to each model via the Laplace approximation to the model evidence [97]. This approximation was used because it provides a more fair comparison across models which use parameters of differing contributions to model complexity [98]. The evidence calculations assumed a uniform prior distribution for the values of the learning rate and weight parameters. In participants for whom the Laplace approximation was not estimable for any model (due to a non-positive definite value of the Hessian of the likelihood function with respect to parameters) the Bayesian Information Criterion [89] was instead used to estimate the posterior probabilities for all models. Model comparisons were computed both per individual, and on the log Bayes factors aggregated across the population. ##### Choice models. Each of the learning rates obtained from fitting reaction times also predicts a different series of option preferences on choice trials. We compared the relative fit to choice behavior of probability estimates at each learning rate or combination of learning rates. Each choice trial involves the choice between two options for the start image, which we index below as and , and a rewarded image, . We took as the decision variable the difference between the probability that each option would lead to the rewarded image in a single step: (), where the probabilities are the conditional image transition probabilities estimated by the learning model at the current point in the task. Motivated by race and sampling models [18], the model instantiates the decision variable on a particular trial by conducting some number of draws from a binomial distribution around each learned transition probability. The mean proportion of successes on the first option is , with binomial variance , and similarly for . We estimate the choice likelihood by adopting a Gaussian approximation to the binomials, so that the resulting decision variable (the difference in sample proportions) has a mean and variance given by the difference and sum, respectively, of the means and variances of the two sample proportions. We compute the likelihood that the subject chooses or using the CDF of this Gaussian, and aggregate the log probabilities for the options actually chosen across the experiment to compute the likelihood of the choices given different probability learning models and parameters. As fMRI regressors, we also use this model to define the per-trial choice difficulty as the variance of the decision variable (the sum of the binomial variances), and the per-category choice difficulty as the binomial variance of that category's probability estimate. #### fMRI methods ##### Acquisition. Imaging was performed on the 3T Siemens Allegra head-only scanner at the NYU Center for Brain Imaging, using a Nova Medical (Wakefield, MA, USA) NM011 head coil. For functional imaging, 40 T2*-weighted axial slices of 3 mm thickness and 3 mm in-plane resolution were acquired using a gradient-echo EPI sequence (TR = 2.37 seconds). Three scans of 400 acquisitions each were collected, with the first four volumes (9.48 seconds) discarded to allow for T1 equilibration effects. We also obtained a T1-weighted high-resolution anatomical image (MPRAGE, 1×1×1 mm) for normalization and localizing functional activations. ##### Imaging analysis. Preprocessing and data analysis were performed using Statistical Parametric Mapping software version 8 (SPM8; Wellcome Department of Imaging Neuroscience, London, UK). EPI images were realigned to the first volume to compensate for participant motion, co-registered to the anatomical image, and, to facilitate group analysis, spatially normalized to atlas space using a transformation estimated by warping the subject's anatomical image to match a template (SPM8 segment and normalize). Following the default settings in SPM, to account for warping due to normalization to the template image, data images were resampled to 2 mm (rather than 3 mm) isotropic voxels, in the normalized space [24]. Finally, data were smoothed using a 6-mm full-width at half maximum Gaussian filter. For statistical analysis, data were scaled to their global mean intensity and high-pass filtered with a cutoff period of 128 seconds. Volumes during which instantaneous motion was mm in any direction were excluded from analysis. ##### Statistical analysis. Statistical analyses of functional time-series were conducted using general linear models (GLM), and coefficient estimates from each individual were used to compute random-effects group statistics. Delta-function onsets were specified at the beginning of each stimulus presentation, and — to control for lateralization effects — nuisance onsets were specified for presentations on which right-handed responses were required. This had the effect of mean-correcting these trials separately. All further regressors were defined as parametric modulators over the initial, two-handed stimulus presentation or choice onsets. All regressors were convolved with SPM8's canonical hemodynamic response function. We used two separate GLMs for our main body of analyses: first, one analyzing sequential and response trials collectively, and a second breaking them down by image category. In these GLMs we specify a number of parametric regressors derived from the model, often together with these regressors' partial dervatives with respect to the learning rate parameter. For the main analyses, all such regressors were evaluated using a (single) learning rate taken at the midpoint between the two identified in our best-fitting behavioral model, the two-learning rate model of Eqns 1 and 2. This enables us to detect activations related to these regressors without a bias toward one learning rate or the other, then use the partial derivatives to estimate the learning rate that best explains the signal (see Learning rate analysis). We also performed ancillary GLM analyses to illustrate activations related to regressors computed using either learning rate identified in RT behavior. For these, the parametric regressors were substituted with the equivalent ones evaluated at one of those learning rates and the partial derivative regressors were omitted. Such analyses were carried out in separate GLMs due to correlation between regressors generated using different values of the learning rate parameter. However, it is important to note that these models were only used for generating figures to visualize the spatial extent of activity. Our formal results fitting learning rates to activity and comparing these estimates between areas are each conducted within a single GLM whose regressors (the main explanatory variable of interest and its partial derivative with respect to learning rate) in different weighted sums together approximately span the continuum of learning rates (see Learning rate analysis).. This allows the fit of different learning rates to an area to be formally assessed in a single model, while avoiding the problems of correlation between regressors and of specifying a discrete set of candidate learning rates a priori. In all analyses, unless otherwise stated, activations are reported for areas where we had a prior anatomical hypothesis at a threshold of after correction for family-wise error (FWE) in a small volume defined by constructing an anatomical mask, comprising the regions of a priori interest. Our anatomical regions of a priori interest were: left hippocampus for slow process associations and bilateral caudate for fast process associations, based on our previous results [7]; right ventral stream cortical regions for visual localizer responses and anticipatory recall of category representations: fusiform gyrus, parahippocampal gyrus, and inferior occipital lobe, based on previous reports of visual category-selective patches of cortex — bodies [32], faces [33], houses [34], and household objects [35]; and nucleus accumbens, based on numerous previous reports of Reward Prediction Error (e.g. [30], [31], [38]). Anatomical regions were defined using the Automated Anatomical Labeling (AAL) atlas [99], except nucleus accumbens, which was taken from the mask produced in [38]. Masks were dilated by 4 mm in all directions to allow for inconsistencies in alignment with the population mean structural image. Unless otherwise stated, activations outside regions of prior interest are reported if they exceed a threshold of , whole-brain corrected for family-wise error. All voxel locations are reported in MNI coordinates, and results are displayed overlaid on the average over participants' normalized anatomical scans. ##### GLM1: Main effects. The first GLM was used to analyze main effects of sequential response and choice trials. It contained the following regressors. First, to control for non-specific effects of reaction time (which, as demonstrated by our behavioral results, was correlated with our primary regressor of interest, the conditional probability), the RT on each sequential response trial was entered into the design matrix as a parametric nuisance effect. As a result all subsequent regressors, including all regressors of interest, were orthogonalized against this variable, ensuring that it accounted for any shared variance. We next included the conditional probability of the current image, to control for effects of surprise on the current trial. Building on our previous work [7], this regressor was not treated as a regressor of interest in our current experiment. Our primary regressor of interest on sequential response trials was the entropy of the distribution over the subsequent stimulus, given the image currently viewed:(3) where denotes the image displayed on trial t, but the sum is over all four possible image identities, . Whereas the conditional probability measures how ‘surprising’ is the current stimulus, this quantity, which we refer to as the ‘forward entropy’, measures the ‘expected surprise’ for the next stimulus conditional on the current one, i.e. the uniformity of the conditional probability distribution. The entropy regressor was followed by the partial derivative of this forward entropy, with respect to the learning rate (see Learning rate analysis). Finally, nuisance regressors, last in orthogonalization priority, were entered to model variance due to the effects of: missed trials (those in which the participant did not press any keys in the allotted time), error trials, and self-transition trials (house-house, etc.). For decision analysis, we specified onsets at the time of the presentation of the two options, and also at the first trial of the reward round, referred to as the ‘outcome’ trial. At the time options were presented, we first specified nuisance regressors: the reaction time of the choice, and the value of the rewarded image (between1 and \$5). Last were our primary regressors of interest: the difficulty of the choice (see Choice models), and the partial derivative of this regressor with respect to learning rate.

On outcome trials, we specified as a nuisance regressor the reaction time of the response. Following was our primary regressor of interest, the Reward Prediction Error (RPE): the reward received minus the expected value of the image chosen (the probability of receiving the reward image times the round's reward value), and its partial derivative with respect to learning rate.

##### GLM2: Image-specific effects.

We used a second GLM to analyze image-specific effects in sequential response and choice trials. Critically, nuisance onsets were specified for trials on which each image category was presented. Additional nuisance onsets were specified for right handed choices and sequential responses, to control for effects of lateralization.

Onsets of interest were specified for sequential response and choice trials. For these analyses, we specified a set of four parametric regressors, one for each image type, over the sequential response and choice onsets. As we did not want our analysis to implicitly prioritize one or another variable, we disabled SPM's serial orthogonalization. On sequential response trials, our regressors of interest were the anticipated probability of each image — body, face, house, object — occuring next. We specified reaction time as a regressor of no interest, along with regressors for missed trials, errors, and self-self trials.

For choice trial onsets, we specified as the primary regressors of interest the choice difficulty for each category separately (see Choice models). Separate timeseries for the difficulty of deciding whether each image led to reward were modeled at every decision period (irrespective of whether that image was part of the decision set), and entered as parametric modulators over these onsets. Subsequent nuisance regressors were entered for the identity of the images on the screen, the identity of the rewarded image, the image categories used as options, the reward value, and the expected value of the decision. Again, these regressors were not orthogonalized against one another.

We also considered the possibility that analyses testing probability effects (Figure 7) were biased by selecting face- and house-sensitive voxels, then testing the effect of interest in those voxels in the same trials [100]. Accordingly, we measured the correlation between the selecting and testing regressors in the final design matrix. After filtering and whitening, the selecting and testing contrasts were not strongly correlated, and the mean of the measured correlation is in the opposite direction of the effect we observed (mean correlation coefficient across subjects: 0.1399+/−0.0238 for the face regressors, 0.0765+/−0.0308 for the house regressors). That is, to whatever extent there is a bias due to voxel selection, it would tend to work against the result we obtained.

##### Learning rate analysis.

In the best-fitting behavioral model, the learned transition matrix arises from two modeled learning processes, each with a free parameter for its learning rate. Thus, a naive attempt to seek fMRI activations related to either hypothesized process separate from the other would need two separate but correlated sets of our various model-derived regressors of interest, such as entropy in sequential response trials and RPE on outcome trials. An alternative specification allows us to evade the problem of mutual correlation while also reasoning statistically about the learning rate that best explains BOLD activity related to a particular variable in a particular area.

To do this, we specify each regressor of interest in our GLMs together with its partial derivative with respect to the learning rate parameter. The weighted sum of these two regressors approximates (linearly, using a first-order Taylor expansion) how the modeled signal would change under different values of the learning rate parameter. Conversely, the best fitting learning rate can be approximated from the betas obtained for the two regressors [7], [25], [101]. Each regressor and its partial derivative were evaluated at the learning rate midway between the two behaviorally-obtained rate. The regression weight estimated for the derivative measures how far from the midpoint, and in which direction, was the learning rate that best explained BOLD. This analysis allowed us to formally investigate the possibility that learning rates expressed across regions of the brain (and multiple distinct computational variables) differed from one another, identify the pattern by which these learning rates varied, and compare them to the learning rates obtained from behavior.

Specifically, we constructed the regressors of interest as estimated by a single process learning at the rate — which we set to the average of the two behaviorally identified rates — and included an additional regressor measuring how the regressors would change if they had been generated from the model with a different learning rate. Technically, we defined these additional regressors as the partial derivatives of the original timeseries with respect to the learning rate parameter, evaluated at [101]. This analysis allows us to estimate the change in learning rate, relative to the reference point , that would best explain BOLD in an area, by using a regression to estimate coefficients for the first two terms in the Taylor expansion of the dependence of the regressor on the learning rate. This takes the following form:(4)

Here F() is the regressor of interest (i.e., the RPE or entropy timeseries), viewed as a function of the learning rate , and is some other learning rate for which the regressor would best fit the BOLD signal. To encode learning rates in this analysis, we used a change in variables by which the original Rescorla-Wagner learning rate was transformed by an inverse sigmoid, so that it ranged through the real numbers and estimates of it could be treated with Gaussian statistics. Thus, the learning rates reported from the fMRI response to the partial derivative (which includes a derivative of the sigmoid transform, by the chain rule), are sigmoid-transformed means of the underlying variable, . Similarly, the illustrated confidence bounds are the sigmoid-transformed S.E.M.s of .

This linear approximation to the (nonlinear) relationship between the regressor and the learning rate parameter allows the use of a GLM to approximately estimate the learning rates that would best explain BOLD correlates to the regressor. In particular, the weight estimated for the partial derivative regressor corresponds to (or, more particularly, k[], if the net effect of the regressor on BOLD is scaled by multiplying both sides of the approximation by some factor k). This is just the degree to which the best-fit (inverse-sigmoid transformed) learning rate for explaining the BOLD response differs from , the value used to calculate our regressor of interest and its derivative.

We thus computed estimates of for each regressor (entropy or probability) at a voxel by first extracting the regression weights for the partial derivative regressor for each subject. To normalize these coefficients to a common scale in units of transformed learning rate (even if they originated from different regions), we divided these weights by the average, across subjects, of the regression weights for the corresponding regressor F() at the voxel, this corresponding to the overall scale factor k mentioned above. Lastly, we added the reference value , converting the result into the range of our behaviorally-obtained rates. Our statistical analyses were all performed on the learning rate estimates in the transformed units, taken across the population. Specifically, we test whether the computed is statistically distinguishable from learning rate values obtained by fitting behavior, via t-tests against each (transformed) fit rate. We also test whether differs between regions, by comparing the estimates in paired-sample t-tests. For our plots of BOLD learning rates, we mapped the mean estimates and their confidence intervals through the sigmoid to depict them in units of Rescorla-Wagner learning rate.

To maximize power, to examine learning-rate effects at areas where there was learning-related activity, and to identify areas to allow between-region comparisons, we performed these analyses of leraning rates at voxels that we selected as peaks of contrasts on the main effect of the conditional probability, entropy, or prediction error regressors (not their derivatives), again using the midpoint rate . This was one motivation for choosing to be the midpoint of the fast and slow rates – i.e., that it is roughly equally suited to detect activity related to either rate. Additionally, the linear approximation to is most accurate when the difference is small, suggesting a choice of that is equally close to both relevant learning rates. We selected the voxels of peak group activation within each of our a priori regions of interest. Differences between parameters in the subsequent tests were considered reliable at a level of .

Finally, note that selecting ROIs on the basis of correlation with a regressor of interest, then estimating the learning rate there, implies a bias that is innocuous with respect to our questions of interest, which generally concern to which of the extreme learning rates does the BOLD activity best correspond. It is intuitive — and can be shown [7] — that the estimated learning rate is biased toward the midpoint used for selection, and therefore away from the extremes that our hypothesis tests concern.

### Supporting Information

Figure S1.

Multiple views of the main effects. Saggital, coronal, and axial views of each of the effects reported in the main text. Each row displays activation corresponding to one of the parametric regressors: First, the forward entropy regressor, generated using the slow process. Second, the forward entropy regressor, generated using the fast process. Third, the choice difficulty regressor (views on the hippocampal correlates). Fourth, the choice difficulty regressor (views of the mPFC and PCC correlates). Fifth, the reward prediction error regressor. All images are displayed at a threshold of , uncorrected.

doi:10.1371/journal.pcbi.1003387.s001

(TIFF)

Table S1.

Clusters greater than 10 contiguous voxels (at ) correlated with the forward entropy regressor computed at the slow learning rate.

doi:10.1371/journal.pcbi.1003387.s002

(TIFF)

Table S2.

Clusters greater than 10 contiguous voxels (at ) correlated with the forward entropy regressor computed at the fast learning rate.

doi:10.1371/journal.pcbi.1003387.s003

(TIFF)

Table S3.

Clusters greater than 10 contiguous voxels (at ) correlated with the choice difficulty regressor computed at the slow learning rate.

doi:10.1371/journal.pcbi.1003387.s004

(TIFF)

Table S4.

Clusters greater than 10 contiguous voxels (at ) correlated with the reward prediction error regressor computed at the slow learning rate.

doi:10.1371/journal.pcbi.1003387.s005

(TIFF)

### Acknowledgments

The authors wish to thank Thomas A. Geib for assistance with data collection, and Lila Davachi, Ernst Fehr, Paul Glimcher, Todd Gureckis, Daphna Shohamy, Arielle Tambini, Shannon Tubridy, Anthony Wagner, and Nathan Witthoft for helpful conversations.

### Author Contributions

Conceived and designed the experiments: AMB NDD. Performed the experiments: AMB. Analyzed the data: AMB. Contributed reagents/materials/analysis tools: AMB NDD. Wrote the paper: AMB NDD.

### References

1. 1. Dickinson A, Balleine BW (2002) The role of learning in the operation of motivational systems. In: Gallistel CR, Pashler HV, editors. Stevens Handbook of Experimental Psychology. Vol. 3: Learning, Motivation and Emotion. New York, NY: John Wiley & Sons Inc. pp. 497–533.
2. 2. Dickinson A (1980) Contemporary Animal Learning Theory. Cambridge: Cambridge University Press.
3. 3. Daw ND, Niv Y, Dayan P (2005) Uncertainty-based competition between prefrontal and dorsolateral striatal systems for behavioral control. Nature Neuroscience 8: 1704–1711. doi: 10.1038/nn1560
4. 4. Thorndike EL (1911) Animal Intelligence. New York: Macmillan.
5. 5. Barto AC (1995) Adaptive Critics and the Basal Ganglia. In: Houk JC, Davis JL, Beiser DG, editors. Models of information processing in the basal ganglia, Cambridge, MA: MIT Press. pp. 215–232.
6. 6. Schultz W, Montague PR, Dayan P (1997) A Neural Substrate of Prediction and Reward. Science 275: 1593–1599. doi: 10.1126/science.275.5306.1593
7. 7. Bornstein AM, Daw ND (2012) Dissociating hippocampal and striatal contributions to sequential prediction learning. European Journal of Neuroscience 35: 1011–1023. doi: 10.1111/j.1460-9568.2011.07920.x
8. 8. Bahrick H (1954) Incidental learning under two incentive conditions. Journal of Experimental Psychology 47: 170–172. doi: 10.1037/h0053619
9. 9. Strange BA, Duggins A, Penny W, Dolan RJ, Friston KJ (2005) Information theory, novelty and hippocampal responses: unpredicted or unpredictable? Neural Networks 18: 225–230. doi: 10.1016/j.neunet.2004.12.004
10. 10. Harrison LM, Duggins A, Friston KJ (2006) Encoding uncertainty in the hippocampus. Neural Networks 19: 535–546. doi: 10.1016/j.neunet.2005.11.002
11. 11. Bestmann S, Harrison L, Blankenburg F, Mars R, Haggard P, et al. (2008) Influence of contextual uncertainty and surprise on human corticospinal excitability during preparation for action. Current Biology 18: 775–80. doi: 10.1016/j.cub.2008.04.051
12. 12. Turk-Browne N, Scholl B, Johnson M, Chun M (2009) Neural evidence of statistical learning: efficient detection of visual regularities without awareness. Journal of Cognitive Neuroscience 21: 1934–45. doi: 10.1162/jocn.2009.21131
13. 13. Turk-Browne N, Scholl B, Johnson M, Chun M (2010) Implicit Perceptual Anticipation Triggered by Statistical Learning. Journal of Neuroscience 30: 11177–87. doi: 10.1523/jneurosci.0858-10.2010
14. 14. Tolman EC (1948) Cognitive Maps in Rats and Men. Psychological Review 55: 189–208. doi: 10.1037/h0061626
15. 15. Gläscher J, Daw N, Dayan P, O'Doherty JP (2010) States versus rewards: dissociable neural prediction error signals underlying model-based and model-free reinforcement learning. Neuron 66: 585–95. doi: 10.1016/j.neuron.2010.04.016
16. 16. Bush RR, Mosteller F (1956) A Stochastic Model with Applications to Learning. The Annals of Mathematical Statistics 24: 559–585. doi: 10.1214/aoms/1177728914
17. 17. Rescorla RA, Wagner AR (1972) A Theory of Pavlovian Conditioning: Variations in the Effectiveness of Reinforcement and Nonreinforcement. In: Black AH, Prokasy WF, editors. Classical Conditioning II: Current research and theory. New York: Appleton-Century-Crofts. pp. 64–99.
18. 18. Ratcliff R (1978) A Theory of Memory Retrieval. Psychological Review 85: 59–108. doi: 10.1037/0033-295x.85.2.59
19. 19. Lengyel M, Dayan P (2008) Hippocampal Contributions to Control: The Third Way. Advances in Neural Information Processing Systems 20: 889–896.
20. 20. Erev I, Glozman I, Hertwig R (2008) What impacts the impact of rare events. Journal of Risk and Uncertainty 36: 153–177. doi: 10.1007/s11166-008-9035-z
21. 21. Gold JI, Shadlen MN (2002) Banburismus and the Brain: Decoding the Relationship between Sensory Stimuli, Decisions, and Reward. Neuron 36: 299–308. doi: 10.1016/s0896-6273(02)00971-6
22. 22. Daw ND, O'Doherty JP, Dayan P, Seymour B, Dolan RJ (2006) Cortical substrates for exploratory decisions in humans. Nature 441: 876–879. doi: 10.1038/nature04766
23. 23. Schapiro AC, Kustner LV, Turk-Browne NB (2012) Shaping of object representations in the human medial temporal lobe based on temporal regularities. Current Biology 22: 1622–7. doi: 10.1016/j.cub.2012.06.056
24. 24. Josephs O, Turner R, Friston K (1997) Event-Related fMRI. Human Brain Mapping 5: 243–248. doi: 10.1002/(sici)1097-0193(1997)5:4<243::aid-hbm7>3.3.co;2-e
25. 25. Daw ND (2010) Trial-by-trial data analysis using computational models. In: Phelps E, Robbins T, Delgado M, editors. Affect, Learning and Decision Making, Attention and Performance. Xxiii edition. Oxford University Press.
26. 26. Buckner RL, Carroll DC (2006) Self-projection and the brain. Trends in Cognitive Sciences 11: 49–57. doi: 10.1016/j.tics.2006.11.004
27. 27. Buckner RL, Andrews-Hanna JR, Schacter DL (2008) The brain's default network: anatomy, function, and relevance to disease. Annals of the New York Academy of Sciences 1124: 1–38. doi: 10.1196/annals.1440.011
28. 28. Kahn I, Andrews-Hanna JR, Vincent JL, Snyder AZ, Buckner RL (2008) Distinct Cortical Anatomy Linked to Subregions of the Medial Temporal Lobe Revealed by Intrinsic Functional Connectivity. Journal of Neurophysiology 100: 129–139. doi: 10.1152/jn.00077.2008
29. 29. Delgado MR, Nystrom LE, Fissell C, Noll DC, Fiez JA (2000) Tracking the Hemodynamic Responses to Reward and Punishment in the Striatum. Journal of Neurophysiology 84: 3072–3077.
30. 30. O'Doherty JP, Dayan P, Friston K, Critchley H, Dolan RJ (2003) Temporal Difference Models and Reward-Related Learning in the Human Brain. Neuron 38: 329–337. doi: 10.1016/s0896-6273(03)00169-7
31. 31. McClure SM, Berns GS, Montague PR (2003) Temporal prediction errors in a passive learning task activate human striatum. Neuron 38: 339–46. doi: 10.1016/s0896-6273(03)00154-5
32. 32. Downing PE, Jiang Y, Shuman M, Kanwisher N (2001) A Cortical Area Selective for Visual Processing of the Human Body. Science 293: 2470–2473. doi: 10.1126/science.1063414
33. 33. Kanwisher N, Mcdermott J, Chun MM (1997) The Fusiform Face Area: A Module in Human Extrastriate Cortex Specialized for Face Perception. Journal of Neuroscience 17: 4302–4311.
34. 34. Epstein R, Kanwisher N (1998) A cortical representation of the local visual environment. Nature 392: 598–601. doi: 10.1038/33402
35. 35. Malach R, Reppas JB, Benson RR, Kwong KK, Jlang H, et al. (1995) Object-related activity revealed by functional magnetic resonance imaging in human occipital cortex. Proceedings of the National Academy of Sciences 92: 8135–8139. doi: 10.1073/pnas.92.18.8135
36. 36. Hampton AN, Bossaerts P, O'Doherty JP (2006) The role of the ventromedial prefrontal cortex in abstract state-based inference during decision making in humans. Journal of Neuroscience 26: 8360. doi: 10.1523/jneurosci.1010-06.2006
37. 37. Hampton AN, Bossaerts P, O'Doherty JP (2008) Neural correlates of mentalizing-related computations during strategic interactions in humans. Proceedings of the National Academy of Sciences 105: 6741–6746. doi: 10.1073/pnas.0711099105
38. 38. Daw ND, Gershman SJ, Seymour B, Dayan P, Raymond J (2011) Model-based influences on humans choices and striatal prediction errors. Neuron 69: 1204–1215. doi: 10.1016/j.neuron.2011.02.027
39. 39. Büchel C, Wise RJ, Mummery CJ, Poline JB, Friston KJ (1996) Nonlinear regression in parametric activation studies. NeuroImage 4: 60–6. doi: 10.1006/nimg.1996.0029
40. 40. Wittmann BC, Daw ND, Seymour B, Dolan RJ (2008) Striatal activity underlies novelty-based choice in humans. Neuron 58: 967–73. doi: 10.1016/j.neuron.2008.04.027
41. 41. Wimmer GE, Daw ND, Shohamy D (2012) Generalization of value in reinforcement learning by humans. The European Journal of Neuroscience 35: 1092–104. doi: 10.1111/j.1460-9568.2012.08017.x
42. 42. Squire LR (1992) Memory and the hippocampus: a synthesis from findings with rats, monkeys, and humans. Psychological Review 99: 195–231. doi: 10.1037/0033-295x.99.2.195
43. 43. Cohen N, Eichenbaum H (1993) Amnesia, Memory and the Hippocampal System. Cambridge, MA: MIT Press.
44. 44. Rose M, Haider H, Salari N, Buchel C (2011) Functional Dissociation of Hippocampal Mechanism during Implicit Learning Based on the Domain of Associations. Journal of Neuroscience 31: 13739–13745. doi: 10.1523/jneurosci.3020-11.2011
45. 45. Johnson A, Redish AD (2007) Neural ensembles in CA3 transiently encode paths forward of the animal at a decision point. Journal of Neuroscience 27: 12176–89. doi: 10.1523/jneurosci.3761-07.2007
46. 46. Addis DR, Wong AT, Schacter DL (2007) Remembering the past and imagining the future: common and distinct neural substrates during event construction and elaboration. Neuropsychologia 45: 1363–77. doi: 10.1016/j.neuropsychologia.2006.10.016
47. 47. Daw ND, Shohamy D (2008) The Cognitive Neuroscience of Motivation and Learning. Social Cognition 26: 593–620. doi: 10.1521/soco.2008.26.5.593
48. 48. Buckner RL (2010) The role of the hippocampus in prediction and imagination. Annual Review of Psychology 61: 27–48, C1–8. doi: 10.1146/annurev.psych.60.110707.163508
49. 49. O'Keefe J, Nadel L (1978) The hippocampus as cognitive map. Cambridge: Cambridge University Press.
50. 50. Redish AD (1999) Beyond the cognitive map: From place cells to episodic memory. Cambridge, MA: MIT Press.
51. 51. Bunsey M, Eichenbaum H (1996) Conservation of hippocampal memory function in rats and humans. Nature 379: 255–257. doi: 10.1038/379255a0
52. 52. Dusek JA, Eichenbaum H (1997) The hippocampus and memory for orderly stimulus relations. Proceedings of the National Academy of Sciences 94: 7109–7114. doi: 10.1073/pnas.94.13.7109
53. 53. Shohamy D, Wagner AD (2008) Integrating memories in the human brain: Hippocampal-midbrain encoding of overlapping event. Neuron 60: 378–89. doi: 10.1016/j.neuron.2008.09.023
54. 54. Kumaran D, Summerfield JJ, Hassabis D, Maguire EA (2009) Tracking the emergence of conceptual knowledge during human decision making. Neuron 63: 889–901. doi: 10.1016/j.neuron.2009.07.030
55. 55. Kumaran D, Melo HL, Duzel E (2012) The emergence and representation of knowledge about social and nonsocial hierarchies. Neuron 76: 653–66. doi: 10.1016/j.neuron.2012.09.035
56. 56. Wimmer G, Shohamy D (2012) Preference by association: How memory mechanisms in the hippocampus bias decisions. Science 338: 270–3. doi: 10.1126/science.1223252
57. 57. Simon DA, Daw ND (2011) Environmental statistics and the trade-off between model-based and TD learning in humans. In: Shawe-Taylor J, Zemel RS, Bartlett P, Pereira F, Weinberger K, editors. Advances in Neural Information Processing Systems 24. pp. 127–135.
58. 58. Yin HH, Knowlton BJ, Balleine BW (2004) Lesions of dorsolateral striatum preserve outcome expectancy but disrupt habit formation in instrumental learning. European Journal of Neuroscience 19: 181–189. doi: 10.1111/j.1460-9568.2004.03095.x
59. 59. Yin HH, Knowlton BJ (2006) The role of the basal ganglia in habit formation. Nature Reviews Neuroscience 7: 464–476. doi: 10.1038/nrn1919
60. 60. Yin HH, Mulcare SP, Hilário MRF, Clouse E, Davis MI, et al. (2009) Dynamic reorganization of striatal circuits during the acquisition and consolidation of a skill. Nature Neuroscience 12: 333–341. doi: 10.1038/nn.2261
61. 61. Davis DGS, Staddon JER (1990) Memory for reward in probabilistic choice: Markovian and non-Markovian properties. Behaviour 114: 37–64. doi: 10.1163/156853990x00040
62. 62. Mayr U (1996) Spatial attention and implicit sequence learning: evidence for independent learning of spatial and nonspatial sequences. Journal of Experimental Psychology: Learning, Memory and Cognition 22: 350–364. doi: 10.1037/0278-7393.22.2.350
63. 63. Willingham DB (1999) Implicit motor sequence learning is not purely perceptual. Memory & Cognition 27: 561–72. doi: 10.3758/bf03211549
64. 64. Packard MG, White M, Ha Q (1989) Differential Effects of Fornix and Caudate Radial Maze Tasks: Evidence for Multiple Nucleus Lesions on Two Memory Systems. Journal of Neuroscience 9: 1465–1472. doi: 10.1037//0735-7044.106.3.439
65. 65. McDonald RJ, White NM (1993) A triple dissociation of memory systems: hippocampus, amygdala, and dorsal striatum. Behavioral Neuroscience 107: 3–22. doi: 10.1037//0735-7044.107.1.3
66. 66. Knowlton BJ, Mangels JA, Squire LR (1996) A neostriatal habit learning system in humans. Science 273: 1399–402. doi: 10.1126/science.273.5280.1399
67. 67. Poldrack RA, Packard MG (2003) Competition among multiple memory systems: converging evidence from animal and human brain studies. Neuropsychologia 41: 245–51. doi: 10.1016/s0028-3932(02)00157-4
68. 68. Behrens TEJ, Woolrich MW, Walton ME, Rushworth MFS (2007) Learning the value of information in an uncertain world. Nature Neuroscience 10: 1214–21. doi: 10.1038/nn1954
69. 69. Li L, Miller EK (1993) The Representation of Stimulus Familiarity Temporal Cortex in Anterior Inferior. Journal of Neurophysiology 69: 1918–1929.
70. 70. Wiggs CL, Martin A (1998) Properties and mechanisms of perceptual priming. Current Opinion in Neurobiology 8: 227–33. doi: 10.1016/s0959-4388(98)80144-x
71. 71. McClure SM, Gilzenrat MS, Cohen JD (2005) An exploration-exploitation model based on norepinephrine and dopamine activity. In: Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, pp. 867–874.
72. 72. Summerfield C, Trittschuh EH, Monti JM, Mesulam Mm, Egner T (2008) Neural repetition suppression reflects fulfilled perceptual expectations. Nature Neuroscience 11: 1004–1006. doi: 10.1038/nn.2163
73. 73. Philiastides M, Biele G, Heekeren H (2010) A mechanistic account of value computation in the human brain. Proceedings of the National Academy of Sciences 107: 9430–5. doi: 10.1073/pnas.1001732107
74. 74. Burgess PW, Dumontheil I, Gilbert SJ (2007) The gateway hypothesis of rostral prefrontal cortex (area 10) function. Trends in Cognitive Sciences 11: 290–8. doi: 10.1016/j.tics.2007.05.004
75. 75. Schacter DL, Addis DR (2007) The cognitive neuroscience of constructive memory: remembering the past and imagining the future. Philosophical transactions of the Royal Society of London Series B, Biological sciences 362: 773–86. doi: 10.1098/rstb.2007.2087
76. 76. Viard A, Doeller CF, Hartley T, Bird CM, Burgess N (2011) Anterior hippocampus and goaldirected spatial decision making. Journal of Neuroscience 31: 4613–21. doi: 10.1523/jneurosci.4640-10.2011
77. 77. Guitart-Masip M, Barnes GR, Horner A, Bauer M, Dolan RJ, et al. (2013) Synchronization of medial temporal lobe and prefrontal rhythms in human decision making. The Journal of Neuroscience 33: 442–51. doi: 10.1523/jneurosci.2573-12.2013
78. 78. Houk J, Adams J, Barto A (1995) A model of how the basal ganglia generate and use neural signals that predict reinforcement. In: Houk JC, Davis JL, Beiser DG, editors. Models of information processing in the Basal Ganglia. Cambridge, MA: MIT Press. pp. 249–270.
79. 79. Frank MJ, Seeberger LC, O'Reilly RC (2004) By carrot or by stick: cognitive reinforcement learning in Parkinsonism. Science 306: 1940–1943. doi: 10.1126/science.1102941
80. 80. Keramati M, Dezfouli A, Piray P (2011) Speed/Accuracy Trade-Off between the Habitual and the Goal-Directed Processes. PLoS Computational Biology 7: e1002055. doi: 10.1371/journal.pcbi.1002055
81. 81. Wunderlich K, Dayan P, Dolan RJ (2012) Mapping value based planning and extensively trained choice in the human brain. Nature Neuroscience 15: 786–91. doi: 10.1038/nn.3068
82. 82. Botvinick M, An J (2008) Goal-directed decision making in prefrontal cortex: A computational framework. In: Koller D, Bengio, Y, Schuurmans D, Bouttou L, Culotta A, editors. Advances in Neural Information Processing Systems. Volume 21. pp. 169–176.
83. 83. Solway A, Botvinick MM (2012) Goal-directed decision making as probabilistic inference: A computational framework and potential neural correlates. Psychological Review 119: 120–54. doi: 10.1037/a0026435
84. 84. Rangel A, Camerer C, Montague PR (2008) A framework for studying the neurobiology of valuebased decision making. Nature Reviews Neuroscience 9: 545–56. doi: 10.1038/nrn2357
85. 85. Krajbich I, Rangel A (2011) Multialternative drift-diffusion model predicts the relationship between visual fixations and choice in value-based decisions. Proceedings of the National Academy of Sciences 108: 13852–13857. doi: 10.1073/pnas.1101328108
86. 86. Stewart N, Chater N, Brown GDA (2006) Decision by sampling. Cognitive Psychology 53: 1–26. doi: 10.1016/j.cogpsych.2005.10.003
87. 87. Bornstein AM, Daw ND (2011) Multiplicity of control in the basal ganglia: computational roles of striatal subregions. Current Opinion in Neurobiology 21: 374–80. doi: 10.1016/j.conb.2011.02.009
88. 88. Peters J, Büchel C (2010) Episodic future thinking reduces reward delay discounting through an enhancement of prefrontal-mediotemporal interactions. Neuron 66: 138–48. doi: 10.1016/j.neuron.2010.03.026
89. 89. Schwarz G (1978) Estimating the Dimension of a Model. Annals of Statistics 6: 461–464. doi: 10.1214/aos/1176344136
90. 90. Brainard DH (1997) The Psychophysics Toolbox. Spatial Vision 10: 433–6. doi: 10.1163/156856897x00357
91. 91. Holmes AP, Friston KJ (1998) Generalisability, Random Effects & Population Inference. Neuroimage 7: S754.
92. 92. Simon DA, Daw ND (2011) Neural Correlates of Forward Planning in a Spatial Decision Task in Humans. Journal of Neuroscience 31: 5526–5539. doi: 10.1523/jneurosci.4647-10.2011
93. 93. Schönberg T, Daw ND, Joel D, O'Doherty JP (2007) Reinforcement learning signals in the human striatum distinguish learners from nonlearners during reward-based decision making. Journal of Neuroscience 27: 12860–7. doi: 10.1523/jneurosci.2496-07.2007
94. 94. Schönberg T, O'Doherty JP, Joel D, Inzelberg R, Segev Y, et al. (2010) Selective impairment of prediction error signaling in human dorsolateral but not ventral striatum in Parkinson's disease patients: evidence from a model-based fMRI study. NeuroImage 49: 772–81. doi: 10.1016/j.neuroimage.2009.08.011
95. 95. Gershman SJ, Pesaran B, Daw ND (2009) Human reinforcement learning subdivides structured action spaces by learning effector-specific values. Journal of Neuroscience 29: 13524–31. doi: 10.1523/jneurosci.2469-09.2009
96. 96. Kass RE, Raftery AE (1995) Bayes Factors. Journal of the American Statistical Association 90: 773–795. doi: 10.1080/01621459.1995.10476572
97. 97. Mackay DJC (2003) Information Theory, Inference, and Learning Algorithms. Cambridge, UK: Cambridge University Press. doi:10.2277/0521642981.
98. 98. Boone EL, Ye K, Smith EP (2005) Assessment of two approximation methods for computing posterior model probabilities. Computational Statistics & Data Analysis 48: 221–234. doi: 10.1016/j.csda.2004.01.005
99. 99. Tzourio-Mazoyer N, Landeau B, Papathanassiou D, Crivello F, Etard O, et al. (2002) Automated anatomical labeling of activations in SPM using a macroscopic anatomical parcellation of the MNI MRI single-subject brain. NeuroImage 15: 273–89. doi: 10.1006/nimg.2001.0978
100. 100. Kriegeskorte N, Simmons WK, Bellgowan PSF, Baker CI (2009) Circular analysis in systems neuroscience: the dangers of double dipping. Nature Neuroscience 12: 535–40. doi: 10.1038/nn.2303
101. 101. Friston KJ, Josephs O, Rees G, Turner R (1997) Nonlinear Event-Related Responses in fMRI. Magnetic Resonance Methods 39: 41–52. doi: 10.1002/mrm.1910390109

Ambra 2.9.22 Managed Colocation provided
by Internet Systems Consortium.