Conceived and designed the experiments: GLC BJR. Performed the experiments: GLC BJR. Analyzed the data: GLC BJR. Contributed reagents/materials/analysis tools: GLC BJR. Wrote the paper: GLC BJR.
The authors have declared that no competing interests exist.
It is often assumed that animals and people adjust their behavior to maximize reward acquisition. In visually cued reinforcement schedules, monkeys make errors in trials that are not immediately rewarded, despite having to repeat error trials. Here we show that error rates are typically smaller in trials equally distant from reward but belonging to longer schedules (referred to as “schedule length effect”). This violates the principles of reward maximization and invariance and cannot be predicted by the standard methods of Reinforcement Learning, such as the method of temporal differences. We develop a heuristic model that accounts for all of the properties of the behavior in the reinforcement schedule task but whose predictions are not different from those of the standard temporal difference model in choice tasks. In the modification of temporal difference learning introduced here, the effect of schedule length emerges spontaneously from the sensitivity to the immediately preceding trial. We also introduce a policy for general Markov Decision Processes, where the decision made at each node is conditioned on the motivation to perform an instrumental action, and show that the application of our model to the reinforcement schedule task and the choice task are special cases of this general theoretical framework. Within this framework, Reinforcement Learning can approach contextual learning with the mixture of empirical findings and principled assumptions that seem to coexist in the best descriptions of animal behavior. As examples, we discuss two phenomena observed in humans that often derive from the violation of the principle of invariance: “framing,” wherein equivalent options are treated differently depending on the context in which they are presented, and the “sunk cost” effect, the greater tendency to continue an endeavor once an investment in money, effort, or time has been made. The schedule length effect might be a manifestation of these phenomena in monkeys.
Theories of rational behavior are built on a number of principles, including the assumption that subjects adjust their behavior to maximize their long-term returns and that they should work equally hard to obtain a reward in situations where the effort to obtain reward is the same (called the invariance principle). Humans, however, are sensitive to the manner in which equivalent choices are presented, or “framed,” and often have a greater tendency to continue an endeavor once an investment in money, effort, or time has been made, a phenomenon known as “sunk cost” effect. In a similar manner, when monkeys must perform different numbers of trials to obtain a reward, they work harder as the number of trials already performed increases, even though both the work remaining and the forthcoming reward are the same in all situations. Methods from the theory of Reinforcement Learning, which usually provide learning strategies aimed at maximizing returns, cannot model this violation of invariance. Here we generalize a prominent method of Reinforcement Learning so as to explain the violation of invariance, without losing the ability to model behaviors explained by standard Reinforcement Learning models. This generalization extends our understanding of how animals and humans learn and behave.
In studying reward-seeking behavior it is often assumed that animals attempt to maximize long term returns. This postulate often forms the basis of normative models of decision making
The idea of maximizing reward over time or effort is general and has provided an effective basis for describing decision-making where the choice between available options is basically a matter of preference. RL methods such as the method of temporal differences (TD) constitute an efficient way of solving decision problems in tasks where a subject must choose between a larger vs. a smaller reward, or between a more probable vs. a less probable reward, and predict courses of actions comparable to the actual behavior observed in animals performing the same tasks
RL methods have proven less successful, however, in situations where motivation, defined as the incentive to be engaged in a task at all, plays a strong role
(A) Color discrimination task. Each trial begins with the monkey touching a bar. A visual cue (horizontal black bar) appears immediately. Four hundred milliseconds later a red dot (WAIT signal) appears in the center of the cue. After a random interval of 500–1500 ms the dot turns green (GO signal). The monkey is required to release the touch-bar between 200 and 800 ms after the green dot appeared, in which case the dot turns blue (OK signal), and a drop of water is delivered 250 to 350 ms later. If the monkey fails to release the bar within the 200–800 ms interval after the GO signal, an error is registered, and no water is delivered. An anticipated bar release (<200 ms) is also counted as an error. (Red, green and blue dots are enlarged for the purpose of illustration). (B) 2-trial schedule. Each trial is a color discrimination task as in panel
Here we show that in trials equally far from reward, monkeys make fewer errors in longer schedules, when more trials have already been performed (“schedule length effect”). Thus, the value of the current trial is also modified by the number of trials already completed. This behavior violates the principle of invariance: monkeys perform differently in trials equally far from reward, depending on the number of trials already completed in the current schedule. Taken together, these results suggest that the behavior in the reward schedule task does not develop under the principles of invariance and reward-optimization, as commonly assumed when applying RL methods to understanding reward-seeking behavior.
We present a RL rule which predicts the monkeys' behavior in the reward schedule task. Such a rule is a heuristic generalization of TD learning. When applied to the reward schedule task, it predicts all aspects of monkeys' behavior, including the sensitivity to the contextual effect due to schedule length leading to the violation of the invariance principle. When applied to a task involving choice preference, the new method predicts the same behavior as does the standard TD model. Thus, the behaviors in the reward schedule and in choice tasks can be the consequence of the same learning rule.
Building on the special cases of the reward schedule and choice tasks, we then provide a general theory for Markov Decision Processes, wherein the transition to the next state is governed in a manner similar to a choice task, but is conditioned on whether the agent is sufficiently motivated to act at all, like in the reward schedule task. Finally, we link the schedule length effect to instances of “framing”
In this work we collate the behavior of 24 monkeys tested in the reward schedule task
In the presence of visual cues informing the monkey of the progress through the schedule (Valid Cue condition), the percentage of errors in all monkeys was directly related to the number of trials remaining to be completed in the schedule, i.e., the largest number of errors occurred in the trials that are furthest from the reward (
(A–B) Error rates as a function of schedule state for two monkeys, for both valid (circles) and random cues (“x”). Each schedule state is labeled by the fraction
In the Random Cue condition the visual cues were selected at random and bore no relationship to schedule state. In such a condition, error rates were indistinguishable across all schedule states (or idiosyncratic; “x” in
In the penultimate trials of each schedule (i.e., 1/2, 2/3, and 3/4 when available) 20 of 24 monkeys made progressively fewer errors as the schedule became longer (sign test,
In many of these studies the cues were distinguished by their brightness, where their brightness had been set according to the number of trials remaining in the schedule (
In the reward schedule task, all trials have the same cost because they all require the same action in response to the same trigger (the appearance of the green dot); trials differ only in their proximity to reward, which in turn does not depend on how many trials have already been performed. A standard reinforcement learning method can only learn to predict the proximity to reward correctly, and thus, unlike the behavior shown by the monkeys, is insensitive to the context introduced by the schedule length. We address this issue in detail in the remainder of this manuscript.
We assume that performance, here measured as the percentage of correct trials in each schedule state, reflects the monkey's motivation, which in turn reflects the value of that schedule state. Both rewarded and unrewarded trials acquire value: if unrewarded trials had no value, the monkeys would not perform them because there would be no motivation to do so. Thus, value acquisition must be based on a mechanism capable of learning to predict delayed rewards, like the method of temporal differences
The model below establishes the functional connection between performance and motivational values, and provides a recipe for learning the values. In general terms, the model assumes that the agent, on any given state
The parameter
In the particular case of the reward schedule, Equation 1 specifies completely the “policy” followed by the agent. We shall clarify in a later section that Equation 1 is a special case of the policy we propose for general Markov Decision Processes.
(A) Diagrammatic representation of the basic model for 3- and 2-trial schedules. (B) General pattern of error rates predicted by the basic model. For trials with the same reward proximity (pre-reward number,
In the reward schedule, a state is defined by the pair {
In the validly-cued reward schedule, learning continues until the average
As Equation 7 shows, trials equally distant from reward will acquire the same value and thus produce the same error rate under any policy (
In the schedule length effect, the value of each trial is larger than predicted by proximity to reward alone. A simple speculation on how this might arise is that the value of each trial is enhanced by having completed any previous trial in the current schedule. This idea can be implemented by modifying the temporal difference rule as follows:
The model predicts equal values for random cues that bear no relationship to schedule state. This results in uniform error rates, with a small spread around the mean due to the stochasticity in the cue selection and the learning processes. Any TD-based model would make the same prediction, which is a consequence of a symmetry contained in the design of the task, i.e., all cues are associated with reward with the same frequency. In the absence of errors, the mean value of random cues is
The same properties can also be captured quantitatively, as shown by the best fits of the context-sensitive model to the data (
(A–B) Theoretical error rates predicted by the context-sensitive model (black) for both valid (circles) and random (“x”) cues. The model parameters were tuned to match the experimental error rates of
Both the basic and context-sensitive models predict that valid and random cues have roughly the same average value, as shown in
To be a valid generalization of TD learning, the context-sensitive model must predict the same qualitative behavior is situations where animal behavior is well described by the standard model. We show here that this is generally true in situations involving behavioral choices. In particular, this means that the context-sensitive model does not predict a suboptimal behavior in tasks where this is not observed.
A simple choice task entails the offer of two alternative options, say
(A) Two-choice task. At decision node
The argument can be generalized to an
The reason for which a positive
(A) Description of the choice-schedule task with 2-trial schedules. At decision node
These conclusions hold for any choice schedule with two schedule states (i.e., with any choice of rewards and parameters values). In this more general case, parameters can be chosen so that a preference for either schedule could emerge; however, a positive value of
A positive
Reinforcement schedules and choice tasks are examples of MDPs. Formally, a MDP is a collection of states, each with an associated cost or reward, and a set of transition probabilities that govern the transitions between those states. We shall indicate with
(A) Policy for the general MDP. In the fragment of MDP shown, the agent is in state
In the more general case of an instrumental MDP, we shall define the policy, i.e. the probability of making a transition from
Regarding the specific choice of policies we have adopted, note that
Regarding the general policy Equation 12, note that ∑
Finally, this framework can be generalized to the case of transitions in continuous time. So far, transitions (including errors) could only happen at discrete time steps indexed by trial number, an adequate simplification for the purpose of this study. Formally, this means that
In reward schedule tasks, monkeys make substantially more errors in validly cued unrewarded trials than in rewarded trials. The number of errors decreases with reward proximity. Also, the error rates are typically smaller in trials equally distant from reward, but belonging to longer schedules (schedule length effect;
The monkeys do not maximize the amount of reward over the smallest number of trials, violating a principle requiring maximization of reward over time, and also violate the principle of invariance in trials equally far from reward, especially penultimate trials (
We have argued that the monkeys' behavior is a direct consequence of learning the motivational values attached to each trial by using the cues. Either randomizing the cues or damaging the rhinal cortex prevents the formation of this typical error rates pattern
In our model, a single algorithm explains the differential behavior with valid and random cues. Assuming that the average value of the schedule states is a measure of overall motivation, the model predicts that the overall motivation is similar in the valid and random cue conditions. The difference in performance in the two paradigms is a consequence of the non-linear (sigmoidal) shape of the performance function Equation 1 (cf.
The context-sensitive model also predicts that, although the behavior appears to be the same in all terminal trials, terminal trials may acquire different values (see, e.g., Equation 9). This difference is not reflected in the behavior since the latter depends on both the values (which might be different) and the performance function (Equation 1), which tends to remove value differences in the high value region (
The context-sensitive behavior is also an emergent property of the model. The model does not change the definition of the schedule states to accommodate their contextual meaning. Valid cues come to “label” the schedule states via predictive learning. The basic model translates these labels into a pattern of motivational values and error rates which only depend on reward proximity, and thus are the same in penultimate trials. This symmetry is broken in the context-sensitive model as a consequence of generalizing the temporal difference so as to look backwards as well as forward, and not through a redefinition of the schedule states.
It might seem at first that the model does not take into account the cost of performing a trial, i.e., the cost of releasing the bar at the GO signal. In fact, this cost could be interpreted as the origin of the residual, non-zero error rate given by the performance function (Equation 1) when the values are maximal (approximately, the error rate in validly-cued rewarded trials). It is also possible to implement this cost so as to affect the values of each state,
Our analysis unveils the inadequacy of standard TD learning for the reward schedule task. The general statement can be proved that it is not possible to capture the schedule length effect with RL methods inspired to TD learning, including TD(
The predictions of the context-sensitive model are the same as standard TD learning in a wide class of other tasks involving choice, where the values of states at decision nodes apply equally to whatever outcome of the decision. In simple choice tasks (cf.
The reward schedule and choice tasks represent two particular cases of general MDPs where the problem of making a decision can be factorized into two sub-problems, the motivation to perform at all, and the selection of one among alternative choices given the motivation to act. We have used the strategy of dividing this general problem into two parts: we have analyzed the behavior as driven by motivational value using the reward schedule task, and the behavior as driven by choice preference using choice tasks. In both cases, we have compared the standard and the novel TD model using the same policy for both. These two components are simply multiplied in general MDPs, where by definition both the motivation to act and choice selection can occur.
Our results indicate that only in the choice selection problem does the actor-critic architecture of RL
The extension of RL to capture the fundamental role of motivation in reinforcement schedules is currently a major challenge for the field, and other authors have also considered how to include motivation in RL
A dependence on the value of the preceding state implemented in our learning rule suggests an explanation of the schedule length effect as a history effect. When environmental cues are not perfect predictors of the availability of resources, monkeys' decisions about where to forage depend on past information like the history of preceding reinforcements
Current theories of reinforcement learning posit that dopaminergic neurons code for a prediction error signal analogous to
The contextual impact of the organization of the task in schedules has been found in the event-related responses of neurons in all neural structures investigated thus far in the reward schedule task, except perhaps for neurons of the area TE
In some brain regions, neuronal responses are different in trials of different schedules that might be regarded as homologous, particularly last trials of different schedules. Dopamine neurons
Neurons of the basolateral complex of the amygdala often have differential post-cue activity in first trials
Finally, there is evidence for the role of the primate striatum in learned action selection, with some authors
In the context-sensitive model, the mechanism responsible for the schedule length effect leads to the violation of invariance. The violation of this principle was invoked by Tversky and Kahneman in their description of “framing”
The schedule length effect is also reminiscent of the so-called “sunk cost” effect
It could be argued that, in the reward schedule task, the cost of performing trials is not strictly a “sunk” (wasted) cost, as it would be if the monkeys had to start the schedule anew after each error trial. However, this would only be a minor difference with other instantiations of sunk cost effects; and it could similarly be argued that the money spent in Experiment 2 of Arkes and Blumer
Various explanations of sunk cost and framing have been proposed. Arkes and Ayton
Our model does make a clear prediction in one case where framing has been found, i.e., in the increase in preference due to training with a larger cost
We stress, however, that our learning model is not meant to be a general model of the effects that frames, or sunk costs, have on humans and animals. For example, Pompilio et al
In the heuristic modification of TD learning introduced in this work, the schedule length effect emerges spontaneously from the sensitivity to the immediately preceding trial, leading to the violation of the invariance principle. Since this principle is violated in instances of framing and sunk cost effects, we have interpreted the monkeys' behavior using the framing and sunk cost analogies, even though monkeys might not be susceptible to framing or sunk cost the way humans are. We are not aware of alternative RL models predicting the violation of the principle of invariance.
In this work we collate the behavioral data from earlier studies on monkeys (
In the paradigm with random cues, the same visual stimuli are present, but each stimulus is selected pseudo-randomly with equal probability in each trial (Random Cue condition). In such a case, there is no relationship between cues and schedule states, although the schedules are still in effect.
The monkeys were not taught the “rules” of the reward schedule task but were simply exposed to it. The behavior reported in
For each monkey, the error rates were calculated as the ratio of the total number of incorrect trials (in all sessions) to the total number of trials for each schedule state. Differences in error rates across schedule states were tested with a
A sign test
Reaction times were defined as the time elapsed since the appearance of the GO signal and the bar release, and, as reported previously, were generally shorter in trials more proximal to reward
For each monkey, the theoretical error rates (
The formula Equation 6 of the main text for the equilibrium values of the basic model is exact only in the absence of errors, otherwise the values are smaller and are given by the self-consistent, recursion formula:
Equation 13 can be derived as follows: at equilibrium,
The same procedure, though more involved algebraically, gives the values in the context-sensitive model:
In the Random Cue condition, the cues define the states of the model. The model learns the values of the cues using the same algorithm specified by Equations 1, 3, and 8, with
Equation 17 defines
The context-sensitive model can be solved in a similar way, with in addition non-first trials to be taken into account. The final result is
Since it is required that
Here we show that it is not possible to obtain values dependent on schedule length (like in the context-sensitive model) by using a standard TD learning rule, which considers only future trials within the current schedule. The most general such rule can be written as
It follows from this argument that, to obtain the schedule-length effect, it is necessary either to look backwards at the values of previous trials in the same schedule (as in the context-sensitive model of the main text), or to take into account trials belonging to different schedules
So far, the value of the next state at the end of each schedule had been to set to zero. In other words, the learning rule following a rewarded trial is
We thank S. Fusi, J. Hertz, and A. Lerchner for useful discussions; S. Bouret, A. Clark, W. Lerchner, T. Minamimoto, J. Simmons, and C. Quaia for a careful reading of a previous version of this manuscript and for useful discussions; and two anonymous reviewers for valuable comments and suggestions.