Humans are primarily model-based and not model-free learners in the two-stage task

Todd A Hare, Carolina Feher Da Silva
2019
1 Distinct model-free and model-based learning processes are thought to drive both typical and 2 dysfunctional behaviors. Data from two-stage decision tasks have seemingly shown that human 3 behavior is driven by both processes operating in parallel. However, in this study, we show that 4 more detailed task instructions lead participants to make primarily model-based choices that show 5 little, if any, model-free influence. We also demonstrate that behavior in the two-stage task may 6 falsely
more » ... pear to be driven by a combination of model-based/model-free learning if purely model-7 based agents form inaccurate models of the task because of misunderstandings. Furthermore, we 8 found evidence that many participants do misunderstand the task in important ways. Overall, 9 we argue that humans formulate a wide variety of learning models. Consequently, the simple 10 dichotomy of model-free versus model-based learning is inadequate to explain behavior in the 11 two-stage task and connections between reward learning, habit formation, and compulsivity. 12 Introduction 13 Investigating the interaction between habitual and goal-directed processes is essential to understand 14 both normal and abnormal behavior [1, 2, 3]. Habits are thought to be learned via model-free learn-15 ing [4], a strategy that operates by strengthening or weakening associations between stimuli and 16 actions, depending on whether the action is followed by a reward or not [5]. Conversely, another 17 strategy known as model-based learning generates goal-directed behavior [4], and may even protect 18 against habit formation [6]. In contrast to habits, model-based behavior selects actions by computing 19 the current values of each action based on a model of the environment. 20 Two-stage learning tasks (Figure 1A) have been used frequently to dissociate model-free and model-21 based influences on choice behavior Past studies employing the original form of two-stage task ( Figure 1A ) have always found that healthy 23 adult human participants use a mixture of model-free and model-based learning (e.g. [7, 20, 21]). 24 Moreover, most studies implementing modifications to the two-stage task that were designed to promote 25 model-based over model-free learning [21, 22, 24] find a reduced, but still substantial influence of model-26 free learning on behavior. Overall, the consensus has been that the influence of model-free learning on 27 behavior is ubiquitous and robust. 28 Our current findings call into question just how ubiquitous model-free learning is. In an attempt 29 to use a two-stage task to examine features of model-free learning, we found clear evidence that 30 participants misunderstood the task [25]. For example, we observed negative effects of reward that 31 cannot be explained by model-free or model-based learning processes. Inspired by a version of the 32 two-stage decision task that was adapted for use in both children and adults [20, 21], we created task 33 instructions in the form of a story that included causes and effects within a physical system for all task 34 events ( Figure 1B-D) . This simple change to the task instructions eliminated the apparent evidence 35 for model-free learning in our participants. We obtained the same results when replicating the exact 36 features of the original two-stage task in every way apart from the instructions and how the task's 37 events were framed. 38 1 . CC-BY-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . http://dx.doi.org/10.1101/682922 doi: bioRxiv preprint first posted online Jun. 27, 2019; for each state in three separate versions of the two-stage task. Panel D shows the stimuli representing common and rare transitions in the spaceship and magic carpet versions of the task. A: The original, abstract version of the task used by Daw et al. [7]. In each trial of the two-stage task, the participant makes choices in two consecutive stages. In the first stage, the participant chooses one of two green boxes, each of which contains a Tibetan character that identifies it. Depending on the chosen box, the participant transitions with different probabilities to a second-stage state, either the pink or the blue state. One green box takes the participant to the pink state with 0.7 probability and to the blue state with 0.3 probability, while the other takes the participant to the blue state with 0.7 probability and to the pink state with 0.3 probability. At the second stage, the participant chooses again between two boxes containing identifying Tibetan characters, which may be pink or blue depending on which state they are in. The participant then receives a reward or not. Each pink or blue box has a different reward probability, which randomly changes during the course of the experiment. The reward and transition properties remain the same in the versions of the two-stage task shown in B and C; only the instructions and visual stimuli differ. B: Spaceship version, which explains the task to participants with a story about a space explorer flying on spaceships and searching for crystals on alien planets. C: Magic carpet version, which explains the task to participants with a story about a musician flying on magic carpets and playing the flute to genies, who live on mountains, inside magic lamps. D: Depiction of common and rare transitions by the magic carpet and spaceship tasks. In the magic carpet task, common transitions are represented by the magic carpet flying directly to a mountain, and rare transitions are represented by the magic carpet being blown by the wind toward the opposite mountain. In the spaceship task, common transitions are represented by the spaceship flying directly to a planet, and rare transitions are represented by the spaceship's path being blocked by an asteroid cloud, which forces the spaceship to land on the other planet. The transitions were shown during each trial in the spaceship task. In order to more closely parallel the original task, transition stimuli were only shown during the initial practice trials for the magic carpet task. Based on these results, we wondered whether behavior labeled as partially model-free in previous 39 studies could, in fact, be the result of participants performing model-based learning, but using incor-40 rect models of the task to do so. In other words, could misconceptions of the task structure cause 41 participants to falsely appear as if they were influenced by model-free learning? To test this hypothesis, 42 we developed simulations of purely model-based agents that used incorrect models of the two-stage 43 task to make their choices. The results demonstrated that purely model-based learners can appear to 44 be partially model-free if their models of the task are wrong. We also re-analyzed the openly avail-45 able human choice data from [21] to look for evidence of confusion or incorrect mental models in this 46 large data set. Consistent with our own previous data [25], we found evidence that participants often 47 misunderstood the original two-stage task's features and acted on an incorrect model of the task 1 . 48 Our overall findings show that truly hybrid model-free/model-based learners cannot be reliably 49 distinguished from purely model-based learners that use the wrong model of the task. Critically, they 50 also indicate that when the correct model of the world is easy to conceptualize and readily understood, 51 human behavior is driven primarily by model-based learning. 52 Results 53 Model-based learning can be confused with model-free learning 54 Model-based agents can be confused with model-free agents. The prevalence of such misidentification 55 is in the existing literature is currently unknown, but we present a series of results indicating that it 56 is high. False conclusions about model-free learning can happen when the model used by model-based 57 agents breaks the assumptions that underlie the data analysis methods. We will show that purely 58 model-based agents are misidentified as hybrid model-free/model-based agents in the two-stage task 59 when the data are analyzed by either of the standard methods, logistic regression or reinforcement 60 learning model fitting. We present two examples of incorrect task models that participants could 61 potentially form. We do not suggest that these are the only or even the most probable ways that 62 people may misunderstand the task. These incorrect task models are merely examples to demonstrate 63 our point. 64 To serve as a reference, we simulated purely model-based agents that use the correct model of the 65 task based on the hybrid reinforcement learning model proposed by Daw et al. [7]. The hybrid rein-66 forcement learning model combines the model-free SARSA( ) algorithm with model-based learning and 67 explains first-stage choices as a combination of both the model-free and model-based state-dependent 68 action values, weighted by a model-based weight w (0  w  1). A model-based weight equal to 1 69 indicates a purely model-based strategy and, conversely, a model-based weight equal to 0 indicates a 70 purely model-free strategy. The results discussed below were obtained by simulating purely model-71 based agents (w = 1). We also note that consistent with recent work by Sharar et al. [26] we found 72 that even when agents have a w equal to exactly 1, used the correct model of the task structure, 73 and performed 1000 simulated trials, the w parameters recovered by the hybrid reinforcement learning 74 model were not always precisely 1 (see Fig. 2F ). This is expected, because parameter recovery is noisy 75 and w cannot be greater than 1, thus any error will be an underestimate of w. 76 The two alternative, purely model-based learning algorithms we created for simulated agents to use 77 are: the "unlucky symbol" algorithm and the "transition-dependent learning rates" (TDLR) algorithm. 78 The full details of these algorithms and our simulations can be found in the Methods section. Briefly, 79 the unlucky symbol algorithm adds to the purely model-based algorithm the mistaken belief that 80 certain first-stage symbols decrease the reward probability of second-stage choices. We reasoned that 81 it is possible that participants may believe that a certain symbol is lucky or unlucky after experiencing 82 by chance a winning or losing streak after repeatedly choosing that symbol. Thus, when they plan 83 their choices, they will take into account not only the transition probabilities associated to each symbol 84 but also how they believe the symbol affects the reward probabilities of second-stage choices. In the 85 current example, we simulated agents that believe a certain first-stage symbol is unlucky and thus 86 lowers the values of second-stage actions by 50%. 87 1 Note that we used these data simply because they were openly available and were from a relatively large sample of people performing the two stage task after receiving, in our view, the best instructions among previously published studies using the two-stage task. We do not expect confusion to be more prevalent in this data set than in any other past work. However, our current results indicate that stringent comprehension checks and double checks should be applied whenever a two-stage task is used. The TDLR algorithm is based on debriefing comments from participants in a pilot study, which 88 suggested that they assign greater importance to outcomes observed after common (i.e. expected) 89 relative to rare transitions. For example, one participant wrote in the debriefing questionnaire that 90 they "imagined a redirection [to the rare second-stage state] with a successful [outcome] as a trap 91 and did not go there again." To formalize this idea, we conceived a simple model-based learning 92 algorithm that had a higher learning rate after a common transition and a lower learning rate after 93 a rare transition. Hence, the learning rates are transition-dependent. Note that although we created 94 this model based on participants' feedback, we are not suggesting it is a model that many or even 95 one participant actually used. We suspect that participants each use a slightly different model and 96 probably even change their mental models of the task over time. In reality, we don't know if and how a 97 given participant misunderstands the task. The TDLR and unlucky symbol algorithms are simply two 98 plausible 'proof of principle' examples to demonstrate our point. We simulated 1, 000 agents of all three 99 purely model-based types (Correct, Unlucky symbol, and TDLR) performing a 1, 000-trial two-stage 100 task. Again, the purpose of these simulations is not to show that real human participants may be 101 employing these specific models when performing a two-stage task. Rather, the proposed algorithms 102 are examples intended to illustrate that when agents do not employ the assumed task model, they may 103 generate patterns of behavior that are mistaken for a model-free influence. 104 The resulting data were first analyzed by logistic regression of consecutive trial pairs (Figure 2 ). 105 In a logistic regression analysis of consecutive trial pairs, the stay probability (i.e. probability of 106 repeating the same action) is a function of two variables: reward, indicating whether the previous trial 107 was rewarded or not, and transition, indicating whether the previous trial's transition was common or 108 rare. Model-free learning generates a main effect of reward (Figure 2A), while model-based learning 109 generates a reward by transition interaction (Figure 2B) [7] (although this may not be true for all 110 modifications to the two-stage task [27]). The core finding in most studies that have employed this 111 task is that healthy adult participants behave like hybrid model-free/model-based agents (Figure 2C). 112 Specifically, in the case of a logistic regression analysis on consecutive trial pairs, the results exhibit 113 both a main effect of reward and a reward by transition interaction. Our simulations show that TDLR 114 and unlucky symbol agents, despite being purely model-based, display the same behavioral pattern as 115 healthy adult participants and simulated hybrid agents (Figure 2D and E). 116 We then analyzed the simulated choice data by fitting them with a hybrid model-based/model-free 117 learning algorithm based on the correct model of the task (i.e. the standard analysis procedure). 118 The resulting distributions of the estimated model-based weights are shown in Figure 2. The median 119 model-based weight estimated for agents using the correct task model was 0.94, and 95% of the agents 120 had an estimated w between 0.74 and 1.00. The estimated w for agents using the other algorithms 121 were, however, significantly lower. The set of model-based agents using the unlucky-symbol algorithm 122 had a median w = 0.36, and 95% of the agents had an estimated w between 0.24 and 0.48. The set 123 of agents using the TDLR algorithm had a median w = 0.80, and 95% of the agents had an estimated 124 weight between 0.70 and 0.90. Thus, these results demonstrate that analyzing two-stage task choices 125 using a hybrid reinforcement learning algorithm can lead to the misclassification of purely model-based 126 agents as hybrid agents if the agents don't fully understand the task and create an incorrect mental 127 model of how it works. 128 Human behavior deviates from the hybrid model's assumptions 129 In order to test if human behavior following typical task instructions violates assumptions inherent in 130 the hybrid model, we re-analyzed the control condition data from a study of 206 participants performing 131 the original two-stage task after receiving a common form of two-stage task instructions that were, in 132 our view, as good or better than all other previous studies [21]. Henceforth, we refer to this as the 133 common instructions data set. First, we note that poor overall hybrid model fits and greater decision 134 noise/exploration were significantly associated with more apparent evidence of model-free behavior. 135 When examining how the overall model fit relates to indications of model-free behavior, we found 136 that maximum likelihood estimates of the model-based weight for each participant correlated with 137 the log-likelihood of the hybrid model's fit. Specifically, we observed a weak but significantly positive 138 correlation between the model-based weight and the log-likelihood of the model fit (Spearman's rho = 139 0.19, P = 0.005). Similarly, we found that the soft-max inverse temperature parameters in both the 140 first (Spearman's rho = 0.24, P = 0.0006) and second-stage (Spearman's rho = 0.19, P = 0.007) choice 141 functions also correlated with model-based weights. These correlations indicate that more exploratory Model-free reinforcement learning can readily explain the results of many animal electrophysiology 399 and human neuroimaging studies. In particular, a highly influential finding in neuroscience is that 400 dopamine neurons respond to rewards in a manner consistent with signalling the reward prediction 401 errors that are fundamental to temporal difference learning algorithms, a form of model-free reward 402 learning [39, 40, 41]. However, in the studies showing this response, there was no higher-order struc-403 ture to the task-often, there was no instrumental task at all. Thus, it was not possible to use more 404 complex or model-based learning algorithms in those cases. Conversely, when the task is more compli-405 cated, it has been found that the dopamine prediction error signal reflects knowledge of task structure 406 and is therefore not a model-free signal (for example, [42, 43]). A recent optogenetic study showed 407 that dopamine signals are both necessary and sufficient for model-based learning in rats [44], and 408 consistent with this finding, neuroimaging studies in humans found that BOLD signals in striatal and 409 prefrontal regions that receive strong dopaminergic innervation correlate with model-based prediction 410 error signals [7, 45]. Moreover, although there is evidence that anatomically distinct striatal systems 411 mediate goal-directed and habitual actions [46], to date there is no evidence for anatomically separate 412 representations of model-free and model-based learning algorithms. 413 Model-free learning algorithms are generally assumed to be the computational analogs of habits, but 414 they are not necessarily the same thing [1]. Initial theoretical work proposed the model-based versus 415 model-free distinction to formalize the dual-system distinction between goal-directed and habitual 416 control [4]. However, model-free learning has never been empirically shown to equate with or even 417 exclusively lead to habitual behavior. Indeed, it is generally assumed that goal-directed actions can 418 be based on model-free learning too. Consider a participant who is purely goal-directed but does 419 not understand how the state transitions work in a two-stage task. This participant may resort to 420 employing a simple win-stay, lose-shift strategy, which is model-free, but his behavior will not be 421 habitual. 422 Apparently model-free participants behave in ways inconsistent with the habitual tendencies that 423 model-free learning is supposed to index. A study by Konovalov and Krajbich [47] combined eye-424 tracking with two-stage task choices to examine fixation patterns as a function of estimated learning 425 type. In addition to those authors' primary conclusions, we think this work highlights inequalities 426 between seemingly model-free behavior, and what one would expect from a habitual agent. Their 427 analysis strategy divided participants into model-free and model-based learners, based on a median 428 (w = 0.3) split of the model-based weight parameter estimated from the hybrid reward learning algo-429 rithm. They reported that when the first-stage symbols were presented, model-based learners tended 430 to look at most once at each symbol, as if they had decided prior to trial onset which symbol they 431 were going to choose. In contrast, learners classified as model-free tended to make more fixations 432 back and forth between first-stage symbols, and their choices were more closely related to fixation 433 duration than those of the model-based group. We interpret this pattern of fixation and choices as 434 suggesting that model-free participants made goal-directed comparisons when the first-stage symbols 435 were presented, rather than habitual responses. This is because similar patterns of back and forth head 436 movements, presumably analogous to fixations, are seen when rats are initially learning to navigate 437 a maze [48]. Furthermore, the rats' head movements are accompanied hippocampal representations 438 of reward locations in the direction the animal is facing. Such behavior is seen as evidence that the 439 animals are deliberating over choices in a goal-directed fashion. Humans also make more fixations 440 per trial as trial difficulty increases in goal-directed choice paradigms [49]. Notably, these patterns 441 of head movements and hippocampal place cell signaling actually cease once animals have extensive 442 training on the maze and act in an automated or habitual fashion at each decision point [48]. Thus, 443 supposedly model-free human participants' fixation patterns during the two-stage task suggest that 444 they are acting in a goal-directed rather than a habit-like fashion. 445 In contrast to habits, model-free behavior decreases with extensive training on the two-stage task. 446 In general, the frequency and strength of habitual actions increase with experience in a given task or 447 environment. However, Economides et al. showed that the estimated amount of model-free influence 448 in human participants decreases over three days of training on the two-stage task [29]. They also 449 found that, after two days of training, human behavior remains primarily model-based in the face of 450 interference from a second task (the Stroop task) performed in parallel. Both of these results raise 451 questions about the relative effortfulness of model-based versus model-free learning in the two-stage 452 task. After all, although it is apparently hard to explain, the transition model behind the two-stage 453 task is rather easy to follow once it is understood. Rats also show primarily model-based behavior 454 15 . CC-BY-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . http://dx.doi.org/10.1101/682922 doi: bioRxiv preprint first posted online Jun. 27, 2019; after receiving extensive training on the two-stage task [23] . In fact, the rats showed little or no 455 influence of model-free learning. Moreover, Miller et al. [23] also showed that inactivation of the dorsal 456 hippocampus or orbitofrontal cortex impairs model-based planning, but does not increase the influence 457 of model-free learning. Instead, any influence of model-free learning in the rats remained negligible. As 458 a whole, these results are difficult to reconcile with the idea of an ongoing competition or arbitration 459 between model-based and model-free control over behavior. 460 Humans have been reported to arbitrate between model-based and model-free strategies based on 461 both their relative accuracy and effort. We know from several lines of evidence that both humans 462 and other animals are sensitive to and generally seek to minimize both physical and mental effort if 463 possible [50]. Model-based learning is often thought to require more effort than model-free learning. A 464 well-known aspect of the original two-stage task [7] is that model-based learning does not lead to greater 465 accuracy or monetary payoffs compared to model-free learning [31, 21]. Thus, it has been hypothesized 466 that an aversion to mental effort coupled with lack of monetary benefits from model-based learning 467 may lead participants to use a partially model-free strategy on the original two-stage task [31, 21]. 468 Previous studies have tested this hypothesis by modifying the original two-stage task so that model-469 based learning strategies do achieve significantly greater accuracy and more rewards [21, 22, 24]. They 470 found that participants appeared more model-based behavior when it paid off to use a model-based 471 strategy. The conclusion in those studies was that participants will employ model-based learning if it 472 is advantageous in a cost-benefit analysis between effort and money. 473 Our results and those from studies with extensive training [51, 23] cannot be explained by cost-474 benefit or accuracy trade-offs between model-free and model-based learning. The magic carpet and 475 spaceship tasks led to almost completely model-based behavior, but had the same equivalency in 476 terms of profit for model-based and model-free learning as in the original two-stage task [7]. The 477 objective effort in the magic carpet task was also equivalent to the original two-stage task. Although 478 an interesting possibility that merits further study is that giving concrete causes for rare transitions 479 also reduced the subjective effort of forming or using the correct mental model of the task. Similarly, 480 the profitability of model-based learning does not change with experience. If anything, more experience 481 with the task should allow the agent to learn that the model-based strategy is no better than the model-482 free if both are being computed and evaluated in parallel for a long period of time. Therefore, these 483 two sets of results cannot be explained by an increase in model-based accuracy and profits compared 484 to model-free learning. 485 Seemingly model-free behavior may be reduced in all three sets of experiments through better 486 understanding of the task. Clearly, improved instructions and more experience can give participants a 487 better understanding of the correct task model. Most, if not all, of the modified two-stage tasks also 488 have the potential to facilitate understanding as well as making model-based learning more profitable. 489 This is because in addition to generating higher profits, the differential payoffs also provide clearer 490 feedback to participants about the correctness of their mental models. If both correct and incorrect 491 models lead to the same average payoffs, participants may be slow to realize their models are incorrect. 492 Conversely, if the correct model provides a clear payoff advantage over other models, participants will 493 be guided to quickly change their mental models through feedback from the task. Of course, increased 494 understanding and changes in the cost-benefit ratio may jointly drive the increases in (correct) model-495 based behavior in modified two-stage tasks. Additional data are needed to carefully tease apart these 496 two potential reasons for increased model-based behavior in many newer versions of the two-stage task. 497 Two-stage tasks have also been used to test for links between compulsive behavior and model-free 498 learning in healthy and clinical populations. Compulsive symptoms have been found to correlate with 499 apparent model-free behavior in the two-stage task [6, 14, 52]. Given our current results, however, 500 the conclusion that model-free learning and compulsive behaviors are linked should be drawn with 501 caution. We have shown that it is not clear what exactly is being measured by the two-stage task in 502 healthy young adult humans. The same questions should be extended to other populations, including 503 those with obsessive compulsive disorder (OCD). In the case of OCD, is the two-stage task picking up 504 alterations in how distinct habitual (indexed by model-free learning) and goal-directed (model-based 505 learning) systems interact to control behavior? Or are differences in two-stage choices driven by the 506 ability to understand a task, create and maintain accurate mental models of it, and use these models 507 to make decisions? It is certainly possible that OCD patients and other individuals with sub-clinical 508 compulsive symptoms do indeed use more model-free learning during the two-stage task. However, a 509 plausible alternative explanation for the correlations with model-free indexes in the two-stage task is 510 16 . CC-BY-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . http://dx.doi.org/10.1101/682922 doi: bioRxiv preprint first posted online Jun. 27, 2019; This is a simple model-based learning algorithm that has a higher learning rate after a common 595 transition and a lower learning rate after a rare transition; hence, the learning rates are transition-596 dependent. This model-based TDLR algorithm has three parameters: ↵ c , the higher learning rate for 597 outcomes observed after common transitions (0  ↵ c  1), ↵ r , the lower learning rate for outcomes 598 observed after rare transitions (0  ↵ r < ↵ c ), and > 0, an inverse temperature parameter that 599 determines the exploration-exploitation trade-off. In each trial t, based on the trial's observed outcome 600 (r t = 1 if the trial was rewarded, r t = 0 otherwise), the algorithm updates the estimated value 601 615 ↵ 1 = ↵ 2 = 0.5, (2) the unlucky-symbol algorithm with ↵ = 0.5 and ⌘ = 0.5, and (3) the TDLR 616 algorithm with ↵ c = 0.8 and ↵ r = 0.2. For all agents, the parameters had a value of 5. 617 Analysis of the common instructions data 618 In [21], 206 participants recruited via Amazon Mechanical Turk performed the two-stage task for 125 619 trials. See [21] for further details. The behavioral data were downloaded from the first author's Github 620 repository (https://github.com/wkool/tradeoffs) and reanalyzed by logistic regression and reinforce-621 ment learning model fitting, as described below. 622 Logistic regression of consecutive trials 623 This analysis was applied to all the analyzed behavioral data sets. Consecutive trial pairs were divided 624 into subsets, depending on the presentation of first-stage stimuli. The results for each subset were then 625 19 . CC-BY-ND 4.0 International license It is made available under a (which was not peer-reviewed) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. The copyright holder for this preprint . http://dx.doi.org/10.1101/682922 doi: bioRxiv preprint first posted online Jun. 27, 2019; values were updated correctly. Data from each participant were described by a vector (logit(↵ i 1 ), logit(↵ i 2 ), logit( i ), log( i 1 ), 670 The computed log-likelihoods obtained for each participant and model at each iteration were used 671 to calculate the PSIS-LOO score (an approximation of leave-one-out cross-validation) of each model for 672 each participant. To this end, the loo and compare functions of the loo R package were employed [60]. up when the carpet arrived on a mountain. During this period a black screen was displayed. Thus, 725 participants would have to figure out the meaning of each symbol on the carpets for themselves. The 726 screens and the time intervals were designed to match the original abstract task [7], except for the 727 black "nap" screen displayed during the transition, which added one extra second to every trial. 728 Spaceship task description 729 We also designed a second task, which we call the spaceship task, that differed from the original task 730 reported in Daw et al. [7] in terms of how the first stage options were represented. Specifically, there 731 were four configurations for the first-stage screen rather than two. These configurations were repre-858 model-free and model-based learning systems due to model misspecification. Journal of Mathe-859 matical Psychology, 91:88-102, August 2019. [62] Stefano Palminteri, Valentin Wyart, and Etienne Koechlin. The importance of falsification in 943 computational cognitive modeling. Trends in cognitive sciences, 21(6):425-433, 2017.
doi:10.5167/uzh-184469 fatcat:dkcq4jjd35fxhkioozpic72wce