Reinforcement learning, conditioning, and the brain: Successes and challenges

Tiago V. Maia
2009 Cognitive, Affective, & Behavioral Neuroscience  
The field of reinforcement learning has greatly influenced the neuroscientific study of conditioning. This article provides an introduction to reinforcement learning followed by an examination of the successes and challenges using reinforcement learning to understand the neural bases of conditioning. Successes reviewed include (1) the mapping of positive and negative prediction errors to the firing of dopamine neurons and neurons in the lateral habenula, respectively; (2) the mapping of
more » ... sed and model-free reinforcement learning to associative and sensorimotor cortico-basal ganglia-thalamo-cortical circuits, respectively; and (3) the mapping of actor and critic to the dorsal and ventral striatum, respectively. Challenges reviewed consist of several behavioral and neural findings that are at odds with standard reinforcement-learning models, including, among others, evidence for hyperbolic discounting and adaptive coding. The article suggests ways of reconciling reinforcement-learning models with many of the challenging findings, and highlights the need for further theoretical developments where necessary. Additional information related to this study may be downloaded from Markov Decision Processes The environment in reinforcement-learning problems can often be described as a Markov decision process (MDP). An MDP defines how the environment behaves in response to the agent's actions. Formally, an MDP consists not only of the aforementioned sets S and A(s), but also of two functions: a function T that defines the environment's dynamics and a function R that defines the reinforcement given to the agent. Specifically, T(s, a, s ) , where s S, a A(s), and s S, gives the probability of transitioning to state s when the agent is in state s and performs action a. In other words, T determines the transition probabilities that determine the dynamics of the environment. Note that these transitions need not be deterministic: Performing action a in state s may result in a transition to different states. The reinforcement that the agent receives when it is in state s, selects action a, and transitions to state s is given by R(s, a, s ) . Sometimes, such reinforcement is not deterministic even given the triplet s, a, s ; in those cases, R(s, a, s ) is the expected value of the distribution of reinforcements when the agent is in state s, selects action a, and transitions to state s . This type of decision process is called a Markov process because it obeys the Markov property. In systems that obey this property, the future of the system is independent of its past, given the current state. In other words, if we know the current state, knowing additional information about previous states and reinforcements does not improve our ability to predict future states or reinforcements. More formally, let s t , a t , and r t represent the state, action, and reinforcement at time t, respectively. 2 If time starts at 0 and the current time is , the history of the system, H, is given by H s 0 , a 0 , r 0 , s 1 , a 1 , r 1 , . . . , s . Suppose the agent now selects action a . The Markov property tells us that P(r r, s 1 s | H, a ) P(r r, s 1 s | s , a ). In other words, knowing the current state is equivalent to knowing the entire history of the system. MDPs play a central role in the theory of reinforcement learning precisely because the future depends only on the current state, not on the system's history. 3 This makes the problem both easier to formalize (e.g., T and R can be PART
doi:10.3758/cabn.9.4.343 pmid:19897789 fatcat:ovlgmuwlljeenhkjvix7lswqzy