Totally model-free reinforcement learning by actor-critic Elman networks in non-Markovian domains

E. Mizutani, S.E. Dreyfus
1998 IEEE International Joint Conference on Neural Networks Proceedings. IEEE World Congress on Computational Intelligence (Cat. No.98CH36227)  
In this paper we describe how a n actor-critic reinforcement learning agent in a non-Markovian domain nds an optimal sequence of actions in a totally modelfree fashion that is, the agent neither learns transitional probabilities and associated rewards, nor by h o w m uch the state space should be augmented so that the Markov p r o perty holds. In particular, we employ an Elman-type recurrent neural network to solve non-Markovian problems since an Elman-type network is able to implicitly and
more » ... matically render the process Markovian. A standard \actor-critic" neural network model has two separate components: the action (actor) network and the value (critic) network. In animal brains, however, those two presumably may not be distinct, but rather somehow entwined. We t h us construct one Elman network with two output nodes: actor node and critic node, and a portion of the shared hidden layer is fed back as the context layer, which functions as a history memory to produce sensitivity to non-Markovian dependencies. The agent explores small-scale three and four-stage triangular path-networks to learn an optimal sequence of actions that maximizes total value (or reward) associated with its transition from vertex to vertex. The posed problem has deterministic transition and reward associated with each allowable action (although either could be stochastic) and is rendered non-Markovian by the reward being dependent on an earlier transition. Due to the nature of neural model-free l e arning, the agent needs many iterations to nd the optimal actions even in small-scale path problems.
doi:10.1109/ijcnn.1998.687169 fatcat:r44xdijcynf7jftn6kqpp362nq