Individual Q-Learning in Normal Form Games

David S. Leslie, E. J. Collins
2005 SIAM Journal of Control and Optimization  
The single-agent multi-armed bandit problem can be solved by an agent that learns the values of each action using reinforcement learning (Sutton and Barto 1998). However the multiagent version of the problem, the iterated normal form game, presents a more complex challenge, since the rewards available to each agent depend on the strategies of the others. We consider the behaviour of value-based learning agents in this situation, and show that such agents cannot generally play at a Nash
more » ... um, although if smooth best responses are used a Nash distribution can be reached. We introduce a particular value-based learning algorithm, individual Q-learning, and use stochastic approximation to study the asymptotic behaviour, showing that strategies will converge to Nash distribution almost surely in 2-player zero-sum games and 2-player partnership games. Playerdependent learning rates are then considered, and it is shown that this extension converges in some games for which many algorithms, including the basic algorithm initially considered, fail to converge.
doi:10.1137/s0363012903437976 fatcat:rmhpgsfdfjfoxejatrgatmhzgy