A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2006; you can also visit the original URL.
The file type is
The single-agent multi-armed bandit problem can be solved by an agent that learns the values of each action using reinforcement learning (Sutton and Barto 1998). However the multiagent version of the problem, the iterated normal form game, presents a more complex challenge, since the rewards available to each agent depend on the strategies of the others. We consider the behaviour of value-based learning agents in this situation, and show that such agents cannot generally play at a Nashdoi:10.1137/s0363012903437976 fatcat:rmhpgsfdfjfoxejatrgatmhzgy