Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems
Foundations and Trends® in Machine Learning
Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration-exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the 1930s, exploration-exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing.
... matically, a multi-armed bandit is defined by the payoff process associated with each option. In this monograph, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model. Note that the randomization of the adversary is not very important here since we ask for bounds which hold for any opponent. On the other hand, it is fundamental to allow randomization for the forecaster -see Section 3 for details and basic results in the adversarial bandit model. This adversarial, or nonstochastic, version of the bandit problem was originally proposed as a way of playing an unknown game against an opponent. The problem of playing a game repeatedly, now a classical topic in game theory, was initiated by the groundbreaking work of James Hannan and David Blackwell. In Hannan's seminal paper Hannan , the game (i.e., the payoff matrix) is assumed to be known by the player, who also observes the opponent's moves in each play. Later, Baños  considered the problem of a repeated unknown game, where in each game round the player only observes its own payoff. This problem turns out to be exactly equivalent to the adversarial bandit 8 Introduction problem with a nonoblivious adversary. Simpler strategies for playing unknown games were more recently proposed by Foster and Vohra  and 94] . Approximately at the same time, the problem was re-discovered in computer science by Auer et al.  . It was them who made apparent the connection to stochastic bandits by coining the term nonstochastic multi-armed bandit problem. The third fundamental model of multi-armed bandits assumes that the reward processes are neither i.i.d. (like in stochastic bandits) nor adversarial. More precisely, arms are associated with K Markov processes, each with its own state space. Each time an arm i is chosen in state s, a stochastic reward is drawn from a probability distribution ν i,s , and the state of the reward process for arm i changes in a Markovian fashion, based on an underlying stochastic transition matrix M i . Both reward and new state are revealed to the player. On the other hand, the state of arms that are not chosen remains unchanged. Going back to our initial interpretation of bandits as sequential resource allocation processes, here we may think of K competing projects that are sequentially allocated a unit resource of work. However, unlike the previous bandit models, in this case the state of a project that gets the resource may change. Moreover, the underlying stochastic transition matrices M i are typically assumed to be known, thus the optimal policy can be computed via dynamic programming and the problem is essentially of computational nature. The seminal result of Gittins  provides an optimal greedy policy which can be computed efficiently. A notable special case of Markovian bandits is that of Bayesian bandits. These are parametric stochastic bandits, where the parameters of the reward distributions are assumed to be drawn from known priors, and the regret is computed by also averaging over the draw of parameters from the prior. The Markovian state change associated with the selection of an arm corresponds here to updating the posterior distribution of rewards for that arm after observing a new reward. Markovian bandits are a standard model in the areas of Operations Research and Economics. However, the techniques used in their analysis are significantly different from those used to analyze stochastic and adversarial bandits. For this reason, in this monograph we do not cover Markovian bandits and their many variants. 9