Nonstochastic bandits: Countable decision set, unbounded costs and reactive environments

Jan Poland
2008 Theoretical Computer Science  
The nonstochastic multi-armed bandit problem, first studied by Auer, Cesa-Bianchi, Freund, and Schapire in 1995, is a game of repeatedly choosing one decision from a set of decisions ("experts"), under partial observation: In each round t, only the cost of the decision played is observable. A regret minimization algorithm plays this game while achieving sublinear regret relative to each decision. It is known that an adversary controlling the costs of the decisions can force the player a regret
more » ... rowing as t 1 2 in the time t. In this work, we propose the first algorithm for a countably infinite set of decisions, that achieves a regret upper bounded by O(t 1 2 +ε ), i.e. arbitrarily close to optimal order. To this aim, we build on the "follow the perturbed leader" principle, which dates back to work by Hannan in 1957. Our results hold against an adaptive adversary, for both the expected and high probability regret of the learner w.r.t. each decision. In the second part of the paper, we consider reactive problem settings, that is, situations where the learner's decisions impact on the future behaviour of the adversary, and a strong strategy can draw benefit from well chosen past actions. We present a variant of our regret minimization algorithm which has still regret of order at most t 1 2 +ε relative to such strong strategies, and even sublinear regret not exceeding O(t 4 5 ) w.r.t. the hypothetical (without external interference) performance of a strong strategy. We show how to combine the regret minimizer with a universal class of experts, given by the countable set of programs on some fixed universal Turing machine. This defines a universal learner with sublinear regret relative to any computable strategy.
doi:10.1016/j.tcs.2008.02.024 fatcat:6bzklixrnbdzjhlpc26sismy74