Structural Causal Bandits: Where to Intervene?

Sanghack Lee, Elias Bareinboim
2018 Neural Information Processing Systems  
We study the problem of identifying the best action in a sequential decisionmaking setting when the reward distributions of the arms exhibit a non-trivial dependence structure, which is governed by the underlying causal model of the domain where the agent is deployed. In this setting, playing an arm corresponds to intervening on a set of variables and setting them to specific values. In this paper, we show that whenever the underlying causal model is not taken into account during the
more » ... king process, the standard strategies of simultaneously intervening on all variables or on all the subsets of the variables may, in general, lead to suboptimal policies, regardless of the number of interventions performed by the agent in the environment. We formally acknowledge this phenomenon and investigate structural properties implied by the underlying causal model, which lead to a complete characterization of the relationships between the arms' distributions. We leverage this characterization to build a new algorithm that takes as input a causal structure and finds a minimal, sound, and complete set of qualified arms that an agent should play to maximize its expected reward. We empirically demonstrate that the new strategy learns an optimal policy and leads to orders of magnitude faster convergence rates when compared with its causal-insensitive counterparts. Recently, the existence of some non-trivial dependencies among arms has been acknowledged in the literature and studied under the rubric of structured bandits, which include settings such as linear [Dani et al., 2008], combinatorial [Cesa-Bianchi and Lugosi, 2012] , unimodal Lipschitz [Magureanu et al., 2014], just to name a few. For example, a linear (or combinatorial) bandit imposes that an action x t 2 R d (or {0, 1} d ) at a time step t incurs a cost '> t x t , where 't is a loss vector chosen by, e.g., an adversary. In this case, an index-based MAB algorithm, oblivious to the structural properties, can be suboptimal. In another line of investigation, rich environments with complex dependency structures are modeled explicitly through the use of causal graphs, where nodes represent decisions and outcome variables, and direct edges represent direct influence of one variable on another [Pearl, 2000] . Despite the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, Canada.
dblp:conf/nips/LeeB18 fatcat:ziihmwac75cexjzvo3mnomddsy