A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2020; you can also visit the original URL.
The file type is application/pdf
.
Filters
Improved Algorithms for Conservative Exploration in Bandits
[article]
2020
arXiv
pre-print
In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost
arXiv:2002.03221v1
fatcat:ncivpe3v2nh2rgm7nsjl3nolyu
more »
... r worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.
Encrypted Linear Contextual Bandit
[article]
2022
arXiv
pre-print
Contextual bandit is a general framework for online learning in sequential decision-making problems that has found application in a wide range of domains, including recommendation systems, online advertising, and clinical trials. A critical aspect of bandit methods is that they require to observe the contexts –i.e., individual or group-level data– and rewards in order to solve the sequential problem. The large deployment in industrial applications has increased interest in methods that preserve
arXiv:2103.09927v2
fatcat:hpueka6tw5cd5fscdrcs2omfdm
more »
... the users' privacy. In this paper, we introduce a privacy-preserving bandit framework based on homomorphic encryption which allows computations using encrypted data. The algorithm only observes encrypted information (contexts and rewards) and has no ability to decrypt it. Leveraging the properties of homomorphic encryption, we show that despite the complexity of the setting, it is possible to solve linear contextual bandits over encrypted data with a O(d√(T)) regret bound in any linear contextual bandit problem, while keeping data encrypted.
Bandits with Side Observations: Bounded vs. Logarithmic Regret
[article]
2018
arXiv
pre-print
We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency ϵ, an extra observation is gathered by the agent for free. We prove that, no matter how small ϵ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than ∑_i (1/ϵ)/Δ_i, up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.
arXiv:1807.03558v1
fatcat:2b6euewcznauxbfr3iy26nbz3i
Conservative Exploration in Reinforcement Learning
[article]
2020
arXiv
pre-print
While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward. Although the agent will eventually learn a good or optimal policy, there is no guarantee on the quality of the intermediate policies. This lack of control is undesired in real-world applications where a minimum requirement is that the executed policies are guaranteed to perform at least as
arXiv:2002.03218v2
fatcat:armuovsmbvbrzmiwz3cdn3vh6y
more »
... ll as an existing baseline. In this paper, we introduce the notion of conservative exploration for average reward and finite horizon problems. We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning. We derive regret bounds showing that being conservative does not hinder the learning ability of these algorithms.
Differentially Private Exploration in Reinforcement Learning with Linear Representation
[article]
2021
arXiv
pre-print
., 2020; Garcelon et al., 2020) . In this paper, we contribute to the study of DP in online reinforcement learning (RL). ...
Algorithm Garcelon et al. (2020) Vietri et al. (2020) Our -Cor. 5 Our -Cor. 6 Our -Thm. 8 Setting Tab. Tab. ...
arXiv:2112.01585v2
fatcat:ubnlne4zyrgqrfcb7gcyhzznt4
Adversarial Attacks on Linear Contextual Bandits
[article]
2020
arXiv
pre-print
Contextual bandit algorithms are applied in a wide range of domains, from advertising to recommender systems, from clinical trials to education. In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior. For instance, an unscrupulous ad publisher may try to increase their own revenue at the expense of the advertisers; a seller may want to increase the exposure of their products, or thwart a competitor's advertising
arXiv:2002.03839v3
fatcat:oweqjzh4erh7pfosmss2sovvcm
more »
... mpaign. In this paper, we study several attack scenarios and show that a malicious agent can force a linear contextual bandit algorithm to pull any desired arm T - o(T) times over a horizon of T steps, while applying adversarial modifications to either rewards or contexts that only grow logarithmically as O(log T). We also investigate the case when a malicious agent is interested in affecting the behavior of the bandit algorithm in a single context (e.g., a specific user). We first provide sufficient conditions for the feasibility of the attack and we then propose an efficient algorithm to perform the attack. We validate our theoretical results on experiments performed on both synthetic and real-world datasets.
No-Regret Exploration in Goal-Oriented Reinforcement Learning
[article]
2020
arXiv
pre-print
Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration-exploitation dilemma has been sparsely studied in general SSP problems, with most of the theoretical literature focusing on different problems (i.e., fixed-horizon
arXiv:1912.03517v3
fatcat:us4tuyaggfe6jemd5v6yccbknm
more »
... nd infinite-horizon) or making the restrictive loop-free SSP assumption (i.e., no state can be visited twice during an episode). In this paper, we study the general SSP problem with no assumption on its dynamics (some policies may actually never reach the goal). We introduce UC-SSP, the first no-regret algorithm in this setting, and prove a regret bound scaling as 𝒪( D S √( A D K)) after K episodes for any unknown SSP with S states, A actions, positive costs and SSP-diameter D, defined as the smallest expected hitting time from any starting state to the goal. We achieve this result by crafting a novel stopping rule, such that UC-SSP may interrupt the current policy if it is taking too long to achieve the goal and switch to alternative policies that are designed to rapidly terminate the episode.
A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning
[article]
2022
arXiv
pre-print
2019] and tabularMDPs [Garcelon et al., 2020a]. ...
., 2016 , Garcelon et al., 2020b , Katariya et al., 2019 , Zhang et al., 2019 , Du et al., 2020 , Wang et al., 2021 and tabular RL [Garcelon et al., 2020a] . ...
arXiv:2106.11692v2
fatcat:lf5dlitu55gddosv5jq5s7w3ji
Local Differential Privacy for Regret Minimization in Reinforcement Learning
[article]
2021
arXiv
pre-print
Reinforcement learning algorithms are widely used in domains where it is desirable to provide a personalized service. In these domains it is common that user data contains sensitive information that needs to be protected from third parties. Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user side. We formulate this notion of privacy for RL by leveraging the local differential privacy (LDP)
arXiv:2010.07778v3
fatcat:76bncyh47zgr3bvtu5qf52gxnm
more »
... mework. We establish a lower bound for regret minimization in finite-horizon MDPs with LDP guarantees which shows that guaranteeing privacy has a multiplicative effect on the regret. This result shows that while LDP is an appealing notion of privacy, it makes the learning problem significantly more complex. Finally, we present an optimistic algorithm that simultaneously satisfies ε-LDP requirements, and achieves √(K)/ε regret in any finite-horizon MDP after K episodes, matching the lower bound dependency on the number of episodes K.
Improved Algorithms for Conservative Exploration in Bandits
2020
PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE
In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost
doi:10.1609/aaai.v34i04.5812
fatcat:gz52kwjj6zhsjfda6uozoj4hre
more »
... r worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.
Top K Ranking for Multi-Armed Bandit with Noisy Evaluations
[article]
2022
arXiv
pre-print
We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, evaluations of the true reward of each arm and it selects K arms with the objective of accumulating as much reward as possible over T rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are
arXiv:2112.06517v4
fatcat:q4spwhtf3fezrpxr5auutvguma
more »
... ted. First, we show a O(T^2/3) regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved O(√(T)) regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice.
Differentially Private Regret Minimization in Episodic Markov Decision Processes
[article]
2021
arXiv
pre-print
Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differentially private regret
minimization in reinforcement learning. arXiv preprint arXiv:2010.07778, 2020. ...
On the other hand, Garcelon et al. (2020) design the first private RL algorithm – LDP-OBI – with regret and
LDP guarantees. ...
arXiv:2112.10599v1
fatcat:nuiwxweo7vcajhfbjeiilo5zle
Exploration-Exploitation in Constrained MDPs
[article]
2020
arXiv
pre-print
Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Improved algorithms for conservative exploration in bandits. CoRR, abs/2002.03221, 2020a. ...
., 2017 , Garcelon et al., 2020a and in RL [Garcelon et al., 2020b ]. ...
arXiv:2003.02189v1
fatcat:7o4wlfdqbvfapb5ws4lmadhtsq
The role of a positive spirit in the attractiveness, sociability and success of a public place
2019
Quaestiones Geographicae
On the other hand, the intangible component has not drawn much attention despite the good will of some masters like Louis Khan who made the materials tell the stories of his projects (Garcelon et al. ...
This tool is a questionnaire of the Likert-type scale (Lombart 2004; Aurier, Evrard 1998; Lichtlé, Plichon 2014) designed by the present authors, in the context of a doctoral thesis in progress, based ...
doi:10.2478/quageo-2019-0040
fatcat:z5qrv5g5mfdcnofolfgezoymou
Safe Optimal Design with Applications in Policy Learning
[article]
2021
arXiv
pre-print
We thus have = b θ + εb Σb,
Yang, Yunchang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, Liwei Wang, Simon S. Du. 2021. ...
arXiv:2111.04835v1
fatcat:iey4fmroszb6vgrmorifjcj62a
« Previous
Showing results 1 — 15 out of 23 results