Filters








23 Hits in 1.5 sec

Improved Algorithms for Conservative Exploration in Bandits [article]

Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta
2020 arXiv   pre-print
In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost
more » ... r worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.
arXiv:2002.03221v1 fatcat:ncivpe3v2nh2rgm7nsjl3nolyu

Encrypted Linear Contextual Bandit [article]

Evrard Garcelon and Vianney Perchet and Matteo Pirotta
2022 arXiv   pre-print
Contextual bandit is a general framework for online learning in sequential decision-making problems that has found application in a wide range of domains, including recommendation systems, online advertising, and clinical trials. A critical aspect of bandit methods is that they require to observe the contexts –i.e., individual or group-level data– and rewards in order to solve the sequential problem. The large deployment in industrial applications has increased interest in methods that preserve
more » ... the users' privacy. In this paper, we introduce a privacy-preserving bandit framework based on homomorphic encryption which allows computations using encrypted data. The algorithm only observes encrypted information (contexts and rewards) and has no ability to decrypt it. Leveraging the properties of homomorphic encryption, we show that despite the complexity of the setting, it is possible to solve linear contextual bandits over encrypted data with a O(d√(T)) regret bound in any linear contextual bandit problem, while keeping data encrypted.
arXiv:2103.09927v2 fatcat:hpueka6tw5cd5fscdrcs2omfdm

Bandits with Side Observations: Bounded vs. Logarithmic Regret [article]

Rémy Degenne, Evrard Garcelon, Vianney Perchet
2018 arXiv   pre-print
We consider the classical stochastic multi-armed bandit but where, from time to time and roughly with frequency ϵ, an extra observation is gathered by the agent for free. We prove that, no matter how small ϵ is the agent can ensure a regret uniformly bounded in time. More precisely, we construct an algorithm with a regret smaller than ∑_i (1/ϵ)/Δ_i, up to multiplicative constant and loglog terms. We also prove a matching lower-bound, stating that no reasonable algorithm can outperform this quantity.
arXiv:1807.03558v1 fatcat:2b6euewcznauxbfr3iy26nbz3i

Conservative Exploration in Reinforcement Learning [article]

Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta
2020 arXiv   pre-print
While learning in an unknown Markov Decision Process (MDP), an agent should trade off exploration to discover new information about the MDP, and exploitation of the current knowledge to maximize the reward. Although the agent will eventually learn a good or optimal policy, there is no guarantee on the quality of the intermediate policies. This lack of control is undesired in real-world applications where a minimum requirement is that the executed policies are guaranteed to perform at least as
more » ... ll as an existing baseline. In this paper, we introduce the notion of conservative exploration for average reward and finite horizon problems. We present two optimistic algorithms that guarantee (w.h.p.) that the conservative constraint is never violated during learning. We derive regret bounds showing that being conservative does not hinder the learning ability of these algorithms.
arXiv:2002.03218v2 fatcat:armuovsmbvbrzmiwz3cdn3vh6y

Differentially Private Exploration in Reinforcement Learning with Linear Representation [article]

Paul Luyo and Evrard Garcelon and Alessandro Lazaric and Matteo Pirotta
2021 arXiv   pre-print
., 2020; Garcelon et al., 2020) . In this paper, we contribute to the study of DP in online reinforcement learning (RL).  ...  Algorithm Garcelon et al. (2020) Vietri et al. (2020) Our -Cor. 5 Our -Cor. 6 Our -Thm. 8 Setting Tab. Tab.  ... 
arXiv:2112.01585v2 fatcat:ubnlne4zyrgqrfcb7gcyhzznt4

Adversarial Attacks on Linear Contextual Bandits [article]

Evrard Garcelon, Baptiste Roziere, Laurent Meunier, Jean Tarbouriech, Olivier Teytaud, Alessandro Lazaric, Matteo Pirotta
2020 arXiv   pre-print
Contextual bandit algorithms are applied in a wide range of domains, from advertising to recommender systems, from clinical trials to education. In many of these domains, malicious agents may have incentives to attack the bandit algorithm to induce it to perform a desired behavior. For instance, an unscrupulous ad publisher may try to increase their own revenue at the expense of the advertisers; a seller may want to increase the exposure of their products, or thwart a competitor's advertising
more » ... mpaign. In this paper, we study several attack scenarios and show that a malicious agent can force a linear contextual bandit algorithm to pull any desired arm T - o(T) times over a horizon of T steps, while applying adversarial modifications to either rewards or contexts that only grow logarithmically as O(log T). We also investigate the case when a malicious agent is interested in affecting the behavior of the bandit algorithm in a single context (e.g., a specific user). We first provide sufficient conditions for the feasibility of the attack and we then propose an efficient algorithm to perform the attack. We validate our theoretical results on experiments performed on both synthetic and real-world datasets.
arXiv:2002.03839v3 fatcat:oweqjzh4erh7pfosmss2sovvcm

No-Regret Exploration in Goal-Oriented Reinforcement Learning [article]

Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, Alessandro Lazaric
2020 arXiv   pre-print
Many popular reinforcement learning problems (e.g., navigation in a maze, some Atari games, mountain car) are instances of the episodic setting under its stochastic shortest path (SSP) formulation, where an agent has to achieve a goal state while minimizing the cumulative cost. Despite the popularity of this setting, the exploration-exploitation dilemma has been sparsely studied in general SSP problems, with most of the theoretical literature focusing on different problems (i.e., fixed-horizon
more » ... nd infinite-horizon) or making the restrictive loop-free SSP assumption (i.e., no state can be visited twice during an episode). In this paper, we study the general SSP problem with no assumption on its dynamics (some policies may actually never reach the goal). We introduce UC-SSP, the first no-regret algorithm in this setting, and prove a regret bound scaling as 𝒪( D S √( A D K)) after K episodes for any unknown SSP with S states, A actions, positive costs and SSP-diameter D, defined as the smallest expected hitting time from any starting state to the goal. We achieve this result by crafting a novel stopping rule, such that UC-SSP may interrupt the current policy if it is taking too long to achieve the goal and switch to alternative policies that are designed to rapidly terminate the episode.
arXiv:1912.03517v3 fatcat:us4tuyaggfe6jemd5v6yccbknm

A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning [article]

Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, Liwei Wang, Simon S. Du
2022 arXiv   pre-print
2019] and tabularMDPs [Garcelon et al., 2020a].  ...  ., 2016 , Garcelon et al., 2020b , Katariya et al., 2019 , Zhang et al., 2019 , Du et al., 2020 , Wang et al., 2021 and tabular RL [Garcelon et al., 2020a] .  ... 
arXiv:2106.11692v2 fatcat:lf5dlitu55gddosv5jq5s7w3ji

Local Differential Privacy for Regret Minimization in Reinforcement Learning [article]

Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, Matteo Pirotta
2021 arXiv   pre-print
Reinforcement learning algorithms are widely used in domains where it is desirable to provide a personalized service. In these domains it is common that user data contains sensitive information that needs to be protected from third parties. Motivated by this, we study privacy in the context of finite-horizon Markov Decision Processes (MDPs) by requiring information to be obfuscated on the user side. We formulate this notion of privacy for RL by leveraging the local differential privacy (LDP)
more » ... mework. We establish a lower bound for regret minimization in finite-horizon MDPs with LDP guarantees which shows that guaranteeing privacy has a multiplicative effect on the regret. This result shows that while LDP is an appealing notion of privacy, it makes the learning problem significantly more complex. Finally, we present an optimistic algorithm that simultaneously satisfies ε-LDP requirements, and achieves √(K)/ε regret in any finite-horizon MDP after K episodes, matching the lower bound dependency on the number of episodes K.
arXiv:2010.07778v3 fatcat:76bncyh47zgr3bvtu5qf52gxnm

Improved Algorithms for Conservative Exploration in Bandits

Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, Matteo Pirotta
2020 PROCEEDINGS OF THE THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE TWENTY-EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE  
In many fields such as digital marketing, healthcare, finance, and robotics, it is common to have a well-tested and reliable baseline policy running in production (e.g., a recommender system). Nonetheless, the baseline policy is often suboptimal. In this case, it is desirable to deploy online learning algorithms (e.g., a multi-armed bandit algorithm) that interact with the system to learn a better/optimal policy under the constraint that during the learning process the performance is almost
more » ... r worse than the performance of the baseline itself. In this paper, we study the conservative learning problem in the contextual linear bandit setting and introduce a novel algorithm, the Conservative Constrained LinUCB (CLUCB2). We derive regret bounds for CLUCB2 that match existing results and empirically show that it outperforms state-of-the-art conservative bandit algorithms in a number of synthetic and real-world problems. Finally, we consider a more realistic constraint where the performance is verified only at predefined checkpoints (instead of at every step) and show how this relaxed constraint favorably impacts the regret and empirical performance of CLUCB2.
doi:10.1609/aaai.v34i04.5812 fatcat:gz52kwjj6zhsjfda6uozoj4hre

Top K Ranking for Multi-Armed Bandit with Noisy Evaluations [article]

Evrard Garcelon and Vashist Avadhanula and Alessandro Lazaric and Matteo Pirotta
2022 arXiv   pre-print
We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, evaluations of the true reward of each arm and it selects K arms with the objective of accumulating as much reward as possible over T rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are
more » ... ted. First, we show a O(T^2/3) regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved O(√(T)) regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice.
arXiv:2112.06517v4 fatcat:q4spwhtf3fezrpxr5auutvguma

Differentially Private Regret Minimization in Episodic Markov Decision Processes [article]

Sayak Ray Chowdhury, Xingyu Zhou
2021 arXiv   pre-print
Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differentially private regret minimization in reinforcement learning. arXiv preprint arXiv:2010.07778, 2020.  ...  On the other hand, Garcelon et al. (2020) design the first private RL algorithm – LDP-OBI – with regret and LDP guarantees.  ... 
arXiv:2112.10599v1 fatcat:nuiwxweo7vcajhfbjeiilo5zle

Exploration-Exploitation in Constrained MDPs [article]

Yonathan Efroni and Shie Mannor and Matteo Pirotta
2020 arXiv   pre-print
Evrard Garcelon, Mohammad Ghavamzadeh, Alessandro Lazaric, and Matteo Pirotta. Improved algorithms for conservative exploration in bandits. CoRR, abs/2002.03221, 2020a.  ...  ., 2017 , Garcelon et al., 2020a and in RL [Garcelon et al., 2020b ].  ... 
arXiv:2003.02189v1 fatcat:7o4wlfdqbvfapb5ws4lmadhtsq

The role of a positive spirit in the attractiveness, sociability and success of a public place

Lyes Rahmani, Maha Messaoudene
2019 Quaestiones Geographicae  
On the other hand, the intangible component has not drawn much attention despite the good will of some masters like Louis Khan who made the materials tell the stories of his projects (Garcelon et al.  ...  This tool is a questionnaire of the Likert-type scale (Lombart 2004; Aurier, Evrard 1998; Lichtlé, Plichon 2014) designed by the present authors, in the context of a doctoral thesis in progress, based  ... 
doi:10.2478/quageo-2019-0040 fatcat:z5qrv5g5mfdcnofolfgezoymou

Safe Optimal Design with Applications in Policy Learning [article]

Ruihao Zhu, Branislav Kveton
2021 arXiv   pre-print
We thus have = b θ + εb Σb, Yang, Yunchang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, Liwei Wang, Simon S. Du. 2021.  ... 
arXiv:2111.04835v1 fatcat:iey4fmroszb6vgrmorifjcj62a
« Previous Showing results 1 — 15 out of 23 results