36,672 Hits in 4.2 sec

Discovering a set of policies for the worst case reward [article]

Tom Zahavy, Andre Barreto, Daniel J Mankowitz, Shaobo Hou, Brendan O'Donoghue, Iurii Kemaev, Satinder Singh
2021 arXiv   pre-print
Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks.  ...  We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance.  ...  For the special case where the SFs associated with any policy are in the simplex, the value of the SMP w.r.t the worst case reward for any set of policies is less than or equal to −1/ √ d.  ... 
arXiv:2102.04323v2 fatcat:36aek7m7uffwdg7w5xoq4dpwaa

Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies [article]

Shengpu Tang, Aditya Modi, Michael W. Sjoding, Jenna Wiens
2020 arXiv   pre-print
We consider an alternative objective -- learning set-valued policies to capture near-equivalent actions that lead to similar cumulative rewards.  ...  However, in healthcare settings, many actions may be near-equivalent with respect to the reward (e.g., survival).  ...  The views and conclusions in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the National Science  ... 
arXiv:2007.12678v1 fatcat:a4k5hgwhyre5zbgqgpy4pddpwe

Unsupervised Meta-Learning for Reinforcement Learning [article]

Abhishek Gupta, Benjamin Eysenbach, Chelsea Finn, Sergey Levine
2020 arXiv   pre-print
In this work, we take a step in this direction, proposing a family of unsupervised meta-learning algorithms for reinforcement learning.  ...  We motivate and describe a general recipe for unsupervised meta-reinforcement learning, and present an instantiation of this approach.  ...  A. Proofs Lemma 1 Let π be a policy for which ρ T π (s) is uniform. Then π has lowest worst-case regret. Proof of Lemma 1.  ... 
arXiv:1806.04640v3 fatcat:p4exyyqnnbesvjatymjqhiz2du

Adaptive Routing with Guaranteed Delay Bounds using Safe Reinforcement Learning

Gautham Nayak Seetanadi, Karl-Erik Årzén, Martina Maggio
2020 Proceedings of the 28th International Conference on Real-Time Networks and Systems  
For known typical and worst-case delays, an algorithm was presented to (statically) determine the policy to be followed during the packet transmission in terms of edge choices.  ...  In this paper we relax the assumption of knowing the typical delay, and we assume only worst-case bounds are available.  ...  A similar methodology of restricting exploration space is to have a finite set of demonstrations to discover the state space.  ... 
doi:10.1145/3394810.3394815 dblp:conf/rtns/SeetanadiAM20 fatcat:xadanpo74zfarfz72lbiprpkga

Two Can Play That Game: An Adversarial Evaluation of a Cyber-alert Inspection System [article]

Ankit Shah, Arunesh Sinha, Rajesh Ganesan, Sushil Jajodia, Hasan Cam
2018 arXiv   pre-print
This made the defender robust to the discovered attacker policy and no further harmful attacker policies were discovered.  ...  We use a double oracle approach to retrain the defender with episodes from this discovered attacker policy.  ...  set the baseline value of avgTTA/hr for a CSOC with ρ < 1.  ... 
arXiv:1810.05921v1 fatcat:z45hqufitzbljjadfimzpcqdje

Learning Adversarially Robust Policies in Multi-Agent Games [article]

Eric Zhao, Alexander R. Trott, Caiming Xiong, Stephan Zheng
2022 arXiv   pre-print
Predicting the worst-case outcome of a policy is thus an equilibrium selection problem -- one known to be generally NP-Hard.  ...  We show that worst-case coarse-correlated equilibria can be efficiently approximated in smooth games and propose a framework that uses the worst-case evaluation scheme to learn robust player policies.  ...  Sampling worst-case members of -CCEs is not significantly harder than sampling worst-case members of strict CCEs (as guaranteed by Theorem 1).  ... 
arXiv:2106.05492v2 fatcat:wmxs4p2oyvebdfhrzwbq5s6xvm

A Short Survey on Probabilistic Reinforcement Learning [article]

Reazul Hasan Russel
2019 arXiv   pre-print
It is important for the agent to explore suboptimal actions as well as to pick actions with highest known rewards.  ...  In this paper, we present a brief survey of methods available in the literature for balancing exploration-exploitation trade off and computing robust solutions from fixed samples in reinforcement learning  ...  The worst-case total expected reward under any policy π over this ambiguity set then provides a valid lower bound on the expected total reward with the confidence at least 1 − δ.  ... 
arXiv:1901.07010v1 fatcat:ed2uhb6umbgx3cq4zjigfre5oa

Feasible Adversarial Robust Reinforcement Learning for Underspecified Environments [article]

JB Lanier, Stephen McAleer, Pierre Baldi, Roy Fox
2022 arXiv   pre-print
Robust reinforcement learning (RL) considers the problem of learning policies that perform well in the worst case among a set of possible environment parameter values.  ...  In real-world environments, choosing the set of possible values for robust RL can be a difficult task.  ...  Figure 4: (a-c) Worst-case MuJoCo Hopper reward among task parameters in the feasible set F λ as a function of PSRO iterations for FARR and other baselines with multiple values of λ.  ... 
arXiv:2207.09597v1 fatcat:p76pfzyr7zf5jnilnx75jow4hy

Mixed Strategies for Robust Optimization of Unknown Objectives [article]

Pier Giuseppe Sessa, Ilija Bogunovic, Maryam Kamgarpour, Andreas Krause
2020 arXiv   pre-print
GP-MRO seeks to discover a robust and randomized mixed strategy, that maximizes the worst-case expected objective value.  ...  Our theoretical results characterize the number of samples required by GP-MRO to discover a robust near-optimal mixed strategy for different GP kernels of interest.  ...  StableOpt discovers a deterministic solution that is robust with respect to the worst-case realization of the uncertain parameter.  ... 
arXiv:2002.12613v2 fatcat:egt4hxnbcngclc4j3kgqibyfry

Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning [article]

You Qiaoben, Chengyang Ying, Xinning Zhou, Hang Su, Jun Zhu, Bo Zhang
2021 arXiv   pre-print
In the first stage, we train a deceptive policy by hacking the environment, and discover a set of trajectories routing to the lowest reward or the worst-case performance.  ...  Despite the efficiency of previous optimization-based methods for generating adversarial noise in supervised learning, such methods might not be able to achieve the lowest cumulative reward since they  ...  To address this issue, we first introduce a deceptive policy to explore the worst case in the environment that can minimize the accumulated reward.  ... 
arXiv:2106.15860v2 fatcat:p6bpqcwx7javff2yhkl7bv2uwm

Long-term fairness with bounded worst-case losses

Gabriel Balan, Dana Richards, Sean Luke
2009 Autonomous Agents and Multi-Agent Systems  
We formulate the problem for the situation where the sequence of action choices continues forever; this problem may be reduced to a set of linear programs.  ...  We examine approaches to discovering sequences of actions for which the worst-off beneficiaries are treated maximally well, then secondarily the second-worst-off, and so on.  ...  Acknowledgements We thank Octav Olteanu, Joey Harrison, Zoran Duric, Alexei Samsonovich, and Alex Brodsky for their help.  ... 
doi:10.1007/s10458-009-9106-9 fatcat:lsa7q2vysrgebklld7e66ul6ly

Safe Policy Improvement with Baseline Bootstrapping [article]

Romain Laroche, Paul Trichelair, Rémi Tachet des Combes
2019 arXiv   pre-print
Finally, we implement a model-free version of SPIBB and show its benefits on a navigation task with deep RL implementation called SPIBB-DQN, which is, to the best of our knowledge, the first RL algorithm  ...  This paper considers Safe Policy Improvement (SPI) in Batch Reinforcement Learning (Batch RL): from a fixed dataset and without direct access to the true environment, train a policy that is guaranteed  ...  Π = {π : X → ∆ A } denotes the set of stochastic policies, with ∆ A the set of probability distributions over the set of actions A.  ... 
arXiv:1712.06924v5 fatcat:q7vb7w3ugvdtpdgrpdv7ghoe2e

Continuous-Time Fitted Value Iteration for Robust Policies [article]

Michael Lutter, Boris Belousov, Shie Mannor, Dieter Fox, Animesh Garg, Jan Peters
2021 arXiv   pre-print
In the case of the Hamilton-Jacobi-Isaacs equation, which includes an adversary controlling the environment and minimizing the reward, the obtained policy is also robust to perturbations of the dynamics  ...  Especially for continuous control, solving this differential equation and its extension the Hamilton-Jacobi-Isaacs equation, is important as it yields the optimal policy that achieves the maximum reward  ...  ACKNOWLEDGMENTS The research was partially conducted during the internship of M. Lutter at NVIDIA. M. Lutter, B. Belousov and J.  ... 
arXiv:2110.01954v1 fatcat:wjmsuj7l5zcwnhfdsq4h7avmva

Resolving Spurious Correlations in Causal Models of Environments via Interventions [article]

Sergei Volodin, Nevan Wichers, Jeremy Nixon
2020 arXiv   pre-print
The experimental results in a grid-world environment show that our approach leads to better causal models compared to baselines: learning the model on data from a random policy or a policy trained on the  ...  We consider the problem of inferring a causal model of a reinforcement learning environment and we propose a method to deal with spurious correlations.  ...  We reward the agent for setting a target node f i to a target value x.  ... 
arXiv:2002.05217v2 fatcat:6ve3jdgq5ng2xcom2msxl2wlcy

Towards Mixed Optimization for Reinforcement Learning with Program Synthesis [article]

Surya Bhupatiraju, Kumar Krishna Agrawal, Rishabh Singh
2018 arXiv   pre-print
We instantiate MORL for the simple CartPole problem and show that the programmatic representation allows for high-level modifications that in turn lead to improved learning of the policies.  ...  Concretely, we propose to use synthesis techniques to obtain a symbolic representation of the learned policy, which can then be debugged manually or automatically using program repair.  ...  by the policy π for all (or a sampled set of) input states S.  ... 
arXiv:1807.00403v2 fatcat:yen3rmixgzfinmtmcgvhchb52m
« Previous Showing results 1 — 15 out of 36,672 results