A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2021; you can also visit the original URL.
The file type is application/pdf
.
Filters
Discovering a set of policies for the worst case reward
[article]
2021
arXiv
pre-print
Our main contribution is a policy iteration algorithm that builds a set of policies in order to maximize the worst-case performance of the resulting SMP on the set of tasks. ...
We show that the worst-case performance of the resulting SMP strictly improves at each iteration, and the algorithm only stops when there does not exist a policy that leads to improved performance. ...
For the special case where the SFs associated with any policy are in the simplex, the value of the SMP w.r.t the worst case reward for any set of policies is less than or equal to −1/ √ d. ...
arXiv:2102.04323v2
fatcat:36aek7m7uffwdg7w5xoq4dpwaa
Clinician-in-the-Loop Decision Making: Reinforcement Learning with Near-Optimal Set-Valued Policies
[article]
2020
arXiv
pre-print
We consider an alternative objective -- learning set-valued policies to capture near-equivalent actions that lead to similar cumulative rewards. ...
However, in healthcare settings, many actions may be near-equivalent with respect to the reward (e.g., survival). ...
The views and conclusions in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the National Science ...
arXiv:2007.12678v1
fatcat:a4k5hgwhyre5zbgqgpy4pddpwe
Unsupervised Meta-Learning for Reinforcement Learning
[article]
2020
arXiv
pre-print
In this work, we take a step in this direction, proposing a family of unsupervised meta-learning algorithms for reinforcement learning. ...
We motivate and describe a general recipe for unsupervised meta-reinforcement learning, and present an instantiation of this approach. ...
A. Proofs Lemma 1 Let π be a policy for which ρ T π (s) is uniform. Then π has lowest worst-case regret. Proof of Lemma 1. ...
arXiv:1806.04640v3
fatcat:p4exyyqnnbesvjatymjqhiz2du
Adaptive Routing with Guaranteed Delay Bounds using Safe Reinforcement Learning
2020
Proceedings of the 28th International Conference on Real-Time Networks and Systems
For known typical and worst-case delays, an algorithm was presented to (statically) determine the policy to be followed during the packet transmission in terms of edge choices. ...
In this paper we relax the assumption of knowing the typical delay, and we assume only worst-case bounds are available. ...
A similar methodology of restricting exploration space is to have a finite set of demonstrations to discover the state space. ...
doi:10.1145/3394810.3394815
dblp:conf/rtns/SeetanadiAM20
fatcat:xadanpo74zfarfz72lbiprpkga
Two Can Play That Game: An Adversarial Evaluation of a Cyber-alert Inspection System
[article]
2018
arXiv
pre-print
This made the defender robust to the discovered attacker policy and no further harmful attacker policies were discovered. ...
We use a double oracle approach to retrain the defender with episodes from this discovered attacker policy. ...
set the baseline value of avgTTA/hr for a CSOC with
ρ < 1. ...
arXiv:1810.05921v1
fatcat:z45hqufitzbljjadfimzpcqdje
Learning Adversarially Robust Policies in Multi-Agent Games
[article]
2022
arXiv
pre-print
Predicting the worst-case outcome of a policy is thus an equilibrium selection problem -- one known to be generally NP-Hard. ...
We show that worst-case coarse-correlated equilibria can be efficiently approximated in smooth games and propose a framework that uses the worst-case evaluation scheme to learn robust player policies. ...
Sampling worst-case members of -CCEs is not significantly harder than sampling worst-case members of strict CCEs (as guaranteed by Theorem 1). ...
arXiv:2106.05492v2
fatcat:wmxs4p2oyvebdfhrzwbq5s6xvm
A Short Survey on Probabilistic Reinforcement Learning
[article]
2019
arXiv
pre-print
It is important for the agent to explore suboptimal actions as well as to pick actions with highest known rewards. ...
In this paper, we present a brief survey of methods available in the literature for balancing exploration-exploitation trade off and computing robust solutions from fixed samples in reinforcement learning ...
The worst-case total expected reward under any policy π over this ambiguity set then provides a valid lower bound on the expected total reward with the confidence at least 1 − δ. ...
arXiv:1901.07010v1
fatcat:ed2uhb6umbgx3cq4zjigfre5oa
Feasible Adversarial Robust Reinforcement Learning for Underspecified Environments
[article]
2022
arXiv
pre-print
Robust reinforcement learning (RL) considers the problem of learning policies that perform well in the worst case among a set of possible environment parameter values. ...
In real-world environments, choosing the set of possible values for robust RL can be a difficult task. ...
Figure 4: (a-c) Worst-case MuJoCo Hopper reward among task parameters in the feasible set F λ as a function of PSRO iterations for FARR and other baselines with multiple values of λ. ...
arXiv:2207.09597v1
fatcat:p76pfzyr7zf5jnilnx75jow4hy
Mixed Strategies for Robust Optimization of Unknown Objectives
[article]
2020
arXiv
pre-print
GP-MRO seeks to discover a robust and randomized mixed strategy, that maximizes the worst-case expected objective value. ...
Our theoretical results characterize the number of samples required by GP-MRO to discover a robust near-optimal mixed strategy for different GP kernels of interest. ...
StableOpt discovers a deterministic solution that is robust with respect to the worst-case realization of the uncertain parameter. ...
arXiv:2002.12613v2
fatcat:egt4hxnbcngclc4j3kgqibyfry
Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning
[article]
2021
arXiv
pre-print
In the first stage, we train a deceptive policy by hacking the environment, and discover a set of trajectories routing to the lowest reward or the worst-case performance. ...
Despite the efficiency of previous optimization-based methods for generating adversarial noise in supervised learning, such methods might not be able to achieve the lowest cumulative reward since they ...
To address this issue, we first introduce a deceptive policy to explore the worst case in the environment that can minimize the accumulated reward. ...
arXiv:2106.15860v2
fatcat:p6bpqcwx7javff2yhkl7bv2uwm
Long-term fairness with bounded worst-case losses
2009
Autonomous Agents and Multi-Agent Systems
We formulate the problem for the situation where the sequence of action choices continues forever; this problem may be reduced to a set of linear programs. ...
We examine approaches to discovering sequences of actions for which the worst-off beneficiaries are treated maximally well, then secondarily the second-worst-off, and so on. ...
Acknowledgements We thank Octav Olteanu, Joey Harrison, Zoran Duric, Alexei Samsonovich, and Alex Brodsky for their help. ...
doi:10.1007/s10458-009-9106-9
fatcat:lsa7q2vysrgebklld7e66ul6ly
Safe Policy Improvement with Baseline Bootstrapping
[article]
2019
arXiv
pre-print
Finally, we implement a model-free version of SPIBB and show its benefits on a navigation task with deep RL implementation called SPIBB-DQN, which is, to the best of our knowledge, the first RL algorithm ...
This paper considers Safe Policy Improvement (SPI) in Batch Reinforcement Learning (Batch RL): from a fixed dataset and without direct access to the true environment, train a policy that is guaranteed ...
Π = {π : X → ∆ A } denotes the set of stochastic policies, with ∆ A the set of probability distributions over the set of actions A. ...
arXiv:1712.06924v5
fatcat:q7vb7w3ugvdtpdgrpdv7ghoe2e
Continuous-Time Fitted Value Iteration for Robust Policies
[article]
2021
arXiv
pre-print
In the case of the Hamilton-Jacobi-Isaacs equation, which includes an adversary controlling the environment and minimizing the reward, the obtained policy is also robust to perturbations of the dynamics ...
Especially for continuous control, solving this differential equation and its extension the Hamilton-Jacobi-Isaacs equation, is important as it yields the optimal policy that achieves the maximum reward ...
ACKNOWLEDGMENTS The research was partially conducted during the internship of M. Lutter at NVIDIA. M. Lutter, B. Belousov and J. ...
arXiv:2110.01954v1
fatcat:wjmsuj7l5zcwnhfdsq4h7avmva
Resolving Spurious Correlations in Causal Models of Environments via Interventions
[article]
2020
arXiv
pre-print
The experimental results in a grid-world environment show that our approach leads to better causal models compared to baselines: learning the model on data from a random policy or a policy trained on the ...
We consider the problem of inferring a causal model of a reinforcement learning environment and we propose a method to deal with spurious correlations. ...
We reward the agent for setting a target node f i to a target value x. ...
arXiv:2002.05217v2
fatcat:6ve3jdgq5ng2xcom2msxl2wlcy
Towards Mixed Optimization for Reinforcement Learning with Program Synthesis
[article]
2018
arXiv
pre-print
We instantiate MORL for the simple CartPole problem and show that the programmatic representation allows for high-level modifications that in turn lead to improved learning of the policies. ...
Concretely, we propose to use synthesis techniques to obtain a symbolic representation of the learned policy, which can then be debugged manually or automatically using program repair. ...
by the policy π for all (or a sampled set of) input states S. ...
arXiv:1807.00403v2
fatcat:yen3rmixgzfinmtmcgvhchb52m
« Previous
Showing results 1 — 15 out of 36,672 results