1,278 Hits in 4.0 sec

Reward learning from human preferences and demonstrations in Atari [article]

Borja Ibarz and Jan Leike and Tobias Pohlen and Geoffrey Irving and Shane Legg and Dario Amodei
2018 arXiv   pre-print
In this work, we combine two approaches to learning from human feedback: expert demonstrations and trajectory preferences.  ...  We train a deep neural network to model the reward function and use its predicted reward to train an DQN-based deep reinforcement learning agent on 9 Atari games.  ...  Moreover, we thank Elizabeth Barnes for proofreading the paper and Ashwin Kakarla, Ethel Morgan, and Yannis Assael for helping us set up the human experiments.  ... 
arXiv:1811.06521v1 fatcat:pxw5cgmnrbbsxluwlaa4p3ggja

Extrapolating Beyond Suboptimal Demonstrations via Inverse Reinforcement Learning from Observations [article]

Daniel S. Brown, Wonjoon Goo, Prabhat Nagarajan, Scott Niekum
2019 arXiv   pre-print
In this paper, we introduce a novel reward-learning-from-observation algorithm, Trajectory-ranked Reward EXtrapolation (T-REX), that extrapolates beyond a set of (approximately) ranked demonstrations in  ...  When combined with deep reinforcement learning, T-REX outperforms state-of-the-art imitation learning and IRL methods on multiple Atari and MuJoCo benchmark tasks and achieves performance that is often  ...  PeARL research is supported in part by the NSF (IIS-1724157, IIS-1638107, IIS-1617639, IIS-1749204) and ONR(N00014-18-2243).  ... 
arXiv:1904.06387v5 fatcat:rglnjfhb2zg5reugvmxgv4sofi

Deep reinforcement learning from human preferences [article]

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, Dario Amodei
2017 arXiv   pre-print
These behaviors and environments are considerably more complex than any that have been previously learned from human feedback.  ...  In this work, we explore goals defined in terms of (non-expert) human preferences between pairs of trajectory segments.  ...  Finally, we thank OpenAI and DeepMind for providing a supportive research environment and for supporting and encouraging this collaboration.  ... 
arXiv:1706.03741v3 fatcat:b2phuyaq7fay7chweuqdkbo4ae

Multi-Preference Actor Critic [article]

Ishan Durugkar, Matthew Hausknecht, Adith Swaminathan, Patrick MacAlpine
2019 arXiv   pre-print
Experiments in Atari and Pendulum verify that constraints are being respected and can accelerate the learning process.  ...  However, for most Reinforcement Learning tasks, humans can provide additional insight to constrain the policy learning.  ...  We examined four different preferences in the M-PAC framework and experimentally evaluated it on Pendulum and Atari environments, and validated that access to even non-expert human demonstrations helps  ... 
arXiv:1904.03295v1 fatcat:wuyfroevgjgz7can73jb2vpxqq

Better-than-Demonstrator Imitation Learning via Automatically-Ranked Demonstrations [article]

Daniel S. Brown, Wonjoon Goo, Scott Niekum
2019 arXiv   pre-print
We empirically validate our approach on simulated robot and Atari imitation learning benchmarks and show that D-REX outperforms standard imitation learning approaches and can significantly surpass the  ...  D-REX is the first imitation learning approach to achieve significant extrapolation beyond the demonstrator's performance without additional side-information or supervision, such as rewards or human preferences  ...  PeARL research is supported in part by the NSF (IIS-1724157, IIS-1638107, IIS-1617639, IIS-1749204) and ONR(N00014-18-2243).  ... 
arXiv:1907.03976v3 fatcat:lldvi7dsjnhe3dxirzfdursdve

Understanding Learned Reward Functions [article]

Eric J. Michaud, Adam Gleave, Stuart Russell
2020 arXiv   pre-print
In such cases, a reward function must instead be learned from interacting with and observing humans.  ...  Absent significant advances in reward learning, it is thus important to be able to audit learned reward functions to verify whether they truly capture user preferences.  ...  Acknowledgments and Disclosure of Funding We thank Cody Wild for her feedback on previous drafts. We thank researchers at the Center for Human-Compatible AI for helpful discussions.  ... 
arXiv:2012.05862v1 fatcat:77tebwasinepnk25ud3cypt7da

Inverse reinforcement learning for video games [article]

Aaron Tucker and Adam Gleave and Stuart Russell
2018 arXiv   pre-print
Inverse reinforcement learning (IRL) algorithms can infer a reward from demonstrations in low-dimensional continuous control environments, but there has been little work on applying IRL to high-dimensional  ...  Deep reinforcement learning achieves superhuman performance in a range of video game environments, but requires that a designer manually specify a reward function.  ...  Acknowledgments This work was supported by the Center for Human-Compatible AI and the Open Philanthropy Project, the Future of Life Institute and the Leverhulme Trust.  ... 
arXiv:1810.10593v1 fatcat:t6co2wtxtfa6jfgoyipt6jhcn4

Deep Bayesian Reward Learning from Preferences [article]

Daniel S. Brown, Scott Niekum
2019 arXiv   pre-print
Using samples from the posterior, we demonstrate how to calculate high-confidence bounds on policy performance in the imitation learning setting, in which the ground-truth reward function is unknown.  ...  Our approach uses successor feature representations and preferences over demonstrations to efficiently generate samples from the posterior distribution over the demonstrator's reward function without requiring  ...  Introduction As robots and other autonomous agents enter our homes, schools, workplaces, and hospitals, it is important that these agents can safely learn from and adapt to a variety of human preferences  ... 
arXiv:1912.04472v1 fatcat:c2ouhzmearhupopywchlvp7ckq

Batch Reinforcement Learning from Crowds [article]

Guoxi Zhang, Hisashi Kashima
2021 arXiv   pre-print
This paper addresses the lack of reward in a batch reinforcement learning setting by learning a reward function from preferences. Generating preferences only requires a basic understanding of a task.  ...  Existing settings for lack of reward, such as behavioral cloning, rely on optimal demonstrations collected from humans.  ...  Evaluations on Atari 2600 games show the efficacy of the proposed model in learning reward functions from noisy preferences, followed by an ablation study for annotator collaborating and smoothing.  ... 
arXiv:2111.04279v1 fatcat:wkvzmd7xhbb35l2ahzgl7v3nqa

Leveraging Human Guidance for Deep Reinforcement Learning Tasks [article]

Ruohan Zhang, Faraz Torabi, Lin Guan, Dana H. Ballard, Peter Stone
2019 arXiv   pre-print
Human knowledge of how to solve these tasks can be incorporated using imitation learning, where the agent learns to imitate human demonstrated decisions.  ...  However, human guidance is not limited to the demonstrations. Other types of guidance could be more suitable for certain tasks and require less human effort.  ...  A portion of this work has taken place in the Learning Agents Research Group (LARG) at UT Austin.  ... 
arXiv:1909.09906v1 fatcat:jprzobqel5cmvczkexojifbfoa

Playing SNES in the Retro Learning Environment [article]

Nadav Bhonker, Shai Rozenberg, Itay Hubara
2017 arXiv   pre-print
In recent years, extensive research was carried out in the field of reinforcement learning and numerous algorithms were introduced, aiming to learn how to perform human tasks such as playing video games  ...  In many games the state-of-the-art algorithms outperform humans.  ...  We've encountered several games in which the learning process is highly dependent on the reward definition. This issue can be addressed and explored in RLE as reward definition can be done easily.  ... 
arXiv:1611.02205v2 fatcat:pgou5e43mjffzbcjivg5lsgjr4

Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences [article]

Daniel S. Brown, Russell Coleman, Ravi Srinivasan, Scott Niekum
2020 arXiv   pre-print
Bayesian REX can learn to play Atari games from demonstrations, without access to the game score and can generate 100,000 samples from the posterior over reward functions in only 5 minutes on a personal  ...  Bayesian reward learning from demonstrations enables rigorous safety and uncertainty analysis when performing imitation learning.  ...  Reward learning from human preferences and demonstrations in atari. In Advances in Neural Information Processing Systems, 2018. Jacq, A., Geist, M., Paiva, A., and Pietquin, O.  ... 
arXiv:2002.09089v4 fatcat:vk6ebzm2ijesjdp3bahj5cjdgi

A Study of Causal Confusion in Preference-Based Reward Learning [article]

Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca D. Dragan, Daniel Brown
2022 arXiv   pre-print
However, in recent years, there has been a growing body of anecdotal evidence that learning reward functions from preferences is prone to spurious correlations and reward gaming or hacking behaviors.  ...  states to actions, we provide the first systematic study of causal confusion in the context of learning reward functions from preferences.  ...  ACKNOWLEDGMENTS We thank the members of the InterACT lab for helpful discussion and advice. This work was supported in part by the NSF NRI, ONR YIP, and NSF CAREER awards.  ... 
arXiv:2204.06601v1 fatcat:wbjmjued4na2hoc53mqqtvs7zi

ToyBox: Better Atari Environments for Testing Reinforcement Learning Agents [article]

John Foley, Emma Tosch, Kaleigh Clary, David Jensen
2019 arXiv   pre-print
Recently, the Arcade Learning Environment (ALE) has become one of the most widely used benchmark suites for deep learning research, and state-of-the-art Reinforcement Learning (RL) agents have been shown  ...  to routinely equal or exceed human performance on many ALE tasks.  ...  Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the United States Air Force.  ... 
arXiv:1812.02850v3 fatcat:yommkjo4q5g3vcjgvi7hpf5lye

First return, then explore [article]

Adrien Ecoffet, Joost Huizinga, Joel Lehman, Kenneth O. Stanley, Jeff Clune
2021 arXiv   pre-print
We also demonstrate the practical potential of Go-Explore on a sparse-reward pick-and-place robotics task.  ...  However, reinforcement learning algorithms struggle when, as is often the case, simple and intuitive rewards provide sparse and deceptive feedback.  ...  Acknowledgements We thank Ashley Edwards, Sanyam Kapoor, Felipe Petroski Such and Jiale Zhi for their ideas, feedback, technical support, and work on aspects of Go-Explore not presented in this work.  ... 
arXiv:2004.12919v4 fatcat:m5in5nokfrgtzdd2gsmuifz7kq
« Previous Showing results 1 — 15 out of 1,278 results