4,829 Hits in 2.6 sec

Multimodal Parameter-exploring Policy Gradients

Frank Sehnke, Alex Graves, Christian Osendorfer, Jurgen Schmidhuber
2010 2010 Ninth International Conference on Machine Learning and Applications  
Policy Gradients with Parameter-based Exploration (PGPE) is a novel model-free reinforcement learning method that alleviates the problem of high-variance gradient estimates encountered in normal policy  ...  This paper extends the basic PGPE algorithm to use multimodal mixture distributions for each parameter, while remaining efficient.  ...  Policy Gradients with Parameter-based Exploration (PGPE; [9] ) tackles the problem directly by transferring the exploration from action space to parameter space.  ... 
doi:10.1109/icmla.2010.24 dblp:conf/icmla/SehnkeGOS10 fatcat:6cbpsaxav5fsbasrxo6nlywafa

Multimodal Safety-Critical Scenarios Generation for Decision-Making Algorithms Evaluation [article]

Wenhao Ding, Baiming Chen, Bo Li, Kim Ji Eun, Ding Zhao
2020 arXiv   pre-print
The proposed generative model is optimized with weighted likelihood maximization and a gradient-based sampling procedure is integrated to improve the sampling efficiency.  ...  Experiments on a self-driving task demonstrate our advantages in terms of testing efficiency and multimodal modeling capability.  ...  In this paper, our data is collected from on-policy exploration.  ... 
arXiv:2009.08311v3 fatcat:knigednisrhxde3nerd5x2pcbm

Reinforcement Learning with Deep Energy-Based Policies [article]

Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, Sergey Levine
2017 arXiv   pre-print
We use the recently proposed amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution.  ...  We apply our method to learning maximum entropy policies, resulting into a new algorithm, called soft Q-learning, that expresses the optimal policy via a Boltzmann distribution.  ...  In this paper, we explore two potential reasons for this: exploration in the presence of multimodal objectives, and compositionality attained via pretraining.  ... 
arXiv:1702.08165v2 fatcat:ustewselofbzdehsi7ahcraql4

Learning End-to-end Multimodal Sensor Policies for Autonomous Navigation [article]

Guan-Horng Liu, Avinash Siravuru, Sai Prabhakar, Manuela Veloso, George Kantor
2017 arXiv   pre-print
Moreover, systematic ways to make policies robust to partial sensor failure are not well explored.  ...  Finally, through the visualization of gradients, we show that the learned policies are conditioned on the same latent states representation despite having diverse observations spaces - a hallmark of true  ...  . ∇ θ µ J = E[∇ a Q(s, a|θ Q )∇ a µ(s)] (5) [32] proved that using the policy gradient calculated in (5) to update model parameters leads to the maximum expected reward.  ... 
arXiv:1705.10422v2 fatcat:xbntkh4tbvctvlzvrcmlfeb7eq

Multimodal Model-Agnostic Meta-Learning via Task-Aware Modulation [article]

Risto Vuorio, Shao-Hua Sun, Hexiang Hu, Joseph J. Lim
2019 arXiv   pre-print
Model-agnostic meta-learners aim to acquire meta-learned parameters from similar tasks to adapt to novel tasks from the same distribution with few gradient updates.  ...  In this paper, we augment MAML with the capability to identify the mode of tasks sampled from a multimodal task distribution and adapt quickly through gradient updates.  ...  For RL problems, the inner loop updates of gradient-based meta-learning take the form of policy gradient updates.  ... 
arXiv:1910.13616v1 fatcat:fq5lgtkpwrgjxatzca4puvpf6y

Probabilistic Mixture-of-Experts for Efficient Deep Reinforcement Learning [article]

Jie Ren, Yewen Li, Zihan Ding, Wei Pan, Hao Dong
2021 arXiv   pre-print
In this work, we propose a probabilistic mixture-of-experts (PMOE) implemented with a Gaussian mixture model (GMM) for multimodal policy, together with a novel gradient estimator for the indifferentiability  ...  However, grasping distinguishable skills for some tasks with non-unique optima can be essential for further improving its learning efficiency and performance, which may lead to a multimodal policy represented  ...  our experiments, where different multimodal policy approximation methods are built on top.  ... 
arXiv:2104.09122v1 fatcat:vp7pnxndtvffzf7kpe74zshdsy

Robotic self-representation improves manipulation skills and transfer learning [article]

Phuong D.H. Nguyen, Manfred Eppe, Stefan Wermter
2020 arXiv   pre-print
this gap by developing a model that learns bidirectional action-effect associations to encode the representations of body schema and the peripersonal space from multisensory information, which is named multimodal  ...  : 1) Deep deterministic policy gradient (DDPG): While policy gradient methods in general refer to a parameterized, stochastic policy, deterministic policy gradient methods aim to learn parameters for a  ...  In our setting, we employ the policy gradient approach that allows agent to select actions directly through a parameterized policy instead of consulting the value function.  ... 
arXiv:2011.06985v1 fatcat:5kyven7c75fevfwpyux7lv5lfq

Improving Evolutionary Strategies with Generative Neural Networks [article]

Louis Faury, Clement Calauzenes, Olivier Fercoq, Syrine Krichen
2019 arXiv   pre-print
The parameter θ is therefore not updated along the negative gradient of J but rather along F −1 θ ∇ θ J(θ), a quantity known as the natural gradient.  ...  The latent space is optimized by natural gradient descent, and the coupling layers via an off-policy objective with a Kullback-Leibler divergence penalty.  ... 
arXiv:1901.11271v1 fatcat:rwcy6tuud5bijhv7nqiigf7oki

Exploiting Multimodal Reinforcement Learning for Simultaneous Machine Translation [article]

Julia Ive, Andy Mingren Li, Yishu Miao, Ozan Caglayan, Pranava Madhyastha, Lucia Specia
2021 arXiv   pre-print
This paper addresses the problem of simultaneous machine translation (SiMT) by exploring two main concepts: (a) adaptive policies to learn a good trade-off between high translation quality and low latency  ...  For that, we propose a multimodal approach to simultaneous machine translation using reinforcement learning, with strategies to integrate visual and textual information in both the agent and the environment  ...  policy gradient (REINFORCE (Williams, 1992) ).  ... 
arXiv:2102.11387v1 fatcat:jyvl6kv43bar3egxhe5m7jac2i

Egocentric Activity Recognition on a Budget

Rafael Possas, Sheila Pinto Caceres, Fabio Ramos
2018 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition  
We develop a Reinforcement Learning model-free method to learn energy-aware policies that maximize the use of low-energy cost predictors while keeping competitive accuracy levels.  ...  Our results show that a policy trained on an egocentric dataset is able use the synergy between motion and vision sensors to effectively tradeoff energy expenditure and accuracy on smartglasses operating  ...  Specifically, we make the following contributions: • We propose a Reinforcement Learning (RL) Policy Gradient framework that balances energy consumption and accuracy through a customizable hyper-parameter  ... 
doi:10.1109/cvpr.2018.00625 dblp:conf/cvpr/PossasPR18 fatcat:e6vjj62dyjfxpd4qppzmz6xx5a

Learning Solution Manifolds for Control Problems via Energy Minimization [article]

Miguel Zamora, Roi Poranne, Stelian Coros
2022 arXiv   pre-print
Further sampling strategies should be explored.  ...  Other gradient-based formulations to train policies are presented by; [11] , [12] in the context of differentiable simulators, and by [13] , [6] under the umbrella of guided policy search methods  ... 
arXiv:2203.03432v1 fatcat:jzbkht7tfbg5doit4rgiioquvi

Leveraging exploration in off-policy algorithms via normalizing flows [article]

Bogdan Mazoure, Thang Doan, Audrey Durand, R Devon Hjelm, Joelle Pineau
2019 arXiv   pre-print
., multimodal) through normalizing flows (NF) and show that this significantly improves performance by accelerating the discovery of good policies while using much smaller policy representations.  ...  Approaches such as neural density models and continuous exploration (e.g., Go-Explore) have been proposed to maintain the high exploration rate necessary to find high performing and generalizable policies  ...  The main contribution of this work is to extend SAC to a richer class of multimodal exploration policies, by transforming the actions during exploration via a sequence of invertible mapping known as normalizing  ... 
arXiv:1905.06893v3 fatcat:xs3qf424d5hmxata5qzj5h2fwi

Adaptive Discretization for Continuous Control using Particle Filtering Policy Network [article]

Pei Xu, Ioannis Karamouzas
2020 arXiv   pre-print
The resulting policy can replace the original continuous policy of any given policy gradient algorithm without changing its underlying model architecture.  ...  In this paper, we propose a simple, yet general, framework for improving the performance of policy gradient algorithms by discretizing the continuous action space.  ...  We can maximize J(θ) by adjusting the policy parameters θ through the gradient ascent method, where the gradient of the expected reward can be determined according to the policy gradient theorem [18]  ... 
arXiv:2003.06959v3 fatcat:v2n7unruhbcc3hnikmpx5fb7rq

Efficient Baseline-Free Sampling in Parameter Exploring Policy Gradients: Super Symmetric PGPE [chapter]

Frank Sehnke
2013 Lecture Notes in Computer Science  
Policy Gradient methods that explore directly in parameter space are among the most effective and robust direct policy search methods and have drawn a lot of attention lately.  ...  The basic method from this field, Policy Gradients with Parameter-based Exploration, uses two samples that are symmetric around the current hypothesis to circumvent misleading reward in asymmetrical reward  ...  The basic method from the field of Parameter Exploring Policy Gradients (PEPG) [8] , Policy Gradients with Parameter-based Exploration (PGPE) [1] , uses two samples that are symmetric around the current  ... 
doi:10.1007/978-3-642-40728-4_17 fatcat:rqfpbz4sufc6bbcgns3r7t7skm

Contextual Latent-Movements Off-Policy Optimization for Robotic Manipulation Skills [article]

Samuele Tosatto, Georgia Chalvatzaki, Jan Peters
2022 arXiv   pre-print
Parameterized movement primitives have been extensively used for imitation learning of robotic tasks.  ...  Moreover, we introduce a new contextual off-policy RL algorithm, named LAtent-Movements Policy Optimization (LAMPO).  ...  In [24] , the authors introduced a DR technique of the exploration parameters of DMPs when using a path integral policy improvement algorithm [28] .  ... 
arXiv:2010.13766v3 fatcat:dtra35awg5epblsesa3rtfoej4
« Previous Showing results 1 — 15 out of 4,829 results