A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2022; you can also visit the original URL.
The file type is application/pdf
.
Filters
Selective Credit Assignment
[article]
2022
arXiv
pre-print
Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms, and describe their role in mediating the backward credit distribution in prediction and control.
arXiv:2202.09699v1
fatcat:26zcp3tku5hqfmhtiuojcjxw4a
more »
... ithin this space, we identify some existing online learning algorithms that can assign credit selectively as special cases, as well as add new algorithms that assign credit backward in time counterfactually, allowing credit to be assigned off-trajectory and off-policy.
Observational Learning by Reinforcement Learning
[article]
2017
arXiv
pre-print
Observational learning is a type of learning that occurs as a function of observing, retaining and possibly replicating or imitating the behaviour of another agent. It is a core mechanism appearing in various instances of social learning and has been found to be employed in several intelligent species, including humans. In this paper, we investigate to what extent the explicit modelling of other agents is necessary to achieve observational learning through machine learning. Especially, we argue
arXiv:1706.06617v1
fatcat:373blxc2rnfqvksiskyhcqezoy
more »
... that observational learning can emerge from pure Reinforcement Learning (RL), potentially coupled with memory. Through simple scenarios, we demonstrate that an RL agent can leverage the information provided by the observations of an other agent performing a task in a shared environment. The other agent is only observed through the effect of its actions on the environment and never explicitly modeled. Two key aspects are borrowed from observational learning: i) the observer behaviour needs to change as a result of viewing a 'teacher' (another agent) and ii) the observer needs to be motivated somehow to engage in making use of the other agent's behaviour. The later is naturally modeled by RL, by correlating the learning agent's reward with the teacher agent's behaviour.
The Termination Critic
[article]
2019
arXiv
pre-print
In this work, we consider the problem of autonomously discovering behavioral abstractions, or options, for reinforcement learning agents. We propose an algorithm that focuses on the termination condition, as opposed to -- as is common -- the policy. The termination condition is usually trained to optimize a control objective: an option ought to terminate if another has better value. We offer a different, information-theoretic perspective, and propose that terminations should focus instead on
arXiv:1902.09996v1
fatcat:zy6jk2ck4jg7bk5xvc7zprv7d4
more »
... compressibility of the option's encoding -- arguably a key reason for using abstractions. To achieve this algorithmically, we leverage the classical options framework, and learn the option transition model as a "critic" for the termination condition. Using this model, we derive gradients that optimize the desired criteria. We show that the resulting options are non-trivial, intuitively meaningful, and useful for learning and planning.
When should agents explore?
[article]
2022
arXiv
pre-print
Exploration remains a central challenge for reinforcement learning (RL). Virtually all existing methods share the feature of a monolithic behaviour policy that changes only gradually (at best). In contrast, the exploratory behaviours of animals and humans exhibit a rich diversity, namely including forms of switching between modes. This paper presents an initial study of mode-switching, non-monolithic exploration for RL. We investigate different modes to switch between, at what timescales it
arXiv:2108.11811v2
fatcat:durixjxq7nbs5f3mvhs6vygedy
more »
... s sense to switch, and what signals make for good switching triggers. We also propose practical algorithmic components that make the switching mechanism adaptive and robust, which enables flexibility without an accompanying hyper-parameter-tuning burden. Finally, we report a promising and detailed analysis on Atari, using two-mode exploration and switching at sub-episodic time-scales.
Expected Eligibility Traces
[article]
2021
arXiv
pre-print
., Borsa, D., & Barreto, A. (2019). General non-linearBellman equations. arXiv preprint arXiv:1907.03687. van Hasselt, H. & Sutton, R. S. (2015. Learning to predict independent of span. ...
arXiv:2007.01839v2
fatcat:2yxk7yab6jgxxfefxrd5xkliry
Adapting Behaviour for Learning Progress
[article]
2019
arXiv
pre-print
., 2018) or task specification (Borsa et al., 2019) . ...
arXiv:1912.06910v1
fatcat:hehbd7uw7vcgnny4vtuhntswja
Universal Successor Features Approximators
[article]
2018
arXiv
pre-print
The ability of a reinforcement learning (RL) agent to learn about many reward functions at the same time has many potential benefits, such as the decomposition of complex tasks into simpler ones, the exchange of information between tasks, and the reuse of skills. We focus on one aspect in particular, namely the ability to generalise to unseen tasks. Parametric generalisation relies on the interpolation power of a function approximator that is given the task description as input; one of its most
arXiv:1812.07626v1
fatcat:ptxih27fezbavg47nqil4w7qry
more »
... common form are universal value function approximators (UVFAs). Another way to generalise to new tasks is to exploit structure in the RL problem itself. Generalised policy improvement (GPI) combines solutions of previous tasks into a policy for the unseen task; this relies on instantaneous policy evaluation of old policies under the new reward function, which is made possible through successor features (SFs). Our proposed universal successor features approximators (USFAs) combine the advantages of all of these, namely the scalability of UVFAs, the instant inference of SFs, and the strong generalisation of GPI. We discuss the challenges involved in training a USFA, its generalisation properties and demonstrate its practical benefits and transfer abilities on a large-scale domain in which the agent has to navigate in a first-person perspective three-dimensional environment.
Learning Shared Representations in Multi-task Reinforcement Learning
[article]
2016
arXiv
pre-print
We investigate a paradigm in multi-task reinforcement learning (MT-RL) in which an agent is placed in an environment and needs to learn to perform a series of tasks, within this space. Since the environment does not change, there is potentially a lot of common ground amongst tasks and learning to solve them individually seems extremely wasteful. In this paper, we explicitly model and learn this shared structure as it arises in the state-action value space. We will show how one can jointly learn
arXiv:1603.02041v1
fatcat:wvziahyo7nhbbjsocw4oywuuqy
more »
... optimal value-functions by modifying the popular Value-Iteration and Policy-Iteration procedures to accommodate this shared representation assumption and leverage the power of multi-task supervised learning. Finally, we demonstrate that the proposed model and training procedures, are able to infer good value functions, even under low samples regimes. In addition to data efficiency, we will show in our analysis, that learning abstractions of the state space jointly across tasks leads to more robust, transferable representations with the potential for better generalization. this shared representation assumption and leverage the power of multi-task supervised learning. Finally, we demonstrate that the proposed model and training procedures, are able to infer good value functions, even under low samples regimes. In addition to data efficiency, we will show in our analysis, that learning abstractions of the state space jointly across tasks leads to more robust, transferable representations with the potential for better generalization.
Ray Interference: a Source of Plateaus in Deep Reinforcement Learning
[article]
2019
arXiv
pre-print
Rather than proposing a new method, this paper investigates an issue present in existing learning algorithms. We study the learning dynamics of reinforcement learning (RL), specifically a characteristic coupling between learning and data generation that arises because RL agents control their future data distribution. In the presence of function approximation, this coupling can lead to a problematic type of 'ray interference', characterized by learning dynamics that sequentially traverse a
arXiv:1904.11455v1
fatcat:2x7xkzw4hjhmth23uhmwfy5g4e
more »
... of performance plateaus, effectively constraining the agent to learn one thing at a time even when learning in parallel is better. We establish the conditions under which ray interference occurs, show its relation to saddle points and obtain the exact learning dynamics in a restricted setting. We characterize a number of its properties and discuss possible remedies.
The Wreath Process: A totally generative model of geometric shape based on nested symmetries
[article]
2015
arXiv
pre-print
We consider the problem of modelling noisy but highly symmetric shapes that can be viewed as hierarchies of whole-part relationships in which higher level objects are composed of transformed collections of lower level objects. To this end, we propose the stochastic wreath process, a fully generative probabilistic model of drawings. Following Leyton's "Generative Theory of Shape", we represent shapes as sequences of transformation groups composed through a wreath product. This representation
arXiv:1506.03041v1
fatcat:sregnjraq5gzjhrvwnl6ekcd5i
more »
... asizes the maximization of transfer --- the idea that the most compact and meaningful representation of a given shape is achieved by maximizing the re-use of existing building blocks or parts. The proposed stochastic wreath process extends Leyton's theory by defining a probability distribution over geometric shapes in terms of noise processes that are aligned with the generative group structure of the shape. We propose an inference scheme for recovering the generative history of given images in terms of the wreath process using reversible jump Markov chain Monte Carlo methods and Approximate Bayesian Computation. In the context of sketching we demonstrate the feasibility and limitations of this approach on model-generated and real data.
The Option Keyboard: Combining Skills in Reinforcement Learning
[article]
2021
arXiv
pre-print
The Option Keyboard Combining Skills in Reinforcement Learning
Supplementary Material André Barreto, Diana Borsa, Shaobo Hou, Gheorghe Comanici, Eser Aygün, Philippe Hamel, Daniel Toyama, Jonathan Hunt ...
, Shibl Mourad, David Silver, Doina Precup {andrebarreto,borsa,shaobohou,gcomanici,eser}@google.com {hamelphi,kenjitoyama,jjhunt,shibl,davidsilver,doinap}@google.com
DeepMind
Abstract In this supplement ...
arXiv:2106.13105v1
fatcat:gvbme6pahfhbpczbeu62cu4g2y
Conditional Importance Sampling for Off-Policy Learning
[article]
2020
arXiv
pre-print
The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms that arise from this framework.
arXiv:1910.07479v2
fatcat:zic3vjldffgq7bevyzflwd4y2q
USO DE CORRETIVOS E FERTILIZANTES EM PASTAGEM NO BIOMA AMAZÔNICO
2018
Nativa
O objetivo do trabalho foi avaliar a produtividade de massa verde e seca da parte aérea e das raízes e o acúmulo de cálcio e magnésio na parte aérea de Urochloa brizantha, mediante a aplicação de doses de calcário, ou calcário e gesso agrícola acompanhados de adubação com nitrogênio (N), fósforo (P) e potássio (K). O delineamento experimental consistiu em blocos ao acaso, com quatro repetições e sete tratamentos: T0 = 0; T1 = 0,40; T2 = 0,80; T3 = 1,60; T4 = 3,20 t ha-1 de calcário; T5 = 1,60 t
doi:10.31413/nativa.v6i6.6330
fatcat:rixsfq2zm5birimt7db6rmxble
more »
... ha-1 de calcário mais NPK (40 kg ha-1 de N - sulfato de amônio + 120 kg ha-1 de P2O5 - superfosfato simples e 20 kg ha-1 de K2O - cloreto de potássio) e T6 = 1,50 t ha-1 de gesso agrícola mais NPK similar ao tratamento anterior. No período de condução do experimento foram realizados oito cortes da parte aérea, duas avaliações dos teores de cálcio e bmagnésio nos tecidos e uma avaliação do sistema radicular. Houve efeito para o acúmulo de massa verde, massa seca e nos teores de cálcio e magnésio no tecido da parte aérea das plantas no primeiro corte, com destaque para o uso de calcário e gesso acompanhados de NPK. O acúmulo de raiz no perfil apresentou um efeito linear nas doses de calcário avaliadas.Palavras-chave: calcário, gesso, NPK, Urochloa brizantha. CORRECTIVES AND FERTILIZERS USE IN PASTURE IN AMAZON BIOME ABSTRACT: The objective of this work was to evaluate the productivity of green and dry mass of shoots and roots and calcium and magnesium accumulation in Urochloa brizantha, by application of limestone, limestone and agricultural gypsum followed by fertilization with nitrogen (N), phosphorus (P) and potassium (K). The experimental design consisted of randomized blocks with four replicates and seven treatments: T0 = 0; T1 = 0.40; T2 = 0.80; T3 = 1.60; T4 = 3.20 t ha-1 of limestone; T5 = 1.60 t ha-1 of limestone plus NPK (40 kg ha-1 of ammonium sulfate + 120 kg ha-1 of P2O5 - single superphosphate and 20 kg ha-1 of K2O - potassium chloride) and T6 = 1.50 t ha-1 of agricultural gypsum plus NPK similar to previous treatment. During the period of experiment conduction, eight aerial part samples were evaluated, two calcium and magnesium contents evaluations in tissues and one root system evaluation. There was an effect for accumulation of green mass, dry mass and calcium and magnesium contents in tissue of plants aerial part at first sample, with emphasis on use of limestone and gypsum accompanied by NPK. The root accumulation in profile showed a linear effect on limestone rates evaluated.Keywords: limestone, gypsum, NPK, Urochloa brizantha.
Temporal Difference Uncertainties as a Signal for Exploration
[article]
2021
arXiv
pre-print
., Borsa, D., Ding, D., Szepesvari, D., Ostrovski, G., Dabney, W., and Osindero, S. Adapting behaviour for learning progress. arXiv preprint arXiv:1912.06910, 2019. Schmidhuber, J. ...
arXiv:2010.02255v2
fatcat:5mtijqrltrekrferevd2jokgu4
Automatic Identification of Web-Based Risk Markers for Health Events
2015
Journal of Medical Internet Research
The escalating cost of global health care is driving the development of new technologies to identify early indicators of an individual's risk of disease. Traditionally, epidemiologists have identified such risk factors using medical databases and lengthy clinical studies but these are often limited in size and cost and can fail to take full account of diseases where there are social stigmas or to identify transient acute risk factors. Objective: Here we report that Web search engine queries
doi:10.2196/jmir.4082
pmid:25626480
pmcid:PMC4327439
fatcat:lkozqz5ccbff3otymop2r7gguy
more »
... led with information on Wikipedia access patterns can be used to infer health events associated with an individual user and automatically generate Web-based risk markers for some of the common medical conditions worldwide, from cardiovascular disease to sexually transmitted infections and mental health conditions, as well as pregnancy. Methods: Using anonymized datasets, we present methods to first distinguish individuals likely to have experienced specific health events, and classify them into distinct categories. We then use the self-controlled case series method to find the incidence of health events in risk periods directly following a user's search for a query category, and compare to the incidence during other periods for the same individuals. Results: Searches for pet stores were risk markers for allergy. We also identified some possible new risk markers; for example: searching for fast food and theme restaurants was associated with a transient increase in risk of myocardial infarction, suggesting this exposure goes beyond a long-term risk factor but may also act as an acute trigger of myocardial infarction. Dating and adult content websites were risk markers for sexually transmitted infections, such as human immunodeficiency virus (HIV). Conclusions: Web-based methods provide a powerful, low-cost approach to automatically identify risk factors, and support more timely and personalized public health efforts to bring human and economic benefits. (J Med Internet Res 2015;17(1):e29)
« Previous
Showing results 1 — 15 out of 405 results