405 Hits in 2.3 sec

Selective Credit Assignment [article]

Veronica Chelu, Diana Borsa, Doina Precup, Hado van Hasselt
2022 arXiv   pre-print
Efficient credit assignment is essential for reinforcement learning algorithms in both prediction and control settings. We describe a unified view on temporal-difference algorithms for selective credit assignment. These selective algorithms apply weightings to quantify the contribution of learning updates. We present insights into applying weightings to value-based learning and planning algorithms, and describe their role in mediating the backward credit distribution in prediction and control.
more » ... ithin this space, we identify some existing online learning algorithms that can assign credit selectively as special cases, as well as add new algorithms that assign credit backward in time counterfactually, allowing credit to be assigned off-trajectory and off-policy.
arXiv:2202.09699v1 fatcat:26zcp3tku5hqfmhtiuojcjxw4a

Observational Learning by Reinforcement Learning [article]

Diana Borsa, Bilal Piot, Rémi Munos, Olivier Pietquin
2017 arXiv   pre-print
Observational learning is a type of learning that occurs as a function of observing, retaining and possibly replicating or imitating the behaviour of another agent. It is a core mechanism appearing in various instances of social learning and has been found to be employed in several intelligent species, including humans. In this paper, we investigate to what extent the explicit modelling of other agents is necessary to achieve observational learning through machine learning. Especially, we argue
more » ... that observational learning can emerge from pure Reinforcement Learning (RL), potentially coupled with memory. Through simple scenarios, we demonstrate that an RL agent can leverage the information provided by the observations of an other agent performing a task in a shared environment. The other agent is only observed through the effect of its actions on the environment and never explicitly modeled. Two key aspects are borrowed from observational learning: i) the observer behaviour needs to change as a result of viewing a 'teacher' (another agent) and ii) the observer needs to be motivated somehow to engage in making use of the other agent's behaviour. The later is naturally modeled by RL, by correlating the learning agent's reward with the teacher agent's behaviour.
arXiv:1706.06617v1 fatcat:373blxc2rnfqvksiskyhcqezoy

The Termination Critic [article]

Anna Harutyunyan, Will Dabney, Diana Borsa, Nicolas Heess, Remi Munos, Doina Precup
2019 arXiv   pre-print
In this work, we consider the problem of autonomously discovering behavioral abstractions, or options, for reinforcement learning agents. We propose an algorithm that focuses on the termination condition, as opposed to -- as is common -- the policy. The termination condition is usually trained to optimize a control objective: an option ought to terminate if another has better value. We offer a different, information-theoretic perspective, and propose that terminations should focus instead on
more » ... compressibility of the option's encoding -- arguably a key reason for using abstractions. To achieve this algorithmically, we leverage the classical options framework, and learn the option transition model as a "critic" for the termination condition. Using this model, we derive gradients that optimize the desired criteria. We show that the resulting options are non-trivial, intuitively meaningful, and useful for learning and planning.
arXiv:1902.09996v1 fatcat:zy6jk2ck4jg7bk5xvc7zprv7d4

When should agents explore? [article]

Miruna Pîslar, David Szepesvari, Georg Ostrovski, Diana Borsa, Tom Schaul
2022 arXiv   pre-print
Exploration remains a central challenge for reinforcement learning (RL). Virtually all existing methods share the feature of a monolithic behaviour policy that changes only gradually (at best). In contrast, the exploratory behaviours of animals and humans exhibit a rich diversity, namely including forms of switching between modes. This paper presents an initial study of mode-switching, non-monolithic exploration for RL. We investigate different modes to switch between, at what timescales it
more » ... s sense to switch, and what signals make for good switching triggers. We also propose practical algorithmic components that make the switching mechanism adaptive and robust, which enables flexibility without an accompanying hyper-parameter-tuning burden. Finally, we report a promising and detailed analysis on Atari, using two-mode exploration and switching at sub-episodic time-scales.
arXiv:2108.11811v2 fatcat:durixjxq7nbs5f3mvhs6vygedy

Expected Eligibility Traces [article]

Hado van Hasselt, Sephora Madjiheurem, Matteo Hessel, David Silver, André Barreto, Diana Borsa
2021 arXiv   pre-print
., Borsa, D., & Barreto, A. (2019). General non-linearBellman equations. arXiv preprint arXiv:1907.03687. van Hasselt, H. & Sutton, R. S. (2015. Learning to predict independent of span.  ... 
arXiv:2007.01839v2 fatcat:2yxk7yab6jgxxfefxrd5xkliry

Adapting Behaviour for Learning Progress [article]

Tom Schaul, Diana Borsa, David Ding, David Szepesvari, Georg Ostrovski, Will Dabney, Simon Osindero
2019 arXiv   pre-print
., 2018) or task specification (Borsa et al., 2019) .  ... 
arXiv:1912.06910v1 fatcat:hehbd7uw7vcgnny4vtuhntswja

Universal Successor Features Approximators [article]

Diana Borsa, André Barreto, John Quan, Daniel Mankowitz, Rémi Munos, Hado van Hasselt, David Silver, Tom Schaul
2018 arXiv   pre-print
The ability of a reinforcement learning (RL) agent to learn about many reward functions at the same time has many potential benefits, such as the decomposition of complex tasks into simpler ones, the exchange of information between tasks, and the reuse of skills. We focus on one aspect in particular, namely the ability to generalise to unseen tasks. Parametric generalisation relies on the interpolation power of a function approximator that is given the task description as input; one of its most
more » ... common form are universal value function approximators (UVFAs). Another way to generalise to new tasks is to exploit structure in the RL problem itself. Generalised policy improvement (GPI) combines solutions of previous tasks into a policy for the unseen task; this relies on instantaneous policy evaluation of old policies under the new reward function, which is made possible through successor features (SFs). Our proposed universal successor features approximators (USFAs) combine the advantages of all of these, namely the scalability of UVFAs, the instant inference of SFs, and the strong generalisation of GPI. We discuss the challenges involved in training a USFA, its generalisation properties and demonstrate its practical benefits and transfer abilities on a large-scale domain in which the agent has to navigate in a first-person perspective three-dimensional environment.
arXiv:1812.07626v1 fatcat:ptxih27fezbavg47nqil4w7qry

Learning Shared Representations in Multi-task Reinforcement Learning [article]

Diana Borsa and Thore Graepel and John Shawe-Taylor
2016 arXiv   pre-print
We investigate a paradigm in multi-task reinforcement learning (MT-RL) in which an agent is placed in an environment and needs to learn to perform a series of tasks, within this space. Since the environment does not change, there is potentially a lot of common ground amongst tasks and learning to solve them individually seems extremely wasteful. In this paper, we explicitly model and learn this shared structure as it arises in the state-action value space. We will show how one can jointly learn
more » ... optimal value-functions by modifying the popular Value-Iteration and Policy-Iteration procedures to accommodate this shared representation assumption and leverage the power of multi-task supervised learning. Finally, we demonstrate that the proposed model and training procedures, are able to infer good value functions, even under low samples regimes. In addition to data efficiency, we will show in our analysis, that learning abstractions of the state space jointly across tasks leads to more robust, transferable representations with the potential for better generalization. this shared representation assumption and leverage the power of multi-task supervised learning. Finally, we demonstrate that the proposed model and training procedures, are able to infer good value functions, even under low samples regimes. In addition to data efficiency, we will show in our analysis, that learning abstractions of the state space jointly across tasks leads to more robust, transferable representations with the potential for better generalization.
arXiv:1603.02041v1 fatcat:wvziahyo7nhbbjsocw4oywuuqy

Ray Interference: a Source of Plateaus in Deep Reinforcement Learning [article]

Tom Schaul, Diana Borsa, Joseph Modayil, Razvan Pascanu
2019 arXiv   pre-print
Rather than proposing a new method, this paper investigates an issue present in existing learning algorithms. We study the learning dynamics of reinforcement learning (RL), specifically a characteristic coupling between learning and data generation that arises because RL agents control their future data distribution. In the presence of function approximation, this coupling can lead to a problematic type of 'ray interference', characterized by learning dynamics that sequentially traverse a
more » ... of performance plateaus, effectively constraining the agent to learn one thing at a time even when learning in parallel is better. We establish the conditions under which ray interference occurs, show its relation to saddle points and obtain the exact learning dynamics in a restricted setting. We characterize a number of its properties and discuss possible remedies.
arXiv:1904.11455v1 fatcat:2x7xkzw4hjhmth23uhmwfy5g4e

The Wreath Process: A totally generative model of geometric shape based on nested symmetries [article]

Diana Borsa, Thore Graepel, Andrew Gordon
2015 arXiv   pre-print
We consider the problem of modelling noisy but highly symmetric shapes that can be viewed as hierarchies of whole-part relationships in which higher level objects are composed of transformed collections of lower level objects. To this end, we propose the stochastic wreath process, a fully generative probabilistic model of drawings. Following Leyton's "Generative Theory of Shape", we represent shapes as sequences of transformation groups composed through a wreath product. This representation
more » ... asizes the maximization of transfer --- the idea that the most compact and meaningful representation of a given shape is achieved by maximizing the re-use of existing building blocks or parts. The proposed stochastic wreath process extends Leyton's theory by defining a probability distribution over geometric shapes in terms of noise processes that are aligned with the generative group structure of the shape. We propose an inference scheme for recovering the generative history of given images in terms of the wreath process using reversible jump Markov chain Monte Carlo methods and Approximate Bayesian Computation. In the context of sketching we demonstrate the feasibility and limitations of this approach on model-generated and real data.
arXiv:1506.03041v1 fatcat:sregnjraq5gzjhrvwnl6ekcd5i

The Option Keyboard: Combining Skills in Reinforcement Learning [article]

André Barreto, Diana Borsa, Shaobo Hou, Gheorghe Comanici, Eser Aygün, Philippe Hamel, Daniel Toyama, Jonathan Hunt, Shibl Mourad, David Silver, Doina Precup
2021 arXiv   pre-print
The Option Keyboard Combining Skills in Reinforcement Learning Supplementary Material André Barreto, Diana Borsa, Shaobo Hou, Gheorghe Comanici, Eser Aygün, Philippe Hamel, Daniel Toyama, Jonathan Hunt  ...  , Shibl Mourad, David Silver, Doina Precup {andrebarreto,borsa,shaobohou,gcomanici,eser} {hamelphi,kenjitoyama,jjhunt,shibl,davidsilver,doinap} DeepMind Abstract In this supplement  ... 
arXiv:2106.13105v1 fatcat:gvbme6pahfhbpczbeu62cu4g2y

Conditional Importance Sampling for Off-Policy Learning [article]

Mark Rowland, Anna Harutyunyan, Hado van Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, Will Dabney
2020 arXiv   pre-print
The principal contribution of this paper is a conceptual framework for off-policy reinforcement learning, based on conditional expectations of importance sampling ratios. This framework yields new perspectives and understanding of existing off-policy algorithms, and reveals a broad space of unexplored algorithms. We theoretically analyse this space, and concretely investigate several algorithms that arise from this framework.
arXiv:1910.07479v2 fatcat:zic3vjldffgq7bevyzflwd4y2q


Anderson Lange, Antonio Carlos Buchelt, Cleris Diana Borsa, Marcos Evaldo Capeletti, Evandro Luiz Schoninger, Rodrigo Sinaidi Zandonadi
2018 Nativa  
O objetivo do trabalho foi avaliar a produtividade de massa verde e seca da parte aérea e das raízes e o acúmulo de cálcio e magnésio na parte aérea de Urochloa brizantha, mediante a aplicação de doses de calcário, ou calcário e gesso agrícola acompanhados de adubação com nitrogênio (N), fósforo (P) e potássio (K). O delineamento experimental consistiu em blocos ao acaso, com quatro repetições e sete tratamentos: T0 = 0; T1 = 0,40; T2 = 0,80; T3 = 1,60; T4 = 3,20 t ha-1 de calcário; T5 = 1,60 t
more » ... ha-1 de calcário mais NPK (40 kg ha-1 de N - sulfato de amônio + 120 kg ha-1 de P2O5 - superfosfato simples e 20 kg ha-1 de K2O - cloreto de potássio) e T6 = 1,50 t ha-1 de gesso agrícola mais NPK similar ao tratamento anterior. No período de condução do experimento foram realizados oito cortes da parte aérea, duas avaliações dos teores de cálcio e bmagnésio nos tecidos e uma avaliação do sistema radicular. Houve efeito para o acúmulo de massa verde, massa seca e nos teores de cálcio e magnésio no tecido da parte aérea das plantas no primeiro corte, com destaque para o uso de calcário e gesso acompanhados de NPK. O acúmulo de raiz no perfil apresentou um efeito linear nas doses de calcário avaliadas.Palavras-chave: calcário, gesso, NPK, Urochloa brizantha. CORRECTIVES AND FERTILIZERS USE IN PASTURE IN AMAZON BIOME ABSTRACT: The objective of this work was to evaluate the productivity of green and dry mass of shoots and roots and calcium and magnesium accumulation in Urochloa brizantha, by application of limestone, limestone and agricultural gypsum followed by fertilization with nitrogen (N), phosphorus (P) and potassium (K). The experimental design consisted of randomized blocks with four replicates and seven treatments: T0 = 0; T1 = 0.40; T2 = 0.80; T3 = 1.60; T4 = 3.20 t ha-1 of limestone; T5 = 1.60 t ha-1 of limestone plus NPK (40 kg ha-1 of ammonium sulfate + 120 kg ha-1 of P2O5 - single superphosphate and 20 kg ha-1 of K2O - potassium chloride) and T6 = 1.50 t ha-1 of agricultural gypsum plus NPK similar to previous treatment. During the period of experiment conduction, eight aerial part samples were evaluated, two calcium and magnesium contents evaluations in tissues and one root system evaluation. There was an effect for accumulation of green mass, dry mass and calcium and magnesium contents in tissue of plants aerial part at first sample, with emphasis on use of limestone and gypsum accompanied by NPK. The root accumulation in profile showed a linear effect on limestone rates evaluated.Keywords: limestone, gypsum, NPK, Urochloa brizantha.
doi:10.31413/nativa.v6i6.6330 fatcat:rixsfq2zm5birimt7db6rmxble

Temporal Difference Uncertainties as a Signal for Exploration [article]

Sebastian Flennerhag, Jane X. Wang, Pablo Sprechmann, Francesco Visin, Alexandre Galashov, Steven Kapturowski, Diana L. Borsa, Nicolas Heess, Andre Barreto, Razvan Pascanu
2021 arXiv   pre-print
., Borsa, D., Ding, D., Szepesvari, D., Ostrovski, G., Dabney, W., and Osindero, S. Adapting behaviour for learning progress. arXiv preprint arXiv:1912.06910, 2019. Schmidhuber, J.  ... 
arXiv:2010.02255v2 fatcat:5mtijqrltrekrferevd2jokgu4

Automatic Identification of Web-Based Risk Markers for Health Events

Elad Yom-Tov, Diana Borsa, Andrew C Hayward, Rachel A McKendry, Ingemar J Cox
2015 Journal of Medical Internet Research  
The escalating cost of global health care is driving the development of new technologies to identify early indicators of an individual's risk of disease. Traditionally, epidemiologists have identified such risk factors using medical databases and lengthy clinical studies but these are often limited in size and cost and can fail to take full account of diseases where there are social stigmas or to identify transient acute risk factors. Objective: Here we report that Web search engine queries
more » ... led with information on Wikipedia access patterns can be used to infer health events associated with an individual user and automatically generate Web-based risk markers for some of the common medical conditions worldwide, from cardiovascular disease to sexually transmitted infections and mental health conditions, as well as pregnancy. Methods: Using anonymized datasets, we present methods to first distinguish individuals likely to have experienced specific health events, and classify them into distinct categories. We then use the self-controlled case series method to find the incidence of health events in risk periods directly following a user's search for a query category, and compare to the incidence during other periods for the same individuals. Results: Searches for pet stores were risk markers for allergy. We also identified some possible new risk markers; for example: searching for fast food and theme restaurants was associated with a transient increase in risk of myocardial infarction, suggesting this exposure goes beyond a long-term risk factor but may also act as an acute trigger of myocardial infarction. Dating and adult content websites were risk markers for sexually transmitted infections, such as human immunodeficiency virus (HIV). Conclusions: Web-based methods provide a powerful, low-cost approach to automatically identify risk factors, and support more timely and personalized public health efforts to bring human and economic benefits. (J Med Internet Res 2015;17(1):e29)
doi:10.2196/jmir.4082 pmid:25626480 pmcid:PMC4327439 fatcat:lkozqz5ccbff3otymop2r7gguy
« Previous Showing results 1 — 15 out of 405 results