165 Hits in 1.5 sec

The Eigenoption-Critic Framework [article]

Miao Liu, Marlos C. Machado, Gerald Tesauro, Murray Campbell
2017 arXiv   pre-print
et al. [2017a,c] .  ...  since the late 90's, with a large number of HRL algorithms being based on the options framework Simsek and Barto [2004] , Daniel et al. [2016] , Florensa et al. [2017] , Konidaris and Barto [2009] , Machado  ... 
arXiv:1712.04065v1 fatcat:slkplzgaczeyljtp532plykeiq

Domain-Independent Optimistic Initialization for Reinforcement Learning [article]

Marlos C. Machado, Sriram Srinivasan, Michael Bowling
2014 arXiv   pre-print
In Reinforcement Learning (RL), it is common to use optimistic initialization of value functions to encourage exploration. However, such an approach generally depends on the domain, viz., the scale of the rewards must be known, and the feature representation must have a constant norm. We present a simple approach that performs optimistic initialization with less dependence on the domain.
arXiv:1410.4604v1 fatcat:lni7etwms5a3zlmizp3icaa654

Learning Purposeful Behaviour in the Absence of Rewards [article]

Marlos C. Machado, Michael Bowling
2016 arXiv   pre-print
Artificial intelligence is commonly defined as the ability to achieve goals in the world. In the reinforcement learning framework, goals are encoded as reward functions that guide agent behaviour, and the sum of observed rewards provide a notion of progress. However, some domains have no such reward signal, or have a reward signal so sparse as to appear absent. Without reward feedback, agent behaviour is typically random, often dithering aimlessly and lacking intentionality. In this paper we
more » ... In this paper we present an algorithm capable of learning purposeful behaviour in the absence of rewards. The algorithm proceeds by constructing temporally extended actions (options), through the identification of purposes that are "just out of reach" of the agent's current behaviour. These purposes establish intrinsic goals for the agent to learn, ultimately resulting in a suite of behaviours that encourage the agent to visit different parts of the state space. Moreover, the approach is particularly suited for settings where rewards are very sparse, and such behaviours can help in the exploration of the environment until reward is observed.
arXiv:1605.07700v1 fatcat:6ob7d5uhnvegxntmnmlkcbzb3m

A Methodology for Player Modeling based on Machine Learning [article]

Marlos C. Machado
2013 arXiv   pre-print
. • Machado, M. C., Fantini, E. P. C., and Chaimowicz, L. Player Modeling: What is it? How to do it?.  ...  • Machado, M. C., Fantini, E. P. C., and Chaimowicz, L. Player Modeling: Towards a Common Taxonomy.  ...  C.1 Pretest Questionnaire The pretest questionnaire was the larger questionnaire players were asked to answer.  ... 
arXiv:1312.3903v1 fatcat:dshyw7sdavef3nwatus7ympag4

Introspective Agents: Confidence Measures for General Value Functions [article]

Craig Sherstan, Adam White, Marlos C. Machado, Patrick M. Pilarski
2016 arXiv   pre-print
(c) The robot makes predictions of the expected squared TD error of the primary hue prediction.  ...  Predictions of variance (not shown) produce a similar pattern to that of (c). The robot can decide to only trust predictions in portions of the world it has visited before, here the upper left.  ... 
arXiv:1606.05593v1 fatcat:l7h27asdivd3zc52ddul3ijkci

Temporal Abstraction in Reinforcement Learning with the Successor Representation [article]

Marlos C. Machado and Andre Barreto and Doina Precup
2021 arXiv   pre-print
Marlos C. Machado and Doina Precup are supported by a Canada CIFAR AI Chair.  ...  Acknowledgements This work was partially developed while Marlos C. Machado was at Google Research, Brain Team.  ...  The option computed in Eq. 21 is an approximation of the option π c+ induced by c, that is,π c+ ≈ π c+ .  ... 
arXiv:2110.05740v1 fatcat:zrguhnljlvbyhlvqlir4cpfrx4

Accelerating Learning in Constructive Predictive Frameworks with the Successor Representation [article]

Craig Sherstan, Marlos C. Machado, Patrick M. Pilarski
2018 arXiv   pre-print
University of Alberta, Canada {sherstan, machado, pilarski} Note that Dayan describes the SR as predicting future state visitation from time t onward.  ...  The first step in our algorithm is to compute the one-step average cumulant, which we do with TD error: δ t = C t+1 −c φ(S t ) . (4) If we use linear function approximation to estimatec then c φ(s) = φ  ... 
arXiv:1803.09001v1 fatcat:2ql23yanynhkrox5dpilq5mkja

Count-Based Exploration with the Successor Representation [article]

Marlos C. Machado, Marc G. Bellemare, Michael Bowling
2019 arXiv   pre-print
Marlos C. Machado performed this work while at the University of Alberta.  ...  Reference functions f 1 (n) = c 1 √ n and f 2 (n) = c 2 n are depicted for comparison (c 1 = 0.19; c 2 = 0.006). See text for details.  ...  ), and C(s, s ) is the sum of the rewards associated with the n(s, s ) transitions (we drop the action to simplify notation).  ... 
arXiv:1807.11622v4 fatcat:uhojfpkbybh4xdc7w5odz5eh3e

Generalization and Regularization in DQN [article]

Jesse Farebrother, Marlos C. Machado, Michael Bowling
2020 arXiv   pre-print
Marlos C. Machado performed part of this work while at the University of Alberta.  ...  We used the default value of sticky actions (Machado et al., 2018) .  ...  As Machado et al. (2018) , hereinafter we call each mode/difficult pair a flavour.  ... 
arXiv:1810.00123v3 fatcat:fatmy5auovfcximggvj3hrvlgq

Player modeling: Towards a common taxonomy

Marlos C. Machado, Eduardo P. C. Fantini, Luiz Chaimowicz
2011 2011 16th International Conference on Computer Games (CGAMES)  
Marlos C. Machado is a graduate student in the Computer Science Department at Federal University of Minas Gerais.  ...  The opponents artificial intelligence can be written in C or C++ directly in the platform source code.  ... 
doi:10.1109/cgames.2011.6000359 dblp:conf/cgames/MachadoFC11 fatcat:et55vbontrfynggu6ot6vx6eua

True Online Temporal-Difference Learning [article]

Harm van Seijen and A. Rupam Mahmood and Patrick M. Pilarski and Marlos C. Machado and Richard S. Sutton
2016 arXiv   pre-print
In Appendix C, the results for all α values are shown.  ...  From the condition t−1 i=0 ∆ t i = 0 it follows that C > 0.  ...  Then, for all time steps t: Proof We prove the theorem by showing that ||θ td t − θ λ t ||/||θ td t − θ 0 || can be approximated by O(α)/ C + O(α) as α → 0, with C > 0.  ... 
arXiv:1512.04087v2 fatcat:m6o7pjfxyjdszizln76rrdo6ua

Combining Metaheuristics and CSP Algorithms to Solve Sudoku

Marlos C. Machado, Luiz Chaimowicz
2011 2011 Brazilian Symposium on Games and Digital Entertainment  
The ml is calculated with the following formula, where f (r, c) is the non-fixed cells number in the square (r, c).  ...  Heuristic-1 This first heuristic is the implementation of two traditional constraint satisfaction algorithms: ARC − 3 (arc-consistency) and P C − 2 (path consistency).  ... 
doi:10.1109/sbgames.2011.18 dblp:conf/sbgames/MachadoC11 fatcat:ppwac62zizbp7duixg776taz6e

An operator view of policy gradient methods [article]

Dibya Ghosh, Marlos C. Machado, Nicolas Le Roux
2020 arXiv   pre-print
We have ∇ θ s d π * (s)V π * (s)KL(Q π * π * ||π) π=π * = s d π * (s) a π * (a|s)Q π * (s, a) ∂ log π θ (a|s) ∂θ θ=θ * (28) = 0 by definition of π * . (29) C Expected return of the improved policy We  ... 
arXiv:2006.11266v3 fatcat:pfkddm36mnhgpdeib44cjff2qa

Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning [article]

Rishabh Agarwal, Marlos C. Machado, Pablo Samuel Castro, Marc G. Bellemare
2021 arXiv   pre-print
Assume o t x = 0.1 W c W d x s t and o t y = 0.1 W c W d y s t for distractor semi-orthogonal matrices W d X and W d Y , respectively.  ...  C BISIMULATION METRICS Notation. We use the notation as defined in Section 2.  ... 
arXiv:2101.05265v2 fatcat:54kisb4zjzc3zkfj2x47myquma

A Laplacian Framework for Option Discovery in Reinforcement Learning [article]

Marlos C. Machado and Marc G. Bellemare and Michael Bowling
2017 arXiv   pre-print
Correspondence to: Marlos C. Machado <>.  ...  C. Options Leading to Doorways in the 4-room Domain D.  ... 
arXiv:1703.00956v2 fatcat:wqphw6bc6rfj5cgcgh4lbxxpou
« Previous Showing results 1 — 15 out of 165 results