Filters








2,079 Hits in 3.4 sec

LESS is More

Andreea Bobu, Dexter R. R. Scobee, Jaime F. Fisac, S. Shankar Sastry, Anca D. Dragan
2020 Proceedings of the 2020 ACM/IEEE International Conference on Human-Robot Interaction  
In contrast, human trajectories lie in a continuous space, with continuous-valued features that influence the reward function.  ...  We then analyze the implications this has for robot inference, first in toy environments where we have ground truth and find more accurate inference, and finally for a 7DOF robot arm learning from user  ...  Further studies on human behavior in more realistic settings would be useful, but complicated by lack of access to the "ground truth" reward.  ... 
doi:10.1145/3319502.3374811 dblp:conf/hri/BobuSFSD20 fatcat:c5kgjpqaizdnlmj6t5nxcywyvm

Accounting for Human Learning when Inferring Human Preferences [article]

Harry Giles, Lawrence Chan
2020 arXiv   pre-print
In addition, we find evidence that misspecification can lead to poor inference, suggesting that modelling human learning is important, especially when the human is facing an unfamiliar environment.  ...  Surprisingly, we find in some small examples that this can lead to better inference than if the human was stationary.  ...  Larger domains and approximate inference As we performed exact inference in this work, we were restricted to domains with a small set of discrete reward parameters.  ... 
arXiv:2011.05596v2 fatcat:prpe6bagfzdxbahj2txv5b35a4

Should Robots be Obedient?

Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, Stuart Russell
2017 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence  
Finally, we study how robots can start detecting such model misspecification.  ...  Overall, our work suggests that there might be a middle ground in which robots intelligently decide when to obey human orders, but err on the side of obedience.  ...  S is a set of world states. Θ is a set of static reward parameters. The hidden state space of the POMDP is S × Θ and at each step R observes the current world state and H's order.  ... 
doi:10.24963/ijcai.2017/662 dblp:conf/ijcai/MilliHDR17 fatcat:zhvquksfp5cf5fdvxr2qi7xhtu

Should Robots be Obedient? [article]

Smitha Milli, Dylan Hadfield-Menell, Anca Dragan, Stuart Russell
2017 arXiv   pre-print
Finally, we study how robots can start detecting such model misspecification.  ...  Overall, our work suggests that there might be a middle ground in which robots intelligently decide when to obey human orders, but err on the side of obedience.  ...  S is a set of world states. Θ is a set of static reward parameters. The hidden state space of the POMDP is S × Θ and at each step R observes the current world state and H's order.  ... 
arXiv:1705.09990v1 fatcat:shak3nj47vh7tginijzkzn56ha

Bootstrap Thompson Sampling and Sequential Decision Problems in the Behavioral Sciences

Dean Eckles, Maurits Kaptein
2019 SAGE Open  
We illustrate its robustness to model misspecification, which is a common concern in behavioral science applications.  ...  Behavioral scientists are increasingly able to conduct randomized experiments in settings that enable rapidly updating probabilities of assignment to treatments (i.e., arms).  ...  settings, especially when this dependence is otherwise difficult to account for in inference (Cameron & Miller, 2015) .  ... 
doi:10.1177/2158244019851675 fatcat:xyqgfmxkm5hsfbnztj7v2nvi2y

Revisiting the importance of model fitting for model-based fMRI: It does matter in computational psychiatry

Kentaro Katahira, Asako Toyama, Woo-Young Ahn
2021 PLoS Computational Biology  
., depression) exhibit diminished neural responses to reward prediction errors (RPEs), which are the differences between experienced and predicted rewards.  ...  We demonstrate that the parameter-misspecification can critically affect the results of group comparison.  ...  Settings other than the reward contingency were the same with Fig 7 ("Effect of model-misspecification" section).  ... 
doi:10.1371/journal.pcbi.1008738 pmid:33561125 fatcat:udu6ujra3fbbjgjx745yyjrcjy

An Analysis of an Alternative Pythagorean Expected Win Percentage Model: Applications Using Major League Baseball Team Quality Simulations [article]

Justin Ehrlich, Christopher Boudreaux, James Boudreau, Shane Sanders
2021 arXiv   pre-print
We find that the difference-form CSF model outperforms the traditional Pythagorean model in terms of explanatory power and in terms of misspecification-based information loss as estimated by the Akaike  ...  We estimate expected win percentage using the traditional Pythagorean model, as well as the difference-form CSF model that is used in game theory and public choice economics.  ...  In fact, we find that the SBS-generated data set aligns closely to real-world data with respect to expected win percentage estimation in MLB.  ... 
arXiv:2112.14846v1 fatcat:u5d6nqmdavfppczpdlkjbtp5ya

Reward-rational (implicit) choice: A unifying formalism for reward learning [article]

Hong Jun Jeon, Smitha Milli, Anca D. Dragan
2020 arXiv   pre-print
Our key insight is that different types of behavior can be interpreted in a single unifying formalism - as a reward-rational choice that the human is making, often implicitly.  ...  The types of behavior interpreted as evidence of the reward function have expanded greatly in recent years.  ...  Any opinion, findings, and conclusions or recommendations expressed in this material are those of the authors(s) and do not necessarily reflect the views of the National Science Foundation.  ... 
arXiv:2002.04833v4 fatcat:wa43c7vxsrhp7if4xuh6qcd7te

Thompson sampling with the online bootstrap [article]

Dean Eckles, Maurits Kaptein
2014 arXiv   pre-print
We first explain BTS and show that the performance of BTS is competitive to Thompson sampling in the well-studied Bernoulli bandit case.  ...  Thompson sampling provides a solution to bandit problems in which new observations are allocated to arms with the posterior probability that an arm is optimal.  ...  In this section we also discuss in more detail the choice of J which can be regarded a tuning parameter in BTS.  ... 
arXiv:1410.4009v1 fatcat:gj5ba3o2ozabjbp36l73v2nf3q

Combining Reward Information from Multiple Sources [article]

Dmitrii Krasheninnikov, Rohin Shah, Herke van Hoof
2021 arXiv   pre-print
In such a setting, we would like to retreat to a broader distribution over reward functions, in order to mitigate the effects of misspecification.  ...  We study this problem in the setting with two conflicting reward functions learned from different sources.  ...  one in which it was specified or inferred.  ... 
arXiv:2103.12142v1 fatcat:jnqlwj5eo5ajtbwhusn3oh54gi

Human irrationality: both bad and good for reward inference [article]

Lawrence Chan, Andrew Critch, Anca Dragan
2021 arXiv   pre-print
Assuming humans are (approximately) rational enables robots to infer reward functions by observing human behavior.  ...  We thus operationalize irrationality in the language of MDPs, by altering the Bellman optimality equation, and use this framework to study how these alterations would affect inference.  ...  The degree affects reward inference, with many settings naturally resulting in worse inference, especially at the extremes.  ... 
arXiv:2111.06956v1 fatcat:gu5p7uues5hmhpklfkpgfqzj5m

Statistical Inference for Online Decision-Making: In a Contextual Bandit Setting

Haoyu Chen, Wenbin Lu, Rui Song
2020 Journal of the American Statistical Association  
Common solutions often need to learn a reward model of different actions given the contextual information and then maximize the long-term reward.  ...  It is meaningful to know if the posited model is reasonable and how the model performs in the asymptotic sense.  ...  An ideal choice of ε t should decrease as fast as possible, providing it satisfies the conditions for making inference.  ... 
doi:10.1080/01621459.2020.1770098 pmid:33737759 pmcid:PMC7962379 fatcat:kqtyfka5j5cofnfuq2kdxak3oq

PG-TS: Improved Thompson Sampling for Logistic Contextual Bandits [article]

Bianca Dumitrascu, Karen Feng, Barbara E Engelhardt
2018 arXiv   pre-print
We address the problem of regret minimization in logistic contextual bandits, where a learner decides among sequential actions or arms given their respective contexts to maximize binary rewards.  ...  PG-TS explores the action space efficiently and exploits high-reward arms, quickly converging to solutions of low regret.  ...  Both of these situations arise in the online learning setting, creating a need for novel TS approaches to inference.  ... 
arXiv:1805.07458v1 fatcat:a5rc4ujdlfh2jgj52misat6tsm

Zero-Shot Assistance in Novel Decision Problems [article]

Sebastiaan De Peuter, Samuel Kaski
2022 arXiv   pre-print
Finally, we show experimentally that our approach adapts to these agent biases, and results in higher cumulative reward for the agent than automation-based alternatives.  ...  To do this we introduce a novel formalization of assistance that models these biases, allowing the assistant to infer and adapt to them.  ...  As this choice is easier than the choice in equation 2 we use a different temperature parameter β 2 here.  ... 
arXiv:2202.07364v1 fatcat:rf6zyvqgcraurbeijgqjgfsbvu

Inverse Reward Design [article]

Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart Russell, Anca Dragan
2020 arXiv   pre-print
We introduce inverse reward design (IRD) as the problem of inferring the true objective based on the designed reward and the training MDP.  ...  Our insight is that reward functions are merely observations about what the designer actually wants, and that they should be interpreted in the context in which they were designed.  ...  case w i in our set.  ... 
arXiv:1711.02827v2 fatcat:u3vsvf7vm5blvausmkyn7mf564
« Previous Showing results 1 — 15 out of 2,079 results