894 Hits in 2.5 sec

Reducing Sampling Error in Batch Temporal Difference Learning [article]

Brahma Pavse, Ishan Durugkar, Josiah Hanna, Peter Stone
2020 arXiv   pre-print
., 2018) , policy evaluation (Hanna et al., 2019) , and policy gradient learning (Hanna & Stone, 2019) .  ...  Hanna, J. and Stone, P. Reducing sampling error in the monte carlo policy gradient estimator.  ... 
arXiv:2008.06738v1 fatcat:qpyg7ke7djgwjojoct2rierqou

Data-Efficient Policy Evaluation Through Behavior Policy Search [article]

Josiah P. Hanna, Philip S. Thomas, Peter Stone, Scott Niekum
2017 arXiv   pre-print
Josiah Hanna is supported by an NSF Graduate Research Fellowship. Peter Stone serves on the Board of Directors of Cogitai, Inc.  ... 
arXiv:1706.03469v1 fatcat:x3jiksf7gzac5fstknxata2n7m

Reinforced Grounded Action Transformation for Sim-to-Real Transfer [article]

Haresh Karnan, Siddharth Desai, Josiah P. Hanna, Garrett Warnell, Peter Stone
2020 arXiv   pre-print
Hanna and Stone demonstrate that GAT can transfer a bipedal walk from a simulator to a physical NAO robot.  ...  Policy Representation Consistent with Hanna and Stone [2] , we find that GAT works well on transferring policies where the policy representation is low dimensional.  ... 
arXiv:2008.01279v1 fatcat:3mlz6cmx75bsdb4xwv75nqrn5m

Stochastic Grounded Action Transformation for Robot Learning in Simulation [article]

Siddharth Desai, Haresh Karnan, Josiah P. Hanna, Garrett Warnell, Peter Stone
2020 arXiv   pre-print
While GAT works well on fairly deterministic environments, as was shown by Hanna and Stone [6] , in our experimentation, we find that policies learned using GAT perform poorly when transferring to highly  ... 
arXiv:2008.01281v1 fatcat:cjrhkrlkgjca7kl6in6i4dgvcu

Learning an Interpretable Traffic Signal Control Policy [article]

James Ault, Josiah P. Hanna, Guni Sharon
2020 arXiv   pre-print
Signalized intersections are managed by controllers that assign right of way (green, yellow, and red lights) to non-conflicting directions. Optimizing the actuation policy of such controllers is expected to alleviate traffic congestion and its adverse impact. Given such a safety-critical domain, the affiliated actuation policy is required to be interpretable in a way that can be understood and regulated by a human. This paper presents and analyzes several on-line optimization techniques for
more » ... ng interpretable control functions. Although these techniques are defined in a general way, this paper assumes a specific class of interpretable control functions (polynomial functions) for analysis purposes. We show that such an interpretable policy function can be as effective as a deep neural network for approximating an optimized signal actuation policy. We present empirical evidence that supports the use of value-based reinforcement learning for on-line training of the control function. Specifically, we present and study three variants of the Deep Q-learning algorithm that allow the training of an interpretable policy function. Our Deep Regulatable Hardmax Q-learning variant is shown to be particularly effective in optimizing our interpretable actuation policy, resulting in up to 19.4% reduced vehicles delay compared to commonly deployed actuated signal controllers.
arXiv:1912.11023v2 fatcat:5yrsicnk6fhk5oxdzj33sqmouu

Minimum Cost Matching for Autonomous Carsharing

Josiah P. Hanna, Michael Albert, Donna Chen, Peter Stone
2016 IFAC-PapersOnLine  
Carsharing programs provide an alternative to private vehicle ownership. Combining carsharing programs with autonomous vehicles would improve user access to vehicles thereby removing one of the main challenges to widescale adoption of these programs. While the ability to easily move cars to meet demand would be significant for carsharing programs, if implemented incorrectly it could lead to worse system performance. In this paper, we seek to improve the performance of a fleet of shared
more » ... s vehicles through improved matching of vehicles to passengers requesting rides. We consider carsharing with autonomous vehicles as an assignment problem and examine four different methods for matching cars to users in a dynamic setting. We show how applying a recent algorithm (Scalable Collision-avoiding Role Assignment with Minimal-makespan or SCRAM) for minimizing the maximal edge in a perfect matching can result in a more efficient, reliable, and fair carsharing system. Our results highlight some of the problems with greedy or decentralized approaches. Introducing a centralized system creates the possibility for users to strategically mis-report their locations and improve their expected wait time so we provide a proof demonstrating that cancellation fees can be applied to eliminate the incentive to mis-report location.
doi:10.1016/j.ifacol.2016.07.757 fatcat:xxlr2beuunebhpmh7rq6eth3q4

Temporal Disentanglement of Representations for Improved Generalisation in Reinforcement Learning [article]

Mhairi Dunion, Trevor McInroe, Kevin Luck, Josiah Hanna, Stefano V. Albrecht
2022 arXiv   pre-print
In real-world robotics applications, Reinforcement Learning (RL) agents are often unable to generalise to environment variations that were not observed during training. This issue is intensified for image-based RL where a change in one variable, such as the background colour, can change many pixels in the image, and in turn can change all values in the agent's internal representation of the image. To learn more robust representations, we introduce TEmporal Disentanglement (TED), a
more » ... d auxiliary task that leads to disentangled representations using the sequential nature of RL observations. We find empirically that RL algorithms with TED as an auxiliary task adapt more quickly to changes in environment variables with continued training compared to state-of-the-art representation learning methods. Due to the disentangled structure of the representation, we also find that policies trained with TED generalise better to unseen values of variables irrelevant to the task (e.g. background colour) as well as unseen values of variables that affect the optimal policy (e.g. goal positions).
arXiv:2207.05480v1 fatcat:aiqpuygzzjgsfcsbjveqyzmdr4

Importance sampling in reinforcement learning with an estimated behavior policy

Josiah P. Hanna, Scott Niekum, Peter Stone
2021 Machine Learning  
AbstractIn reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different policy. Importance sampling requires computing the likelihood ratio between the action probabilities of a target policy and those of the data-producing behavior policy. In this article, we study importance sampling where the behavior policy action probabilities are replaced by their
more » ... m likelihood estimate of these probabilities under the observed data. We show this general technique reduces variance due to sampling error in Monte Carlo style estimators. We introduce two novel estimators that use this technique to estimate expected values that arise in the RL literature. We find that these general estimators reduce the variance of Monte Carlo sampling methods, leading to faster learning for policy gradient algorithms and more accurate off-policy policy evaluation. We also provide theoretical analysis showing that our new estimators are consistent and have asymptotically lower variance than Monte Carlo estimators.
doi:10.1007/s10994-020-05938-9 fatcat:djw7yjw5gzec3ggb22xropu6du

Approximation of Lorenz-Optimal Solutions in Multiobjective Markov Decision Processes [article]

Patrice Perny, Paul Weng, Judy Goldsmith, Josiah Hanna
2013 arXiv   pre-print
This paper is devoted to fair optimization in Multiobjective Markov Decision Processes (MOMDPs). A MOMDP is an extension of the MDP model for planning under uncertainty while trying to optimize several reward functions simultaneously. This applies to multiagent problems when rewards define individual utility functions, or in multicriteria problems when rewards refer to different features. In this setting, we study the determination of policies leading to Lorenz-non-dominated tradeoffs. Lorenz
more » ... minance is a refinement of Pareto dominance that was introduced in Social Choice for the measurement of inequalities. In this paper, we introduce methods to efficiently approximate the sets of Lorenz-non-dominated solutions of infinite-horizon, discounted MOMDPs. The approximations are polynomial-sized subsets of those solutions.
arXiv:1309.6856v1 fatcat:fqxtdqnspjbsll3b22mt5ajahe

Grounded action transformation for sim-to-real reinforcement learning

Josiah P. Hanna, Siddharth Desai, Haresh Karnan, Garrett Warnell, Peter Stone
2021 Machine Learning  
AbstractReinforcement learning in simulation is a promising alternative to the prohibitive sample cost of reinforcement learning in the physical world. Unfortunately, policies learned in simulation often perform worse than hand-coded policies when applied on the target, physical system. Grounded simulation learning (gsl) is a general framework that promises to address this issue by altering the simulator to better match the real world (Farchy et al. 2013 in Proceedings of the 12th international
more » ... conference on autonomous agents and multiagent systems (AAMAS)). This article introduces a new algorithm for gsl—Grounded Action Transformation (GAT)—and applies it to learning control policies for a humanoid robot. We evaluate our algorithm in controlled experiments where we show it to allow policies learned in simulation to transfer to the real world. We then apply our algorithm to learning a fast bipedal walk on a humanoid robot and demonstrate a 43.27% improvement in forward walk velocity compared to a state-of-the art hand-coded walk. This striking empirical success notwithstanding, further empirical analysis shows that gat may struggle when the real world has stochastic state transitions. To address this limitation we generalize gat to the stochasticgat (sgat) algorithm and empirically show that sgat leads to successful real world transfer in situations where gat may fail to find a good policy. Our results contribute to a deeper understanding of grounded simulation learning and demonstrate its effectiveness for applying reinforcement learning to learn robot control policies entirely in simulation.
doi:10.1007/s10994-021-05982-z fatcat:cednowwd7vbl7hjdszowmeccse

Multi-agent Databases via Independent Learning [article]

Chi Zhang, Olga Papaemmanouil, Josiah P. Hanna, Aditya Akella
2022 arXiv   pre-print
Machine learning is rapidly being used in database research to improve the effectiveness of numerous tasks included but not limited to query optimization, workload scheduling, physical design, etc. Currently, the research focus has been on replacing a single database component responsible for one task by its learning-based counterpart. However, query performance is not simply determined by the performance of a single component, but by the cooperation of multiple ones. As such, learning based
more » ... abase components need to collaborate during both training and execution in order to develop policies that meet end performance goals. Thus, the paper attempts to address the question "Is it possible to design a database consisting of various learned components that cooperatively work to improve end-to-end query latency?". To answer this question, we introduce MADB (Multi-Agent DB), a proof-of-concept system that incorporates a learned query scheduler and a learned query optimizer. MADB leverages a cooperative multi-agent reinforcement learning approach that allows the two components to exchange the context of their decisions with each other and collaboratively work towards reducing the query latency. Preliminary results demonstrate that MADB can outperform the non-cooperative integration of learned components.
arXiv:2205.14323v3 fatcat:qwnd3acnabfozb3ttcmlgtbh5i

Interpretable Goal Recognition in the Presence of Occluded Factors for Autonomous Vehicles [article]

Josiah P. Hanna, Arrasy Rahman, Elliot Fosong, Francisco Eiras, Mihai Dobre, John Redford, Subramanian Ramamoorthy, Stefano V. Albrecht
2021 arXiv   pre-print
Recognising the goals or intentions of observed vehicles is a key step towards predicting the long-term future behaviour of other agents in an autonomous driving scenario. When there are unseen obstacles or occluded vehicles in a scenario, goal recognition may be confounded by the effects of these unseen entities on the behaviour of observed vehicles. Existing prediction algorithms that assume rational behaviour with respect to inferred goals may fail to make accurate long-horizon predictions
more » ... cause they ignore the possibility that the behaviour is influenced by such unseen entities. We introduce the Goal and Occluded Factor Inference (GOFI) algorithm which bases inference on inverse-planning to jointly infer a probabilistic belief over goals and potential occluded factors. We then show how these beliefs can be integrated into Monte Carlo Tree Search (MCTS). We demonstrate that jointly inferring goals and occluded factors leads to more accurate beliefs with respect to the true world state and allows an agent to safely navigate several scenarios where other baselines take unsafe actions leading to collisions.
arXiv:2108.02530v1 fatcat:7ntrwgsvvrfnpn4wtxzden7l4i

Delta-Tolling: Adaptive Tolling for Optimizing Traffic Throughput

Guni Sharon, Josiah Hanna, Tarun Rambha, Michael Albert, Peter Stone, Stephen D. Boyles
2016 International Joint Conference on Artificial Intelligence  
In recent years, the automotive industry has been rapidly advancing toward connected vehicles with higher degrees of autonomous capabilities. This trend opens up many new possibilities for AI-based efficient traffic management. This paper investigates traffic optimization through the setting and broadcasting of dynamic and adaptive tolls under the assumption that the cars will be able to continually reoptimize their paths as tolls change. Previous work has studied tolling policies that result
more » ... optimal traffic flow and several traffic models were developed to compute such tolls. Unfortunately, applying these models in practice is infeasible due to the dynamically changing nature of typical traffic networks. Moreover, this paper shows that previously developed tolling models that were proven to yield optimal flow in theory may not be optimal in real-life simulation. Next, this paper introduces an efficient tolling scheme, denoted ∆tolling, for setting dynamic and adaptive tolls. We evaluate the performance of ∆-tolling using a traffic micro-simulator. ∆-tolling is shown to reduce average travel time by up to 35% over using no tolls and by up to 17% when compared to the current state-of-the-art tolling scheme.
dblp:conf/ijcai/SharonHRASB16 fatcat:nzn3sjkmffcwta7anstn2anr7m

Decoupled Reinforcement Learning to Stabilise Intrinsically-Motivated Exploration [article]

Lukas Schäfer, Filippos Christianos, Josiah P. Hanna, Stefano V. Albrecht
2022 arXiv   pre-print
Intrinsic rewards can improve exploration in reinforcement learning, but the exploration process may suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we introduce Decoupled RL (DeRL) as a general framework which trains separate policies for intrinsically-motivated exploration and exploitation. Such decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and
more » ... sample efficiency. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. Our results show that DeRL is more robust to varying scale and rate of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the challenge of distribution shift and show that divergence constraint regularisers can successfully minimise instability caused by divergence of exploration and exploitation policies.
arXiv:2107.08966v3 fatcat:2uqxi6z52nfy3anjmhp6ek7gw4

ReVar: Strengthening Policy Evaluation via Reduced Variance Sampling [article]

Subhojyoti Mukherjee, Josiah P. Hanna, Robert Nowak
2022 arXiv   pre-print
Brahma Pavse, Ishan Durugkar, Josiah Hanna, and Peter Josiah P Hanna, Philip S Thomas, Peter Stone, and Scott Stone. Reducing sampling error in batch temporal differ- Niekum.  ...  Adaptive importance sampling was used by Hanna et al. [2017] to lower the variance of policy evaluation in MDPs.  ... 
arXiv:2203.04510v3 fatcat:b4y7wckoengrlpxnhw7ydwxec4
« Previous Showing results 1 — 15 out of 894 results