A copy of this work was available on the public web and has been preserved in the Wayback Machine. The capture dates from 2017; you can also visit the original URL.
The file type is application/pdf
.
Improving Policies without Measuring Merits
1995
Neural Information Processing Systems
Performing policy iteration in dynamic programming should only require knowledge of relative rather than absolute measures of the utility of actions (Werbos, 1991) -what Baird (1993) calls the advantages of actions at states. Nevertheless, most existing methods in dynamic programming (including Baird's) compute some form of absolute utility function . For smooth problems, advantages satisfy two differential consistency conditions (including the requirement that they be free of curl), and we
dblp:conf/nips/DayanS95
fatcat:se4kuz2hgvf37abmmsnsuz5une