### The Simplex and Policy-Iteration Methods Are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

Yinyu Ye
2011 Mathematics of Operations Research
We prove that the classic policy-iteration method (Howard 1960) , including the Simplex method (Dantzig 1947) with the most-negative-reduced-cost pivoting rule, is a strongly polynomial-time algorithm for solving the Markov decision problem (MDP) with a fixed discount rate. Furthermore, the computational complexity of the policyiteration method (including the Simplex method) is superior to that of the only known strongly polynomial-time interior-point algorithm ([28] 2005) for solving this
more » ... em. The result is surprising since the Simplex method with the same pivoting rule was shown to be exponential for solving a general linear programming (LP) problem, the Simplex (or simple policy-iteration) method with the smallest-index pivoting rule was shown to be exponential for solving an MDP regardless of discount rates, and the policy-iteration method was recently shown to be exponential for solving a undiscounted MDP. We also extend the result to solving MDPs with sub-stochastic and transient state transition probability matrices. and partly under the control of a decision maker. The MDP is one of the most fundamental models for studying a wide range of optimization problems solved via dynamic programming and reinforcement learning. Today, it has been used in a variety of areas, including management, economics, bioinformatics, electronic commerce, social networking, and supply chains. More precisely, an MDP is a discrete-time stochastic control process. At each time step, the process is in some state i, and the decision maker may choose any action, say action j, that is available in state i. The process responds at the next time step by randomly moving into a new state i , and giving the decision maker a corresponding immediate cost c j (i, i ). Let m denote the total number of states. The probability that the process enters i as its new state is influenced by the chosen state-action j. Specifically, it is given by a state transition probability distribution p j (i, i ) ≥ 0, i = 1, · · · , m, and Thus, the next state i depends on the current state i and the decision maker's chosen state-action j, but is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP possess the Markov property. The key decision of MDPs is to find a (stationary) policy for the decision maker: a set function π = {π 1 , π 2 , · · · , π m } that specifies the action π i that the decision maker will choose when in state i, for i = 1, · · · , m. The goal of the problem is to find a (stationary) policy π that will minimize some cumulative function of the random costs, typically the expected discounted sum over an infinite horizon: where c π i t (i t , i t+1 ) represents the cost, at time t, incurred to an individual who is in state i t and takes action π i t . Here γ is the discount rate, where γ ≥ 0 and is assumed to be strictly less than 1 in this paper. This MDP problem is called the infinite-horizon discounted Markov decision problem (DMDP), which serves as the core model for MDPs. Because of the Markov property, there is an optimal stationary policy, or policy for short, for the DMDP so that it can indeed be written as a function of i only; that is, π is independent of time t as described above. Let k i be the number of state-actions available in state i, i = 1, · · · , m, and let