Markov Decision Process

Lecture Notes Intro to MDPs and Exact Solution Methods

We have the initial state $s_{t}$. The agent gets to choose an action at time $t$, action $a_t$ as a consequence the environment changes.

The environment is the world around the agent. Environment is some states after the action, the state will change to $s_{t+1}$ and a reward might be emitted associated with what just happened and then this process repeats.

Markov Decision Process

MDP Definition

  • Set of state $S$: scalar, vector
  • Set of action $A$: what agent can do
  • Transition function $P(s\prime|s, a)$: a distribution over next state $s\prime$ given current state and action
  • Reward function $R(s, a, s\prime)$: reward from current state $s$ and action $a$ to next state $s\prime$
  • Start state $s_{0}$
  • Discount factor $\gamma$: how much you care about the future compared to the current time. Without the discount factor, it might be possible to collect infinite
  • Horizon $H$: how long you are going to acting, sometimes it's infinity


  • Optimal Control: given an MDP $(S, A, P, R, \gamma, H)$ to find the optimal policy $\pi^*$
  • Exact Methods:
    • Value Iteration
    • Policy Iteration
  • Limitations:
    • Iteration over / storage for all states and actions: requires small, discrete state-action space
    • Update equations require access to dynamics model
An example:

Robot Diamond Hell

Your goal is to maximize expected sum of rewards under a policy $\pi$. $\pi$ is a prescription for each state what action you want to take. The goal can be written as:
$\max_{\pi}E[\sum ^{H}_{t=0} \gamma^t R (S_t, A_t, S _{t+1} | \pi)] $

Deterministic Policy: a policy that is a deterministic mapping from state to action
Stochastic Policy: a policy that will map from state to distribution over actions

Optimal Value Function $V^*$

$$ V^*(s) = \max_\pi \mathbb{E} [\sum ^{H}_{t=0} \gamma ^t R (s_t, a_t, s _{t+1} | \pi, s_0=s)] $$

It means the sum of discounted rewards when starting from state $s$ and acting optimally.

Value Iteration

  • $ V ^\ast_0 (s) $ is the optimal value for state s when $H = 0$
    • $V ^\ast_0 (s) = 0 ,\forall s$
  • $V ^*_1 (s)$ is the optimal value for state s when $H = 1$,
    • $V ^\ast_1 (s) = \max _a \sum _{s\prime} P(s\prime | s, a)(R(s, a, s\prime) + \gamma V ^\ast_0 (s\prime))$
      • $P(s\prime | s, a)$ state distribution over current state and action
      • $P(s\prime | s, a)(R(s, a, s\prime)$ immediate transition
      • $\gamma V ^\ast_0 (s\prime)$, the discount factor times the value of the next state
  • $V ^*_t (s)$ is the optimal value for state s when $H = t$,
    • $V ^\ast_t (s) = \max _a \sum _{s\prime} P(s\prime | s, a)(R(s, a, s\prime) + \gamma V ^\ast _{t-1} (s\prime))$

Once you have enough time to essentially be guaranteed to get into the positive reward having more time is not going to help you and so the values will stop changing.

Value Update Algorithm / Bellman update

Start with $V ^\ast_0 (s) = 0 ,\forall s$.
for $k = 1, ..., H$:
For all states $s$ in $S$:
$$V ^\ast_k (s) \leftarrow \max _a \sum _{s\prime} P(s\prime | s, a)(R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime))$$
$$\pi ^\ast_k (s) \leftarrow arg \max _a \sum _{s\prime} P(s\prime | s, a)(R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime))$$

Value Iteration Convergence

Theorem: Value iteration converges. At convergence, we have found the optimal value function $V \ast$ for the discounted infinite horizon problem, which satisfies the Bellman equations:

$$\forall S \in S: V ^\ast_k (s) = \max _a \sum _{s\prime} P(s\prime | s, a)[R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime)]$$

$$\forall S \in S: \pi ^\ast_k (s) = arg \max _a \sum _{s\prime} P(s\prime | s, a)[R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime)]$$

Convergence and Contractions
  • Max-norm: $||U|| = max_s|U(s)|$
  • Theorem: For any two approximations U and V, $||U _{i+1} - V _{i+1} \leqq \gamma ||U _i - V _i|| $. i.e., any distinct approximation must get closer to each other, and value iteration converges to a unique, stable, optimal solution.
  • Theorem: $||V _{i+1} - V _i|| < \epsilon \rightarrow || V _{i+1} - V^\ast || < 2 \epsilon \gamma / (1-\gamma)$. i.e., once the change in our approximation is small, it must also be close to correct.


A Q-Value was a lot like a V value but it takes in a state and an action.

$Q^\ast (s, a) =$ expected utility starting in $s$, taking action $a$, and acting optimally. $a$ could be not optimal at all.

  • Bellman Equation:
    $$Q \ast (s, a) = \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma max _{a\prime} Q ^\ast(s\prime, a\prime)$$

  • Q-Value Iteration:
    $$Q ^\ast_{k+1} (s, a) \leftarrow \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q^\ast _{k1} (s\prime, a\prime)) $$

Q-Values after 100 iterations

There are 4 values per state, because there are 4 actions except the exit action.

Policy Iteration

  • Policy evaluation for a given $\pi(s)$:
    $$V ^\pi_k (s) \leftarrow \sum _{s\prime} P(s\prime | s, \pi(s)) (R(s, \pi(s), s\prime) + \gamma V ^\pi _{k-1} (s))$$

  • At convergence:
    $$\forall s, V ^\pi (s) \leftarrow \sum _{s\prime} P(s\prime | s, \pi(s)) (R(s, \pi(s), s\prime) + \gamma V ^\pi(s)$$

  • Policy evaluation for stochastic policy $\pi(a|s)$:
    $$V ^\pi_k (s) \leftarrow \sum _{s\prime}\sum _{a} \pi(a|s) P(s\prime | s, \pi(s)) (R(s, \pi(s), s\prime) + \gamma V ^\pi _{k-1} (s))$$

This together is a probability of an action and next state combination to happen times reward you get for the transition plus $\gamma$ times the value with one less time step to go.

Policy improvement: find the best action according to one-step look-ahead.
$$\pi ^\ast_k (s) \leftarrow arg \max _a \sum _{s\prime} P(s\prime | s, a)(R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime))$$

Theorem: Policy iteration is guaranteed to converge and at convergence, the current policy and its value function are the optimal policy and the optimal value function.