Markov Decision Process
Lecture Notes Intro to MDPs and Exact Solution Methods
We have the initial state $s_{t}$. The agent gets to choose an action at time $t$, action $a_t$ as a consequence the environment changes.
The environment is the world around the agent. Environment is some states after the action, the state will change to $s_{t+1}$ and a reward might be emitted associated with what just happened and then this process repeats.
MDP Definition
 Set of state $S$: scalar, vector
 Set of action $A$: what agent can do
 Transition function $P(s\primes, a)$: a distribution over next state $s\prime$ given current state and action
 Reward function $R(s, a, s\prime)$: reward from current state $s$ and action $a$ to next state $s\prime$
 Start state $s_{0}$
 Discount factor $\gamma$: how much you care about the future compared to the current time. Without the discount factor, it might be possible to collect infinite
 Horizon $H$: how long you are going to acting, sometimes it's infinity
Outline
 Optimal Control: given an MDP $(S, A, P, R, \gamma, H)$ to find the optimal policy $\pi^*$
 Exact Methods:
 Value Iteration
 Policy Iteration
 Limitations:
 Iteration over / storage for all states and actions: requires small, discrete stateaction space
 Update equations require access to dynamics model
An example:
Your goal is to maximize expected sum of rewards under a policy $\pi$. $\pi$ is a prescription for each state what action you want to take. The goal can be written as:
$\max_{\pi}E[\sum ^{H}_{t=0} \gamma^t R (S_t, A_t, S _{t+1}  \pi)] $
Deterministic Policy: a policy that is a deterministic mapping from state to action
Stochastic Policy: a policy that will map from state to distribution over actions
Optimal Value Function $V^*$
$$ V^*(s) = \max_\pi \mathbb{E} [\sum ^{H}_{t=0} \gamma ^t R (s_t, a_t, s _{t+1}  \pi, s_0=s)] $$
It means the sum of discounted rewards when starting from state $s$ and acting optimally.
Value Iteration
 $ V ^\ast_0 (s) $ is the optimal value for state s when $H = 0$
 $V ^\ast_0 (s) = 0 ,\forall s$
 $V ^*_1 (s)$ is the optimal value for state s when $H = 1$,
 $V ^\ast_1 (s) = \max _a \sum _{s\prime} P(s\prime  s, a)(R(s, a, s\prime) + \gamma V ^\ast_0 (s\prime))$
 $P(s\prime  s, a)$ state distribution over current state and action
 $P(s\prime  s, a)(R(s, a, s\prime)$ immediate transition
 $\gamma V ^\ast_0 (s\prime)$, the discount factor times the value of the next state
 $V ^\ast_1 (s) = \max _a \sum _{s\prime} P(s\prime  s, a)(R(s, a, s\prime) + \gamma V ^\ast_0 (s\prime))$
 $V ^*_t (s)$ is the optimal value for state s when $H = t$,
 $V ^\ast_t (s) = \max _a \sum _{s\prime} P(s\prime  s, a)(R(s, a, s\prime) + \gamma V ^\ast _{t1} (s\prime))$
Once you have enough time to essentially be guaranteed to get into the positive reward having more time is not going to help you and so the values will stop changing.
Value Update Algorithm / Bellman update
Start with $V ^\ast_0 (s) = 0 ,\forall s$.
for $k = 1, ..., H$:
For all states $s$ in $S$:
$$V ^\ast_k (s) \leftarrow \max _a \sum _{s\prime} P(s\prime  s, a)(R(s, a, s\prime) + \gamma V ^\ast _{k1} (s\prime))$$
$$\pi ^\ast_k (s) \leftarrow arg \max _a \sum _{s\prime} P(s\prime  s, a)(R(s, a, s\prime) + \gamma V ^\ast _{k1} (s\prime))$$
Value Iteration Convergence
Theorem: Value iteration converges. At convergence, we have found the optimal value function $V \ast$ for the discounted infinite horizon problem, which satisfies the Bellman equations:
$$\forall S \in S: V ^\ast_k (s) = \max _a \sum _{s\prime} P(s\prime  s, a)[R(s, a, s\prime) + \gamma V ^\ast _{k1} (s\prime)]$$
$$\forall S \in S: \pi ^\ast_k (s) = arg \max _a \sum _{s\prime} P(s\prime  s, a)[R(s, a, s\prime) + \gamma V ^\ast _{k1} (s\prime)]$$
Convergence and Contractions
 Maxnorm: $U = max_sU(s)$
 Theorem: For any two approximations U and V, $U _{i+1}  V _{i+1} \leqq \gamma U _i  V _i $. i.e., any distinct approximation must get closer to each other, and value iteration converges to a unique, stable, optimal solution.
 Theorem: $V _{i+1}  V _i < \epsilon \rightarrow  V _{i+1}  V^\ast  < 2 \epsilon \gamma / (1\gamma)$. i.e., once the change in our approximation is small, it must also be close to correct.
QValues
A QValue was a lot like a V value but it takes in a state and an action.
$Q^\ast (s, a) =$ expected utility starting in $s$, taking action $a$, and acting optimally. $a$ could be not optimal at all.

Bellman Equation:
$$Q \ast (s, a) = \sum _{s\prime} P(s\prime  s, a) (R(s, a, s\prime) + \gamma max _{a\prime} Q ^\ast(s\prime, a\prime)$$ 
QValue Iteration:
$$Q ^\ast_{k+1} (s, a) \leftarrow \sum _{s\prime} P(s\prime  s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q^\ast _{k1} (s\prime, a\prime)) $$
There are 4 values per state, because there are 4 actions except the exit action.
Policy Iteration

Policy evaluation for a given $\pi(s)$:
$$V ^\pi_k (s) \leftarrow \sum _{s\prime} P(s\prime  s, \pi(s)) (R(s, \pi(s), s\prime) + \gamma V ^\pi _{k1} (s))$$ 
At convergence:
$$\forall s, V ^\pi (s) \leftarrow \sum _{s\prime} P(s\prime  s, \pi(s)) (R(s, \pi(s), s\prime) + \gamma V ^\pi(s)$$ 
Policy evaluation for stochastic policy $\pi(as)$:
$$V ^\pi_k (s) \leftarrow \sum _{s\prime}\sum _{a} \pi(as) P(s\prime  s, \pi(s)) (R(s, \pi(s), s\prime) + \gamma V ^\pi _{k1} (s))$$
This together is a probability of an action and next state combination to happen times reward you get for the transition plus $\gamma$ times the value with one less time step to go.
Policy improvement: find the best action according to onestep lookahead.
$$\pi ^\ast_k (s) \leftarrow arg \max _a \sum _{s\prime} P(s\prime  s, a)(R(s, a, s\prime) + \gamma V ^\ast _{k1} (s\prime))$$
Theorem: Policy iteration is guaranteed to converge and at convergence, the current policy and its value function are the optimal policy and the optimal value function.