Lecture Note: Sample-based Approximations and Fitted Learning
Q-Values Bellman Equation:
$$Q ^\ast (s, a) = \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q ^\ast (s\prime, a\prime))$$
Q-value iteration:
$$Q _{k+1} (s, a) \leftarrow \sum
]]>Lecture Note: Sample-based Approximations and Fitted Learning
Q-Values Bellman Equation:
$$Q ^\ast (s, a) = \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q ^\ast (s\prime, a\prime))$$
Q-value iteration:
$$Q _{k+1} (s, a) \leftarrow \sum _{s\prime} P(s\prime| s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q_k(s\prime, a\prime)$$
Rewrite as expectation:
$$Q _{k+1} \leftarrow \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [ R(s, a, s\prime) + \gamma \max _{a\prime} Q _k(s\prime, s\prime) ] $$
Off-Policy Learning: Q-learning converges to optimal policy, even if you are acting sub-optimally.
You have to explore enough. Eventually, you have to make the learning rate small enough, but not decrease it too quickly.
All states and actions are visited infinitely often.
Lecture Notes Intro to MDPs and Exact Solution Methods
We have the initial state $s_{t}$. The agent gets to choose an action at time $t$, action $a_t$ as a consequence the environment changes.
The environment is the world around the agent. Environment is some states
]]>Lecture Notes Intro to MDPs and Exact Solution Methods
We have the initial state $s_{t}$. The agent gets to choose an action at time $t$, action $a_t$ as a consequence the environment changes.
The environment is the world around the agent. Environment is some states after the action, the state will change to $s_{t+1}$ and a reward might be emitted associated with what just happened and then this process repeats.
Your goal is to maximize expected sum of rewards under a policy $\pi$. $\pi$ is a prescription for each state what action you want to take. The goal can be written as:
$\max_{\pi}E[\sum ^{H}_{t=0} \gamma^t R (S_t, A_t, S _{t+1} | \pi)] $
Deterministic Policy: a policy that is a deterministic mapping from state to action
Stochastic Policy: a policy that will map from state to distribution over actions
$$ V^*(s) = \max_\pi \mathbb{E} [\sum ^{H}_{t=0} \gamma ^t R (s_t, a_t, s _{t+1} | \pi, s_0=s)] $$
It means the sum of discounted rewards when starting from state $s$ and acting optimally.
Once you have enough time to essentially be guaranteed to get into the positive reward having more time is not going to help you and so the values will stop changing.
Start with $V ^\ast_0 (s) = 0 ,\forall s$.
for $k = 1, ..., H$:
For all states $s$ in $S$:
$$V ^\ast_k (s) \leftarrow \max _a \sum _{s\prime} P(s\prime | s, a)(R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime))$$
$$\pi ^\ast_k (s) \leftarrow arg \max _a \sum _{s\prime} P(s\prime | s, a)(R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime))$$
Theorem: Value iteration converges. At convergence, we have found the optimal value function $V \ast$ for the discounted infinite horizon problem, which satisfies the Bellman equations:
$$\forall S \in S: V ^\ast_k (s) = \max _a \sum _{s\prime} P(s\prime | s, a)[R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime)]$$
$$\forall S \in S: \pi ^\ast_k (s) = arg \max _a \sum _{s\prime} P(s\prime | s, a)[R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime)]$$
A Q-Value was a lot like a V value but it takes in a state and an action.
$Q^\ast (s, a) =$ expected utility starting in $s$, taking action $a$, and acting optimally. $a$ could be not optimal at all.
Bellman Equation:
$$Q \ast (s, a) = \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma max _{a\prime} Q ^\ast(s\prime, a\prime)$$
Q-Value Iteration:
$$Q ^\ast_{k+1} (s, a) \leftarrow \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q^\ast _{k1} (s\prime, a\prime)) $$
There are 4 values per state, because there are 4 actions except the exit action.
Policy evaluation for a given $\pi(s)$:
$$V ^\pi_k (s) \leftarrow \sum _{s\prime} P(s\prime | s, \pi(s)) (R(s, \pi(s), s\prime) + \gamma V ^\pi _{k-1} (s))$$
At convergence:
$$\forall s, V ^\pi (s) \leftarrow \sum _{s\prime} P(s\prime | s, \pi(s)) (R(s, \pi(s), s\prime) + \gamma V ^\pi(s)$$
Policy evaluation for stochastic policy $\pi(a|s)$:
$$V ^\pi_k (s) \leftarrow \sum _{s\prime}\sum _{a} \pi(a|s) P(s\prime | s, \pi(s)) (R(s, \pi(s), s\prime) + \gamma V ^\pi _{k-1} (s))$$
This together is a probability of an action and next state combination to happen times reward you get for the transition plus $\gamma$ times the value with one less time step to go.
Policy improvement: find the best action according to one-step look-ahead.
$$\pi ^\ast_k (s) \leftarrow arg \max _a \sum _{s\prime} P(s\prime | s, a)(R(s, a, s\prime) + \gamma V ^\ast _{k-1} (s\prime))$$
Theorem: Policy iteration is guaranteed to converge and at convergence, the current policy and its value function are the optimal policy and the optimal value function.
]]>