# Sampling-Based Approximation

Lecture Note: Sample-based Approximations and Fitted Learning

## Recap Q-Value

Q-Values Bellman Equation:

$$Q ^\ast (s, a) = \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q ^\ast (s\prime, a\prime))$$

Q-value iteration:

$$Q _{k+1} (s, a) \leftarrow \sum _{s\prime} P(s\prime| s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q_k(s\prime, a\prime)$$

Rewrite as expectation:

$$Q _{k+1} \leftarrow \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [ R(s, a, s\prime) + \gamma \max _{a\prime} Q _k(s\prime, s\prime) ] $$

## Tabular Q-Learning

- For an state-action pair $(s, a)$, receive: $s\prime ~ P(s\prime | s, a)$
- Old estimate: $Q_k(s, a)$
- New sample estimate: $target(s\prime) = R(s, a, s\prime) + \gamma \max _{a\prime} Q_k(s\prime, a\prime)$
- $R(s, a, s\prime)$ is instant reward
- $\gamma \max _{a\prime} Q _k(s\prime, a\prime)$ is the discounted version of the future value according to our current estimate

- Incorporate the new estimate into a running average: $Q _{k+1}(s, a) \leftarrow (1 - \alpha) Q _k(s, a) + \alpha [target (s\prime)]$

### Algorithm

- Start with $Q _0{s, a}$ for all $s, a$,
- Get initial state s
- For $k = 1, 2, ... $ till convergence
- sample action $a$, get next state $s\prime$
- if $s\prime$ is terminal:
- $target = R(s, a, s\prime)$

- else:
- $target = R(s, a, s\prime) + \gamma \max _{a\prime} Q _k(s\prime, a\prime)$

- $Q _{k+1}(s,a) \leftarrow (1-\alpha) Q _k(s, a) + \alpha [target]$
- $s \leftarrow s\prime$

### How to sample actions

- Choose random actions
- Choose action that maximizes $Q _k(s, a)$, greedily
- $\epsilon -$greedy: choose random action with prob $\epsilon$, otherwise choose action greedily

### Q-Learning Properties

Off-Policy Learning: Q-learning converges to optimal policy, even if you are acting sub-optimally.

You have to explore enough. Eventually, you have to make the learning rate small enough, but not decrease it too quickly.

All states and actions are visited infinitely often.

#### GridWorld Demo

- States: 11 cells
- Actions: {up, down, left, right}
- Deterministic transition function
- Learning rate: 0.5
- Discount: 1
- Reward: +1 for getting diamond, -1 for falling into trap

#### Crawler Demo

- States: discretized value of 2D state: (arm angle, hand angle)
- Action: Cartesian product of {arm up, arm down} and {hand up, hand down}
- Reward: speed in the forward direction

## Value Iteration w/ Samples

- Value Iteration: $V ^\ast_{i+1} (s) \leftarrow \max _{a} \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [R(s, a, s\prime) + \gamma V ^\ast_i (s\prime)] $
- But it's unclear how to draw samples through max

## Recap: Policy Iteration

- Policy evaluation for current policy $\pi _k$
- Iterate until convergence

$$ V ^{\pi _k} _{i+1}(s) \leftarrow \mathbb{E} _{s\prime ~ P(s\prime | s, \pi_k(s))} [R(s, \pi_k(s), s\prime) + \gamma V^{\pi_k}_i(s\prime)] $$ - This can be approximated by samples. Called Temporal Difference Learning.

- Iterate until convergence
- Policy improvement: find the best action according to one-step look-ahead

$$ \pi_{k+1}(s) \leftarrow arg \max _a \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [R(s, a, s\prime) + \gamma V^{\pi_k}(s\prime)]$$- unclear what to do with the max for now

## Generalizing Across States

- Learn about some small number of training states from experience
- Generalize that experience to new, similar situations

## Approximate Q-Learning

- Instead of a table, we have a parametrized Q function: $Q_\theta (s,a)$
- Can be a linear function in features
- Or a complicated neural network

- Learning rule:
- Remember: $target(s\prime) = R(s, a, s\prime) + \gamma \max_{a\prime} Q_{\theta_k}(s\prime, a\prime)$
- Update: $\theta _{k+1} \leftarrow \theta _k - \alpha \triangledown _\theta [\frac{1}{2} (Q _\theta (s, a) - target (s\prime)) ^ 2] | _{\theta = \theta_k} $