# Sampling-Based Approximation

Lecture Note: Sample-based Approximations and Fitted Learning

## Recap Q-Value

Q-Values Bellman Equation:

$$Q ^\ast (s, a) = \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q ^\ast (s\prime, a\prime))$$

Q-value iteration:

$$Q _{k+1} (s, a) \leftarrow \sum _{s\prime} P(s\prime| s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q_k(s\prime, a\prime)$$

Rewrite as expectation:

$$Q _{k+1} \leftarrow \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [ R(s, a, s\prime) + \gamma \max _{a\prime} Q _k(s\prime, s\prime) ] $$

## Tabular Q-Learning

- For an state-action pair //((s, a)//), receive: //(s\prime ~ P(s\prime | s, a)//)
- Old estimate: //(Q_k(s, a)//)
- New sample estimate: //(target(s\prime) = R(s, a, s\prime) + \gamma \max _{a\prime} Q_k(s\prime, a\prime)//)
- //(R(s, a, s\prime)//) is instant reward
- //(\gamma \max _{a\prime} Q _k(s\prime, a\prime)//) is the discounted version of the future value according to our current estimate

- Incorporate the new estimate into a running average: //(Q _{k+1}(s, a) \leftarrow (1 - \alpha) Q _k(s, a) + \alpha [target (s\prime)]//)

### Algorithm

- Start with //(Q _0{s, a}//) for all //(s, a//),
- Get initial state s
- For //(k = 1, 2, ... //) till convergence
- sample action $a$, get next state //(s\prime//)
- if //(s\prime//) is terminal:
- //(target = R(s, a, s\prime)//)

- else:
- //(target = R(s, a, s\prime) + \gamma \max _{a\prime} Q _k(s\prime, a\prime)//)

- //(Q _{k+1}(s,a) \leftarrow (1-\alpha) Q _k(s, a) + \alpha [target]//)
- //(s \leftarrow s\prime//)

### How to sample actions

- Choose random actions
- Choose action that maximizes //(Q _k(s, a)//), greedily
- //(\epsilon -//)greedy: choose random action with prob //(\epsilon//), otherwise choose action greedily

### Q-Learning Properties

Off-Policy Learning: Q-learning converges to optimal policy, even if you are acting sub-optimally.

You have to explore enough. Eventually, you have to make the learning rate small enough, but not decrease it too quickly.

All states and actions are visited infinitely often.

#### GridWorld Demo

- States: 11 cells
- Actions: {up, down, left, right}
- Deterministic transition function
- Learning rate: 0.5
- Discount: 1
- Reward: +1 for getting diamond, -1 for falling into trap

#### Crawler Demo

- States: discretized value of 2D state: (arm angle, hand angle)
- Action: Cartesian product of {arm up, arm down} and {hand up, hand down}
- Reward: speed in the forward direction

## Value Iteration w/ Samples

- Value Iteration: //(V ^\ast_{i+1} (s) \leftarrow \max _{a} \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [R(s, a, s\prime) + \gamma V ^\ast_i (s\prime)] //)
- But it's unclear how to draw samples through max

## Recap: Policy Iteration

- Policy evaluation for current policy //(\pi _k//)
- Iterate until convergence

$$ V ^{\pi _k} _{i+1}(s) \leftarrow \mathbb{E} _{s\prime ~ P(s\prime | s, \pi_k(s))} [R(s, \pi_k(s), s\prime) + \gamma V^{\pi_k}_i(s\prime)] $$ - This can be approximated by samples. Called Temporal Difference Learning.

- Iterate until convergence
- Policy improvement: find the best action according to one-step look-ahead

$$ \pi_{k+1}(s) \leftarrow arg \max _a \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [R(s, a, s\prime) + \gamma V^{\pi_k}(s\prime)]$$- unclear what to do with the max for now

## Generalizing Across States

- Learn about some small number of training states from experience
- Generalize that experience to new, similar situations

## Approximate Q-Learning

- Instead of a table, we have a parametrized Q function: //(Q_\theta (s,a)//)
- Can be a linear function in features
- Or a complicated neural network

- Learning rule:
- Remember: //(target(s\prime) = R(s, a, s\prime) + \gamma \max_{a\prime} Q_{\theta_k}(s\prime, a\prime)//)
- Update: //(\theta _{k+1} \leftarrow \theta _k - \alpha \triangledown _\theta [\frac{1}{2} (Q _\theta (s, a) - target (s\prime)) ^ 2] | _{\theta = \theta_k}//)