# Sampling-Based Approximation

Lecture Note: Sample-based Approximations and Fitted Learning

## Recap Q-Value

Q-Values Bellman Equation:

$$Q ^\ast (s, a) = \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q ^\ast (s\prime, a\prime))$$

Q-value iteration:

$$Q _{k+1} (s, a) \leftarrow \sum _{s\prime} P(s\prime| s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q_k(s\prime, a\prime)$$

Rewrite as expectation:

$$Q _{k+1} \leftarrow \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [ R(s, a, s\prime) + \gamma \max _{a\prime} Q _k(s\prime, s\prime) ]$$

## Tabular Q-Learning

• For an state-action pair $(s, a)$, receive: $s\prime ~ P(s\prime | s, a)$
• Old estimate: $Q_k(s, a)$
• New sample estimate: $target(s\prime) = R(s, a, s\prime) + \gamma \max _{a\prime} Q_k(s\prime, a\prime)$
• $R(s, a, s\prime)$ is instant reward
• $\gamma \max _{a\prime} Q _k(s\prime, a\prime)$ is the discounted version of the future value according to our current estimate
• Incorporate the new estimate into a running average: $Q _{k+1}(s, a) \leftarrow (1 - \alpha) Q _k(s, a) + \alpha [target (s\prime)]$

### Algorithm

• Start with $Q _0{s, a}$ for all $s, a$,
• Get initial state s
• For $k = 1, 2, ...$ till convergence
• sample action $a$, get next state $s\prime$
• if $s\prime$ is terminal:
• $target = R(s, a, s\prime)$
• else:
• $target = R(s, a, s\prime) + \gamma \max _{a\prime} Q _k(s\prime, a\prime)$
• $Q _{k+1}(s,a) \leftarrow (1-\alpha) Q _k(s, a) + \alpha [target]$
• $s \leftarrow s\prime$

### How to sample actions

• Choose random actions
• Choose action that maximizes $Q _k(s, a)$, greedily
• $\epsilon -$greedy: choose random action with prob $\epsilon$, otherwise choose action greedily

### Q-Learning Properties

Off-Policy Learning: Q-learning converges to optimal policy, even if you are acting sub-optimally.

You have to explore enough. Eventually, you have to make the learning rate small enough, but not decrease it too quickly.

All states and actions are visited infinitely often.

#### GridWorld Demo

• States: 11 cells
• Actions: {up, down, left, right}
• Deterministic transition function
• Learning rate: 0.5
• Discount: 1
• Reward: +1 for getting diamond, -1 for falling into trap

#### Crawler Demo

• States: discretized value of 2D state: (arm angle, hand angle)
• Action: Cartesian product of {arm up, arm down} and {hand up, hand down}
• Reward: speed in the forward direction

## Value Iteration w/ Samples

• Value Iteration: $V ^\ast_{i+1} (s) \leftarrow \max _{a} \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [R(s, a, s\prime) + \gamma V ^\ast_i (s\prime)]$
• But it's unclear how to draw samples through max

## Recap: Policy Iteration

• Policy evaluation for current policy $\pi _k$
• Iterate until convergence
$$V ^{\pi _k} _{i+1}(s) \leftarrow \mathbb{E} _{s\prime ~ P(s\prime | s, \pi_k(s))} [R(s, \pi_k(s), s\prime) + \gamma V^{\pi_k}_i(s\prime)]$$
• This can be approximated by samples. Called Temporal Difference Learning.
• Policy improvement: find the best action according to one-step look-ahead
$$\pi_{k+1}(s) \leftarrow arg \max _a \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [R(s, a, s\prime) + \gamma V^{\pi_k}(s\prime)]$$
• unclear what to do with the max for now

## Generalizing Across States

• Learn about some small number of training states from experience
• Generalize that experience to new, similar situations

## Approximate Q-Learning

• Instead of a table, we have a parametrized Q function: $Q_\theta (s,a)$
• Can be a linear function in features
• Or a complicated neural network
• Learning rule:
• Remember: $target(s\prime) = R(s, a, s\prime) + \gamma \max_{a\prime} Q_{\theta_k}(s\prime, a\prime)$
• Update: $\theta _{k+1} \leftarrow \theta _k - \alpha \triangledown _\theta [\frac{1}{2} (Q _\theta (s, a) - target (s\prime)) ^ 2] | _{\theta = \theta_k}$