Sampling-Based Approximation

Lecture Note: Sample-based Approximations and Fitted Learning

Recap Q-Value

Q-Values Bellman Equation:

$$Q ^\ast (s, a) = \sum _{s\prime} P(s\prime | s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q ^\ast (s\prime, a\prime))$$

Q-value iteration:

$$Q _{k+1} (s, a) \leftarrow \sum _{s\prime} P(s\prime| s, a) (R(s, a, s\prime) + \gamma \max _{a\prime} Q_k(s\prime, a\prime)$$

Rewrite as expectation:

$$Q _{k+1} \leftarrow \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [ R(s, a, s\prime) + \gamma \max _{a\prime} Q _k(s\prime, s\prime) ] $$

Tabular Q-Learning

  • For an state-action pair $(s, a)$, receive: $s\prime ~ P(s\prime | s, a)$
  • Old estimate: $Q_k(s, a)$
  • New sample estimate: $target(s\prime) = R(s, a, s\prime) + \gamma \max _{a\prime} Q_k(s\prime, a\prime)$
    • $R(s, a, s\prime)$ is instant reward
    • $\gamma \max _{a\prime} Q _k(s\prime, a\prime)$ is the discounted version of the future value according to our current estimate
  • Incorporate the new estimate into a running average: $Q _{k+1}(s, a) \leftarrow (1 - \alpha) Q _k(s, a) + \alpha [target (s\prime)]$


  • Start with $Q _0{s, a}$ for all $s, a$,
  • Get initial state s
  • For $k = 1, 2, ... $ till convergence
    • sample action $a$, get next state $s\prime$
    • if $s\prime$ is terminal:
      • $target = R(s, a, s\prime)$
    • else:
      • $target = R(s, a, s\prime) + \gamma \max _{a\prime} Q _k(s\prime, a\prime)$
    • $Q _{k+1}(s,a) \leftarrow (1-\alpha) Q _k(s, a) + \alpha [target]$
    • $s \leftarrow s\prime$

How to sample actions

  • Choose random actions
  • Choose action that maximizes $Q _k(s, a)$, greedily
  • $\epsilon -$greedy: choose random action with prob $\epsilon$, otherwise choose action greedily

Q-Learning Properties

Off-Policy Learning: Q-learning converges to optimal policy, even if you are acting sub-optimally.

You have to explore enough. Eventually, you have to make the learning rate small enough, but not decrease it too quickly.

All states and actions are visited infinitely often.

GridWorld Demo

  • States: 11 cells
  • Actions: {up, down, left, right}
  • Deterministic transition function
  • Learning rate: 0.5
  • Discount: 1
  • Reward: +1 for getting diamond, -1 for falling into trap

Crawler Demo

  • States: discretized value of 2D state: (arm angle, hand angle)
  • Action: Cartesian product of {arm up, arm down} and {hand up, hand down}
  • Reward: speed in the forward direction

Value Iteration w/ Samples

  • Value Iteration: $V ^\ast_{i+1} (s) \leftarrow \max _{a} \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [R(s, a, s\prime) + \gamma V ^\ast_i (s\prime)] $
  • But it's unclear how to draw samples through max

Recap: Policy Iteration

  • Policy evaluation for current policy $\pi _k$
    • Iterate until convergence
      $$ V ^{\pi _k} _{i+1}(s) \leftarrow \mathbb{E} _{s\prime ~ P(s\prime | s, \pi_k(s))} [R(s, \pi_k(s), s\prime) + \gamma V^{\pi_k}_i(s\prime)] $$
    • This can be approximated by samples. Called Temporal Difference Learning.
  • Policy improvement: find the best action according to one-step look-ahead
    $$ \pi_{k+1}(s) \leftarrow arg \max _a \mathbb{E} _{s\prime ~ P(s\prime | s, a)} [R(s, a, s\prime) + \gamma V^{\pi_k}(s\prime)]$$
    • unclear what to do with the max for now

Generalizing Across States

  • Learn about some small number of training states from experience
  • Generalize that experience to new, similar situations

Approximate Q-Learning

  • Instead of a table, we have a parametrized Q function: $Q_\theta (s,a)$
    • Can be a linear function in features
    • Or a complicated neural network
  • Learning rule:
    • Remember: $target(s\prime) = R(s, a, s\prime) + \gamma \max_{a\prime} Q_{\theta_k}(s\prime, a\prime)$
    • Update: $\theta _{k+1} \leftarrow \theta _k - \alpha \triangledown _\theta [\frac{1}{2} (Q _\theta (s, a) - target (s\prime)) ^ 2] | _{\theta = \theta_k} $