date | title |
---|---|
2020-12-06 |
Introduction to Reinforcement Learning |
The goal of Reinforcement Learning (RL) is to learn a good strategy for the agent from experimental trials and relative simple feedback received. With the optimal strategy, the agent is capable to actively adapt to the environment to maximize future rewards.
Bellman equations refer to a set of equations that decompose the value function into the immediate reward plus the discounted future values. $$\begin{aligned} V(s) &= \mathbb{E}[G_t | S_t = s]\ &= \mathbb{E}[R_{t+1} + \gamma V(S_{t+1})|S_t = s)]\ Q(s,a)&=\mathbb{E}[R_{t+1} + \gamma V(S_{t+1}) | S_t = s, A_t = a] \end{aligned}$$
$$\begin{aligned} V_(s) &= \max_{a}\sum_{s'}\sum_{r}p(s', r | s, a)[r + \gamma V_(s')]\ Q_(s,a) &= \sum_{s'}\sum_{r}p(s', r | s, a)[r +\gamma\max_{a'}Q_(s', a')] \end{aligned} $$
When the model of the environment is known, following Bellman equations, we can use Dynamic Programming (DP) to iteratively evaluate value functions and improve policy.
Given a policy and its value function, we can easily evaluate a change in the policy at a single state to a particular action. It is a natural extension to consider changes at all states and to all possible actions, selecting at each state the action that appears best according to
Once a policy,
Monte-Carlo (MC) methods require only experience --- sample sequences of states, actions, and rewards from actual or simulated interaction with an environment. It learns from actual experience without no prior knowledge of the environment's dynamics. To compute the empirical return
The empirical mean return for state
This way of approximation can be easily extended to action-value functions by counting
-
Improve the policy greedily with respect to the current value function:
$$\pi(s) = \arg\max_{a\in A}Q(s,a)$$ -
Generate a new episode with the new policy
$\pi$ (i.e. using algorithms like$\epsilon$ -greedy helps us balance between exploitation and exploration) -
Estimate
$Q$ using the new episode:$$q_\pi(s, a) = \frac{\sum_{t = 1}^T(\mathbf{1}[S_t = s, A_t = a]\sum_{k = 0}^{T-t-1}\gamma^kR_{t+k+1})}{\sum_{t=1}^T\mathbf{1}[S_t = s, A_t = a]}$$
Temporal-difference (TD) learning is a combination of Monte Carlo ideas and dynamic programming (DP) ideas. Like Monte Carlo methods, TD methods can learn from raw experience without a model of the environment's dynamics. Like DP, TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
Similar to Monte-Carlo methods, Temporal-Difference (TD) Learning is model-free and learns from episodes of experience. However, TD learning can learn from incomplete episodes.
MC regresses
TD estimates
TD learning methods update targets in the following equation with regard to existing estimates rather than exclusively relying on actual rewards and complete returns as in MC methods in Equation (\ref{eq:9}). This approach is known as bootstrapping. $$ \begin{aligned} V(S_t) &\leftarrow V(S_t) +\alpha[G_t - V(S_t)]\ V(S_t) &\leftarrow V(S_t) +\alpha[R_{t+1} +\gamma V(S_t) - V(S_t)] \end{aligned} $$
Here are some simple methods used in Reinforcement Learning. There are a lot of fancy stuff, but due to limited pages, not included here. Feel free to update the wiki to keep track of the latest algorithms of RL.
https://lilianweng.github.io/lil-log/2018/02/19/a-long-peek-into-reinforcement-learning.html
- Introduction to Reinforcement Learning, MIT Press
Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of artificial intelligence research 4 (1996): 237-285.