이제 model-free prediction에 이어, MDP knowledge 없이 value function을 최적화할 수 있는 방법 (model-free control)을 알아볼 것이다. 대표적으로 2가지 방법이 있는데, on-policy learning과 off-policy learning이다.
On-Policy Learning
On-policy learning의 대표적인 예시는 Monte-Carlo control and Temporal Difference control인데, 하나씩 다뤄보자. 먼저 ㅇdynamic programming에서 배웠던 policy iteration을 상기해보자. 우리가 배웠던 policy iteration은 iterative policy evaluation과 greedy policy improvement를 합친 알고리즘이었는데, 사실 policy evaluation과 improvement에 다른 알고리즘을 사용해도 가능하다. 일단 $\pi$를 evaluation 알고리즘을 통해 평가하고 improvement 알고리즘으로 성능을 발전시키면 그만이기 때문이다.
즉 policy iteration은 높은 일반성을 갖고, 이 때문에 model이 알려지지 않은 model-free 환경에서도 이를 응용할 수 있게 된다. Monte-Carlo control과 Temporal difference control도 이러한 policy iteration 프레임워크를 기반으로 하는 알고리즘이다. 따라서, 이제 policy iteration을 model-free 환경에서 일반화시키는 방법을 고민해볼 것이다. 문제가 하나 있는데, greedy policy improvement의 경우 $V(s)$를 최적화하기 위해선 MDP model이 반드시 필요하다:
$$
\begin{aligned}
\pi^\prime (s) = \underset{a \in \mathcal{A}}{\text{ argmax }} [\mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a V(s^\prime)]
\end{aligned}
$$
이는, 보시다시피 policy 성능 향상을 위해 가능한 모든 action $a$에 대해 $Q(s, a)$를 비교해보는 과정으로 일단 action을 취해야 하기 때문이다. 이로 인해 어쩔 수 없이 MDP model을 사용하게 되는 것을 볼 수 있다. 반대로, 만약 state $s$에서 모든 가능한 action $a \in \mathcal{A}$에 대해 $Q(s, a)$를 알고 있다면, MDP model을 사용하지 않아도 policy improvement가 가능하다는 소리이다. 따라서 model-free control에서, policy evaluation 단계에서 $V = v_\pi$ 대신 $Q = q_\pi$를 추정하게 된다면 greedy policy improvement는 다음과 같이 간단하게 이루어질 수 있다:
$$
\begin{aligned}
\pi^\prime (s) = \underset{a \in \mathcal{A}}{\text{ argmax }} Q(s, a)
\end{aligned}
$$
이제 이론 상으로는 model-free policy evaluation을 구현하는데 큰 문제는 없다. 다만, 모든 state에서 greedy하게 행동하게 되면 방문하지 않게 되는 state-action pair가 반드시 존재하고, 이 때문에 value 값이 제대로 update되지 않아 알고리즘 성능을 저하시킬 수 있다. 예를 들어 어떤 state $s$에서 2가지 action $a_1$과 $a_2$가 가능한 상황을 생각해보자. 만약 $Q(s, a_1)$는 자주 방문하여 value가 잘 추정되어 있고, $Q(s, a_2)$는 그렇지 않다면, agent가 제대로 environment을 이해하고 action을 선택할 확률이 낮을 수 밖에 없을 것이다.
이를 방지하기 위해 다양한 방법들이 시도되는데, episode를 시작할 때 agent가 모든 state-action pair로 시작할 수 있게, 각각의 state-action pair가 initial state와 action으로 선택될 확률을 0이 되지 않게 강제로 조정하거나 0이 아니라고 가정하는 것이다. 이를 exploring starts이라고 한다.
$\boldsymbol{\varepsilon}$-greedy Exploration
$\boldsymbol{\varepsilon}$-greedy exploration도 agent가 계속적으로 탐험할 수 있도록 하는 방법 중 하나이다. Hyperparameter $\varepsilon \in [0, 1]$를 설정하여, 탐욕적으로 행동하려할 때, $\varepsilon$의 확률로 greedy action이 아닌 무작위하게 action을 취하는 것이다. 즉, action space의 모든 action이 선택될 확률이 0 이상이 된다. 아래는 action space가 유한한 상황에서 $\boldsymbol{\varepsilon}$-greedy policy를 나타낸다:
$$
\begin{aligned}
\pi (a|s) = \begin{cases}
\frac{\varepsilon}{m} + 1 - \varepsilon && \text{ if } a^\star = \underset{a \in \mathcal{A}(s)}{\text{ argmax }} Q(s, a) \\
\frac{\varepsilon}{m} && \text{ otherwise }
\end{cases}
\end{aligned}
$$
여기서 $m = \| \mathcal{A}(s) \|$이다. And here is a theorem that ensures the improvement of $\varepsilon$-greedy policy iteration: <br>
$\mathbf{Thm.}$ **The monotone improvement of** $\boldsymbol{\varepsilon}$-**greedy policy iteration** <br>
For any policy $\pi$, the $\varepsilon$-greedy policy $\pi^\prime$ with respect to $q_\pi$ is an improvement, i.e., $v_{\pi^\prime} (s) \geq v_\pi (s)$. <br>
<details>
<summary>$\mathbf{Proof.}$ </summary>
<div markdown="1">
$$
\begin{aligned}
q_{\pi} (s, \pi^\prime (s)) & = \sum_{a \in \mathcal{A}(s)} \pi^\prime (a | s) q_\pi (s, a) \\
& = \frac{\varepsilon}{m} \sum_{a \in \mathcal{A}(s)} q_\pi (s, a) + (1 - \varepsilon) \underset{a \in \mathcal{A}(s)}{\text{ max }} q_\pi (s, a)
\end{aligned}
$$
For the first equality, refer to the dynamic programming post. We claim that $\underset{a \in \mathcal{A}(s)}{\text{ max }} q_\pi (s, a) \geq \sum_{a \in \mathcal{A}(s)} \frac{\pi (a \| s) - \frac{\varepsilon}{m}}{1 - \varepsilon} q_\pi (s, a)$. <br>
Let $w_a = \frac{\pi (a \| s) - \frac{\varepsilon}{m}}{1 - \varepsilon}$. Then, it is clear that $\sum_{a \in \mathcal{A}} w_a = 1$ as $m = \| \mathcal{A}(s) \|$. Then, $\sum_{a \in \mathcal{A}} \frac{\pi (a \| s) - \frac{\varepsilon}{m}}{1 - \varepsilon} q_\pi (s, a)$ is the weighted average of $q_\pi (s, a)$ with non-negative weights summing to $1$, and as such it must be less than or equal to the largest number between $q_\pi (s, a)$. <br>
Hence,
$$
\begin{aligned}
q_\pi (s, \pi^\prime (s)) & = \sum_{a \in \mathcal{A}(s)} \pi^\prime (a | s) q_\pi (s, a) \\
& = \frac{\varepsilon}{m} \sum_{a \in \mathcal{A}} q_\pi (s, a) + (1 - \varepsilon) \underset{a \in \mathcal{A}(s)}{\text{ max }} q_\pi (s, a) \\
& \geq \frac{\varepsilon}{m} \sum_{a \in \mathcal{A}(s)} q_\pi (s, a) + (1 - \varepsilon) \sum_{a \in \mathcal{A}(s)} \frac{\pi (a | s) - \frac{\varepsilon}{m} }{1 - \varepsilon} q_\pi (s, a) \\
& = \sum_{a \in \mathcal{A}(s)} \pi (a | s) q_\pi (s, a) = v_\pi (s)._\blacksquare
\end{aligned}
$$
which proves the theorem by policy improvement theorem. <br>
</div>
</details>
<br>
Monte-Carlo Control
In Monte-Carlo policy iteration, we use Monte-Carlo prediction of action value for policy evaluation. To estimate action value with Monte-Carlo method, it must consider not only the state, but also the action. In first-visit and every-visit MC method, to approximate $Q(s, a)$, it averages the returns in each episode that the state $s$ was visited and the next action $a$ was selected immediately. <br>
<p align="center">
<img width="400" alt="image" src="https://user-images.githubusercontent.com/88715406/231189909-45155d92-f1fe-49f3-918f-3fd9da7d7408.png">
</p>
<p align="center">
$\mathbf{Fig\ 2.}$ Monte-Carlo Policy Iteration
</p><br><br>
However, the convergence of Monte-Carlo prediction is assured when the sample size goes to infinity. Hence we may avoid the infinite number of episodes by stopping MC policy evaluation after several finite steps. Although each evaluation iteration updates our estimate of value function toward true $q_{\pi_k}$, but we give up before actual value that can be obtained by expensive number of iterations. (One extreme form of this idea is value iteration in DP. Even only one iteration of policy evaluation is performed between each step of policy improvement, we show that the convergence of algorithm is ensured by Banach fixed point theorem.) <br>
And it is natural to alternate between each episode. After each episode, the observed returns are used for policy evaluation, and then the policy is improved at all the states visited in the episode. <br>
<p align="center">
<img width="400" alt="image" src="https://user-images.githubusercontent.com/88715406/230547912-c0a20bb8-a9fd-4249-810b-66f9e4ee5030.png">
</p>
<p align="center">
<img width="400" alt="image" src="https://user-images.githubusercontent.com/88715406/233781036-9ca4a3e3-9f20-4fc4-ba05-cdcd19d6332a.png">
</p>
<p align="center">
$\mathbf{Fig\ 3.}$ Monte-Carlo Control
</p><br>
On-policy Temporal Difference Control (SARSA)
We already see that Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC). In **Temporal-Difference Control**, we used TD method to estimate $Q(s, a)$ instead of MC to exploit such advantages. Recall that an episode consists of an alternating sequence of state–action pairs (and rewards between each of them):
<p align="center">
<img width="600" alt="image" src="https://user-images.githubusercontent.com/88715406/230739354-b637573f-82b4-41bf-a193-551c41cde260.png">
</p>
<p align="center">
$\mathbf{Fig\ 4.}$ Some parts of an episode
</p><br>
Here is a formula for update in policy evaluation. We update $Q(s, a)$ for *every time step* of each episode. <br>
$$
\begin{aligned}
Q(s, a) \longleftarrow Q(s, a) + \alpha (R + \gamma Q(s^\prime, a^\prime) - Q(s, a))
\end{aligned}
$$
<p align="center">
<img width="100" alt="image" src="https://user-images.githubusercontent.com/88715406/229799838-10b5e034-79eb-43b7-a08c-1fa19c63a814.png">
</p>
<p align="center">
$\mathbf{Fig\ 5.}$ The backup diagram of SARSA
</p><br>
Here is a motivation for target from the Bellman expectation equation with one-sampling Monte-Carlo estimation:
$$
\begin{aligned}
q_\pi (s, a) & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \sum_{a^\prime \in \mathcal{A}} \pi(a^\prime | s^\prime) q_\pi (s^\prime, a^\prime) \\
& = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \mathbb{E}_{a^\prime \sim \pi( \cdot | s^\prime)} [q_\pi (s^\prime, a^\prime)] \\
& \approx \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a
q_\pi (s^\prime, A^\prime) \text{ where } A^\prime \sim \pi( \cdot | s^\prime) \\
& \approx \mathcal{R}_s^a + \gamma q_\pi (S^\prime, A^\prime) \text{ where } S^\prime \sim \mathcal{P}_s^a, A^\prime \sim \pi( \cdot | s^\prime) \\
\end{aligned}
$$
Note that a tuple $(S_t, A_t, R_t, S_{t+1}, A_{t+1})$ is all for updating. And this gives rise to the name **SARSA** for the algorithm. <br>
Now, it is straightforward to design the full control algorithm with TD prediction. As usual, it follows the pattern of generalized policy iteration. But we should use alternative policy improvement (or add some assumptions to policy) such as $\varepsilon$-greedy to assure the convergence. All state–action pairs must be visited an infinite number of times for convergence. <br>
The following figure and pseudocode show the implementation of SARSA algorithm: <br>
<p align="center">
<img width="400" alt="image" src="https://user-images.githubusercontent.com/88715406/230740333-2c7d3328-f316-4242-a53b-095d812503fb.png">
</p>
<p align="center">
$\mathbf{Fig\ 6.}$ SARSA
</p><br>
<p align="center">
<img width="500" alt="image" src="https://user-images.githubusercontent.com/88715406/229800728-a135b607-6859-470e-afe6-69879cbf4774.png">
</p>
<p align="center">
$\mathbf{Fig\ 7.}$ The pseudocode for SARSA
</p><br>
Like TD prediction, SARSA can be extended to $n$-step SARSA and SARSA$(\lambda)$. <br>
$$
\begin{alignat*}{2}
&n = 1 \text{ (SARSA) } &&\longrightarrow q_t^{(1)} = R_{t+1} + \gamma Q (S_{t+1}) \\
&n = 2 &&\longrightarrow q_t^{(2)} = R_{t+1} + \gamma R_{t+2} + \gamma^2 Q (S_{t+2}) \\
&\quad \vdots \\
&n = \infty \text{ (MC) } &&\longrightarrow q_t^{(\infty)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{T-1} R_T \\
\end{alignat*}
$$
where $q_t^{(n)} = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n Q (S_{t+n})$ is the $n$-step $Q$-return. And, the $q^\lambda$ return combines all $n$-step Q-returns:
$$
\begin{aligned}
q_t^\lambda = (1- \lambda) \sum_{n=1}^\infty \lambda^{n-1} q_t^{(n)}
\end{aligned}
$$
<br>
Off-Policy Learning
In off-policy learning, we differentiate the policy into two types: the policy to optimize (target policy $\pi(a \| s)$; *learning its value function is the target of the learning process*) and the policy to generate experience, or action (behaviour policy $\mu (a \| s)$; *policy controlling the agent and generating action*). <br>
* Evaluate target policy $\pi (a \| s)$ to compute $v_\pi (s)$ or $q_\pi (s, a)$ <br>
* while following behaviour policy $\mu (a \| s)$ that is independent from $\pi$,
$$
\begin{aligned}
\{ S_1, A_1, R_2, \cdots, S_t \} \sim \mu
\end{aligned}
$$
There are several reasons for that: <br>
**(1)** It allows the agent to learn from other sources such as humans or other agents. <br>
**(2)** The target policy may be deterministic, while the behavior policy can continue to sample all possible actions. <br>
**(3)** The agent can re-use experience from old policies $\pi_1, \cdots, \pi_{k - 1}$. <br>
**(4)** The agent can learn about optimal policy while following exploratory policy. <br>
**(5)** The agent can learn about multiple policies while following one policy. <br>
<br>
Thus, the learning on such two different policies called **off-policy** learning as the agent learns about a policy only with experience *off* (not following) that policy. Note that the learning $\pi$ from episodes of $\mu$ may be possible when $\pi(a \| s) > 0 \Longrightarrow \mu (a \| s) > 0$, i.e., every action taken under $\pi$ is also taken under $\mu$. <br><br>
Importance Sampling
We may change on-policy algorithms like MC & TD prediction to off-policy. One possible technique is called **importance sampling**. It is a general technique for estimating expectation under a distribution, with given samples from another distribution. Since we have different policies for generating samples, we should estimate the expectation of a different distribution.
$$
\begin{aligned}
\mathbb{E}_{X \sim P} [f (X)] & = \sum P(X) f(X) \\
& = \sum Q(X) \frac{P(X)}{Q(X)} f(X) \\
& = \mathbb{E}_{X \sim Q} [\frac{P(X)}{Q(X)} f (X)] \\
& \approx \frac{1}{N} \sum_{n=1}^N \frac{P(X^{(n)})}{Q(X^{(n)})} f(X^{(n)}) \text{ where } X^{(n)} \sim Q
\end{aligned}
$$
Also, if $P(X)$ is too complex to sample, the distribution of $X$ can be transformed to other distribution $Q(X)$ via this technique, too. The ratio of two distribution $\frac{P(X)}{Q(X)}$ is usually called **importance weight**, or **importance-sampling ratio**. <br>
Off-Policy Monte-Carlo
Note that the probability of the state-action trajectory $S_t, A_t, S_{t+1}, A_{t+1}, \cdots, S_T$ under policy $\pi$ is
$$
\begin{aligned}
\prod_{k=t}^{T-1} \pi (A_k | S_k) \cdot p(S_{k+1}, | S_k, A_k).
\end{aligned}
$$
Thus, the importance weight for this trajectory is
$$
\begin{aligned}
\rho_t^T = \frac{\prod_{k=t}^{T-1} \pi (A_k | S_k) \cdot p(S_{k+1} | S_k, A_k)}{\prod_{k=t}^{T-1} \mu (A_k | S_k) \cdot p(S_{k+1} | S_k, A_k)} = \prod_{k=t}^{T-1} \frac{\pi (A_k | S_k)}{\mu (A_k | S_k)}
\end{aligned}
$$
We can see that the ratio is independent of environment, i.e., transition probability. Thus it doesn't ruin the nature of off-policy that must be indepedent of model. In off-policy version of MC learning, returns generated from $\mu$ are used to valuate $\pi$. Hence, to transform return to follow our behavior policy $\mu$, we multiply importance sampling corrections during whole episode
$$
\begin{aligned}
G_t^{\pi / \mu} & = \frac{\pi (A_t | S_t) }{\mu (A_t | S_t)} \cdot \frac{\pi (A_{t+1} | S_{t+1})}{\mu (A_{t+1} | S_{t+1})} \cdots \frac{\pi (A_T | S_T)}{\mu (A_T | S_T)} G_t \\
& = \frac{P(\tau | S_t)}{Q(\tau | S_t)} G_t
\end{aligned}
$$
Then we update value by modified return,
$$
\begin{aligned}
V(S_t) \longleftarrow V(S_t) + \alpha (G_t^{\pi / \mu} - V(S_t))
\end{aligned}
$$
and our estimate of $v_\pi (s)$ by Monte-Carlo method is as follows:
$$
\begin{aligned}
V(s)= \frac{\sum_{t \in \mathcal{T}(s)} \rho_t^{T(t)} G_t}{|\mathcal{T}(s)|}
\end{aligned}
$$
where $\mathcal{T}(s)$ is the set of all time steps in which state $s$ is visited. But since importance sampling can increase variance, so it is not used very often. <br><br>
### Off-Policy TD with importance sampling
Also TD may utilized instead of MC. Like MC, off-policy version uses TD targets from the behaviour policy $\mu$.
Recall that the Bellman expectation equation for $v_\pi (s)$:
$$
\begin{aligned}
v_\pi (s) & = \sum_{a \in A} \pi (a | s) ( \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a v_\pi (s^\prime)) \\
& = \sum_{a \in A} \pi (a | s) ( \mathcal{R}_s^a + \gamma \mathbb{E}_{s^\prime \sim \mathcal{P}_{ss^\prime}^a} [ v_\pi (s^\prime) ]) \\
\end{aligned}
$$
But, it is not an off-policy, so we multiply the equation by importance weight
$$
\begin{aligned}
v_\pi (s) & = \sum_{a \in A} \mu (a | s) \frac{\pi (a | s)}{\mu (a | s)} ( \mathcal{R}_s^a + \gamma \mathbb{E}_{s^\prime \sim \mathcal{P}_{ss^\prime}^a} [ v_\pi (s^\prime) ]) \\
\end{aligned}
$$
Since we consider only one time step, only single importance sampling is enough. (Hence it has much lower variance than MC importance sampling.) Finally, we discard the need of transition matrix by approximating the expectation as empirical mean. Here is an iterative update formula:
$$
\begin{aligned}
V(S_t) & \longleftarrow V(S_t) \\
& + \alpha ( \frac{\pi (A_t | S_t)}{\mu (A_t | S_t)} (R_{t+1} + \gamma V(S_{t+1})) - V(S_t))
\end{aligned}
$$
Q-Learning
Now, let's consider off-policy learning of action-values $Q(s, a)$ **without importance sampling**, which is called **Q-learning**, one of the most important breakthroughs in RL.
Recall the Bellman optimality equation of $Q$, which is true target for control if we know transition matrix,
$$
\begin{aligned}
q_* (s, a) = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \underset{a^\prime \in \mathcal{A}}{\text{ max }} q_* (s^\prime, a^\prime)
\end{aligned}
$$
Note that it doesn't need a policy for calculating the equation and there is no expecation with respect to $\pi$, so we do not need to sample from that policy.
$$
\begin{aligned}
q_{k+1} (s, a) = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \underset{a^\prime \in \mathcal{A}}{\text{ max }} q_k (s^\prime, a^\prime)
\end{aligned}
$$
Now, we apply MC estimation to the above. In other words, we approximate the expectation with respect to $\mathcal{P}_{ss^\prime}^a$ by sample mean with sampling:
$$
\begin{aligned}
q_{k+1} (s, a) = \mathcal{R}_s^a + \gamma \underset{a^\prime \in \mathcal{A}}{\text{ max }} q_k (s^\prime, a^\prime) \text{ where } s^\prime \sim \mathcal{P}_{ss^\prime}^a
\end{aligned}
$$
So, we do not need importance sampling here! Finally, Q-Learning is an incremental learning using this Monte-Carlo estimated target:
$$
\begin{aligned}
q_{k+1} (s, a) = q_k (s, a) + \alpha ( R_{s}^a + \gamma \underset{a^\prime}{\text{ max }} q_k (s^\prime, a^\prime) - q_k (s, a))
\end{aligned}
$$
Hence, in this update rule, the learned $Q$ will directly estimates the optimal $$Q_*$$ and Q-learning is like value iteration of $Q$ with one-step MC sampling for $s^\prime$. It updates the Q-value the most and sample the next action by behavior policy from it. In other words, **Q-learning targets to learn values of the optimal greedy policy**. Notice that the state-action pair to update Q-value is determined by a behavior policy $\mu$, so that the algorithm is off-policy. <br>
And it is mathematically known that Q-learning control converges with probability 1 to the optimal action-value function if behavior policy must visit every state-action; infinitely many enough, so that all pairs continue to be updated.
The following figure and pseudocode summarizes Q-Learning: <br>
<p align="center">
<img width="100" alt="image" src="https://user-images.githubusercontent.com/88715406/231223987-84e84ff5-cbf8-4fce-8431-7ffeab98165f.png">
</p>
<p align="center">
$\mathbf{Fig\ 8.}$ The backup diagram for Q-Learning
</p><br>
<p align="center">
<img width="500" alt="image" src="https://user-images.githubusercontent.com/88715406/231224297-3c3450b7-d98c-480c-a065-afdb268a155f.png">
</p>
<p align="center">
$\mathbf{Fig\ 9.}$ The pseudocode for Q-Learning
</p><br>
<details>
<summary>$\mathbf{Remark.}$ If all policies are greedy in Q-learning and SARSA, are two algorithms equivalent? </summary>
<div markdown="1">
(Reference: [**[5]**](#reference))
No. SARSA samples the next action to update *before updating Q values*, whereas Q-learning samples *after updating*. (Compare the pseudocodes of two algorithms) <br>
For explicit example, consider the transition from $S$ to $S^\prime = S$. <br>
* **SARSA** <br>
* Initialize $S_1 = S$, and choose $$A_1 = \text{ argmax}_A \; Q_1 (S_1 = S, A)$$
* Take action $A_1$ and obtain $R$ and $S_2 = S^\prime = S$
* Choose action $$A_2 = \text{ argmax}_A \; Q_1 (S_2 = S, A)$$
* $Q_2 (S_1, A_1) = Q_1 (S_1, A_1) + \alpha \left( R + \gamma Q_1 (S_2, A_2) - Q_1(S_1, A_1) \right)$
* **Q-Learning** <br>
* Initialize $S_1 = S$
* Choose $$A_1 = \text{ argmax}_A \; Q_1 (S_1 = S, A)$$
* Take action $A_1$ and obtain $R$ and $S_2 = S^\prime = S$
* $Q_2 (S_1, A_1) = Q_1 (S_1, A_1) + \alpha \left( R + \gamma \text{ max}_A \; Q_1 (S_2, A) - Q_1(S_1, A_1) \right)$
* Choose action $$A_2 = \text{ argmax}_A \; Q_2 (S_2 = S, A)$$ <br>
Note that $A_1$ is for 1st update, and $A_2$ is an action for 2nd update. And, it can be deducted that the fundamental reason for this difference is on-policy and off-policy.
</div>
</details><br>
<details>
<summary>$\mathbf{Remark.}$ What if we do 2-step Q-Learning? </summary>
<div markdown="1">
Like $n$-step $\text{SARSA}$ or $\text{SARSA}(\lambda)$, can we do 2-step Q-Learning?
$$
\begin{aligned}
q_{k+1} (s, a) = q_k (s, a) + \alpha (\mathcal{R}_s^a + \gamma R_{s^\prime}^{a^\prime} + \gamma^2 \underset{a^{\prime \prime} \in \mathcal{A}}{\text{ max }} q_k (s^{\prime \prime}, a^{\prime \prime}) - q_k (s, a))
\end{aligned}
$$
Actually, the above target is involving the policy $\pi$. To get the second state and action, we must samples $a^\prime$ from $\pi$. More precisely, after eliminating MC approximations, the TD target can be written as:
$$
\begin{aligned}
q_{k+1} (s, a) & = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \sum_{a^\prime \in \mathcal{A}} \boldsymbol{\pi (a^\prime | s^\prime)} \left( \mathcal{R}_{s^\prime}^{a^\prime} + \gamma \sum_{s^{\prime \prime} \in \mathcal{S}} \mathcal{P}_{s^{\prime} s^{\prime \prime}}^{a^\prime} \underset{a^{\prime \prime} \in \mathcal{A}}{\text{ max }} q_k (s^{\prime \prime}, a^{\prime \prime}) \right) \\
& = \mathcal{R}_s^a + \gamma \sum_{s^\prime \in \mathcal{S}} \mathcal{P}_{ss^\prime}^a \sum_{a^\prime \in \mathcal{A}} \boldsymbol{\mu (a^\prime | s^\prime) \frac{\pi (a^\prime | s^\prime)}{\mu (a^\prime | s^\prime)}} \left( \mathcal{R}_{s^\prime}^{a^\prime} + \gamma \sum_{s^{\prime \prime} \in \mathcal{S}} \mathcal{P}_{s^{\prime} s^{\prime \prime}}^{a^\prime} \underset{a^{\prime \prime} \in \mathcal{A}}{\text{ max }} q_k (s^{\prime \prime}, a^{\prime \prime}) \right) \\
\end{aligned}
$$
Hence, we should correct the scale by multiplying importance weight to make valid objective.
</div>
</details><br>
SARSA v.s. Q-learning
So far we considered two version of TD control, one is SARSA (on-policy) and another is Q-learning (off-policy). <br>
SARSA required to sample $A_{t+1} = a^\prime$ from $\pi$ in order to approximate Bellman expectation equation, and this will be next action for update since it is following $\pi$. However, although Q-Learning seems to be following the target policy $\pi$ (in Q-learning, $\pi$ is greedy policy), it is actually following $\mu$ because a state-action pair to update Q-value is determined by $\mu$. Hence, **Q-Learning is off-policy whereas SARSA is on-policy**. [**[4]**](#reference) <br>
Although they seem very similar, but two algorithms are distinctly different. We also confirmed that even all policies in two algorithms are greedy, SARSA and Q-learning could not be the same. Because, if one compares the TD target of two algorithms, **SARSA aims to learn the optimal $\varepsilon$-greedy policy, while Q-learning aims to learn the optimal greedy policy.** <br>
One of the well-known examples that highlight the effect of this slight difference between them is *Cliff walking* (referenced from Example 6.6 (page 106) of [Reinforcement Learning: An Introduction by Sutton and Barto](#reference)). It is very similar to frozenlake but with larger observation space.
이미지
In the environment, the agent reach the goal along the edge of a cliff; without falling off. There is a $-1$ reward for each work, and penalized if the agent falls off the cliff by $-100$ reward and sends the agent instantly back to the initial state.
By SARSA and Q-learning, the agent learns the values and find the path as follows:
<p align="center">
<img width="500" alt="image" src="https://user-images.githubusercontent.com/88715406/234849536-d249711b-7e07-432c-82f9-96acfd9bf95e.png">
</p>
<p align="center">
$\mathbf{Fig\ 10.}$ The result of SARSA and Q-learning in cliff-walking task. $\varepsilon = 0.1$
</p><br>
Q-learning learns an optimal path and policy: travels right along the edge of the cliff. On the other hand, SARSA takes the action selection into account and learns the longer but much safer path than Q-learning: travels through the topmost grids. However, its performance is worse than that of Sarsa, which learns the roundabout policy.
But note that this result doesn't mean that SARSA cannot converge to optimal policy. If $\varepsilon$ of $\varepsilon$-greedy policy is gradually reduced during iteration, both methods can asymptotically converge to the optimal policy. <br>
https://ai.stackexchange.com/questions/34071/is-the-optimal-policy-the-one-with-the-highest-accumulative-reward-q-learning-v
Summary
In summary, we compare the difference algorithms of DP and TD learning:
<p align="center">
<img width="600" alt="image" src="https://user-images.githubusercontent.com/88715406/231224769-83fbe496-7060-458b-9e1f-5a24f7a4e827.png">
<img width="600" alt="image" src="https://user-images.githubusercontent.com/88715406/231224817-74915872-fe18-4c59-a322-dca2900b1d27.png">
</p>
<p align="center">
$\mathbf{Fig\ 11.}$ Relationship Between DP and TD
</p><br>
The notation $$x \overset{\alpha}{\longleftarrow} y \equiv x \Leftrightarrow x + \alpha (y - x)$$.
<br><br><br><br>
Reference
[1] Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto, The MIT Press.
Reinforcement Learning
The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intel...
mitpress.mit.edu
[2] Introduction to Reinforcement Learning with David Silver
Introduction to Reinforcement Learning with David Silver
Interested in learning more about reinforcement learning? Follow along in this video series as DeepMind Principal Scientist, creator of AlphaZero and 2019 ACM Computing Prize Winner David Silver, gives a comprehensive explanation of everything RL.
www.deepmind.com
[3] A (Long) Peek into Reinforcement Learning
A (Long) Peek into Reinforcement Learning
[Updated on 2020-09-03: Updated the algorithm of SARSA and Q-learning so that the difference is more pronounced. [Updated on 2021-09-19: Thanks to 爱吃猫的鱼, we have this post in Chinese]. A couple of exciting news in Artificial Intelligence (AI) has
lilianweng.github.io
[4] What is the difference between off-policy and on-policy learning?
What is the difference between off-policy and on-policy learning?
Artificial intelligence website defines off-policy and on-policy learning as follows: "An off-policy learner learns the value of the optimal policy independently of the agent's actions. Q-learn...
stats.stackexchange.com
[5] Are Q-learning and SARSA the same when action selection is greedy?
Are Q-learning and SARSA the same when action selection is greedy?
I'm currently studying reinforcement learning and I'm having difficulties with question 6.12 in Sutton and Barto's book. Suppose action selection is greedy. Is Q-learning then exactly the same alg...
ai.stackexchange.com
[6] Stanford, CS234: Reinforcement Learning Winter 2022
CS234: Reinforcement Learning Winter 2022
Credit/No Credit Enrollment If you're enrolled in the class on credit/no credit status, you will be graded on work as usual per standard Stanford rules. The only distinction with those taking the class for letter grade is that you must obtain a C- (C minus
web.stanford.edu
'AI > Reinforcement Learning' 카테고리의 다른 글
[RL] Model-Free Prediction (0) | 2023.06.24 |
---|---|
[RL] Convergence of DP (0) | 2023.06.22 |
[RL] Dynamic Programming (0) | 2023.06.21 |
[RL] Markov Decision Process (2) | 2023.06.16 |