[RL] Convergence of DP

Convergence of Dynamic Programming

지난 글에서는 environment의 reward dynamic, transition matrix 등 모든 MDP 정보들이 알려져 있을 때 이 정보를 활용하여 planning할 수 있는 dynamic programming 알고리즘들을 배웠다. Policy iteration, value iteration들의 업데이트 공식은 직관적으로 Bellman equation을 기반으로 이루어졌었는데, 이 글에선 두 알고리즘이 실제로 optimal point로 수렴할 수 있는지를 간단하게 증명해볼 것이다.

Preliminary

먼저 증명을 위해선 몇 가지 사전 지식이 필요하다.

Contraction Mapping Theorem

$\mathbf{Definition.}$ $\boldsymbol{\gamma}$-contraction
An operator $F$ on a normed vector space $\mathcal{X}$ is a $\boldsymbol{\gamma}$-contraction for $0 < \gamma < 1$, if for all $x, y \in \mathcal{X}$,

$$
\begin{aligned}
\| F(x) - F(y) \| \leq \gamma \| x - y \|._\blacksquare
\end{aligned}
$$

한국어로는 축약 사상으로 번역되는 것 같다. 굉장히 직관적인 개념인데, 벡터 공간의 두 점 사이의 거리를 일정 비율 이하로 축소시키는 함수라고 생각하면 된다. 아래 그림에선 $d_1 \leq \gamma \cdot d_0$이 되는거다.

$\mathbf{Fig\ 1.}$ $\gamma$-contraction $T$

이러한 축약 사상에는 직관적이면서도 굉장히 중요한 성질이 존재하는데, contraction mapping을 계속 적용하게 되면 두 점 사이의 거리는 점점 축소되고, 결국에는 유일한 고정점 $x^*$으로 수렴하게 된다. 이를 Contraction Mapping Theorem이라고 한다.

$\mathbf{Fig\ 2.}$ Contraction Mapping Theorem

$\mathbf{Theorem.}$ Contraction Mapping Theorem
For a $\gamma$-contraction $F$ in a complete normed vector space $\mathcal{X}$, <br>

1. $F$ admits an unique fixed point, i.e., $F \mathbf{v} = \mathbf{v}$.
2. Iterative application of $F$ converges to an unique fixed point in $\mathcal{X}$ independent of the initial point, at a linear convergence rate determined by $\gamma._\blacksquare$

먼저 $0 \leq \gamma < 1$임을 기억하자. 다음과 같은 수열 $\{ v_n \}$을 정의하자.

\begin{aligned}
v_{n+1} = F(v_n)
\end{aligned}

즉 $v_n = F^n(v_0) \; (n \geq 1)$이다. 이때 $v_{n+1}$과 $v_n$의 거리에는 다음과 같은 관계가 성립한다.

\begin{aligned}
d(v_{n+1}, v_n) &= d(F^{n+1} (v_0), F^n (v_0)) \\
&\leq \gamma^n d(F (v_0), v_0) \\
&= \gamma^n d(v_1, v_0)
\end{aligned}

$0 \leq \gamma < 1$에 의해,

\begin{aligned}
\underset{n \to \infty}{\lim} d(v_{n+1}, v_n) = \underset{n \to \infty}{\lim} d(F(v_n), v_n) = 0
\end{aligned}

따라서 $v_n$의 극한값은 고정점임을 알 수 있다.

이제 이 고정점의 유일성을 보일 것이다. 서로 다른 2개의 고정점 $v^*, w^*$ 이 극한값으로 존재한다고 가정하자.

\begin{aligned}
0 \leq d(v^*, w^*) = d(F(v^*), F(w^*)) \leq \gamma d(v^*, w^*)
\end{aligned}

따라서 $d(v^*, w^*)$는 반드시 0일 수 밖에 없다. 따라서 $v^* = w^*._\blacksquare$

이 정리는 Banach fixed-point theorem, contractive mapping theorem, 또는 Banach-Caccioppoli theorem으로도 알려져 있다.

Bellman Expectation Backup

이제 value function들의 벡터 공간 $V$를 고려해보자. 그러면 이 벡터 공간은 최대 $\| \mathcal{S} \|$ 차원을 갖고, 공간 속 각 point는 value function을 나타낸다. 이제 우리는 Bellman backup 연산자가 이 함수 공간에서 contraction operator임을 증명할 것이다. 그러면 contraction mapping theorem에 의해 이 backup이 유일한 해로 수렴함을 알 수 있다.

Bellman expectation backup operator $F^\pi$를 다음과 같이 정의하자.

$$
\begin{aligned}
  F^\pi (\mathbf{v}) = r^\pi + \gamma T^\pi \mathbf{v}
\end{aligned}
$$

여기서 $r^\pi = \sum_{a \in \mathcal{A}} \pi (a \| s) \cdot \mathcal{R}_s^a, T^\pi = \sum_{a \in \mathcal{A}} \pi (a \| s) \cdot \mathcal{P}_{ss^\prime}^a$이다. 그리고 이 연산자는 $\gamma$-contraction이 된다

$$
\begin{aligned}
  F^\pi (\mathbf{v}) = r^\pi + \gamma T^\pi \mathbf{v}
\end{aligned}
$$

$$
\begin{aligned}
  \| F^\pi (\mathbf{u}) - F^\pi (\mathbf{v}) \|_\infty & = \| \gamma T^\pi (\mathbf{u} - \mathbf{v}) \|_\infty \\
  & \leq \gamma T^\pi ( \| \mathbf{u} - \mathbf{v} \|_\infty \cdot \mathbf{1} ) \|_\infty \\
  & = \gamma \cdot \| \mathbf{u} - \mathbf{v} \|_\infty \cdot  \| T^\pi  \cdot \mathbf{1} \|_\infty \\
  & = \gamma \cdot \| \mathbf{u} - \mathbf{v} \|_\infty
\end{aligned}
$$

여기서 $\mathbf{1} \equiv  [1, 1, \cdots, 1]^\top$. Then, note that $T^\pi \cdot \mathbf{1} = \mathbf{1}$로 정의된다. 또한 norm의 동치 관계 때문에 꼭 max norm일 필요는 없다. 따라서 Bellman expectation operator $F^\pi$는 유일한 고정점을 갖게 된다.

Convergence of Iterative Policy Evaluation

지금까지 우리는 $F^\pi$가 유일한 고정점으로 수렴한다는 것을 보이기 위해 contraction mapping임을 보였다. 이제는 이 유일한 고정점이 반드시 $v_\pi$임을 보여야 한다.

그러나 이미 우리는 $v_\pi$가 Bellman expectation equation에 의해 Bellman expectation operator $F^\pi$의 고정점임을 알고 있다. 따라서 policy evaluation의 $v_\pi$로의 수렴성을 바나흐 고정점 정리에 의해 결론지을 수 있는 것이다.

Convergence of Policy Iteration

그렇다면 greedy policy improvement를 기반으로 하는 policy iteration도 optimal point $Q^*$와 $\pi^*$로 수렴할 수 있을까? 이 섹션에서는 다음을 보임으로써 policy iteration with greedy policy improvement의 수렴성을 보일 것이다.

Show monotone improvement: $V^{\pi_{k+1}} \geq V^{\pi_k}$
Show that the algorithm have to stop after a finite number of steps for finite action space $\mathcal{A}$
Show the converged policy is the optimal.

먼저, policy improvement가 monotone improvement를 보장함을 증명할 것이다. Deterministic policy $a= \pi (s)$를 고려해보자. 우리는 이 policy를 탐욕적으로 행동함으로써 policy를 향상시킬 수 있었다:

$$
\begin{aligned}
\pi^\prime (s) = \underset{a \in \mathcal{A}}{\text{ argmax }} q_\pi (s, a)
\end{aligned}
$$

그리고 이는 모든 state $s$에 대해 value function을 한 단계 향상시킬 수 있다:

$$
\begin{aligned}
q_\pi (s, \pi^\prime(s)) = \underset{a \in \mathcal{A}}{\text{ max }} q_\pi (s, a) \leq q_\pi (s, \pi (s) = v_\pi (s)
\end{aligned}
$$

따라서 $v_\pi (S_t) \leq q_\pi (S_t, \pi^\prime (S_t))$에 의해 $v_{\pi^\prime} (s) \geq v_\pi (s)$를 보일 수 있다:

$$
\begin{aligned}
  v_\pi (s) & \leq q_\pi (s, \pi^\prime (s)) = \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma v_\pi (S_{t+1}) | S_t = s ] \\
  & \leq \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma q_\pi (S_{t+1}, \pi^\prime (S_{t+1}) | S_t = s ] \\
  & = \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma R_{t+2} + \gamma^2 v_\pi (S_{t+2}) | S_t = s ] \\
  & \leq \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi (S_{t+2}, \pi^\prime (S_{t+2})) | S_t = s ]  \\
  & \leq \cdots \\
  & \leq \mathbb{E}_{\pi_\prime} [R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots | S_t = s ]  \\
  & = v_{\pi^\prime} (s)._\blacksquare
\end{aligned}
$$

우리는 유한한 action space $\mathcal{A}$ 가정했으므로, 유한한 수의 policy만 존재한다. 그렇기 때문에 알고리즘은 유한한 횟수 이내에 멈춰야 하며, 멈췄을 때 다음과 같은 등식이 성립한다:

$$
\begin{aligned}
q_\pi (s, \pi^\prime (s)) = \underset{a \in \mathcal{A}}{\text{ max }} q_\pi (s, a) = q_\pi (s, \pi (s)) = v_\pi (s)
\end{aligned}
$$

즉 Bellman optimality equation이 성립하는 것을 알 수 있다:

$$
\begin{aligned}
v_\pi (s) = \underset{a \in \mathcal{A}}{\text{ max }} q_\pi (s, a)
\end{aligned}
$$

따라서 $v_\pi (s) = v_* (s)$ for all $s \in \mathcal{S}$임과 수렴한 policy는 optimal policy임을 결론지을 수 있다.

Convergence of Value Iteration

이제 value iteration의 수렴성을 증명해보자. Policy iteration과 매우 흡사한데, 먼저 Bellman optimality backup 연산자 $T^*$를 다음과 같이 정의해보자.

$$
\begin{aligned}
  T^* (\mathbf{v}) = \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{R}^a + \gamma \mathcal{P}^a \mathbf{v}
\end{aligned}
$$

그 후 동일하게, 이 연산자가 $\gamma$-contraction임을 증명할 것이다. 그 전에 먼저 다음과 같은 사실을 기억하자.

$$
\begin{aligned}
  \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{u}] - \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{v}] \leq \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{u} - (\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{v})].
\end{aligned}
$$

(이는 자명한데, 만약 $a^*$이 우변을 최대화하는 action이라면, $\mathcal{R}_s^{a^*} + \gamma \mathcal{P}^{a^*} \mathbf{v}$는 그 최댓값보다 작거나 같을 것이다.)

$$
\begin{aligned}
  \| T^* (\mathbf{v}) - T^* (\mathbf{u}) \|_\infty & = \| \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{u}] - \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{v}] \|_\infty \\
  & \leq \| \underset{a \in \mathcal{A}}{\text{ max }} [\mathcal{R}_s^a + \gamma \mathcal{P}^a \mathbf{u} - \mathcal{R}_s^a - \gamma \mathcal{P}^a \mathbf{v}] \|_\infty \\
  & = \gamma \cdot \| \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{P}^a (\mathbf{u} - \mathbf{v}) \|_\infty \\
  & \leq \gamma \cdot \| \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{P}^a \cdot \mathbf{1} \cdot \| \mathbf{u} - \mathbf{v} \|_\infty \|_\infty \\
  & = \gamma \cdot \| \mathbf{u} - \mathbf{v} \|_\infty \cdot \| \underset{a \in \mathcal{A}}{\text{ max }} \mathcal{P}^a \cdot \mathbf{1} \|_\infty \\
  & = \gamma \cdot \| \mathbf{u} - \mathbf{v} \|_\infty
\end{aligned}
$$

여기서 $\mathbf{1} \equiv  [1, 1, \cdots, 1]^\top$로 정의되니까, $\mathcal{P}^a \cdot \mathbf{1} = \mathbf{1}$이다.

따라서 바나흐 고정점 정리에 의해 Bellman optimality backup operator $T^*$는 유일한 고정점을 갖고, $v_*$는 Bellman optimality equation에 의해 $T^*$의 고정점임을 알 수 있다. 따라서, value iteration (iterative application of Bellman optimality operator)이 optimal value function $v_*$으로 수렴함을 결론지을 수 있다.

Reference

[1] Reinforcement Learning: An Introduction. Richard S. Sutton and Andrew G. Barto, The MIT Press.

Reinforcement Learning

The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intel...

mitpress.mit.edu

[2] Wikipedia, Banach fixed-point theorem

[3] CMU, Deep Reinforcement Learning and Control, Katerina Fragkiadaki

https://www.google.com/url?cd=&esrc=s&q=&rct=j&sa=t&source=web&url=https%3A%2F%2Fnlp.jbnu.ac.kr%2FAI%2Fslides_cmu_RL%2Flecture3_mdp_planning.pdf&usg=AOvVaw1HXRDAeRoltHj5ezWpmCtR&ved=2ahUKEwjw1oPLl6H-AhXEpVYBHc2FAFMQFnoECCUQAQ

www.google.com

[4] Contraction mapping

Contraction mapping

Contraction Mapping ModuLabs 강남Dynamics Lab Hancheol Choi (babchol@gmail.com) Definition (𝑋, 𝑑) is a metric space. 𝑇: 𝑋 → 𝑋 is a contraction mapping if ∃...

www.slideshare.net

'AI > Reinforcement Learning' 카테고리의 다른 글

[RL] Model-Free Control (0)	2023.06.26
[RL] Model-Free Prediction (0)	2023.06.24
[RL] Dynamic Programming (0)	2023.06.21
[RL] Markov Decision Process (2)	2023.06.16

ydev

[RL] Convergence of DP

Convergence of Dynamic Programming

Preliminary

Contraction Mapping Theorem

Bellman Expectation Backup

Convergence of Iterative Policy Evaluation

Convergence of Policy Iteration

Convergence of Value Iteration

Reference

'AI > Reinforcement Learning' 카테고리의 다른 글

티스토리툴바

[RL] Convergence of DP

Convergence of Dynamic Programming

Preliminary

Contraction Mapping Theorem

Bellman Expectation Backup

Convergence of Iterative Policy Evaluation

Convergence of Policy Iteration

Convergence of Value Iteration

Reference

'AI > Reinforcement Learning' 카테고리의 다른 글

관련글

티스토리툴바