[Generative model] Generative Adversarial Network (GAN)

Generative Adversarial Networks

Implicit generative models

지금까지 우리는 관찰된 데이터 $\mathbf{x}$의 확률 분포의 형태를 명시적으로 모델링하는 생성 모델들 explicit generative model들을 살펴보았다. 반면 관찰된 데이터의 확률 분포를 명시적으로 모델링하지 않고 데이터를 생성해낼 수 있는 모델로 implicit generative model도 존재한다. Explicit generative model과는 다르게 implicit generative model은 데이터의 확률 분포나 likelihood를 모델링하지 않고 관찰된 데이터를 바탕으로 바로 데이터를 생성해낼 수 있는 확률 과정을 정의한다. 이러한 모델은 현실 세계의 문제에서 훨씬 자연스러운 모델이라고 할 수 있는데, 기후, 기상정보와 같이 현실의 데이터는 주로 단순한 형태의 확률 분포로 모델링할 수 없기 때문이다. 이 글에서는 Ian Goodfellow et al.에서 제안되었던 적대적 생성 신경망 (Generative Adversarial Network, 줄여서 GAN)을 살펴볼 것이다. GAN은 VAE와 다르게 implicit generative model로써, 생성 모델 분야 활성화에 큰 역할을 했던 모델이다.

Training principles of GAN

Implicit model에선 데이터 분포의 형태를 알지 못하기 때문에, 통계기반 머신러닝에서 자주 사용되는 likelihood를 계산할 수 없다. 따라서, 실제 분포 $p^\star$와 명시적 (explicit)으로 정하진 않았지만 우리가 데이터의 확률 분포일 것으로 추정하는 모델 $q$ 사이의 유사도, 또는 모델 $q$의 성능를 계산할 수 있는 새로운 objective $\mathcal{D}(p^\star, q)$가 필요하다. 이때 우리에게 필요하거나 유용한 objective의 성질들은 다음과 같다.

1. $q$가 $p^\star$와 가까워질수록 최소가 되는 objective를 설정하고자 한다. 즉, $\underset{q}{\text{argmin }} \mathcal{D} (p^\star, q) = p^\star$

2. 관찰된 데이터 또는 샘플된 데이터 만으로 계산이 가능해야한다.

3. 계산이 간편하거나 가능해야한다.

예를 들어 KL divergence도 첫 번째 성질을 만족한다. KL divergence는 두 분포가 정확히 같을 때에만 0이 되고, non-negative한 성질 덕분에 최적화에 있어서도 간편하기 때문이다.

$$
\begin{align*}
\mathcal{D} (p^\star, q) \geq 0; \quad \quad \mathcal{D} (p^\star, q) = 0 \Leftrightarrow q = p^\star
\end{align*}
$$

이 외에도 Wasserstein distance 등 다양한 metric들이 이와 같은 조건을 만족한다. 그러나 문제는 3번째 조건인데, 대부분의 metric은 $q$의 명시적인 확률 분포 모형 없이는 계산이 불가능하거나 어렵다는 것이다. 이를 극복하기 위한 GAN의 주요한 아이디어는 $p^\star$와 $q$를 비교해주는 모델인 discriminator $D$를 도입하여, 이 discriminator를 주어진 데이터와 샘플을 통해 최적화함으로써 $p^\star$와 $q$를 가능한 완벽하게 비교할 수 있는 모델로 만든다. 그러고나면 discriminator로 $p^\star$와 $q$를 비교하여 알 수 있는 두 분포의 차이를 objective로 설정할 수 있을 것이다:

$$
\begin{align*}
\mathcal{D}(p^\star, q) = \underset{D}{\text{argmax }} \mathcal{F} (D, p^\star, q)
\end{align*}
$$

여기서 $\mathcal{F}$는 $p^\star$, $q$과 주어진 데이터 및 샘플에만 의존하는 범함수 (functional)이다. 만약 distriminator와 우리의 추정 모델 $q$를 parametrization한다면 다음과 같이 표현할 수 있다:

$$
\begin{align*}
\mathcal{D}_\phi (p^\star, q) = \underset{\phi}{\text{argmax }} \mathcal{F} (D, p^\star, q_\theta)
\end{align*}
$$

$\mathcal{F}$가 주어진 데이터 및 샘플에만 의존하도록 하기 위해 (우리가 원하는 objective의 2번째 조건이었다), 가장 간편한 방법 중 하나는 기댓값을 기용하는 것인데, 기댓값은 Monte-Carlo approximation 즉 주어진 데이터의 표본 평균 만으로 근사될 수 있기 때문이다! 그러면 $\mathcal{F}(\mathcal{D}_\phi, p^\star, q_\theta)$를 어떤 함수 $f$와 $g$에 대해 다음과 같이 표현할 수 있다.

$$
\begin{align*}
\mathcal{F}(\mathcal{D}_\phi, p^\star, q_\theta) = \mathbb{E}_{p^\star(\mathbf{x})} f(\mathbf{x}, \mathbf{\phi}) + \mathbb{E}_{q_\theta (\mathbf{x})} g(\mathbf{x}, \mathbf{\phi})
\end{align*}
$$

여기서 $f$와 $g$의 선택은 우리의 자유이며, 어떤 함수를 선택하느냐에 따라 $\mathcal{F}$의 형태가 결정된다. 실제로 이 $f$와 $g$의 선택에 따라 일반적인 GAN, conditional GAN, cycleGAN, LSGAN 등의 종류로 나뉘게 된다.

위의 $\mathcal{F}$ 형태는 데이터 분포를 추정하는 모델 $q$를 $\theta$로 parameterization했을 때의 경우다. Implicit generative model에서는 $q_\theta$를 명시적으로 설정하지 않기 때문에, 사전 분포 (prior distribution) $q$와 사전 분포에서 샘플링된 잠재 변수 (latent variable) $\mathbf{z} \sim q(\cdot)$를 도입한 후, 데이터의 분포를 추정하지 않고 바로 생성 모델 $\mathbf{x} = G_\theta (\mathbf{z})$를 통해 데이터를 생성한다. $G_\theta$는 사전 분포에서 샘플링된 잠재 변수 $\mathbf{z}$로부터 새로운 데이터 $\mathbf{x}$를 생성하는 모델로, 확률 분포가 아님을 유의하자. 또한 잠재 변수, 특히 샘플링된 잠재 변수를 도입하는 이유는 생성 모델로써 무작위성을 부여하기 위해서이며, $q$는 데이터의 분포와 전혀 관련이 없다. $\mathbf{z}$가 고정된 변수라면 생성 모델은 반드시 동일한 데이터만을 생성해낼 것이다.

그 이후 $\mathcal{F}$를 다음과 같이 수정할 수 있다:

$$
\begin{align*}
\mathcal{F}(\mathcal{D}_\phi, p^\star, q_\theta) = \mathbb{E}_{p^\star(\mathbf{x})} f(\mathbf{x}, \mathbf{\phi}) + \mathbb{E}_{q (\mathbf{z})} g(\mathbf{x} = G_\theta (\mathbf{z}), \mathbf{\phi}) \text{ where } \mathbf{z} \sim q(\cdot)
\end{align*}
$$

이제 $f$와 $g$만 결정하면 된다. 두 분포 $p^\star$, $q_\theta$를 적절히 비교할 수 있는 방법을 찾아야 하는데, 가장 간단한 방법이면서도 GAN에서 사용한 방법은 두 분포의 비율 $r (\mathbf{x}) = \frac{p^\star (\mathbf{x})}{q(\mathbf{x})}$이다. 두 분포가 완전히 동일할 때만 모든 $\mathbf{x}$에 대해 비율이 1이 되므로 합리적인 방법이다. 다만 implicit generative model이므로 $r (\mathbf{x})$을 계산할 수 없는데, 여기서 discriminator $D$를 사용하게 된다. Discriminator $D(\mathbf{x}) \in [0, 1]$를 데이터가 $p^\star$ 또는 $q_\theta$에 속해있는지를 판별하는 binary classifier로 모델링하여 최적화된다면, 두 분포의 비율을 다음과 같이 표현할 수 있다.

$$
\begin{align*}
\frac{p^\star(\mathbf{x})}{q_\theta (\mathbf{x})} = \frac{D_\phi (\mathbf{x})}{1 - D_\phi (\mathbf{x})}
\end{align*}
$$

$\mathbf{x}$가 실제 데이터, 즉 $p^\star$에 가까울수록 $D(\mathbf{x}) = 1$, 생성된 데이터로 $q_\theta$에 가까울수록 $D(\mathbf{x}) = 0$이 되는 방식이다. 그러면, binary classifier를 학습시킬 때 자주 사용하는 Binary Cross Entropy (BCE) loss를 통해 discriminator를 최적화할 수 있게 된다. Class를 나타내는 확률 변수 $y$에 대해,

$$
\begin{align*}
  V(q_\theta, p^\star) &= \underset{\phi}{\text{arg max }} \mathbb{E}_{p(\mathbf{x} | y) p(y)} [y \text{ log } D_\phi (\mathbf{x}) + (1 - y) \text{ log } (1 - D_\phi (\mathbf{x}))] \\
  &= \underset{\phi}{\text{arg max }} [\mathbb{E}_{p(\mathbf{x} | y=1) p(y=1)} [\text{log } D_\phi (\mathbf{x})] + \mathbb{E}_{p(\mathbf{x} | y=0) p(y=0)} [\text{log } (1 - D_\phi (\mathbf{x}))] \\
  &= \underset{\phi}{\text{arg max }} \frac{1}{2} \left[ \mathbb{E}_{p^\star (\mathbf{x})} [\text{log } D_\phi (\mathbf{x})] + \mathbb{E}_{q_\theta (\mathbf{x})} [\text{log } (1 - D_\phi (\mathbf{x}))] \right]
\end{align*}
$$

최대화하는 이유는 기존 BCE loss에서 음의 부호를 떼어냈기 때문이다. 또한 마지막 식은 $p(y = 1) = p(y = 0) = \frac{1}{2}$와 $p(\mathbf{x} | y = 1) = p^\star (\mathbf{x})$ 및 $p(\mathbf{x} | y = 0) = q_\theta (\mathbf{x})$에서 비롯된 것이다.

만약 완벽하게 두 분포를 구분할 수 있는 discriminator가 있다면, 그 discriminator는 다음을 만족하게 된다.

$$
\begin{align*}
\frac{p^\star(\mathbf{x})}{q_\theta (\mathbf{x})} = \frac{D^\star (\mathbf{x})}{1 - D^\star (\mathbf{x})} \Rightarrow D^\star (\mathbf{x}) = \frac{p^\star (\mathbf{x})}{p^\star (\mathbf{x}) + q_\theta (\mathbf{x})}
\end{align*}
$$

우리는 discriminator $D_\phi$가 된다면 위와 같은 조건을 만족하게 하고 싶다. 이러한 discriminator를 만들기 위해, 이를 BCE loss에 대입하면 다음과 같이 나타난다:

$$
\begin{align*}
  V^\star (q_\theta, p^\star) &= \frac{1}{2} \mathbb{E}_{p^\star (\mathbf{x})} [\text{log } \frac{p^\star (\mathbf{x})}{p^\star (\mathbf{x}) + q_\theta (\mathbf{x})}] + \frac{1}{2} \mathbb{E}_{q_\theta (\mathbf{x})} [\text{log } (1 - \frac{p^\star (\mathbf{x})}{p^\star (\mathbf{x}) + q_\theta (\mathbf{x})})] \\
  &= \frac{1}{2} \mathbb{E}_{p^\star (\mathbf{x})} [\text{log } \frac{p^\star (\mathbf{x})}{\frac{p^\star (\mathbf{x}) + q_\theta (\mathbf{x})}{2}}] + \frac{1}{2} \mathbb{E}_{q_\theta (\mathbf{x})} [\text{log } (\frac{q_\theta (\mathbf{x})}{\frac{p^\star (\mathbf{x}) + q_\theta (\mathbf{x})}{2}})] - \text{log } 2 \\
  &= \frac{1}{2} KL(p^\star \, \| \, \frac{p^\star + q_\theta}{2}) + \frac{1}{2} KL(q_\theta \, \| \, \frac{p^\star + q_\theta}{2}) - \text{log } 2 \\
  &= JSD(p^\star, q_\theta) - \text{log } 2
\end{align*}
$$

즉 우리의 objective 최적화 문제는 Jensen-Shannon divergence (JSD)의 최소화 문제로 귀결되는 것을 볼 수 있다. 즉, BCE loss를 사용함으로써 우리는 두 분포를 적절하게 비교하여 최적화할 수 있다는 의미이다. 이때 두 분포가 가까울수록 JSD 값이 작아지므로, 분포 $\theta$에 대한 최적화 문제는 JSD를 최소화하는 $\theta$를 찾는 것이 되겠다:

$$
\begin{align*}
\underset{\mathbf{\theta}}{\text{min }} JSD(p^\star, q_\theta) &= \underset{\mathbf{\theta}}{\text{min }} \frac{1}{2} \left[ \mathbb{E}_{p^\star (\mathbf{x})} [\text{log } D^\star (\mathbf{x})] + \mathbb{E}_{q_\theta (\mathbf{x})} [\text{log } (1 - D^\star (\mathbf{x}))] \right] \\
\end{align*}
$$

하지만 우리는 완벽한 discriminator $D^\star$를 알지 못한다. $D^\star$ 자리에 다시 최적화된 $D_\phi$를 대입하면, 다음과 같다:

$$
\begin{align*}
\underset{\mathbf{\theta}}{\text{min }} JSD(p^\star, q_\theta) &= \underset{\mathbf{\theta}}{\text{min }} \underset{\mathbf{\phi}}{\text{max }} \frac{1}{2} \left[ \mathbb{E}_{p^\star (\mathbf{x})} [\text{log } D_\mathbf{\phi} (\mathbf{x})] + \mathbb{E}_{q_\theta (\mathbf{x})} [\text{log } (1 - D_\mathbf{\phi} (\mathbf{x}))] \right]
\end{align*}
$$

마지막으로 위 식은 명시적인 생성 모델 $q_\mathbf{\theta}$ 직접적으로 최대화하는 것이므로, 앞서 보았던 것처럼 implicit model의 형태로 바꾸어주면 최종적인 GAN의 objective가 된다:

$$
\begin{align*}
\underset{\mathbf{\theta}}{\text{min }} JSD(p^\star, q_\theta) &= \underset{\mathbf{\theta}}{\text{min }} \underset{\mathbf{\phi}}{\text{max }} \frac{1}{2} \mathbb{E}_{p^\star (\mathbf{x})} [\text{log } D_\mathbf{\phi} (\mathbf{x})] + \frac{1}{2} \mathbb{E}_{q (\mathbf{z})} [\text{log } (1 - D_\mathbf{\phi} (\mathbf{x} = G_\mathbf{\theta} (\mathbf{z}) ))] \\
\end{align*}
$$

위 과정에서는 BCE loss (Bernoulli log-loss)를 사용했지만 이외에도 많은 loss가 사용될 수 있다.

Generative Adversarial Network

앞서 우리는 GAN의 학습 원리를 살펴보았다. 이 원리를 이해했다면 GAN의 작동 방식과 아키텍쳐도 쉽게 이해할 수 있다. GAN은 기본적으로 binary classifier인 discriminator를 모델 분포와 실제 데이터 분포의 divergence를 근사하기 위한 용도로 학습하고, 이 근사값을 최소화함으로써 우리의 생성 모델 역시 학습하는 구조를 따른다. 이제 실제 GAN이 어떻게 구성되어 있는지 더 자세하게 살펴보자. GAN의 full architecture는 다음과 같다:

$\mathbf{Fig\ 2.}$ Full architecture of GAN

Loss functions

Original GAN 뿐만 아니라 대부분 종류의 GAN의 학습은 mini-max game의 형태로 나타나는데, 두 분포의 divergence에 대한 근사를 향상시킴과 동시에 divergence 근사값을 최소화시키기 때문이다. Original GAN의 경우 BCE loss를 사용함으로써 Jensen-Shannon divergence (JSD)를 근사했었다.

$$
\begin{align*}
\underset{\theta}{\text{min }} \underset{\phi}{\text{max }} V(\mathbf{\phi}, \mathbf{\theta})
\end{align*}
$$

여기서 $V(\mathbf{\phi}, \mathbf{\theta})$는

$$
\begin{align*}
V(\mathbf{\phi}, \mathbf{\theta}) = \frac{1}{2} \mathbb{E}_{p^\star (\mathbf{x})} [\text{log } D_\mathbf{\phi} (\mathbf{x})] + \frac{1}{2} \mathbb{E}_{q (\mathbf{z})} [\text{log } (1 - D_\mathbf{\phi} (\mathbf{x} = G_\mathbf{\theta} (\mathbf{z}) ))]
\end{align*}
$$

그러나 이와 같은 제로섬 게임은 주로 모순에 부딪히게 되는데, discriminator는 실제 데이터와 생성된 데이터 샘플을 최대한 구분하고자 최적화되지만, 반대로 생성 모델은 실제 데이터와 구분하기 힘든 데이터 샘플을 생성하고자 최적화되기 때문이다. 실제로, GAN의 학습 초기 단계에서는 생성 모델의 성능이 좋지 않아 discriminator는 긴 시간의 학습 없이도 이들을 잘 구분할 수 있다. 즉 초기 단계에서는 $D(G_\mathbf{\theta} (\mathbf{z}))$가 $0$에 매우 가깝게 학습되어, $G_\mathbf{\theta}$의 vanishing gradient를 초래하며 generator의 학습이 제대로 진행되지 않는다. 아래 값은 $D(G_\mathbf{\theta} (\mathbf{z}))$에 따른 generator loss의 값을 그래프로 나타낸 것이다. 그래프를 보면 $D(G_\mathbf{\theta} (\mathbf{z}))$가 $0$에 가까워질수록 기울기가 매우 완만한 것을 볼 수 있다.

$\mathbf{Fig\ 3.}$ Plot of $D(G(z)$ v.s. Generator loss

따라서, 실제 original GAN에서는 기존의 loss $\mathbb{E}_{q (\mathbf{z})} [\text{log } (1 - D_\mathbf{\phi} (G_\mathbf{\theta} (\mathbf{z}))]$ 대신 수학적으로 동일하면서 더 직관적인 generator loss를 사용한다. 생성된 데이터에 대해 discriminator의 분류 확률이 1이 되도록, 즉 discriminator가 가짜를 구분하기 어렵도록 새로운 loss $\mathbb{E}_{q (\mathbf{z})} [- \text{log } D_\mathbf{\phi} (G_\mathbf{\theta} (\mathbf{z}))]$를 사용한다.

$$
\begin{align*}
\underset{\theta}{\text{max }} \mathbb{E}_{q (\mathbf{z})} [- \text{log } D_\mathbf{\phi} (G_\mathbf{\theta} (\mathbf{z})] \quad \Leftrightarrow \quad \underset{\theta}{\text{min }} \mathbb{E}_{q (\mathbf{z})} [\text{log } (1 - D_\mathbf{\phi} (G_\mathbf{\theta} (\mathbf{z}))]
\end{align*}
$$
변경된 loss는 기존보다 더 안정적인 초기 학습이 가능하다. 초기 단계에서 생성 모델의 성능이 좋지 않아 discriminator가 진짜와 가짜를 쉽게 구분해내는 상황에서, non-saturating loss로써 gradient가 0되지 않기 때문이다.

$\mathbf{Fig\ 4.}$ Generator loss and gradient of original loss and non-saturating loss

최종적으로, 학습에 사용할 discriminator와 generator의 loss는 다음과 같다. Minmax loss를 한 번에 최적화하는 것이 아닌, generator를 nonsaturating loss를 통해 따로 최적화하는 것이다.

$$
\begin{align*}
&L_D (\mathbf{\phi}, \mathbf{\theta}) = \mathbb{E}_{p^\star (\mathbf{x})} [\text{log } D_\mathbf{\phi} (\mathbf{x})] + \mathbb{E}_{q (\mathbf{z})} [\text{log } (1 - D_\mathbf{\phi} ( G_\mathbf{\theta} (\mathbf{z}) ))] \\
&L_G (\mathbf{\phi}, \mathbf{\theta}) = \mathbb{E}_{q (\mathbf{z})} [- \text{log } D_\mathbf{\phi} (G_\mathbf{\theta} (\mathbf{z}))]
\end{align*}
$$

Gradient Descent

학습동안 GAN은 여느 딥러닝 알고리즘처럼 gradient descent를 통해 학습되지만 discriminator와 generator를 동시에 학습해야하므로, training이 조금 다르게 진행된다. GAN에선 먼저 discriminator를 학습시켜주는데, discriminator가 도입된 목적을 생각해보면 자명하다. 두 분포 사이의 divergence $\mathcal{D}(p^\star, q)$를 어느 정도는 잘 근사해야 generator를 근사된 divergence를 통해 학습시키는 의미가 있기 때문이다.

$\mathbf{Fig\ 5.}$ Pseudocode for general GAN training algorithm

Gradient descent를 위해 각 loss의 gradient를 살펴볼 필요가 있다. 먼저 discriminator $D_\phi$의 gradient는 다음과 같이 계산된다:

$$
\begin{align*}
\nabla_\phi L_D (\mathbf{\phi}, \mathbf{\theta}) = \mathbb{E}_{p^\star (\mathbf{x})} [\nabla_\phi \text{ log } D_\mathbf{\phi} (\mathbf{x})] + \mathbb{E}_{q (\mathbf{z})} [\nabla_\phi \text{ log } (1 - D_\mathbf{\phi} ( G_\mathbf{\theta} (\mathbf{z}) ))]
\end{align*}
$$

Generator $G_\theta$의 경우, gradient는 VAE 때처럼 reparametrization trick을 통해 계산된다.
For the generator $G_\theta$, the gradient with respect to $\theta$ can be computed by reparametrization trick, replace $q$ that allows to change the order of gradient and expectation so that Monte Carlo estimation can be utillized:

$$
\begin{align*}
\nabla_\theta L_G (\mathbf{\phi}, \mathbf{\theta}) = \mathbb{E}_{q (\mathbf{z})} [\nabla_\theta - \text{log } D_\mathbf{\phi} (G_\mathbf{\theta} (\mathbf{z})] \\
\end{align*}
$$

다음 pseudocode는 일반적인 GAN의 full training algorithm을 보여준다. Original GAN 외에도 Wasserstein GAN 등 많은 종류의 GAN이 아래와 같은 loss의 형태를 갖는데, 각 loss

\begin{align*}
& L_D(\boldsymbol{\phi}, \boldsymbol{\theta})=\mathbb{E}_{p^*(\boldsymbol{x})} g\left(D_{\boldsymbol{\phi}}(\boldsymbol{x})\right)+\mathbb{E}_{q_\theta(\boldsymbol{x})} h\left(D_{\boldsymbol{\phi}}(\boldsymbol{x})\right)=\mathbb{E}_{p^*(\boldsymbol{x})} g\left(D_{\boldsymbol{\phi}}(\boldsymbol{x})\right)+\mathbb{E}_{q(\boldsymbol{z})} h\left(D_{\boldsymbol{\phi}}\left(G_{\boldsymbol{\theta}}(\boldsymbol{z})\right)\right) \\
& L_G(\boldsymbol{\phi}, \boldsymbol{\theta})=\mathbb{E}_{q_\theta(\boldsymbol{x})} l\left(D_{\boldsymbol{\phi}}(\boldsymbol{x})\right)=\mathbb{E}_{q(\boldsymbol{z})} l\left(D_{\boldsymbol{\phi}}\left(G_{\boldsymbol{\theta}}(\boldsymbol{z})\right)\right.
\end{align*}

에 대해, training algorithm은 다음과 같다. 예를 들어, Original GAN의 경우 $g(t) = - \text{ log } t$, $h(t) = - \text{ log } (1-t)$와 $l(t) = \text{ log } (1-t)$, non-saturating loss를 사용하는 경우 $l(t) = - \text { log } t$이다. 또는 Wasserstein GAN의 경우 $g(t) = l(t) = t$, $h(t) = -t$를 사용한다.

$\mathbf{Fig\ 6.}$ Full training algorithm of general GAN

Mode collapse and hopping

It is known that the training of GANs is usually suffered from **mode collapse**. Mode collapse is a phenomenon that the generator underfits to a incomplete distribution which cannot cover not all the modes (peaks) of the data distribution. 

Or, it may shows **mode hopping** that the generator just "hops" between training different modes due to the nature of the mechanism of GAN. Simply, the gerator gives up expanding one mode and jumps to train another mode when the discriminator distinguishes fake datas very well, so that the generator cannot easily generate better one to beat the discriminator and just paves the new way. 


<img width="600" alt="image" src="https://user-images.githubusercontent.com/88715406/221357537-8ca1699c-dc3f-4e55-89fc-1dfdbe0f3bb6.png">


Data is a mixture of Gaussians with 16 modes, and the model shows mode collapse and mode hopping.
 

However, numerous advancements have made GAN training more stable and these behaviors less common. Large batch sizes, regularizations, and more sophisticated optimization techniques are some of these examples.

Summary

Beyond the use of the Bernoulli scoring rule used above, other scoring rules have been used to train generative models via min-max optimization. The Brier scoring rule, which under discriminator optimality conditions can be shown to correspond to minimizing the Pearson χ2 divergence via similar arguments as the ones shown above has lead to LS-GAN [Mao+17]. The hinge scoring rule has become popular [Miy+18b; BDS18], and under discriminator optimality conditions corresponds to minimizing the total variational distance [NWJ+09].

The connection between proper scoring rules and distributional divergences allows the construction of convergence guarantees for the learning criteria above, under infinite capacity of the discriminator and generator: since the minimizer of distributional divergence is the true data distribution (Equa- tion 26.3), if the discriminator is optimal and the generator has enough capacity, it will learn the data distribution. In practice however, this assumption will not hold, as discriminators are rarely optimal; we will discuss this at length in Section 26.3.

Reference

[1] Stanford CS231n, Deep Learning for Computer Vision

Stanford University CS231n: Deep Learning for Computer Vision

Course Description Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image

cs231n.stanford.edu

[2] Kevin P. Murphy, Probabilistic Machine Learning: An introduction, MIT Press 2022.

https://probml.github.io/pml-book/book1.html

Probabilistic Machine Learning: An Introduction by Kevin Patrick Murphy. MIT Press, March 2022. Key links If you use this book, please be sure to cite @book{pml1Book, author = "Kevin P. Murphy", title = "Probabilistic Machine Learning: An introduction", pu

probml.github.io

ydev