Deep Learning: Diffusion Models - Part 1

Intro

Diffusion models are unsupervised probabilistic generative models inspired by non-equilibrium thermodynamics. They have 2 major processes:

Forward diffusion process: Adding noise to data until the result is no different to random noise.
Reverse diffusion process: Learning to reverse the diffusion process to reconstruct the original data from the noise obtained from forward diffusion process.

Generative Models Overview

Forward diffusion process

Given an image, which we call $x_0$ , the forward diffusion process is in which we sample a series of Gaussian noise and add them to the image in $T$ steps, producing a sequence of noisy samples $x_1, x_2, …, x_T$ . Eventually when $T \rightarrow \infty$ , $x_T$ is equivalent to an isotropic Gaussian distribution, which will be proved later.

Let’s assume each step can be pictured like this:

  
flowchart LR
	input(("x(t-1)")) --> processor("q(x(t)|x(t-1))")
	processor --> output(("x(t)"))

Each step is controlled by a variance schedule $\{\beta_t \in (0, 1)\}^T_{t=1}$ .

$q(x_t|x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t\mathbf{I}) \tag 1$

In this equation, we can tell $\sqrt{1-\beta_t}x_{t-1}$ is the mean and $\beta_t\mathbf{I}$ is the variance. While the mean is smaller than 1, each step will shrink the value for each pixel in the image, and then add a noise with variance equals $\beta_t$ .

At the beginning, small noise can cause considerable impact, so we can make the $\beta_t$ small to keep the gap between $x_t$ and $x_{t-1}$ closer. And with time pass by, we will need bigger $\beta_t$ to ensure changes could be tell. We can set $\beta_1 = 10^{-4}$ and $\beta_T = 0.02$ .

It’s not difficult to find that $x_t$ only depends on $x_{t-1}$ , therefore the forward diffusion process is Markov process.

Re-parameterization

Before diving into the calculation, it’s better to introduce re-parameterization, which will be used during the calculation.

Let’s assume a random value $x \sim \mathcal{N}(\mu, \sigma^2 \mathbf{I})$ , let $\epsilon \sim \mathcal{N}(0, \mathbf{I})$ , then $x$ can be re-parameterized as $z = \mu + \sigma \epsilon$ .

Re-parameterization can transfer the randomicity from $x$ to $\epsilon$ , which can not only simplify the calculation, but make the reverse diffusion process differentiable.

Proof

Let $\alpha_t = 1 - \beta_t$ , $\bar{\alpha}_t = \prod^t_{i=1}\alpha_i$ .

Re-parameterize the equation (1) we can get:

$\begin{split} x_t &= \sqrt{1-\beta_t}x_{t-1}+\sqrt{\beta_t}\epsilon \\ &= \sqrt{\alpha_t}x_{t-1}+\sqrt{1-\alpha_t}\epsilon \\ &= \sqrt{\alpha_t}(\sqrt{\alpha_{t-1}}x_{t-2}+\sqrt{1-\alpha_{t-1}}\epsilon)+\sqrt{1-\alpha_t}\epsilon \\ &= \sqrt{\alpha_t\alpha_{t-1}}x_{t-2} + \sqrt{\alpha_t}\sqrt{1 - \alpha_{t-1}}\epsilon + \sqrt{1-\alpha_t}\epsilon \\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \mathcal{N}(0, \alpha_t(1-\alpha_{t-1})) + \mathcal{N}(0, 1-\alpha_t) \\ &= \sqrt{\alpha_t\alpha_{t-1}}x_{t-2} + \mathcal{N}(0, 1-\alpha_t \alpha_{t-1}) \\ &= \sqrt{\alpha_t \alpha_{t-1}} x_{t-2} + \sqrt{1-\alpha_t \alpha_{t-1}} \epsilon \end{split}$

Considering $x_1 = \sqrt{\alpha_1} x_0 + \sqrt{1-\alpha_1} \epsilon$ , we get:

$\begin{eqnarray*} x_t =& \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1-\bar{\alpha}_t} \epsilon \tag 2 \\ q(x_t|x_0) =& \mathcal{N}(x_t;\sqrt{\bar{\alpha}_t}x_0, (1-\bar{\alpha}_t)\mathbf{I}) \end{eqnarray*}$

when $T \rightarrow \infty$ , we get:

$\begin{split} \lim_{t\rightarrow\infty}q(x_t|x_0)=\mathcal{N}(0, \mathbf{I}) \\ \lim_{t\rightarrow\infty}q(x_t)=\mathcal{N}(0, \mathbf{I}) \end{split}$

Reverse diffusion Process

If we can reverse the forward diffusion process, we will be able to recreate the origin image from the Gaussian noise. But we cannot easily estimate $q(x_{t-1}|x_{t})$ because it needs to use the entire dataset. Therefore, we need to learn a model $p_{\theta}$ to approximate these conditional probabilities in order to run the reverse diffusion process.

$\begin{split} p_{\theta}(x_{0:T})&=p(x_T)\prod^T_{t=1}p_{\theta}(x_{t-1}|x_{t}) \\ p_{\theta}(x_{t-1}|x_{t})&=\mathcal{N}(x_{t-1};\mu_{\theta}(x_t,t),\Sigma_{\theta}(x_t,t)) \end{split}$

  
flowchart LR
	subgraph Theory
		input1(("x(t)")) --> processor1("q(x(t-1)|x(t))")
		processor1 --> output1(("x(t-1)"))
	end
	subgraph Practice
		input2(("x(t)")) --> processor2("p(x(t-1)|x(t))")
		processor2 --> output2(("x(t-1)"))
	end
	Theory-->Practice

Now the problem is how to learn the model $p_{\theta}$ .

It is noteworthy that the reverse conditional probability is tractable when conditioned on $x_0$ :

$q(x_{t-1}|x_{t},x_0)=\mathcal{N}(x_{t-1};\tilde{\mu}(x_t,x_0),\tilde{\beta}_t\mathbf{I})$

Using Bayes’ rule, we have:

$\begin{split} q(x_{t-1}|x_{t},x_0)=\frac{q(x_{t}|x_{t-1},x_0)q(x_{t-1}|x_0)}{q(x_{t}|x_0)} \end{split}$

From the forward diffusion process, we can get:

$\begin{split} q(x_{t}|x_{t-1},x_0)&=\mathcal{N}(x_t; \sqrt{\alpha_t}x_{t-1}, 1-\alpha_t) \\ &=\frac{1}{\sqrt{2\pi(1-\alpha_t)}}exp(-\frac{(x_t-\sqrt{\alpha_t}x_{t-1})^2}{2(1-\alpha_t)}) \\\\ q(x_{t-1}|x_0)&=\mathcal{N}(x_{t-1};\sqrt{\bar{\alpha}_{t-1}}x_0, 1-\bar{\alpha}_{t-1}) \\ &=\frac{1}{\sqrt{2\pi(1-\bar{\alpha}_{t-1})}}exp(-\frac{(x_{t-1}-\sqrt{\bar{\alpha}_{t-1}}x_{0})^2}{2(1-\bar{\alpha}_{t-1})}) \\\\ q(x_t|x_0) &= \mathcal{N}(x_t;\sqrt{\bar{\alpha}_t}x_0, 1-\bar{\alpha}_t) \\ &=\frac{1}{\sqrt{2\pi(1-\bar{\alpha}_{t})}}exp(-\frac{(x_{t}-\sqrt{\bar{\alpha}_{t}}x_{0})^2}{2(1-\bar{\alpha}_{t})}) \end{split}$

Combine the result above and because we only concern $x_{t-1}$ , we have:

$\begin{split} q(x_{t-1}|x_{t},x_0)&\propto exp(-\frac{1}{2}(\frac{(x_t-\sqrt{\alpha_t}x_{t-1})^2}{1-\alpha_t}+\frac{(x_{t-1}-\sqrt{\bar{\alpha}_{t-1}}x_{0})^2}{1-\bar{\alpha}_{t-1}}-\frac{(x_{t}-\sqrt{\bar{\alpha}_{t}}x_{0})^2}{1-\bar{\alpha}_{t}})) \\ &=exp(-\frac{1}{2}((\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}})x_{t-1}^{2}-(\frac{2\sqrt{\alpha_t}}{\beta_t}x_t+\frac{2\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}x_0)x_{t-1}+C(x_t,x_0))) \end{split}$

where $C(x_t,x_0)$ is some function not involving $x_{t-1}$ and details are omitted. Following the standard Gaussian density function, the mean and variance can be parameterized as follows:

$\begin{split} \tilde{\beta}_t&=1/(\frac{\alpha_t}{\beta_t}+\frac{1}{1-\bar{\alpha}_{t-1}}) \\ &=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t \\\\ \tilde{\mu}_t(x_t,x_0)&=(\frac{\sqrt{\alpha_t}}{\beta_t}x_t+\frac{\sqrt{\bar{\alpha}_{t-1}}}{1-\bar{\alpha}_{t-1}}x_0)(\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\beta_t) \\ &=\frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}x_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1-\bar{\alpha}_t}x_0 \end{split}$

From (2) we can represent $x_0=\frac{x_t-\sqrt{1-\bar{\alpha}_t}\epsilon}{\bar{\alpha}_t}$ , and plug it into the above equation and obtain:

$\tilde{\mu}_t=\frac{1}{\sqrt{\alpha_t}}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_t)$