Deep Learning: Diffusion Models - Part 2

Intro

The goal of diffusion models is: while you have a lot of images and you want to have more, you can use images you have as training data, and a diffusion model, which is a neural network, serves as the generator.

In order to make images useful to a neural network, we need to train the network waht these images are about, including fine details, general outlines and everything in between. One way to do that is adding different noise levels to the training data. During the noising process, both fine details and general outlines will be emphasized.

The method to train a diffusion model is making it learning to remove the noise we added in the noising process. It’s noteworthy that during the training, we need to feed it with normally distributed data, which means each pixel of the image is sampled from a normal distribution, also know as a Gaussian distribution. So when we ask the generator for a new image, we can sample noise from the normal distribution, and then get a completely new image by using the network to remove the noise.

Sampling

Before talking about how to train a neural network, let’s talk about sampling, or what we do with the neural network after it’s trained.

After the NN is trained, we feed it with the noise sample, the NN will predict the noise, and then we subtract the predicted noise to get something more like the image we want. Thsi cannot fully remove all the noise in one step, so we need multiple steps to get high quality samples.

  
flowchart LR
	input("Noise sample") --> NN["Trained NN"]
	NN --> predicted_noise("Predicted noise")
	predicted_noise --> subtractor["-"]
	subtractor --> output("Output")
	input --> subtractor
	output -.-> input

The algorithm is showned as below.

sample = random_sample
for t = T, ..., 1 do:
    extra_noise = random_sample if t > 1, else extra_noise = 0
    predicted_noise = trained_nn(x[t-1], t)
    s1, s2, s3 = ddpm_scaling(t)
    sample = s1 * (sample - s2 * predicted_noise) + s3 * extra_noise

Firstly, we sample a random sample, that is the original noise we have at the begining.

And then we step through time, backwardly, from the last iteration to the first, because it is the reverse diffusion process.

Then we get an extra noise, which we will talk later.

After that, we put the original noise sample or the sample we get from last step into the trained neural network to get the predicted noise.

Finally, we use a sampling algorithm called DDPM (Denoising Diffusion Problem Models) to get a list of numbers for scale, then we use these numbers to subtract the predicted noise from the original sample and add the extra noise.

By repeating above steps, we can remove noises from a randomly sampled noise and get the image we want.

But the thing is the NN expects a normally distributed noisy sample as input, and once we denoied the sample, it’s no longer distributed in that way. Therefore, after each step and before the next step, we need to add an extra noise which is scaled base on the time step. Empirically, this stabilizes the NN so it doesn’t collapse to something closer to the average of the dataset.

  
flowchart LR
	input("Noise sample") --> NN["Trained NN"]
	NN --> predicted_noise("Predicted noise")
	predicted_noise --> subtractor["-"]
	subtractor --> output("Output")
	output --> adder["+"]
	adder --> next_input("Next input")
	input --> subtractor
	next_input -.-> input
	extra_noise("Extra noise") --> adder

Neural Network

The neural network architecture we use for diffusion models is a UNet.

Model Structure

The most important thing about UNet is that it takes an image as input and output an image in the same size of the input. What it does is first embeds information about the input into an embedding that compresses all the information in smaller space, so it downsamples with a lot of convolutional layers. And then it upsamples with the same number of upsampling blocks until get the output.

An advantage of UNet is that it can take additional Information. For diffusion models, time embedding is an information we could never ignore, it tells the model what kindof noise level we need. All we have to do for time embedding is to embed it into an vector and add it to sampling blocks. And we can also use the context embedding to control the result by multiplying it to sampling blocks.

Structure with Embedding

Training

The goal of the neural network is to predict the noise, so it needs to learn the distribution of what is not noise.

How we do that is taking an image from the training data, adding noise to it, and then give it to the NN, asking the NN to predict the noise, and then we compare the preedicted noise against the actual noise that was added to the image, and that’s how we compute the loss.

We need to determine what the noise here is. We could just go through time steps and give it different noise samples. But realistically in training, we don’t want the NN to be looking at the same image all the time. It helps to be more stable if it looks at different images across an epoch. In order to unify the operation of each loop, we randomly sample a timestep and then get the appropriate noise to this timestep (noise level), add the noise to the image and feed result to the NN, get the predicted noise, then compute the loss.

  
flowchart LR
	random_image("Random Image") --> adder["+"]
	random_timestep("Random Timestep") --> noise("Noise")
	noise --> adder
	adder --> noised_image("Noised Image")
	noised_image --> NN["NN"]
	NN --> predicted_noise("Predicted Noise")
	predicted_noise --> subtractor["-"]
	noise --> subtractor
	subtractor --> loss("Loss")
	loss --> NN

The algorithm is showned as below.

Sample training image
Sample timestep t, this determines the level of noise
Sample the noise
Add noise to image
Input this into the neural network. Neural network predicts the noise.
Compute the loss between predicted and true noise
Backprop and learn

Controlling

When it comes to controlling models, embeddings should never be ignored. We already mentioned time embedding and context embedding. Embeddings are vectors or numbers able to capture meaning. What is special about embeddings is because they can capture meaning, text with similar content will have similar vectors.

The chart below shows how these embeddings become context to the model during training.

  
flowchart LR
	text("Text") --> embedding["Embedding"]
	embedding --> vector("Vector")
	random_image("Image") --> adder["+"]
	random_timestep("Random Timestep") --> noise("Noise")
	noise --> adder
	adder --> noised_image("Noised Image")
	noised_image --> NN["NN"]
	NN --> predicted_noise("Predicted Noise")
	predicted_noise --> subtractor["-"]
	noise --> subtractor
	subtractor --> loss("Loss")
	loss --> NN
	vector --> NN

The magic of embedding is that we can control the generated result when sampling, as showed in the image below.

Magic of Embedding

In summary, context is a vector for controlling generation, it can be either text embeddings (e.g. > 1000 in length) or categories (e.g. 5 in length).

Fast Sampling

We are going to talk about a new method for sampling which is called DDIM (Denoising Diffusion Implicit Models) and is 10 times more efficient than DDPM.

Sampling is slow because there many timesteps and each timestep is dependent on the previous one (Markovian).

DDIM is faster because it skips some timesteps. It predicts a rough sketch of the final output and then refines it with the denoising process.

  
flowchart LR
	500 --> 499
	499 --> 498
	498 --> 497
	497 --> 496
	496 --> 495
	495 --> 494
	500 --> 498
	498 --> 496
	496 --> 494

DDIM cannot always get the same level of quality as DDPM, but the result still look good.

Empirically, if you sample for less than 500 steps, DDIM is better, otherwise, DDPM is better.

References

How Diffusion Models Works