Deep Learning: Diffusion Models - Part 2
Intro
The goal of diffusion models is: while you have a lot of images and you want to have more, you can use images you have as training data, and a diffusion model, which is a neural network, serves as the generator.
In order to make images useful to a neural network, we need to train the network waht these images are about, including fine details, general outlines and everything in between. One way to do that is adding different noise levels to the training data. During the noising process, both fine details and general outlines will be emphasized.
The method to train a diffusion model is making it learning to remove the noise we added in the noising process. It’s noteworthy that during the training, we need to feed it with normally distributed data, which means each pixel of the image is sampled from a normal distribution, also know as a Gaussian distribution. So when we ask the generator for a new image, we can sample noise from the normal distribution, and then get a completely new image by using the network to remove the noise.
Sampling
Before talking about how to train a neural network, let’s talk about sampling, or what we do with the neural network after it’s trained.
After the NN is trained, we feed it with the noise sample, the NN will predict the noise, and then we subtract the predicted noise to get something more like the image we want. Thsi cannot fully remove all the noise in one step, so we need multiple steps to get high quality samples.
flowchart LR input("Noise sample") --> NN["Trained NN"] NN --> predicted_noise("Predicted noise") predicted_noise --> subtractor["-"] subtractor --> output("Output") input --> subtractor output -.-> input
The algorithm is showned as below.
1 | sample = random_sample |
Firstly, we sample a random sample, that is the original noise we have at the begining.
And then we step through time, backwardly, from the last iteration to the first, because it is the reverse diffusion process.
Then we get an extra noise, which we will talk later.
After that, we put the original noise sample or the sample we get from last step into the trained neural network to get the predicted noise.
Finally, we use a sampling algorithm called DDPM (Denoising Diffusion Problem Models) to get a list of numbers for scale, then we use these numbers to subtract the predicted noise from the original sample and add the extra noise.
By repeating above steps, we can remove noises from a randomly sampled noise and get the image we want.
But the thing is the NN expects a normally distributed noisy sample as input, and once we denoied the sample, it’s no longer distributed in that way. Therefore, after each step and before the next step, we need to add an extra noise which is scaled base on the time step. Empirically, this stabilizes the NN so it doesn’t collapse to something closer to the average of the dataset.
flowchart LR input("Noise sample") --> NN["Trained NN"] NN --> predicted_noise("Predicted noise") predicted_noise --> subtractor["-"] subtractor --> output("Output") output --> adder["+"] adder --> next_input("Next input") input --> subtractor next_input -.-> input extra_noise("Extra noise") --> adder
Neural Network
The neural network architecture we use for diffusion models is a UNet.
The most important thing about UNet is that it takes an image as input and output an image in the same size of the input. What it does is first embeds information about the input into an embedding that compresses all the information in smaller space, so it downsamples with a lot of convolutional layers. And then it upsamples with the same number of upsampling blocks until get the output.
An advantage of UNet is that it can take additional Information. For diffusion models, time embedding is an information we could never ignore, it tells the model what kindof noise level we need. All we have to do for time embedding is to embed it into an vector and add it to sampling blocks. And we can also use the context embedding to control the result by multiplying it to sampling blocks.
Training
The goal of the neural network is to predict the noise, so it needs to learn the distribution of what is not noise.
How we do that is taking an image from the training data, adding noise to it, and then give it to the NN, asking the NN to predict the noise, and then we compare the preedicted noise against the actual noise that was added to the image, and that’s how we compute the loss.
We need to determine what the noise here is. We could just go through time steps and give it different noise samples. But realistically in training, we don’t want the NN to be looking at the same image all the time. It helps to be more stable if it looks at different images across an epoch. In order to unify the operation of each loop, we randomly sample a timestep and then get the appropriate noise to this timestep (noise level), add the noise to the image and feed result to the NN, get the predicted noise, then compute the loss.
flowchart LR random_image("Random Image") --> adder["+"] random_timestep("Random Timestep") --> noise("Noise") noise --> adder adder --> noised_image("Noised Image") noised_image --> NN["NN"] NN --> predicted_noise("Predicted Noise") predicted_noise --> subtractor["-"] noise --> subtractor subtractor --> loss("Loss") loss --> NN
The algorithm is showned as below.
1 | Sample training image |
Controlling
When it comes to controlling models, embeddings should never be ignored. We already mentioned time embedding and context embedding. Embeddings are vectors or numbers able to capture meaning. What is special about embeddings is because they can capture meaning, text with similar content will have similar vectors.
The chart below shows how these embeddings become context to the model during training.
flowchart LR text("Text") --> embedding["Embedding"] embedding --> vector("Vector") random_image("Image") --> adder["+"] random_timestep("Random Timestep") --> noise("Noise") noise --> adder adder --> noised_image("Noised Image") noised_image --> NN["NN"] NN --> predicted_noise("Predicted Noise") predicted_noise --> subtractor["-"] noise --> subtractor subtractor --> loss("Loss") loss --> NN vector --> NN
The magic of embedding is that we can control the generated result when sampling, as showed in the image below.
In summary, context is a vector for controlling generation, it can be either text embeddings (e.g. > 1000 in length) or categories (e.g. 5 in length).
Fast Sampling
We are going to talk about a new method for sampling which is called DDIM (Denoising Diffusion Implicit Models) and is 10 times more efficient than DDPM.
Sampling is slow because there many timesteps and each timestep is dependent on the previous one (Markovian).
DDIM is faster because it skips some timesteps. It predicts a rough sketch of the final output and then refines it with the denoising process.
flowchart LR 500 --> 499 499 --> 498 498 --> 497 497 --> 496 496 --> 495 495 --> 494 500 --> 498 498 --> 496 496 --> 494
DDIM cannot always get the same level of quality as DDPM, but the result still look good.
Empirically, if you sample for less than 500 steps, DDIM is better, otherwise, DDPM is better.