diffusion models, latent diffusion, VQ-VAE
TL;DR: Introduced Latent Diffusion Models that operate in a lower-dimensional latent space rather than the pixel space. This innovation reduced computational costs and enabled diffusion models to handle higher-resolution images and complex tasks like text-to-image generation.
Diffusion models [Shohl-Dickstein et al. 2015] are a class of generative models that operate in the pixel space. They model the data distribution by iteratively applying a series of noise levels sampled from a predefined distribution to the input image, governed by a stochastic time-dependent process that dictates the transition of data distributions over time (Fig. 1).
However, pixel space is a high-dimensional space which makes them computationally expensive, specially during the reverse diffusion process or denoising.
This paper uses an Autoencoder, specially, a VQ-VAE to convert the pixel space into a latent space and therefore, reduce the computational complexity of the denoising process. Also, the latent space is powerful for conditional sampling.
An autoencoder is composed by an Encoder \( \mathcal{E} \) and a Decoder \( \mathcal{D} \). The encoder maps the input image \( x \) to a lower-dimensional latent space \( z \). The decoder maps the latent space \( z \) to the output image \( x \).
This paper uses a VQ-VAE in the middle of the diffusion process, which is basically using the VQ-VAE to denoise or, in other words, to predict the noise level of the input image at each time step of the reverse diffusion process.
In particular, the Autoencoder is a UNet-based VQ-VAE that predicts the noise level of the input image \( x_t \) that we want to remove. The input of the Unet is a noisy image which contains the noise-level correspondint to step time \( t \).
This way, we input an image in the pixel space and generate a latent space representation of it.
One of the core contributions of this paper is to use the latent space of the VQ-VAE for conditional sampling.
Then, we can summarize the training process in 2 steps: training the autoencoder and training the diffusion process in the latent space. These steps are trained separately.
A vanilla autoencoder is trained with an input \( \mathbf{x} \) and it predicts the same image \( \mathbf{\hat{x}} \).
In this paper, to train the VQ-VAE along with the diffusion process which depends on time \( t \), the authors train the model in an adversarial manner.
A discrimiator decoder \( \mathcal{D}_\phi \) is used to discriminate the input images \( \mathbf{x} \) from the reconstructions \( \mathbf{\hat{x}} = \mathcal{D}(\mathcal{E}(\mathbf{x}))\).
The autoencoder is trained with the following loss function:
\[
\mathcal{L}_{\text{Autoencoder}} = \min_{\mathcal{E}, D} \max_{\psi} \left( \mathcal{L}_{\text{rec}}(x, D(\mathcal{E}(x))) - \mathcal{L}_{\text{adv}}(D(\mathcal{E}(x))) + \log D_{\psi}(x) + \mathcal{L}_{\text{reg}}(x; \mathcal{E}, D) \right),
\tag{Eq. 1} \label{eq:1}
\]
where:
Fig. 1. Standard diffusion process.
Fig. 2. Diffusion process in the latent space.
During the diffusion training process, the encoder and decoder are frozen.
Now that we trained our model,
© Copyright © All rights reserved | Carlos Hernández Oliván | Colorlib