diffusion models, latent diffusion, VQ-VAE
TL;DR: Introduced Latent Diffusion Models that operate in a lower-dimensional latent space rather than the pixel space. This innovation reduced computational costs and enabled diffusion models to handle higher-resolution images and complex tasks like text-to-image generation.
Diffusion models [Shohl-Dickstein et al. 2015] are a class of generative models that operate in the pixel space. They model the data distribution by iteratively applying a series of noise levels sampled from a predefined distribution to the input image, governed by a stochastic time-dependent process that dictates the transition of data distributions over time (Fig. 1).
However, pixel space is a high-dimensional space which makes them computationally expensive, specially during the reverse diffusion process or denoising.
This paper uses an Autoencoder, specially, a VQ-VAE to convert the pixel space into a latent space and therefore, reduce the computational complexity of the denoising process. Also, the latent space is powerful for conditional sampling.
An autoencoder is composed by an Encoder
In particular, the Autoencoder is a UNet-based VQ-VAE that predicts the noise level of the input image
Then, we can summarize the training process in 2 steps: training the autoencoder and training the diffusion process in the latent space. These steps are trained separately.
A vanilla autoencoder is trained with an input
The autoencoder is trained with the following loss function:
Fig. 1. Standard diffusion process.
Fig. 2. Diffusion process in the latent space.
During the diffusion training process, the encoder and decoder are frozen.
Now that we trained our model,
© Copyright ©2025 All rights reserved | Carlos Hernández Oliván | Colorlib