Paper Title

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole

In International Conference on Learning Representations (ICLR), 2021

score-based models, diffusion models

@article{song2021scorebased,
title={Score-Based Generative Modeling through Stochastic Differential Equations},
author={Yang Song and Jascha Sohl{-}Dickstein and Diederik P. Kingma and Abhishek Kumar and Stefano Ermon and Ben Poole},
year={2021},
booktitle={9th International Conference on Learning Representations, {ICLR}}
}

TL;DR: Define continuous-time stochastic processes instead of discrete steps, enabling better control and flexibility. The reverse SDE is learned using score matching, and this framework can generalize to various types of noise schedules (like the variance-exploding or variance-preserving SDEs).

1. Introduction and Background

1.1. Denoising Score Matching with Langevin Dynamics

In [Song et al. 2019] we discussed the role of Langevin dynamics in score matching. To recap, Langevin dynamics consists of a stochastic differential equation (SDE) that describes the evolution of a particle in a potential field. We can use the score function, which is the gradient of the log-density of the data distribution, and apply Langevin dynamics to generate new samples from the score function.
Let's define a perturbation kernel: \[ p_\sigma(\mathbf{\hat{x}} \vert \mathbf{x}) := \mathcal{N}(\mathbf{x}; \mathbf{x}, \sigma_i^2 \mathbf{I}) \] where:

  • \( \mathbf{x} \) is the data point.
  • \( \mathbf{\hat{x}} \) is the noisy observation.
  • \( \sigma \) is the noise scale or standard deviation of the noise.

2. Noise Perturbations

2.1. Noise Perturbations, VP and VE

Both Score matching with Langevin dynamics (SMLD) and Denoising Diffusion Probabilistic Models (DDPM) [Ho et al. 2020] add Noise progressively to the data over discrete time steps.
Let's derive the variance exxploding and perserving continuous-time stochastic differential equations (SDEs) for the noise perturbations from the discrete expressions.

2.1.1. Variance Exploding Stochastic Differential Equation (VE-SDE)

Variance Exploding (VE) is used in SMLD and it constists of progressively increasing the variance of the noise over time. This paper leverages VE-SDE to create robust denoising trajectories and stable training. But how is this done? We will define two equations of the stochastic process with standard deviation \( \sigma \) and variance \( \beta \) of the noise distribution. This will allow us to control the noise scale over time with variance exploding or variance preserving noise.
First, let's consider a Markov chain of \( N \) steps, each of them containing a noise scale with perturbation kernels \( p_{\sigma_i}(\mathbf{x} \vert \mathbf{x}_o) \) or, in other words, the probability distributions of the noisy observations \( \mathbf{x} \) given the data point \( \mathbf{x}_0 \), where \( i \) is the time step index. The state at the \( i \)-th time step is given by the Markov chain: \[ \mathbf{x}_i = \mathbf{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \ \mathbf{z}_{i-1}, \quad i=1, \cdots, N \tag{Eq. 1} \label{eq:markov_chain_ve} \] where:

  • \( \mathbf{z}_{i-1} \sim \mathcal{N}(\mathbf{0, I}) \) is a standard Gaussian noise.
  • \( \sigma_0 = 0 \) because it is the noise level or perturbation at the beginning of the process.
  • \( \mathbf{x}_0 \sim p_{data} \) is the dataset distribution.
If we have infinite time steps, \( N \rightarrow \infty \), the Markov chain \( \) . Let's define an infinitesimal continuous time step \( \Delta t\) and rewrite rewrite Eq. 1 as: \[ \mathbf{x}(t + \Delta t) = \mathbf{x}(t) + \sqrt{\sigma^2(t + \Delta t) - \sigma^2(t)} \, \mathbf{z}(t) \overset{\text{1st-order Taylor expansion}}{\approx} \mathbf{x}(t) + \sqrt{\frac{\text{d} [\sigma^2(t)]}{\text{d}t}} \Delta t \, \mathbf{z}(t), \tag{Eq. 2} \label{eq:continuous_time} \] Note that we approximate the continuous incremental standard deviation of the noise \( \sqrt{\sigma^2(t + \Delta t) - \sigma^2(t)} \) to the term \( \sqrt{\frac{\text{d} [\sigma^2(t)]}{\text{d}t}} \text{d}\mathbf{w} \) with the 1st-order Taylor expansion which assumes a small \( \Delta t \). The variable \( \mathbf{w} \) determines the random fluctations of the noise scale over time, which is also called the Wiener process or Brownian motion. Eq. 2 is the Variance Exploding Stochastic Differential Equation (VE-SDE). The naming of VE-SDE comes from the fact that the variance of the noise increases over time, growing undboundly.

2.1.2. Variance Preserving Stochastic Differential Equation (VP-SDE)

Variance Preserving (VP) is used in DDPM and it constists of keeping the variance of the noise constant over time. For VE-SDE, we define \( \sigma \) as the standard deviatiaon of the noise distribution. In a stochastic process, we can also define \( \beta \) as the variance scale of the noise. In VP-SDE, the Markov chain for the perturbation kernels \( {p_{\alpha_i} ( \mathbf{x} \vert \mathbf{x}_0)}_{i=1}^N \) where \( \alpha = 1 - \beta \), can be written as: \[ \mathbf{x}_i = \sqrt{1-\beta_i} \ \mathbf{x}_{i-1} + \sqrt{\beta_i} \ \mathbf{z}_{i-1}, \quad i=1, \cdots, N \tag{Eq. 3} \label{eq:markov_chain_vp} \] Eq. 3 is the same as Eq. 1 but written in terms of the variance scale \( \beta \) of the noise distribution. As we did with VE-SDE, we would like to obtain the continuous time equation for VP-SDE. First, let's again consider a Markov chain of \( N \) steps \( N \rightarrow \infty \). In the case of variance, we define a set of auxiliary noise scales \( \{ \hat{\beta}_i = N \beta_{i=1}^N \} \) to prevent variance from exploding. We can rewrite Eq. 3 as: \[ \mathbf{x}_i = \sqrt{1-\frac{\hat{\beta}_i}{N}} \ \mathbf{x}_{i-1} + \sqrt{\frac{\hat{\beta}_i}{N}} \ \mathbf{z}_{i-1}, \quad i=1, \cdots, N \tag{Eq. 4} \label{eq:markov_chain_vp_aux} \] In ... We can rewrite Eq. 4 as: \[ \begin{align} \mathbf{x}(t + \Delta t) &= \sqrt{1 - \beta(t + \Delta t) \Delta t} \ \mathbf{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \ \mathbf{z}(t) \tag{Eq. 5.1}\\ &\overset{\text{1st-order Taylor expansion}}{\approx} \mathbf{x}(t) - \frac{1}{2} \beta(t + \Delta t) \Delta t \mathbf{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \ \mathbf{z}(t) \tag{Eq. 5.2}\\ &\overset{\text{if} \Delta t \text{ is small} \rightarrow \beta(t+\Delta t) \approx \beta(t)}{\approx} \mathbf{x}(t) - \frac{1}{2} \beta(t) \Delta t \ \mathbf{x}(t) + \sqrt{\beta(t) \Delta t} \ \mathbf{z}(t) \tag{Eq. 5.3} \label{eq:continuous_time_vp} \end{align} \] where \( \sqrt{\beta(t) \Delta t} \ \mathbf{z}(t) \) is the noise level at time \( t \).
The limit when \( \Delta t \rightarrow 0 \) of Eq. 5.3 is the Variance Preserving Stochastic Differential Equation (VP-SDE) expressed as: \[ \text{d} \mathbf{x} = -\frac{1}{2} \beta(t) \mathbf{x} \text{d}t + \sqrt{\beta(t)} \text{d} \mathbf{w}(t) \tag{Eq. 6} \label{eq:vp_sde} \]

VE-SDE VP-SDE
Discrete SDE \(\mathbf{x}_i = \mathbf{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \ \mathbf{z}_{i-1}\) \(\mathbf{x}_i = \sqrt{1 - \hat{\beta}_i} \ \mathbf{x}_{i-1} + \sqrt{\hat{\beta}_i} \ \mathbf{z}_{i-1}\)
Continuous SDE \( \mathbf{x}(t + \Delta t) = \mathbf{x}(t) + \sqrt{\sigma^2(t + \Delta t) - \sigma^2(t)} \ \mathbf{z}(t) \) \( \mathbf{x}(t + \Delta t) = \mathbf{x}(t) - \frac{1}{2} \beta(t) \Delta t \ \mathbf{x}(t) + \sqrt{\beta(t) \ \Delta t} \ \mathbf{z}(t) \)
dx \( \text{d} \mathbf{x} = \sqrt{\frac{\text{d} [\sigma^2(t)]}{\text{d}t}} \text{d}\mathbf{w} \) \( \text{d} \mathbf{x} = -\frac{1}{2} \beta(t) \mathbf{x} \text{d}t + \sqrt{\beta(t)} \text{d} \mathbf{w}(t) \)

3. Probability Flow ODE

The reverse time SDE [Anderson 1982] is given by: \[ \text{d}\mathbf{x} = \left\{ \mathbf{f}(\mathbf{x},t) - \nabla \cdot [\mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T}] - \mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T} \nabla_\mathbf{x} \log p_t(\mathbf{x}) \right\} \text{d}t + \mathbf{G}(\mathbf{x},t) \text{d}\mathbf{\tilde{w}}, \] One of the main contributions of this paper is the derivation of the Probability Flow Ordinary Differential Equation (PF-ODE) which is a continuous-time stochastic differential equation (SDE) that describes the evolution of the data distribution. We will demonstrate that the following probability flow ODE (Eq. 7) induces the same marginal probability density \( p_t(\mathbf{x}) \) as the SDE giben by Eq. 8. \[ \text{d}\mathbf{x} = \left\{ \mathbf{f}(\mathbf{x},t) - \nabla \cdot \frac{1}{2} [\mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T}] - \frac{1}{2} \mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T} \nabla_\mathbf{x} \log p_t(\mathbf{x}) \right\} \text{d}t \tag{Eq. 7} \label{eq:ode} \] where:

  • \( \mathbf{G} \)
The SDE equation can be written as: \[ \text{d}\mathbf{x} = \mathbf{f}(\mathbf{x},t)\text{d}t + \mathbf{G}(\mathbf{x},t)\text{d}\mathbf{w}, \tag{Eq. 8} \label{eq:sde} \]
This equation can be written as a simplified ODE: \[ \text{d} \mathbf{x} = \mathbf{\tilde{f}}(\mathbf{x}(t), t)\text{d}t \tag{Eq. X} \label{eq:ode_simplified} \] where: \[ \mathbf{\tilde{f}}(\mathbf{x}(t), t) \colon= \left\{ \mathbf{f}(\mathbf{x},t) - \nabla \cdot [\mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T}] - \mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T} \nabla_\mathbf{x} \log p_t(\mathbf{x}) \right\} \text{d}t + \mathbf{G}(\mathbf{x},t) \]

© Copyright © All rights reserved | Carlos Hernández Oliván | Colorlib