score-based models, diffusion models
TL;DR: Define continuous-time stochastic processes instead of discrete steps, enabling better control and flexibility. The reverse SDE is learned using score matching, and this framework can generalize to various types of noise schedules (like the variance-exploding or variance-preserving SDEs).
In [Song et al. 2019] we discussed the role of Langevin dynamics in score matching.
To recap, Langevin dynamics consists of a stochastic differential equation (SDE) that describes the evolution of a particle in a potential field.
We can use the score function, which is the gradient of the log-density of the data distribution, and apply Langevin dynamics to generate new samples from the score function.
Let's define a perturbation kernel:
\[
p_\sigma(\mathbf{\hat{x}} \vert \mathbf{x}) := \mathcal{N}(\mathbf{x}; \mathbf{x}, \sigma_i^2 \mathbf{I})
\]
where:
Both Score matching with Langevin dynamics (SMLD) and Denoising Diffusion Probabilistic Models (DDPM) [Ho et al. 2020] add Noise
progressively to the data over discrete time steps.
Let's derive the variance exxploding and perserving continuous-time stochastic differential equations (SDEs) for the noise perturbations from the discrete expressions.
Variance Exploding (VE) is used in SMLD and it constists of progressively increasing the variance of the noise over time.
This paper leverages VE-SDE to create robust denoising trajectories and stable training.
But how is this done?
We will define two equations of the stochastic process with standard deviation \( \sigma \) and variance \( \beta \) of the noise distribution.
This will allow us to control the noise scale over time with variance exploding or variance preserving noise.
First, let's consider a Markov chain of \( N \) steps, each of them containing a noise scale with perturbation kernels \( p_{\sigma_i}(\mathbf{x} \vert \mathbf{x}_o) \) or, in other words, the probability distributions of the noisy observations \( \mathbf{x} \) given the data point \( \mathbf{x}_0 \),
where \( i \) is the time step index. The state at the \( i \)-th time step is given by the Markov chain:
\[
\mathbf{x}_i = \mathbf{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \ \mathbf{z}_{i-1}, \quad i=1, \cdots, N
\tag{Eq. 1} \label{eq:markov_chain_ve}
\]
where:
Variance Preserving (VP) is used in DDPM and it constists of keeping the variance of the noise constant over time.
For VE-SDE, we define \( \sigma \) as the standard deviatiaon of the noise distribution. In a stochastic process, we can also define \( \beta \) as the variance scale of the noise.
In VP-SDE, the Markov chain for the perturbation kernels \( {p_{\alpha_i} ( \mathbf{x} \vert \mathbf{x}_0)}_{i=1}^N \) where \( \alpha = 1 - \beta \), can be written as:
\[
\mathbf{x}_i = \sqrt{1-\beta_i} \ \mathbf{x}_{i-1} + \sqrt{\beta_i} \ \mathbf{z}_{i-1}, \quad i=1, \cdots, N
\tag{Eq. 3} \label{eq:markov_chain_vp}
\]
Eq. 3 is the same as Eq. 1 but written in terms of the variance scale \( \beta \) of the noise distribution.
As we did with VE-SDE, we would like to obtain the continuous time equation for VP-SDE.
First, let's again consider a Markov chain of \( N \) steps \( N \rightarrow \infty \). In the case of variance, we define a set of auxiliary noise scales \( \{ \hat{\beta}_i = N \beta_{i=1}^N \} \) to prevent variance from exploding.
We can rewrite Eq. 3 as:
\[
\mathbf{x}_i = \sqrt{1-\frac{\hat{\beta}_i}{N}} \ \mathbf{x}_{i-1} + \sqrt{\frac{\hat{\beta}_i}{N}} \ \mathbf{z}_{i-1}, \quad i=1, \cdots, N
\tag{Eq. 4} \label{eq:markov_chain_vp_aux}
\]
In ...
We can rewrite Eq. 4 as:
\[
\begin{align}
\mathbf{x}(t + \Delta t) &= \sqrt{1 - \beta(t + \Delta t) \Delta t} \ \mathbf{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \ \mathbf{z}(t) \tag{Eq. 5.1}\\
&\overset{\text{1st-order Taylor expansion}}{\approx} \mathbf{x}(t) - \frac{1}{2} \beta(t + \Delta t) \Delta t \mathbf{x}(t) + \sqrt{\beta(t + \Delta t) \Delta t} \ \mathbf{z}(t) \tag{Eq. 5.2}\\
&\overset{\text{if} \Delta t \text{ is small} \rightarrow \beta(t+\Delta t) \approx \beta(t)}{\approx} \mathbf{x}(t) - \frac{1}{2} \beta(t) \Delta t \ \mathbf{x}(t) + \sqrt{\beta(t) \Delta t} \ \mathbf{z}(t)
\tag{Eq. 5.3} \label{eq:continuous_time_vp}
\end{align}
\]
where \( \sqrt{\beta(t) \Delta t} \ \mathbf{z}(t) \) is the noise level at time \( t \).
The limit when \( \Delta t \rightarrow 0 \) of Eq. 5.3 is the Variance Preserving Stochastic Differential Equation (VP-SDE) expressed as:
\[
\text{d} \mathbf{x} = -\frac{1}{2} \beta(t) \mathbf{x} \text{d}t + \sqrt{\beta(t)} \text{d} \mathbf{w}(t)
\tag{Eq. 6} \label{eq:vp_sde}
\]
VE-SDE | VP-SDE | |
---|---|---|
Discrete SDE | \(\mathbf{x}_i = \mathbf{x}_{i-1} + \sqrt{\sigma_i^2 - \sigma_{i-1}^2} \ \mathbf{z}_{i-1}\) | \(\mathbf{x}_i = \sqrt{1 - \hat{\beta}_i} \ \mathbf{x}_{i-1} + \sqrt{\hat{\beta}_i} \ \mathbf{z}_{i-1}\) |
Continuous SDE | \( \mathbf{x}(t + \Delta t) = \mathbf{x}(t) + \sqrt{\sigma^2(t + \Delta t) - \sigma^2(t)} \ \mathbf{z}(t) \) | \( \mathbf{x}(t + \Delta t) = \mathbf{x}(t) - \frac{1}{2} \beta(t) \Delta t \ \mathbf{x}(t) + \sqrt{\beta(t) \ \Delta t} \ \mathbf{z}(t) \) |
dx | \( \text{d} \mathbf{x} = \sqrt{\frac{\text{d} [\sigma^2(t)]}{\text{d}t}} \text{d}\mathbf{w} \) | \( \text{d} \mathbf{x} = -\frac{1}{2} \beta(t) \mathbf{x} \text{d}t + \sqrt{\beta(t)} \text{d} \mathbf{w}(t) \) |
The reverse time SDE [Anderson 1982] is given by: \[ \text{d}\mathbf{x} = \left\{ \mathbf{f}(\mathbf{x},t) - \nabla \cdot [\mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T}] - \mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T} \nabla_\mathbf{x} \log p_t(\mathbf{x}) \right\} \text{d}t + \mathbf{G}(\mathbf{x},t) \text{d}\mathbf{\tilde{w}}, \] One of the main contributions of this paper is the derivation of the Probability Flow Ordinary Differential Equation (PF-ODE) which is a continuous-time stochastic differential equation (SDE) that describes the evolution of the data distribution. We will demonstrate that the following probability flow ODE (Eq. 7) induces the same marginal probability density \( p_t(\mathbf{x}) \) as the SDE giben by Eq. 8. \[ \text{d}\mathbf{x} = \left\{ \mathbf{f}(\mathbf{x},t) - \nabla \cdot \frac{1}{2} [\mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T}] - \frac{1}{2} \mathbf{G}(\mathbf{x},t) \mathbf{G}(\mathbf{x},t)^\text{T} \nabla_\mathbf{x} \log p_t(\mathbf{x}) \right\} \text{d}t \tag{Eq. 7} \label{eq:ode} \] where:
© Copyright © All rights reserved | Carlos Hernández Oliván | Colorlib