Diffusion Models
Introduction
In the previous lessons, we studied Variational Autoencoders (VAEs) which have a solid probabilistic foundation but rely on approximating data likelihood using a lower bound (ELBO) because exact computation is intractable.
This lesson introduces diffusion models. They define a nonlinear mapping from latent variables to observed data where both quantities share the exact same dimension. Like VAEs, they approximate data likelihood using an ELBO. However, diffusion models have a crucial difference: the encoder is entirely predetermined rather than learned. It works by gradually blending data with white noise until only noise remains. The decoder then learns to invert this process, step by step.
The result is a generative model that is relatively easy to train and can produce samples that often exceed the realism achieved by other generative models.
Part 1 — Overview of Diffusion Models
A diffusion model consists of an encoder and a decoder. The encoder takes a data sample \(\mathbf{x}\) and maps it through a series of intermediate latent variables \(\mathbf{z}_1, \mathbf{z}_2, \ldots, \mathbf{z}_T\). The decoder reverses this process: it starts with \(\mathbf{z}_T\) and maps back through \(\mathbf{z}_{T-1}, \ldots, \mathbf{z}_1\) until it finally (re-)creates a data point \(\mathbf{x}\). In both the encoder and decoder, the mappings are stochastic rather than deterministic.
1.1 The Forward Process (Encoder)
The encoder is prespecified — it has no learnable parameters. It gradually blends the input with samples of white noise over a sequence of steps:
\[ \mathbf{x} \xrightarrow{\text{add noise}} \mathbf{z}_1 \xrightarrow{\text{add noise}} \mathbf{z}_2 \xrightarrow{\text{add noise}} \cdots \xrightarrow{\text{add noise}} \mathbf{z}_T \]
With enough steps \(T\), the conditional distribution \(q(\mathbf{z}_T | \mathbf{x})\) and the marginal distribution \(q(\mathbf{z}_T)\) both become the standard normal distribution. In other words, after sufficient diffusion, all traces of the original data are erased.
1.2 The Reverse Process (Decoder)
The decoder is the part we learn. Its goal is to act as the exact inverse of the forward process:
\[ \mathbf{z}_T \xrightarrow{\text{denoise}} \mathbf{z}_{T-1} \xrightarrow{\text{denoise}} \cdots \xrightarrow{\text{denoise}} \mathbf{z}_1 \xrightarrow{\text{denoise}} \mathbf{x} \]
A series of networks are trained to map backward between each adjacent pair of latent variables \(\mathbf{z}_t\) and \(\mathbf{z}_{t-1}\). The loss function encourages each network to invert the corresponding encoder step. Noise is gradually removed from the representation until a realistic-looking data example remains.
To generate a new data example \(\mathbf{x}\), we draw a sample from \(q(\mathbf{z}_T) = \text{Norm}[\mathbf{0}, \mathbf{I}]\) and pass it through the decoder.
Diffusion Model: A generative model consisting of a predetermined encoder (forward/diffusion process) that gradually adds noise to data, and a learned decoder (reverse process) that removes noise to reconstruct data. Since the encoder has no learnable parameters, all learned parameters reside in the decoder. Generation proceeds by sampling pure noise and iteratively denoising.
Part 2 — The Forward Process (Encoder) in Detail
The diffusion or forward process maps a data example \(\mathbf{x}\) through a series of intermediate variables \(\mathbf{z}_1, \mathbf{z}_2, \ldots, \mathbf{z}_T\) with the same size as \(\mathbf{x}\), according to:
2.1 The Noise-Adding Rule
At each step, we attenuate the current signal and add fresh noise:
\[ \mathbf{z}_1 = \sqrt{1 - \beta_1} \cdot \mathbf{x} + \sqrt{\beta_1} \cdot \boldsymbol{\epsilon}_1 \]
\[ \mathbf{z}_t = \sqrt{1 - \beta_t} \cdot \mathbf{z}_{t-1} + \sqrt{\beta_t} \cdot \boldsymbol{\epsilon}_t \qquad \forall\; t \in 2, \ldots, T \]
where \(\boldsymbol{\epsilon}_t\) is noise drawn from a standard normal distribution \(\text{Norm}[\mathbf{0}, \mathbf{I}]\). The first term attenuates the data (plus any noise added so far), and the second term adds more noise.
The hyperparameters \(\beta_t \in [0, 1]\) determine how quickly the noise is blended and are collectively known as the noise schedule.
2.2 Probabilistic Form
The forward process can equivalently be written as conditional normal distributions:
\[ q(\mathbf{z}_1 | \mathbf{x}) = \text{Norm}_{\mathbf{z}_1}\!\left[\sqrt{1 - \beta_1}\, \mathbf{x},\; \beta_1 \mathbf{I}\right] \]
\[ q(\mathbf{z}_t | \mathbf{z}_{t-1}) = \text{Norm}_{\mathbf{z}_t}\!\left[\sqrt{1 - \beta_t}\, \mathbf{z}_{t-1},\; \beta_t \mathbf{I}\right] \qquad \forall\; t \in \{2, \ldots, T\} \]
This is a Markov chain because the probability of \(\mathbf{z}_t\) is determined entirely by the value of the immediately preceding variable \(\mathbf{z}_{t-1}\). With sufficient steps \(T\), all traces of the original data are removed, and \(q(\mathbf{z}_T | \mathbf{x}) = q(\mathbf{z}_T)\) becomes a standard normal distribution.
Markov Chain: A sequence of random variables where each variable depends only on its immediate predecessor. The forward diffusion process \(q(\mathbf{z}_1, \ldots, \mathbf{z}_T | \mathbf{x})\) factorises as:
\[ q(\mathbf{z}_{1\ldots T} | \mathbf{x}) = q(\mathbf{z}_1 | \mathbf{x}) \prod_{t=2}^{T} q(\mathbf{z}_t | \mathbf{z}_{t-1}) \]
2.3 The Diffusion Kernel \(q(\mathbf{z}_t | \mathbf{x})\)
To train the decoder to invert the forward process, we need multiple samples \(\mathbf{z}_t\) at time \(t\) for the same example \(\mathbf{x}\). Generating these sequentially using the step-by-step rule is time-consuming when \(t\) is large.
Fortunately, there is a closed-form expression for \(q(\mathbf{z}_t | \mathbf{x})\), known as the diffusion kernel, which allows us to directly draw samples \(\mathbf{z}_t\) given initial data point \(\mathbf{x}\) without computing all the intermediate variables \(\mathbf{z}_1, \ldots, \mathbf{z}_{t-1}\).
Derivation
Consider the first two steps of the forward process:
\[ \mathbf{z}_1 = \sqrt{1 - \beta_1} \cdot \mathbf{x} + \sqrt{\beta_1} \cdot \boldsymbol{\epsilon}_1 \]
\[ \mathbf{z}_2 = \sqrt{1 - \beta_2} \cdot \mathbf{z}_1 + \sqrt{\beta_2} \cdot \boldsymbol{\epsilon}_2 \]
Substituting the first equation into the second:
\[ \mathbf{z}_2 = \sqrt{1 - \beta_2}\left(\sqrt{1 - \beta_1} \cdot \mathbf{x} + \sqrt{\beta_1} \cdot \boldsymbol{\epsilon}_1\right) + \sqrt{\beta_2} \cdot \boldsymbol{\epsilon}_2 \]
\[ = \sqrt{(1 - \beta_2)(1 - \beta_1)} \cdot \mathbf{x} + \sqrt{1 - \beta_2 - (1 - \beta_2)(1 - \beta_1)} \cdot \boldsymbol{\epsilon}_1 + \sqrt{\beta_2} \cdot \boldsymbol{\epsilon}_2 \]
The last two terms are independent samples from mean-zero normal distributions with variances \(1 - \beta_2 - (1 - \beta_2)(1 - \beta_1)\) and \(\beta_2\), respectively. Their sum has variance equal to the sum of the component variances, so:
\[ \mathbf{z}_2 = \sqrt{(1 - \beta_2)(1 - \beta_1)} \cdot \mathbf{x} + \sqrt{1 - (1 - \beta_2)(1 - \beta_1)} \cdot \boldsymbol{\epsilon} \]
where \(\boldsymbol{\epsilon}\) is a single sample from \(\text{Norm}[\mathbf{0}, \mathbf{I}]\).
General Form
Continuing this telescoping process yields the general diffusion kernel:
\[ \mathbf{z}_t = \sqrt{\alpha_t} \cdot \mathbf{x} + \sqrt{1 - \alpha_t} \cdot \boldsymbol{\epsilon} \]
where \(\alpha_t = \prod_{s=1}^{t}(1 - \beta_s)\) and \(\boldsymbol{\epsilon} \sim \text{Norm}[\mathbf{0}, \mathbf{I}]\).
In probabilistic form:
\[ q(\mathbf{z}_t | \mathbf{x}) = \text{Norm}_{\mathbf{z}_t}\!\left[\sqrt{\alpha_t} \cdot \mathbf{x},\; (1 - \alpha_t)\mathbf{I}\right] \]
For any starting data point \(\mathbf{x}\), the variable \(\mathbf{z}_t\) is normally distributed with a known mean and variance. This means we can directly sample \(\mathbf{z}_t\) for any timestep \(t\) without computing the intermediate variables \(\mathbf{z}_1, \ldots, \mathbf{z}_{t-1}\).
The diffusion kernel \(q(\mathbf{z}_t | \mathbf{x}) = \text{Norm}_{\mathbf{z}_t}[\sqrt{\alpha_t} \cdot \mathbf{x},\, (1 - \alpha_t)\mathbf{I}]\) is the critical computational shortcut that makes diffusion models practical to train. It allows us to jump directly to any timestep \(t\) without simulating the entire chain, by simply computing \(\mathbf{z}_t = \sqrt{\alpha_t} \cdot \mathbf{x} + \sqrt{1 - \alpha_t} \cdot \boldsymbol{\epsilon}\).
2.4 Marginal Distributions \(q(\mathbf{z}_t)\)
The marginal distribution \(q(\mathbf{z}_t)\) is the probability of observing a value of \(\mathbf{z}_t\) given the distribution of possible starting points \(\mathbf{x}\) and the possible diffusion paths for each starting point. It can be computed by marginalising over \(\mathbf{x}\):
\[ q(\mathbf{z}_t) = \int q(\mathbf{z}_t | \mathbf{x})\, Pr(\mathbf{x})\, d\mathbf{x} \]
If we repeatedly sample from the data distribution \(Pr(\mathbf{x})\) and superimpose the diffusion kernel \(q(\mathbf{z}_t | \mathbf{x})\) on each sample, the result is the marginal distribution \(q(\mathbf{z}_t)\). The marginal distribution cannot be written in closed form because we do not know the original data distribution \(Pr(\mathbf{x})\).
2.5 Conditional Distribution \(q(\mathbf{z}_{t-1} | \mathbf{z}_t)\)
To reverse the diffusion process, we need \(q(\mathbf{z}_{t-1} | \mathbf{z}_t)\) — the distribution over the previous step given the current step. Applying Bayes’ rule:
\[ q(\mathbf{z}_{t-1} | \mathbf{z}_t) = \frac{q(\mathbf{z}_t | \mathbf{z}_{t-1})\, q(\mathbf{z}_{t-1})}{q(\mathbf{z}_t)} \]
This is intractable since we cannot compute the marginal distribution \(q(\mathbf{z}_{t-1})\) in closed form. However, in many cases, these reverse conditionals are well-approximated by a normal distribution. This is important because when we build the decoder, we will approximate the reverse process using a normal distribution.
2.6 Conditional Diffusion Distribution \(q(\mathbf{z}_{t-1} | \mathbf{z}_t, \mathbf{x})\)
Although we cannot compute \(q(\mathbf{z}_{t-1} | \mathbf{z}_t)\), if we know the starting variable \(\mathbf{x}\), then we can compute \(q(\mathbf{z}_{t-1} | \mathbf{z}_t, \mathbf{x})\) in closed form. Starting from Bayes’ rule:
\[ q(\mathbf{z}_{t-1} | \mathbf{z}_t, \mathbf{x}) = \frac{q(\mathbf{z}_t | \mathbf{z}_{t-1}, \mathbf{x})\, q(\mathbf{z}_{t-1} | \mathbf{x})}{q(\mathbf{z}_t | \mathbf{x})} \]
Since the diffusion process is Markov, \(q(\mathbf{z}_t | \mathbf{z}_{t-1}, \mathbf{x}) = q(\mathbf{z}_t | \mathbf{z}_{t-1})\), and both \(q(\mathbf{z}_{t-1} | \mathbf{x})\) and \(q(\mathbf{z}_t | \mathbf{x})\) are known from the diffusion kernel. Combining two normal distributions using the Gaussian change of variables identity yields:
\[ q(\mathbf{z}_{t-1} | \mathbf{z}_t, \mathbf{x}) = \text{Norm}_{\mathbf{z}_{t-1}}\!\left[\frac{(1 - \alpha_{t-1})}{1 - \alpha_t}\sqrt{1 - \beta_t}\,\mathbf{z}_t + \frac{\sqrt{\alpha_{t-1}}\,\beta_t}{1 - \alpha_t}\,\mathbf{x},\;\; \frac{\beta_t(1 - \alpha_{t-1})}{1 - \alpha_t}\,\mathbf{I}\right] \]
This distribution is used to train the decoder: it is the target that the decoder must learn to approximate at each step.
| Distribution | Formula | Computable? |
|---|---|---|
| Forward step \(q(\mathbf{z}_t \mid \mathbf{z}_{t-1})\) | \(\text{Norm}[\sqrt{1-\beta_t}\,\mathbf{z}_{t-1},\, \beta_t \mathbf{I}]\) | Yes (by definition) |
| Diffusion kernel \(q(\mathbf{z}_t \mid \mathbf{x})\) | \(\text{Norm}[\sqrt{\alpha_t}\,\mathbf{x},\, (1-\alpha_t)\mathbf{I}]\) | Yes (closed form) |
| Marginal \(q(\mathbf{z}_t)\) | \(\int q(\mathbf{z}_t \mid \mathbf{x})\,Pr(\mathbf{x})\,d\mathbf{x}\) | No (unknown \(Pr(\mathbf{x})\)) |
| Reverse \(q(\mathbf{z}_{t-1} \mid \mathbf{z}_t)\) | Bayes’ rule involving \(q(\mathbf{z}_t)\) | No (intractable) |
| Conditional reverse \(q(\mathbf{z}_{t-1} \mid \mathbf{z}_t, \mathbf{x})\) | Gaussian (closed form) | Yes (used for training) |
Part 3 — The Decoder (Reverse Process)
When we learn a diffusion model, we learn the reverse process. We learn a series of probabilistic mappings back from latent variable \(\mathbf{z}_T\) to \(\mathbf{z}_{T-1}\), from \(\mathbf{z}_{T-1}\) to \(\mathbf{z}_{T-2}\), and so on, until we reach the data \(\mathbf{x}\).
The true reverse distributions \(q(\mathbf{z}_{t-1} | \mathbf{z}_t)\) are complex multi-modal distributions that depend on the data distribution \(Pr(\mathbf{x})\). We approximate these as normal distributions:
\[ Pr(\mathbf{z}_T) = \text{Norm}_{\mathbf{z}_T}[\mathbf{0}, \mathbf{I}] \]
\[ Pr(\mathbf{z}_{t-1} | \mathbf{z}_t, \boldsymbol{\phi}_t) = \text{Norm}_{\mathbf{z}_{t-1}}\!\left[\mathbf{f}_t[\mathbf{z}_t, \boldsymbol{\phi}_t],\; \sigma_t^2 \mathbf{I}\right] \]
\[ Pr(\mathbf{x} | \mathbf{z}_1, \boldsymbol{\phi}_1) = \text{Norm}_{\mathbf{x}}\!\left[\mathbf{f}_1[\mathbf{z}_1, \boldsymbol{\phi}_1],\; \sigma_1^2 \mathbf{I}\right] \]
where \(\mathbf{f}_t[\mathbf{z}_t, \boldsymbol{\phi}_t]\) is a neural network that computes the mean of the normal distribution in the estimated mapping from \(\mathbf{z}_t\) to the preceding latent variable \(\mathbf{z}_{t-1}\). The variances \(\{\sigma_t^2\}\) are predetermined.
This normal approximation is reasonable when the hyperparameters \(\beta_t\) are close to zero and the number of time steps \(T\) is large.
Ancestral Sampling in Diffusion Models: To generate new examples from \(Pr(\mathbf{x})\), we start by drawing \(\mathbf{z}_T\) from \(Pr(\mathbf{z}_T) = \text{Norm}[\mathbf{0}, \mathbf{I}]\). Then we sample \(\mathbf{z}_{T-1}\) from \(Pr(\mathbf{z}_{T-1} | \mathbf{z}_T, \boldsymbol{\phi}_T)\), sample \(\mathbf{z}_{T-2}\) from \(Pr(\mathbf{z}_{T-2} | \mathbf{z}_{T-1}, \boldsymbol{\phi}_{T-1})\), and so on until we finally generate \(\mathbf{x}\) from \(Pr(\mathbf{x} | \mathbf{z}_1, \boldsymbol{\phi}_1)\).
Part 4 — Training: The Evidence Lower Bound (ELBO)
The training objective for diffusion models follows the same strategy as for VAEs: we maximise a lower bound on the log-likelihood of the data.
4.1 Joint Distribution and Likelihood
The joint distribution of the observed variable \(\mathbf{x}\) and the latent variables \(\{\mathbf{z}_t\}\) is:
\[ Pr(\mathbf{x}, \mathbf{z}_{1\ldots T} | \boldsymbol{\phi}_{1\ldots T}) = Pr(\mathbf{x} | \mathbf{z}_1, \boldsymbol{\phi}_1) \prod_{t=2}^{T} Pr(\mathbf{z}_{t-1} | \mathbf{z}_t, \boldsymbol{\phi}_t) \cdot Pr(\mathbf{z}_T) \]
The likelihood of the observed data \(Pr(\mathbf{x} | \boldsymbol{\phi}_{1\ldots T})\) is found by marginalising over the latent variables:
\[ Pr(\mathbf{x} | \boldsymbol{\phi}_{1\ldots T}) = \int Pr(\mathbf{x}, \mathbf{z}_{1\ldots T} | \boldsymbol{\phi}_{1\ldots T})\, d\mathbf{z}_{1\ldots T} \]
This integral is intractable. We therefore use Jensen’s inequality to define a lower bound, exactly as we did for the VAE.
4.2 Deriving the ELBO
To derive the lower bound, we multiply and divide the log-likelihood by the encoder distribution \(q(\mathbf{z}_{1\ldots T} | \mathbf{x})\) and apply Jensen’s inequality:
\[ \log\left[Pr(\mathbf{x} | \boldsymbol{\phi}_{1\ldots T})\right] \geq \int q(\mathbf{z}_{1\ldots T} | \mathbf{x}) \log\left[\frac{Pr(\mathbf{x}, \mathbf{z}_{1\ldots T} | \boldsymbol{\phi}_{1\ldots T})}{q(\mathbf{z}_{1\ldots T} | \mathbf{x})}\right] d\mathbf{z}_{1\ldots T} \]
This gives us the Evidence Lower Bound (ELBO):
\[ \text{ELBO}[\boldsymbol{\phi}_{1\ldots T}] = \int q(\mathbf{z}_{1\ldots T} | \mathbf{x}) \log\left[\frac{Pr(\mathbf{x}, \mathbf{z}_{1\ldots T} | \boldsymbol{\phi}_{1\ldots T})}{q(\mathbf{z}_{1\ldots T} | \mathbf{x})}\right] d\mathbf{z}_{1\ldots T} \]
In the VAE, the encoder \(q(\mathbf{z} | \mathbf{x})\) approximates the posterior distribution over the latent variables to make the bound tight. In diffusion models, the encoder has no parameters — the decoder must do all the work: it makes the bound tighter by both (i) changing its parameters so that the static encoder does approximate the posterior \(Pr(\mathbf{z}_{1\ldots T} | \mathbf{x}, \boldsymbol{\phi}_{1\ldots T})\), and (ii) optimising its own parameters with respect to that bound.
4.3 Simplifying the ELBO
Through algebraic manipulation — substituting the joint distribution definitions, expanding the encoder product, and applying Bayes’ rule to rewrite the denominator — the ELBO simplifies to:
\[ \text{ELBO}[\boldsymbol{\phi}_{1\ldots T}] \approx \mathbb{E}_{q(\mathbf{z}_1 | \mathbf{x})}\!\left[\log Pr(\mathbf{x} | \mathbf{z}_1, \boldsymbol{\phi}_1)\right] - \sum_{t=2}^{T} \mathbb{E}_{q(\mathbf{z}_t | \mathbf{x})}\!\left[D_{KL}\!\left[q(\mathbf{z}_{t-1} | \mathbf{z}_t, \mathbf{x}) \,\|\, Pr(\mathbf{z}_{t-1} | \mathbf{z}_t, \boldsymbol{\phi}_t)\right]\right] \]
This has a clear interpretation:
- The first term is the reconstruction term: how well does the first step of the decoder reconstruct \(\mathbf{x}\) from \(\mathbf{z}_1\)?
- The second term is a sum of KL divergences: at each intermediate step \(t\), how close is the decoder’s estimate \(Pr(\mathbf{z}_{t-1} | \mathbf{z}_t, \boldsymbol{\phi}_t)\) to the true conditional \(q(\mathbf{z}_{t-1} | \mathbf{z}_t, \mathbf{x})\)?
4.4 The Diffusion Loss Function
To fit the model, we maximise the ELBO with respect to the parameters \(\boldsymbol{\phi}_{1\ldots T}\). Recasting this as a minimisation (multiplying by minus one and approximating expectations with samples) gives the loss function:
\[ L[\boldsymbol{\phi}_{1\ldots T}] = \sum_{i=1}^{I}\left(-\log\!\left[\text{Norm}_{\mathbf{x}_i}\!\left[\mathbf{f}_1[\mathbf{z}_{i1}, \boldsymbol{\phi}_1], \sigma_1^2 \mathbf{I}\right]\right] + \sum_{t=2}^{T} \frac{1}{2\sigma_t^2} \left\|\underbrace{\frac{(1-\alpha_{t-1})}{1-\alpha_t}\sqrt{1-\beta_t}\,\mathbf{z}_{it} + \frac{\sqrt{\alpha_{t-1}}\,\beta_t}{1-\alpha_t}\,\mathbf{x}_i}_{\text{target: mean of } q(\mathbf{z}_{t-1}|\mathbf{z}_t, \mathbf{x})} - \underbrace{\mathbf{f}_t[\mathbf{z}_{it}, \boldsymbol{\phi}_t]}_{\text{predicted } \mathbf{z}_{t-1}}\right\|^2\right) \]
where \(\mathbf{x}_i\) is the \(i^{\text{th}}\) data point, and \(\mathbf{z}_{it}\) is the associated latent variable at diffusion step \(t\).
The KL divergence between two normal distributions has a closed-form expression, and many of the terms do not depend on \(\boldsymbol{\phi}\), so the expression simplifies to the squared difference between the means plus a constant.
Part 5 — Reparameterization of the Loss Function
Although the loss function derived above can be used directly, diffusion models work better with a different parameterization. Instead of predicting the previous latent variable \(\mathbf{z}_{t-1}\), the model is reparameterized to predict the noise \(\boldsymbol{\epsilon}\) that was mixed with the original data example to create the current variable \(\mathbf{z}_t\).
5.1 Reparameterizing the Target
Recall the diffusion kernel:
\[ \mathbf{z}_t = \sqrt{\alpha_t} \cdot \mathbf{x} + \sqrt{1 - \alpha_t} \cdot \boldsymbol{\epsilon} \]
Solving for \(\mathbf{x}\):
\[ \mathbf{x} = \frac{1}{\sqrt{\alpha_t}} \cdot \mathbf{z}_t - \frac{\sqrt{1 - \alpha_t}}{\sqrt{\alpha_t}} \cdot \boldsymbol{\epsilon} \]
Substituting this into the target term of the loss function (the mean of \(q(\mathbf{z}_{t-1} | \mathbf{z}_t, \mathbf{x})\)) and simplifying, the target becomes:
\[ \frac{1}{\sqrt{1-\beta_t}}\,\mathbf{z}_t - \frac{\beta_t}{\sqrt{1-\alpha_t}\sqrt{1-\beta_t}}\,\boldsymbol{\epsilon} \]
The target is now expressed purely in terms of \(\mathbf{z}_t\) (which the model receives as input) and \(\boldsymbol{\epsilon}\) (the noise that was added).
5.2 Reparameterizing the Network
We now replace the model \(\hat{\mathbf{z}}_{t-1} = \mathbf{f}_t[\mathbf{z}_t, \boldsymbol{\phi}_t]\) with a new model \(\hat{\boldsymbol{\epsilon}} = \mathbf{g}_t[\mathbf{z}_t, \boldsymbol{\phi}_t]\) which predicts the noise \(\boldsymbol{\epsilon}\) that was mixed with \(\mathbf{x}\) to create \(\mathbf{z}_t\):
\[ \mathbf{f}_t[\mathbf{z}_t, \boldsymbol{\phi}_t] = \frac{1}{\sqrt{1 - \beta_t}}\,\mathbf{z}_t - \frac{\beta_t}{\sqrt{1 - \alpha_t}\sqrt{1 - \beta_t}}\,\mathbf{g}_t[\mathbf{z}_t, \boldsymbol{\phi}_t] \]
Substituting into the loss function, the scaling factors can be absorbed into a single constant, yielding the remarkably simple final loss function:
\[ L[\boldsymbol{\phi}_{1\ldots T}] = \sum_{i=1}^{I} \sum_{t=1}^{T} \left\|\mathbf{g}_t[\mathbf{z}_{it}, \boldsymbol{\phi}_t] - \boldsymbol{\epsilon}_{it}\right\|^2 \]
where \(\mathbf{z}_{it} = \sqrt{\alpha_t} \cdot \mathbf{x}_i + \sqrt{1 - \alpha_t} \cdot \boldsymbol{\epsilon}_{it}\) using the diffusion kernel.
The entire probabilistic training objective — derived from the ELBO through KL divergences between normal distributions — simplifies to a least-squares noise prediction loss. The network simply receives a noisy image \(\mathbf{z}_t\) and predicts the noise \(\boldsymbol{\epsilon}\) that was added. This is why diffusion models are sometimes called denoising models.
5.3 Training Algorithm
The reparameterized loss leads to an elegant training procedure:
Input: Training data \(\mathbf{x}\)
Output: Model parameters \(\boldsymbol{\phi}_t\)
repeat
- for each training example \(i\) in batch do
- \(\quad\) Sample random timestep: \(t \sim \text{Uniform}\{1, \ldots, T\}\)
- \(\quad\) Sample noise: \(\boldsymbol{\epsilon} \sim \text{Norm}[\mathbf{0}, \mathbf{I}]\)
- \(\quad\) Compute individual loss: \(\ell_i = \left\|\mathbf{g}_t\!\left[\sqrt{\alpha_t}\,\mathbf{x}_i + \sqrt{1-\alpha_t}\,\boldsymbol{\epsilon},\; \boldsymbol{\phi}_t\right] - \boldsymbol{\epsilon}\right\|^2\)
- Accumulate losses for batch and take gradient step
until converged
5.4 Sampling Algorithm
Once trained, generating new samples proceeds as follows:
Input: Model \(\mathbf{g}_t[\bullet, \boldsymbol{\phi}_t]\)
Output: Sample \(\mathbf{x}\)
- Sample last latent variable: \(\mathbf{z}_T \sim \text{Norm}_{\mathbf{z}}[\mathbf{0}, \mathbf{I}]\)
- for \(t = T \ldots 2\) do
- \(\quad\) Predict previous latent variable: \(\hat{\mathbf{z}}_{t-1} = \frac{1}{\sqrt{1-\beta_t}}\,\mathbf{z}_t - \frac{\beta_t}{\sqrt{1-\alpha_t}\sqrt{1-\beta_t}}\,\mathbf{g}_t[\mathbf{z}_t, \boldsymbol{\phi}_t]\)
- \(\quad\) Draw new noise vector: \(\boldsymbol{\epsilon} \sim \text{Norm}_{\boldsymbol{\epsilon}}[\mathbf{0}, \mathbf{I}]\)
- \(\quad\) Add noise to previous latent variable: \(\mathbf{z}_{t-1} = \hat{\mathbf{z}}_{t-1} + \sigma_t\,\boldsymbol{\epsilon}\)
- Generate sample from \(\mathbf{z}_1\) without noise: \(\mathbf{x} = \frac{1}{\sqrt{1-\beta_1}}\,\mathbf{z}_1 - \frac{\beta_1}{\sqrt{1-\alpha_1}\sqrt{1-\beta_1}}\,\mathbf{g}_1[\mathbf{z}_1, \boldsymbol{\phi}_1]\)
Part 6 — Implementation: The U-Net Architecture
Diffusion models have been very successful in modelling image data. At each step, we need a model that can take a noisy image and predict the noise that was added. The natural architectural choice for this image-to-image mapping is the U-Net.
6.1 U-Net Structure
The U-Net consists of two phases connected by skip connections:
- Encoder phase: Convolutional layers progressively reduce the spatial scale of the image while increasing the number of feature channels, extracting deep semantic representations.
- Decoder phase: The network increases the spatial scale back to the original image size while reducing the number of channels.
- Skip connections: Direct connections link the encoder and decoder at corresponding spatial resolutions, preserving fine-grained spatial details that might otherwise be lost during compression.
Modern diffusion U-Nets also integrate residual blocks and global self-attention layers to better capture global context in complex images. Connections between adjacent representations consist of residual blocks, and periodic global self-attention layers allow every spatial position to interact with every other spatial position.
6.3 Sinusoidal Time Embeddings
The timestep \(t\) is encoded using sinusoidal time embeddings, conceptually similar to the positional encodings used in Transformers. These embeddings are:
- Computed as fixed sinusoidal functions of the timestep \(t\).
- Linearly transformed (via a learned shallow network) to match the number of channels at each stage of the U-Net.
- Added to the feature channels at various layers of the U-Net.
This injection mechanism allows the model’s behaviour to shift dynamically depending on the current timestep. Early timesteps (high noise) require different denoising strategies than late timesteps (low noise), and the time embedding provides this contextual information.
Time Embedding: A fixed sinusoidal encoding of the timestep \(t\), analogous to positional encodings in Transformers. It is linearly transformed and added to the feature channels at each stage of the U-Net, enabling a single network to handle all denoising steps by conditioning its behaviour on the current timestep.
Part 7 — Advanced Topics and Optimisations
7.1 Accelerated Sampling: Denoising Diffusion Implicit Models (DDIMs)
A major drawback of diffusion models is that sampling requires running the U-Net through all \(T\) steps sequentially (e.g., \(T = 1000\)), making generation slow compared to other generative models.
The key observation is that the loss function (noise prediction) requires only the diffusion kernel \(q(\mathbf{z}_t | \mathbf{x}) = \text{Norm}[\sqrt{\alpha_t}\,\mathbf{x},\, (1-\alpha_t)\mathbf{I}]\). The same loss function is valid for any forward process that satisfies this relation. This gives rise to a family of compatible processes:
- Denoising Diffusion Implicit Models (DDIMs): These are no longer stochastic after the first step from \(\mathbf{x}\) to \(\mathbf{z}_1\). The deterministic reverse process does not add noise at each step.
- Accelerated sampling models: The forward process is defined only on a sub-sequence of time steps, allowing the reverse process to skip steps. Good samples can be generated with as few as 50 time steps.
This is much faster than the original \(T = 1000\) steps, though still slower than most other generative models.
7.2 Cascaded Generation for High-Resolution Images
Generating high-resolution images directly is computationally expensive. The cascade approach addresses this:
- A diffusion model generates a low-resolution base image (e.g., \(64 \times 64\)), possibly guided by class information.
- A subsequent super-resolution diffusion model generates a higher-resolution image (e.g., \(256 \times 256\)), conditioned on the low-resolution image and any class/text information.
- This can be repeated to reach even higher resolutions (e.g., \(1024 \times 1024\)).
The conditioning on the lower-resolution image is achieved by resizing it and appending it to the layers of the constituent U-Net.
7.3 Conditional Generation and Guidance
If the data has associated labels \(c\) (e.g., class labels or text captions), these can be exploited to control the generation. Two main approaches exist:
Classifier Guidance
This modifies the denoising update from \(\mathbf{z}_t\) to \(\mathbf{z}_{t-1}\) to take into account class information \(c\). An extra term involving the gradient of a classifier \(Pr(c | \mathbf{z}_t)\) is added to the sampling update:
\[ \mathbf{z}_{t-1} = \hat{\mathbf{z}}_{t-1} + \sigma_t^2 \frac{\partial \log\left[Pr(c | \mathbf{z}_t)\right]}{\partial \mathbf{z}_t} + \sigma_t\,\boldsymbol{\epsilon} \]
The classifier \(Pr(c | \mathbf{z}_t)\) is trained separately on noisy latent variables and steers the update towards making class \(c\) more likely.
Classifier-Free Guidance
This avoids learning a separate classifier by incorporating class information directly into the main model: \(\mathbf{g}_t[\mathbf{z}_t, \boldsymbol{\phi}_t, c]\). In practice, this takes the form of adding an embedding based on \(c\) to the layers of the U-Net, in a similar way to how the time step is added.
The model is jointly trained on conditional and unconditional objectives by randomly dropping the class information during training. At test time, it can generate unconditional data, conditional data, or any weighted combination of the two. Over-weighting the conditioning information tends to produce very high quality but slightly stereotypical examples.
| Property | Classifier Guidance | Classifier-Free Guidance |
|---|---|---|
| Separate classifier needed | Yes (\(Pr(c \mid \mathbf{z}_t)\)) | No |
| How conditioning enters | Gradient of classifier added to sampling step | Embedding of \(c\) added to U-Net layers |
| Training | Train classifier on noisy data separately | Drop conditioning randomly during training |
| Flexibility at test time | Fixed conditioning strength | Adjustable conditioning weight |
Summary
In this lesson, we have developed the theory and practice of diffusion models:
- We contextualised diffusion models within the landscape of generative models (GANs, normalizing flows, VAEs), highlighting that diffusion models combine a predetermined encoder with a learned decoder.
- We defined the forward process as a Markov chain that progressively adds noise to data, and derived the diffusion kernel \(q(\mathbf{z}_t | \mathbf{x})\) that allows efficient sampling at any timestep.
- We defined the reverse process (decoder) as a series of normal distributions with learned means, and derived the ELBO training objective.
- We showed how the ELBO simplifies to a least-squares noise prediction loss through reparameterization: the network simply predicts the noise \(\boldsymbol{\epsilon}\) that was added to the data.
- We discussed the U-Net architecture with sinusoidal time embeddings that enables a single network to handle all denoising steps.
- We covered advanced topics: DDIMs and accelerated sampling for faster generation, cascaded generation for high-resolution images, and conditional guidance (classifier and classifier-free) for controlled generation such as text-to-image synthesis.
These ideas underpin modern text-to-image systems such as DALL-E, Stable Diffusion, and Imagen, which combine diffusion models with text encoders to generate high-quality images from natural language descriptions.