Conditional Image Generation with Diffusion Models

generative-models
diffusion-models
CLIP
classifier-guidance
classifier-free-guidance
text-to-image
An introduction to conditional image generation with diffusion models, covering classifier-guided diffusion, CLIP-guided diffusion, and classifier-free guidance — the techniques that underpin modern text-to-image systems such as DALL-E 2, Stable Diffusion, and Imagen.
Published

March 25, 2026

NoneAbstract

This lesson extends the unconditional diffusion model of Chapter 18 to support conditional image generation. We begin by motivating the need for user control over generated content and then present three complementary approaches. Classifier-guided diffusion trains a noise-aware classifier on noisy latents and uses its gradient to steer the reverse diffusion process toward a desired class. CLIP-guided diffusion replaces the supervised classifier with the gradient of CLIP cosine similarity, enabling guidance toward arbitrary text descriptions without class labels. Classifier-free guidance bakes text conditioning directly into the diffusion U-Net, combining conditional and unconditional predictions at sampling time. Together, these techniques form the backbone of modern text-to-image systems such as DALL-E 2, Stable Diffusion, and Imagen.

CREATED AT: 2026-03-25

Introduction

In the previous lessons, we studied diffusion models — generative models that learn to reverse a gradual noising process to produce high-quality samples — and CLIP — a contrastive model that aligns images and text in a shared embedding space. Diffusion models can generate impressive images, but the model described in Chapter 18 is unconditional: given a sample from the prior \(\mathrm{Pr}(\mathbf{z}_T) = \mathrm{Norm}_{\mathbf{z}_T}[\mathbf{0}, \mathbf{I}]\), the sampling algorithm (Algorithm 18.2) produces a data point \(\mathbf{x}\) drawn from the learned data distribution \(\mathrm{Pr}(\mathbf{x})\). While impressive, this gives the user no control over what is generated.

In practice, we want to condition generation on user input — for example, generating an image of a specific class (e.g., “golden retriever”) or from a free-form text description (e.g., “a golden retriever wearing a beret”). This lesson presents two complementary approaches:

  1. Classifier-guided diffusion (Parts 1–2): When training images carry class labels \(c \in \{1, \ldots, K\}\), we train a classifier \(\mathrm{Pr}(c \mid \mathbf{z}_t)\) on noisy latents and use its gradient to steer the reverse diffusion process toward a user-specified class.

  2. CLIP-guided / text-conditioned diffusion (Parts 3–5): When no labels are available — or when the user wants to specify a class in natural language — we exploit CLIP, a multimodal embedding model that aligns images and text in a shared latent space. This enables zero-shot guidance without any class labels.

The two approaches are not mutually exclusive. In state-of-the-art systems such as DALL-E 2 and Imagen, both ideas are combined to produce text-to-image generation of remarkable quality.


Part 1 — Classifier-Guided Diffusion

1.1 The Core Idea

Recall from Bayes’ rule that the score of the class-conditional distribution factors as:

\[ \nabla_{\mathbf{z}_t} \log \mathrm{Pr}(\mathbf{z}_t \mid c) = \nabla_{\mathbf{z}_t} \log \mathrm{Pr}(\mathbf{z}_t) + \nabla_{\mathbf{z}_t} \log \mathrm{Pr}(c \mid \mathbf{z}_t). \]

The first term on the right is the score of the unconditional diffusion model — it is exactly what the learned noise model \(g_t[\mathbf{z}_t, \boldsymbol{\phi}_t]\) approximates. The second term is the gradient of a classifier evaluated at the noisy latent \(\mathbf{z}_t\). By adding this gradient to the denoising update, we steer samples toward the class \(c\) at every reverse diffusion step.

1.2 Training a Noise-Aware Classifier

A standard image classifier trained on clean data \(\mathbf{x}\) cannot be used here, because during sampling the model operates on noisy latents \(\mathbf{z}_t\). We must train a classifier \(\mathrm{Pr}(c \mid \mathbf{z}_t)\) that accepts noisy inputs at all noise levels \(t\).

Concretely, the classifier \(f_\psi(\mathbf{z}_t, t)\) takes both the noisy latent and the timestep as inputs. It is trained on noisy samples drawn from the diffusion kernel:

\[ \mathbf{z}_t = \sqrt{\alpha_t}\, \mathbf{x} + \sqrt{1 - \alpha_t}\, \boldsymbol{\epsilon}, \qquad \boldsymbol{\epsilon} \sim \mathrm{Norm}[\mathbf{0}, \mathbf{I}], \]

where \(\alpha_t = \prod_{s=1}^{t}(1 - \beta_s)\) (equation 18.7 in the textbook). The training loss is the standard cross-entropy:

\[ \mathcal{L}_\psi = -\sum_{i=1}^{I} \log \mathrm{Pr}(c_i \mid \mathbf{z}_{t}^{(i)}), \]

where \(t\) is sampled uniformly at each training step and \(\mathbf{z}_t^{(i)}\) is generated from training example \(\mathbf{x}_i\) using the diffusion kernel. The classifier architecture often re-uses the downsampling encoder of the U-Net used for the diffusion model itself, with an added pooling head that outputs class logits.

1.3 Modified Sampling with Classifier Guidance

Once the classifier is trained, we modify the sampling algorithm. Recall the standard update from Algorithm 18.2:

\[ \hat{\mathbf{z}}_{t-1} = \frac{1}{\sqrt{1-\beta_t}}\mathbf{z}_t - \frac{\beta_t}{\sqrt{1-\alpha_t}\sqrt{1-\beta_t}}\, g_t[\mathbf{z}_t, \boldsymbol{\phi}_t], \] \[ \mathbf{z}_{t-1} = \hat{\mathbf{z}}_{t-1} + \sigma_t \boldsymbol{\epsilon}. \]

Classifier guidance adds an extra term to this update (equation 18.41 in the textbook):

\[ \boxed{ \mathbf{z}_{t-1} = \hat{\mathbf{z}}_{t-1} + \sigma_t^2\, \nabla_{\mathbf{z}_t} \log \mathrm{Pr}(c \mid \mathbf{z}_t) + \sigma_t \boldsymbol{\epsilon}. } \]

The gradient \(\nabla_{\mathbf{z}_t} \log \mathrm{Pr}(c \mid \mathbf{z}_t)\) is computed by a single forward–backward pass through the classifier at each timestep. It points in the direction in latent space that increases the classifier’s log-probability for class \(c\), thereby nudging \(\mathbf{z}_{t-1}\) toward a region that the classifier associates with that class.

1.4 Guidance Scale

In practice, the gradient is scaled by a guidance coefficient \(s > 0\):

\[ \mathbf{z}_{t-1} = \hat{\mathbf{z}}_{t-1} + s \cdot \sigma_t^2\, \nabla_{\mathbf{z}_t} \log \mathrm{Pr}(c \mid \mathbf{z}_t) + \sigma_t \boldsymbol{\epsilon}. \]

  • When \(s = 0\), guidance is disabled and the model generates unconditionally.
  • Increasing \(s\) produces samples that more strongly resemble class \(c\), at the cost of reduced diversity and potential artifacts.
  • Empirically, values of \(s \in [1, 10]\) produce the best trade-off between image quality and fidelity to the class label.

1.5 Limitations

Classifier-guided diffusion requires training a separate, noise-aware classifier. This involves significant additional computational cost — the classifier must be evaluated and differentiated at every one of the \(T\) reverse diffusion steps. Furthermore, at high guidance scales, the classifier gradient can point toward adversarial directions in pixel space, producing high-classifier-confidence images that are nevertheless visually unrealistic.

These limitations motivate the classifier-free approach described in the textbook (Section 18.6.3), and the CLIP-based approach introduced next.


Part 2 — From Class Labels to Language: Motivation for Zero-Shot Guidance

Classifier-guided diffusion requires that images have discrete class labels. In many settings this is restrictive:

  • Label sets are fixed at training time. A classifier trained on ImageNet’s 1,000 classes cannot directly guide generation of a “golden retriever wearing a top hat.”
  • Collecting and annotating large labelled datasets is expensive.
  • The user may want to specify a generation target in natural language rather than by selecting from a predefined menu.

What we want is a way to measure the semantic similarity between an image \(\mathbf{x}\) and an arbitrary text description \(\mathbf{d}\), without requiring class-level supervision. CLIP provides exactly this capability.


Part 3 — CLIP Recap: Contrastive Language–Image Pretraining

3.1 Architecture

CLIP (Radford et al., 2021) consists of two encoders trained jointly:

  • An image encoder \(f_\theta^{\text{img}}: \mathbb{R}^{H \times W \times 3} \to \mathbb{R}^d\), typically a Vision Transformer (ViT) or ResNet.
  • A text encoder \(f_\theta^{\text{txt}}: \text{text} \to \mathbb{R}^d\), typically a Transformer.

Both encoders map their respective inputs into the same \(d\)-dimensional embedding space. For an image \(\mathbf{x}\) and a text string \(\mathbf{d}\), their CLIP embeddings are:

\[ \mathbf{e}^{\text{img}} = f_\theta^{\text{img}}(\mathbf{x}) \in \mathbb{R}^d, \qquad \mathbf{e}^{\text{txt}} = f_\theta^{\text{txt}}(\mathbf{d}) \in \mathbb{R}^d. \]

After \(\ell_2\)-normalisation, the similarity between an image and a text description is measured by the cosine similarity:

\[ \mathrm{sim}(\mathbf{x}, \mathbf{d}) = \frac{\mathbf{e}^{\text{img}} \cdot \mathbf{e}^{\text{txt}}}{\|\mathbf{e}^{\text{img}}\|\, \|\mathbf{e}^{\text{txt}}\|}. \]

3.2 Contrastive Training Objective

CLIP is trained on a large corpus of (image, text) pairs \(\{(\mathbf{x}_i, \mathbf{d}_i)\}_{i=1}^{N}\) scraped from the internet. For a mini-batch of \(B\) pairs, CLIP forms a \(B \times B\) similarity matrix:

\[ S_{ij} = \mathrm{sim}(\mathbf{x}_i, \mathbf{d}_j) = \frac{f_\theta^{\text{img}}(\mathbf{x}_i) \cdot f_\theta^{\text{txt}}(\mathbf{d}_j)}{\|f_\theta^{\text{img}}(\mathbf{x}_i)\|\,\|f_\theta^{\text{txt}}(\mathbf{d}_j)\|}. \]

The training objective is a symmetric cross-entropy loss that encourages high similarity on the diagonal (matched pairs) and low similarity off the diagonal (mismatched pairs):

\[ \mathcal{L}_{\text{CLIP}} = -\frac{1}{2B} \sum_{i=1}^{B} \left[ \log \frac{\exp(S_{ii} / \tau)}{\sum_{j=1}^{B} \exp(S_{ij} / \tau)} + \log \frac{\exp(S_{ii} / \tau)}{\sum_{j=1}^{B} \exp(S_{ji} / \tau)} \right], \]

where \(\tau > 0\) is a learned temperature parameter. The first term treats the image \(\mathbf{x}_i\) as the query and selects the correct text \(\mathbf{d}_i\) from all \(B\) texts; the second term reverses the roles.

3.3 Properties of the Joint Embedding Space

After training at scale (400 million image–text pairs in the original CLIP paper), the joint embedding space has two important properties:

  1. Semantic alignment: Images and texts with similar semantic content are embedded close together. A photograph of a dog and the sentence “a photo of a dog” will have high cosine similarity.

  2. Modality bridging: The embedding space is shared across modalities. This means that arithmetic relations that hold for text embeddings often also hold for image embeddings, and vice versa.

These properties give rise to a powerful capability: zero-shot image classification.


Part 4 — Zero-Shot Image Classification with CLIP

4.1 The Zero-Shot Protocol

Suppose we have a set of \(K\) class names \(\{c_1, c_2, \ldots, c_K\}\) (e.g., the 1,000 ImageNet classes). For each class \(c_k\), we construct a natural language prompt such as (\(k=\mathbf{cat}\)):

\[ \mathbf{d}_k = \textrm{``a photo of a {\bf cat}".} \]

We compute text embeddings \(\mathbf{e}_k^{\text{txt}} = f_\theta^{\text{txt}}(\mathbf{d}_k)\) for all \(K\) classes. Given a test image \(\mathbf{x}\), we compute its image embedding \(\mathbf{e}^{\text{img}} = f_\theta^{\text{img}}(\mathbf{x})\) and predict the class as:

\[ \hat{c} = \arg\max_{k \in \{1,\ldots,K\}} \mathrm{sim}(\mathbf{x}, \mathbf{d}_k) = \arg\max_{k} \; \mathbf{e}^{\text{img}} \cdot \mathbf{e}_k^{\text{txt}}. \]

Equivalently, this can be viewed as a probabilistic classifier:

\[ \mathrm{Pr}(c = c_k \mid \mathbf{x}) = \frac{\exp\!\left(\mathrm{sim}(\mathbf{x}, \mathbf{d}_k) / \tau\right)}{\sum_{j=1}^{K} \exp\!\left(\mathrm{sim}(\mathbf{x}, \mathbf{d}_j) / \tau\right)}. \]

This is zero-shot because the classifier was never trained on the downstream task. The class names \(c_k\) are supplied only at test time, in natural language — no labelled images of those classes are required.

4.2 Why Zero-Shot Works

Zero-shot classification is possible because the CLIP text encoder has learned rich semantic representations of class names through its contrastive pretraining. When the text encoder reads “golden retriever,” it produces an embedding that lies near the embeddings of actual golden retriever images in the shared space. This alignment was learned implicitly from the internet-scale image–caption corpus, not from any classification-specific supervision.

4.3 Implications for Conditional Generation

The zero-shot classification protocol reveals that CLIP can assign a soft probability \(\mathrm{Pr}(c \mid \mathbf{x})\) to any image for any class specified in natural language. This is precisely the kind of signal needed for guidance. Instead of requiring a separately trained classifier with a fixed label set, we can use CLIP similarity as a continuously differentiable score that can be computed for arbitrary text descriptions.

ImportantKey Insight

CLIP provides a differentiable similarity score between images and arbitrary text. This means we can compute gradients of this score with respect to image pixels (or noisy latents), which is exactly what we need to guide a diffusion model toward generating images that match a text prompt — without any class labels.


Part 5 — CLIP-Guided Diffusion

5.1 Replacing the Classifier with CLIP Similarity

In classifier-guided diffusion (Part 1), the guidance gradient was \(\nabla_{\mathbf{z}_t} \log \mathrm{Pr}(c \mid \mathbf{z}_t)\). We now replace this with the gradient of the CLIP similarity score between the current denoised estimate and a text description \(\mathbf{d}\) provided by the user.

At diffusion step \(t\), we have the noisy latent \(\mathbf{z}_t\). We first compute the predicted clean image using the reparameterisation (equation 18.31 in the textbook):

\[ \hat{\mathbf{x}}_0 = \frac{1}{\sqrt{\alpha_t}} \mathbf{z}_t - \frac{\sqrt{1-\alpha_t}}{\sqrt{\alpha_t}}\, g_t[\mathbf{z}_t, \boldsymbol{\phi}_t]. \]

We then evaluate the CLIP similarity between this estimate and the text description:

\[ \mathcal{S}(\mathbf{z}_t, \mathbf{d}) = \mathrm{sim}\!\left(\hat{\mathbf{x}}_0(\mathbf{z}_t),\; \mathbf{d}\right) = \frac{f_\theta^{\text{img}}(\hat{\mathbf{x}}_0(\mathbf{z}_t)) \cdot f_\theta^{\text{txt}}(\mathbf{d})}{\|f_\theta^{\text{img}}(\hat{\mathbf{x}}_0(\mathbf{z}_t))\|\;\|f_\theta^{\text{txt}}(\mathbf{d})\|}. \]

The text embedding \(\mathbf{e}^{\text{txt}} = f_\theta^{\text{txt}}(\mathbf{d})\) is constant across all timesteps and can be precomputed. The CLIP-guided sampling update becomes:

\[ \boxed{ \mathbf{z}_{t-1} = \hat{\mathbf{z}}_{t-1} + s \cdot \sigma_t^2\, \nabla_{\mathbf{z}_t}\, \mathcal{S}(\mathbf{z}_t, \mathbf{d}) + \sigma_t \boldsymbol{\epsilon}. } \]

This is structurally identical to the classifier-guided update (Section 1.3), with \(\nabla_{\mathbf{z}_t} \log \mathrm{Pr}(c \mid \mathbf{z}_t)\) replaced by \(\nabla_{\mathbf{z}_t}\, \mathcal{S}(\mathbf{z}_t, \mathbf{d})\).

5.2 Gradient Computation

The CLIP guidance gradient is computed via backpropagation through the chain:

\[ \mathbf{z}_t \;\longrightarrow\; \hat{\mathbf{x}}_0(\mathbf{z}_t) \;\longrightarrow\; f_\theta^{\text{img}}(\hat{\mathbf{x}}_0) \;\longrightarrow\; \mathcal{S}(\mathbf{z}_t, \mathbf{d}). \]

Concretely, this requires:

  1. A forward pass of the diffusion U-Net \(g_t[\mathbf{z}_t, \boldsymbol{\phi}_t]\) to obtain \(\hat{\mathbf{x}}_0\).
  2. A forward pass of the CLIP image encoder \(f_\theta^{\text{img}}\) on \(\hat{\mathbf{x}}_0\) to obtain the image embedding.
  3. Computation of the cosine similarity \(\mathcal{S}\) against the precomputed text embedding.
  4. A backward pass to obtain \(\nabla_{\mathbf{z}_t} \mathcal{S}\).

Both the CLIP image encoder and the diffusion U-Net are frozen during inference — no fine-tuning is required.

5.3 Conditioning the Diffusion Model Directly: Classifier-Free Guidance with CLIP

An alternative and now more commonly used approach incorporates the text conditioning inside the diffusion model rather than as an external gradient. The text embedding is injected into the U-Net layers at each timestep, analogously to how the timestep embedding is incorporated (see figure 18.9 of the textbook). The model is then parameterised as:

\[ g_t[\mathbf{z}_t, \boldsymbol{\phi}_t, \mathbf{e}^{\text{txt}}], \]

where \(\mathbf{e}^{\text{txt}} = f_\theta^{\text{txt}}(\mathbf{d})\) is the CLIP (or language model) text embedding of the user’s description.

During training, the text conditioning is randomly dropped with probability \(p_{\text{drop}}\) (e.g., 10%), replacing \(\mathbf{e}^{\text{txt}}\) with a null embedding \(\varnothing\). This jointly trains the conditional model \(g_t[\mathbf{z}_t, \boldsymbol{\phi}_t, \mathbf{e}^{\text{txt}}]\) and the unconditional model \(g_t[\mathbf{z}_t, \boldsymbol{\phi}_t, \varnothing]\) with shared parameters.

At sampling time, the two models are combined using classifier-free guidance (Ho & Salimans, 2022):

\[ \tilde{g}_t = g_t[\mathbf{z}_t, \boldsymbol{\phi}_t, \varnothing] + s \cdot \left( g_t[\mathbf{z}_t, \boldsymbol{\phi}_t, \mathbf{e}^{\text{txt}}] - g_t[\mathbf{z}_t, \boldsymbol{\phi}_t, \varnothing] \right), \]

where \(s > 1\) amplifies the conditional component relative to the unconditional baseline. This approach avoids training a separate classifier entirely. A single forward pass of the U-Net (or two, if both conditional and unconditional outputs are needed) is sufficient at each sampling step.

ImportantKey Insight

Classifier-free guidance is now the dominant approach in production text-to-image systems. It avoids the cost of a separate classifier, supports arbitrary text conditioning, and produces state-of-the-art image quality. The key trick is the random dropout of conditioning during training, which allows a single model to serve as both the conditional and unconditional generator.

5.4 Comparison of Guidance Approaches

The table below summarises the three guidance strategies covered in this lesson.

Classifier Guidance CLIP Guidance Classifier-Free Guidance
Requires labelled data Yes (class labels) No No (drops label during training)
Requires separate classifier Yes Yes (CLIP encoders, frozen) No
Supports free-form text No Yes Yes
Extra cost at sampling Classifier forward + backward CLIP forward + backward One extra U-Net forward pass
Modifies training No (plug-and-play) No (plug-and-play) Yes (requires retraining)
Quality Good Good State-of-the-art

Summary

This lesson extended the unconditional diffusion model of Chapter 18 to support conditional image generation. The key ideas are:

  • Classifier guidance leverages a noise-aware classifier \(\mathrm{Pr}(c \mid \mathbf{z}_t, t)\) trained on noisy latents. Its gradient \(\nabla_{\mathbf{z}_t} \log \mathrm{Pr}(c \mid \mathbf{z}_t)\) is added to each reverse diffusion step, steering samples toward the desired class \(c\).

  • CLIP learns a joint image–text embedding space through contrastive pretraining on internet-scale data. The shared embedding space enables zero-shot image classification: for any set of class names described in natural language, a probability over classes can be computed without any task-specific training.

  • CLIP guidance replaces the supervised classifier gradient with the gradient of the CLIP cosine similarity \(\mathcal{S}(\mathbf{z}_t, \mathbf{d})\), enabling guidance toward an arbitrary text description \(\mathbf{d}\) with no class-label supervision.

  • Classifier-free guidance bakes the text conditioning into the diffusion U-Net itself, using CLIP or language model embeddings as conditioning inputs. By over-weighting the conditional component at sampling time, the model produces high-quality samples that closely match the text prompt.

Together, these techniques form the backbone of modern text-to-image generation systems such as DALL-E 2, Stable Diffusion, and Imagen.

References

  1. Simon J.D. Prince, “Understanding Deep Learning”, Chapter 18,
  2. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever, “Learning Transferable Visual Models From Natural Language Supervision”,
  3. Prafulla Dhariwal & Alex Nichol, “Diffusion Models Beat GANs on Image Synthesis”, NeurIPS 2021.
  4. Jonathan Ho & Tim Salimans, “Classifier-Free Diffusion Guidance”, NeurIPS Workshop 2022.