Probabilistic Models: Foundation of Generative Models
Introduction
Up to this point in the course, we have treated neural networks primarily as functions that map inputs to outputs: an image goes in, a class label comes out; a sequence of tokens goes in, the next token comes out. This deterministic view is powerful but limited. To understand generative models — models that can create new data rather than merely classify existing data — we need to shift our perspective.
In this lesson, we recast machine learning in the language of probability distributions. Instead of asking “what output does this input produce?”, we ask “what is the probability of observing this data point?” This probabilistic lens opens the door to a rich family of generative models, from autoregressive language models to variational autoencoders and beyond.
Part 1 — Probability Distribution Terminology
Before diving into generative models, we establish the key probabilistic concepts that will be used throughout.
1.1 Probability Distributions
A probability distribution is a function \(p(x)\) that assigns a non-negative real number to each element \(x\) in a sample space \(\mathcal{X}\):
\[ p(x) : \mathcal{X} \to \mathbb{R}^+ \]
subject to the normalisation constraint:
- Discrete case: \(\displaystyle\sum_{x \in \mathcal{X}} p(x) = 1\)
- Continuous case: \(\displaystyle\int_{\mathcal{X}} p(x)\, dx = 1\)
We write \(p_\theta(x)\) to denote a learned model of \(p(x)\), where \(\theta\) represents the parameters of the model (e.g., the weights of a neural network).
1.2 Joint and Conditional Distributions
When we have two random variables \(x\) and \(y\), we can talk about:
- Joint distribution \(p(x, y)\): the probability of observing both \(x\) and \(y\) together.
- Conditional distribution \(p(x \mid y)\): the probability of observing \(x\) given that \(y\) is known.
These are related by the fundamental identity:
\[ p(x, y) = p(x \mid y) \cdot p(y) \]
This identity — the chain rule of probability — will play a central role when we decompose the probability of sequences.
1.3 From Distribution to Instance: Sampling
A distribution \(p(\cdot)\) describes a population. To get a concrete instance \(x\), we sample from \(p\):
\[ x \sim p(x) \]
The mechanics of sampling depend on the space:
Discrete space (\(x \in\) a finite set): Pick an element from the space, favouring samples \(x\) with larger \(p(x)\). This is straightforward — we can enumerate all possibilities and select according to their probabilities.
Continuous space (\(x \in \mathbb{R}^d\)): We can compute expectations by integrating:
\[ \mathbb{E}[x \mid p] = \int_{\mathbb{R}^d} x \cdot p(x)\, dx \]
but actually drawing samples from a continuous distribution is much harder, especially in high dimensions. This challenge motivates much of the machinery we will develop later.
Sampling from a discrete distribution (e.g., picking the next token from a vocabulary) is computationally easy. Sampling from a continuous high-dimensional distribution (e.g., generating a realistic image) is fundamentally hard and requires specialised techniques such as variational inference or diffusion processes.
Part 2 — Classification as a Probabilistic Model
The classification pipeline we have studied throughout this course can be understood as a probabilistic model. Consider image classification on MNIST with 10 digit classes.
2.1 The Classification Pipeline
An image \(x \in \mathbb{R}^{28 \times 28}\) is fed through a pipeline:
\[ x \xrightarrow{\text{ConvNet}} \cdot \xrightarrow{\text{MLP}} \mathbb{R}^{10} \xrightarrow{\text{Softmax}} \begin{bmatrix} p_1 \\ p_2 \\ \vdots \\ p_{10} \end{bmatrix} \]
The softmax output is a valid probability distribution over the label space \(y \in \{1, 2, \dots, 10\}\). In probabilistic notation, the entire neural network computes:
\[ p_\theta(y \mid x) = \text{softmax}(\text{logits}) \]
This is a conditional distribution: given an image \(x\), the model tells us how likely each label \(y\) is.
2.2 Classification is a Discriminative Model
A classifier is a specific type of probabilistic model called a discriminative model:
- Training data consists of pairs of random variables \((x, y)\).
- The model learns the conditional distribution \(p_\theta(y \mid x)\).
- The goal is to approximate the true data-generating conditional:
\[ p_\theta(y \mid x) \approx p_{\text{data}}(y \mid x) \]
Discriminative models are excellent at answering the question “given this input, what is the output?” but they tell us nothing about the distribution of the inputs themselves.
Discriminative Model: A model that learns \(p_\theta(y \mid x)\) — the conditional distribution of labels given inputs. Examples include classifiers, regressors, and sequence taggers. Discriminative models answer: “What label should this input have?”
Part 3 — Generative Models
A generative model takes a fundamentally different approach. Rather than learning a conditional distribution \(p(y \mid x)\), a generative model learns the distribution of the data itself.
3.1 Definition
Given training data \(\mathcal{D} = \{x^{(1)}, x^{(2)}, \dots, x^{(N)}\}\) consisting of samples from a random variable \(x \in \mathbb{R}^d\), a generative model seeks to learn a distribution \(p_\theta(x)\) that approximates the true data distribution:
\[ p_\theta(x) \approx p_{\text{data}}(x) \]
Notice the key difference: there are no labels \(y\). The model is learning about the data \(x\) itself, not about any particular task. If the model successfully captures \(p_{\text{data}}(x)\), then given any data point \(x\), we know its likelihood \(p_\theta(x)\).
3.2 Why Are Generative Models Useful?
If we have a model \(p_\theta(x)\) that faithfully represents the data distribution, then under certain conditions we can generate new instances of \(x\) by sampling from \(p_\theta\). That is, we can create new data points — images, text, audio — that were never observed in the training set but are plausible according to the learned distribution.
This is the foundation of all modern generative AI: image generation (DALL-E, Stable Diffusion), text generation (GPT), music generation, and more.
Generative Model: A model that learns \(p_\theta(x)\) — the distribution of the data itself (not conditioned on labels). Once learned, the model can:
- Evaluate likelihood: Given a data point \(x\), compute how likely it is under the model.
- Generate new data: Sample new instances \(x \sim p_\theta(x)\) that are not in the training set.
Part 4 — Explicit vs. Implicit Density Models
Not all generative models work the same way. A fundamental distinction is between explicit and implicit density models.
4.1 Explicit Density Models
An explicit density model directly defines and computes the probability \(p_\theta(x)\) for any given instance \(x\):
\[ x \longrightarrow \boxed{p_\theta} \longrightarrow [0, 1] \]
Given any data point \(x\), we can query the model and get back a number — the likelihood of \(x\) under the model. This is valuable because:
- We can directly optimise the likelihood during training.
- We can evaluate how well the model fits the data.
- We can compare different data points in terms of their likelihood.
4.2 Implicit Density Models
An implicit density model does not provide a way to compute \(p_\theta(x)\). Instead, it provides a mechanism to directly generate new “likely” instances:
\[ ? \longrightarrow \boxed{G_\theta} \longrightarrow x \]
The model takes some input (typically random noise) and produces a data sample \(x\). We cannot ask “how likely is this particular \(x\)?” — we can only generate samples. Generative Adversarial Networks (GANs) are the classic example of implicit density models.
| Property | Explicit Density Model | Implicit Density Model |
|---|---|---|
| Computes \(p_\theta(x)\) | Yes | No |
| Can evaluate likelihood | Yes | No |
| Can generate new samples | Sometimes (requires sampling procedure) | Yes (directly) |
| Example | Autoregressive LM, VAE | GAN |
| Training signal | Maximise likelihood | Adversarial loss |
Part 5 — Autoregressive Language Models as Explicit Density Models
Autoregressive language models, such as GPT, are a prime example of explicit density models. Let us see how they compute \(p_\theta(x)\) for a sequence of tokens.
5.1 Autoregressive Factorisation
Let \(x = [x_1, x_2, \dots, x_L]\) be a sentence, where each \(x_i\) is a token from a vocabulary \(\mathcal{V}\).
What is \(p_\theta(x)\)? We can decompose it using the chain rule of probability. Starting with:
\[ p(x_1, x_2, \dots, x_L) = p(x_1, x_2, \dots, x_{L-1}) \cdot p(x_L \mid x_1, \dots, x_{L-1}) \]
This uses the identity \(p(A, B) = p(A \mid B) \cdot p(B)\). Applying it recursively:
\[ p(x_1, x_2, \dots, x_L) = p(x_1, \dots, x_{L-2}) \cdot p(x_{L-1} \mid x_1, \dots, x_{L-2}) \cdot p(x_L \mid x_1, \dots, x_{L-1}) \]
Continuing all the way down, we obtain the autoregressive factorisation:
\[ p(x_1, x_2, \dots, x_L) = \prod_{i=1}^{L} p(x_i \mid x_{1:i-1}) \]
where \(x_{1:i-1}\) denotes the already-generated tokens, and \(x_i\) is the next token to predict.
5.2 GPT as an Explicit Density Model
A GPT-style autoregressive model computes all of these conditional probabilities in a single forward pass. Given the input sequence \([\langle\text{cls}\rangle, x_1, x_2, \dots, x_L]\) with shape \((L+1, d)\), the model produces logits with shape \((L+1, |\mathcal{V}|)\):
- Row \(i=1\): logits for \(p(x_1)\) — the probability of the first token.
- Row \(i=2\): logits for \(p(x_2 \mid x_1)\) — the probability of the second token given the first.
- Row \(i=L\): logits for \(p(x_L \mid x_1, x_2, \dots, x_{L-1})\) — the probability of the last token given all previous tokens.
Applying softmax to each row of logits gives us the conditional distributions. Multiplying them together (or equivalently, summing the log-probabilities) gives us \(p_\theta(x)\) — the likelihood of the entire sequence. This is what makes autoregressive LMs explicit density models.
5.3 Sampling from Explicit Density Models
How do we generate new data from an explicit density model?
For autoregressive models over a discrete vocabulary, the answer is sampling:
- Since \(p(x_{i+1} \mid x_1, \dots, x_i)\) is known for all \(x_{i+1} \in \mathcal{V}\) (it is the softmax output over the vocabulary), we can sample \(x_{i+1}\) by picking from this discrete distribution.
- Repeat: condition on the growing sequence and sample the next token.
This works because each token lives in a finite, discrete vocabulary. We can enumerate all possible next tokens and select one according to their probabilities.
But what about continuous spaces? If \(x \in \mathbb{R}^d\) (e.g., pixel values of an image), sampling becomes much harder. Even if we have an explicit model \(p_\theta(x)\), we cannot simply enumerate all possible \(x\) values. This challenge motivates architectures like the Variational Autoencoder (VAE), which we will study in a subsequent lesson.
Autoregressive language models rely on a finite, discrete vocabulary for each token \(x_i\). This makes sampling straightforward: at each step, the model produces a probability distribution over \(|\mathcal{V}|\) possible next tokens, and we simply pick one. For continuous data (images, audio), sampling from an explicit density model requires more sophisticated techniques.
Part 6 — Maximum Likelihood Training
Given an explicit density model \(p_\theta(x)\), how do we train it? The standard approach is maximum likelihood estimation (MLE).
6.1 The Maximum Likelihood Objective
Let \(\mathcal{D} = \{x^{(1)}, x^{(2)}, \dots, x^{(N)}\}\) be the training data. For language models, each \(x^{(i)}\) is a sequence of tokens.
We want to find parameters \(\theta\) such that the training data are likely samples of \(p_\theta\). This means we want to maximise the total log-likelihood of the data:
\[ \theta^* = \arg\max_\theta \sum_{i=1}^{N} \log p_\theta(x^{(i)}) \]
We use the logarithm for two reasons: (1) it converts products into sums, which is numerically more stable, and (2) \(\log(p)\) is a monotonically increasing function, so maximising \(\log p\) is equivalent to maximising \(p\).
6.2 The Maximum Likelihood Loss Function
Since gradient-based optimisation typically minimises a loss function, we negate the average log-likelihood to obtain the maximum likelihood loss:
\[ \mathcal{L}_{\text{ML}} = -\frac{1}{N} \sum_{i=1}^{N} \log p_\theta(x^{(i)}) \]
Minimising \(\mathcal{L}_{\text{ML}}\) is equivalent to maximising the log-likelihood. This loss function can be interpreted as the negative log-likelihood (NLL), and it is the standard training objective for autoregressive language models.
In expectation over the data distribution, this becomes:
\[ \mathcal{L}_{\text{ML}} = -\mathbb{E}_{x \sim p_{\text{data}}}\left[\log p_\theta(x)\right] \]
where we observe \(\log p_\theta(x)\) for samples \(x\) drawn from the true data distribution \(p_{\text{data}}\).
Part 7 — Connection to KL Divergence
The maximum likelihood objective has a deep connection to a fundamental quantity in information theory: the Kullback-Leibler (KL) divergence.
7.1 KL Divergence Defined
The KL divergence is a measure of how different two probability distributions are. For distributions \(p_{\text{data}}\) and \(p_\theta\):
\[ \text{KL}(p_{\text{data}} \| p_\theta) = \mathbb{E}_{x \sim p_{\text{data}}}\left[\log \frac{p_{\text{data}}(x)}{p_\theta(x)}\right] \]
Key properties:
- \(\text{KL}(p \| q) \geq 0\) always (non-negativity).
- \(\text{KL}(p \| q) = 0\) if and only if \(p = q\) (identity of indiscernibles).
- \(\text{KL}(p \| q) \neq \text{KL}(q \| p)\) in general (asymmetry — it is not a true distance).
7.2 Decomposing KL Divergence
Expanding the KL divergence:
\[ \text{KL}(p_{\text{data}} \| p_\theta) = \underbrace{\mathbb{E}_{x \sim p_{\text{data}}}\left[\log p_{\text{data}}(x)\right]}_{\text{constant w.r.t. } \theta} - \underbrace{\mathbb{E}_{x \sim p_{\text{data}}}\left[\log p_\theta(x)\right]}_{\mathcal{L}_{\text{ML}}(\theta)} \]
The first term is the negative entropy of the data distribution — it depends only on \(p_{\text{data}}\) and is a constant with respect to \(\theta\). The second term is exactly our maximum likelihood objective.
Therefore:
\[ \theta^* = \arg\min_\theta \text{KL}(p_{\text{data}} \| p_\theta) \]
Minimising the maximum likelihood loss is equivalent to minimising the KL divergence between the data distribution and the model distribution. This provides a theoretical justification for MLE: the trained model \(p_{\theta^*}\) is the one that is “closest” to the true data distribution in the KL sense.
Part 8 — Explicit Density Models with a Target Distribution
In some settings, we want to train an explicit density model not only to fit the training data but also to be close to a known target distribution \(p^*\). This gives us a combined loss:
Given training data \(\mathcal{D} = \{x^{(1)}, x^{(2)}, \dots\}\) and a desired target distribution \(p^*\), we can define:
- \(\mathcal{L}_{\text{MLE}}(\theta)\): the maximum likelihood loss from the data.
- \(\mathcal{L}_{\text{KL}}(\theta) = \text{KL}(p^* \| p_\theta)\): the KL divergence to the target distribution.
The combined objective is:
\[ \mathcal{L}(\theta) = \mathcal{L}_{\text{MLE}}(\theta) + \lambda \, \mathcal{L}_{\text{KL}}(\theta) \]
where \(\lambda\) controls the trade-off between fitting the data and matching the target distribution. This technique of combining a data-fitting term with a KL regularisation term will be used extensively in latent variable models, particularly the Variational Autoencoder.
Part 9 — Latent Variable Models
Latent variable models introduce a hidden (latent) variable \(z\) to make the generative process more expressive and tractable.
9.1 Two-Stage Generation
A latent variable model is a two-stage explicit model:
- Stage 1: Sample a latent variable from a prior distribution:
\[ z \sim q(z) \]
The prior \(q(z)\) is typically a simple, easy-to-sample distribution (e.g., a standard Gaussian \(\mathcal{N}(0, I)\)).
- Stage 2: Generate the observed data \(x\) conditioned on the latent variable:
\[ x \sim p_\theta(x \mid z) \]
The conditional model \(p_\theta(x \mid z)\) is a neural network (the decoder) that maps from the latent space to the data space.
Together, \(q(z)\) and \(p_\theta(x \mid z)\) define the full generative model.
9.2 Computing the Marginal Likelihood
Given a data point \(x\), what is its likelihood under the latent variable model? We must marginalise out the latent variable \(z\):
- Continuous \(z\):
\[ p_\theta(x) = \int p_\theta(x \mid z) \cdot q(z)\, dz \]
- Discrete \(z\):
\[ p_\theta(x) = \sum_{z \in \mathcal{Z}} p_\theta(x \mid z) \cdot q(z) \]
This integral (or sum) is often intractable in practice — the latent space may be high-dimensional and the integrand complex. This intractability is a central challenge in training latent variable models, and it motivates the variational approach used in VAEs.
9.3 Sampling and Inference
A latent variable model supports two key operations:
Sampling (generation of new data):
- Draw \(z \sim q(z)\) from the prior.
- Pass \(z\) through the decoder to get \(x \sim p_\theta(x \mid z)\).
This gives us a way to generate new data: sample a latent code from the simple prior, then decode it into a data point.
Inference (understanding existing data):
Given a data point \(x\), compute the posterior distribution \(p(z \mid x)\) — what latent code likely gave rise to this data point? This inverse problem is typically intractable and must be approximated (e.g., via an encoder network in a VAE).
9.4 Latent Variable Implicit Models
When we give up on computing \(p_\theta(x)\) explicitly and instead use the latent variable model purely for generation, we obtain a latent variable implicit model:
\[ q \longrightarrow z \longrightarrow \boxed{G_\theta} \longrightarrow x \]
Here \(G_\theta\) is a generator network (rather than a probabilistic decoder). We can generate samples but cannot evaluate their likelihood. GANs follow this paradigm: a simple prior \(q(z)\) feeds into a generator \(G_\theta\) that produces data samples, and the model is trained through an adversarial game rather than maximum likelihood.
Taxonomy of Generative Models:
| Model Type | Computes \(p_\theta(x)\) | Latent Variable | Example |
|---|---|---|---|
| Explicit, no latent | Yes | No | Autoregressive LM (GPT) |
| Explicit, latent | Yes (via marginalisation) | Yes | VAE |
| Implicit, latent | No | Yes | GAN |
Summary
In this lesson, we have built the probabilistic foundation for understanding generative models:
- We reviewed probability distributions, joint distributions, conditional distributions, and the mechanics of sampling.
- We recast classification as a discriminative model that learns \(p_\theta(y \mid x)\).
- We defined generative models as models that learn \(p_\theta(x)\) — the distribution of the data itself.
- We distinguished between explicit density models (which compute likelihoods) and implicit density models (which generate samples directly).
- We showed that autoregressive language models are explicit density models, using the chain rule to factorise \(p(x)\) into a product of conditional probabilities.
- We derived the maximum likelihood training objective and connected it to KL divergence.
- We introduced latent variable models as a two-stage approach: sample a latent \(z\) from a prior, then generate \(x\) conditioned on \(z\).
These ideas set the stage for the Variational Autoencoder (VAE), which combines latent variable modeling with a variational approximation to the intractable posterior.