Multilayer Perceptrons: Foundations, Stability, and Regularization

Published

January 15, 2026

Abstract

This document explores the architecture and training of Multilayer Perceptrons (MLPs). It addresses the fundamental limitations of linear models, motivates the inclusion of hidden layers and non-linear activation functions, and provides the computational framework for forward and backward propagation. Furthermore, it analyzes numerical stability issues such as vanishing and exploding gradients, and introduces regularization techniques like dropout, early stopping, and K-fold cross-validation to improve model generalization.

CREATED AT: 2026-01-15 21:11:02.784030

Introduction

Deep learning represents a paradigm shift from traditional linear models by enabling the learning of complex, non-linear representations from raw data. The Multilayer Perceptron (MLP) is the foundational architecture of deep learning, consisting of an input layer, one or more hidden layers, and an output layer. By interleaving affine transformations with non-linear activation functions, MLPs can approximate virtually any continuous function, overcoming the geometric constraints of linear classifiers.

Limitations of Linear Models and Geometric Interpretations

Linear models for binary classification rely on the assumption that classes can be separated by a single linear boundary. From a geometric perspective, this boundary is represented as a hyperplane in the input feature space. If the data is not linearly separable, a linear classifier will necessarily fail to achieve perfect accuracy, regardless of the optimization effort.

Definition

A hyperplane in a \(d\)-dimensional space is a \((d-1)\)-dimensional subspace defined by the set of points \(\mathbf{x}\) satisfying the affine equation: \(f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b = 0\), where \(\mathbf{w} \in \mathbb{R}^d\) is the weight vector (normal to the hyperplane) and \(b \in \mathbb{R}\) is the bias term.

The hyperplane partitions the space into two half-spaces. The sign of \(f(\mathbf{x})\) determines the side on which a point lies: \(f(\mathbf{x}) > 0\) assigns the input to one class, while \(f(\mathbf{x}) < 0\) assigns it to the other.

Important

Linearity implies the assumption of monotonicity. This means that increasing a specific feature must either always increase the model output (if the weight is positive) or always decrease it (if the weight is negative). This assumption fails for complex tasks like image recognition, where the significance of a pixel depends on the values of its neighbors.

In image processing, inverting the intensity of pixels might preserve the category of an object, yet a linear model would be forced to change its prediction due to the shift in pixel-wise feature values.

Multilayer Perceptron (MLP) Architecture

To overcome the constraints of linearity, we stack multiple fully connected layers. An MLP transforms input features into hidden representations through one or more intermediate ‘hidden’ layers before reaching the output layer. However, stacking linear layers alone is insufficient, as the composition of affine transformations remains an affine transformation.

Definition

A Multilayer Perceptron (MLP) is a feedforward neural network that applies a sequence of transformations of the form \(\mathbf{H} = \sigma(\mathbf{X}\mathbf{W}^{(1)} + \mathbf{b}^{(1)})\), where \(\mathbf{X}\) is the input, \(\mathbf{W}\) and \(\mathbf{b}\) are parameters, and \(\sigma\) is a non-linear activation function.

Without the non-linear function \(\sigma\), a network with any number of hidden layers can be collapsed into a mathematically equivalent single-layer linear model: \(\mathbf{O} = \mathbf{X}\mathbf{W}^{(1)}\mathbf{W}^{(2)} = \mathbf{X}\mathbf{W}_{combined}\).

Important

Common activation functions include:

ReLU (Rectified Linear Unit): \(\text{ReLU}(x) = \max(0, x)\).
Sigmoid: \(\sigma(x) = \frac{1}{1 + \exp(-x)}\).
Tanh (Hyperbolic Tangent): \(\tanh(x) = \frac{1 - \exp(-2x)}{1 + \exp(-2x)}\).

The ReLU function is currently the most popular choice for hidden layers due to its simplicity and its ability to mitigate the vanishing gradient problem during backpropagation compared to sigmoid or tanh.

Forward and Backward Propagation

Training a neural network involves two primary computational passes: forward propagation to calculate the loss and backward propagation to calculate the gradients required for parameter updates. These processes are formalized through the use of computational graphs and the chain rule of calculus.

Algorithm

The Forward Pass for a single-hidden-layer MLP (ignoring bias for simplicity) is computed as follows:

\(\mathbf{z} = \mathbf{W}^{(1)}\mathbf{x}\)
\(\mathbf{h} = \phi(\mathbf{z})\)
\(\mathbf{o} = \mathbf{W}^{(2)}\mathbf{h}\)
\(L = l(\mathbf{o}, y)\)
\(J = L + s(\mathbf{W})\), where \(J\) is the objective function including regularization \(s\).

Algorithm

The Backward Pass (Backpropagation) computes gradients by traversing the computational graph in reverse:

\(\frac{\partial J}{\partial \mathbf{o}}\): Gradient of objective w.r.t. output.
\(\frac{\partial J}{\partial \mathbf{W}^{(2)}} = \frac{\partial J}{\partial \mathbf{o}} \mathbf{h}^\top + \lambda \mathbf{W}^{(2)}\)
\(\frac{\partial J}{\partial \mathbf{h}} = (\mathbf{W}^{(2)})^\top \frac{\partial J}{\partial \mathbf{o}}\)
\(\frac{\partial J}{\partial \mathbf{z}} = \frac{\partial J}{\partial \mathbf{h}} \odot \phi'(\mathbf{z})\)
\(\frac{\partial J}{\partial \mathbf{W}^{(1)}} = \frac{\partial J}{\partial \mathbf{z}} \mathbf{x}^\top + \lambda \mathbf{W}^{(1)}\)

Backpropagation reuses intermediate values calculated during the forward pass. Consequently, training requires significantly more memory than inference because these intermediate activations must be stored until the backward pass is complete.

Numerical Stability and Parameter Initialization

Deep networks are susceptible to numerical instability due to the repeated multiplication of gradients across many layers. This instability typically manifests as gradients that either vanish or explode, both of which prevent effective learning.

Definition

The Vanishing Gradient Problem occurs when gradients become excessively small as they are backpropagated through layers, causing parameters in early layers to stop updating. This is common when using sigmoid activations, as their gradients are at most 0.25.

When inputs to a sigmoid are very large or very small, the gradient approaches zero, leading to ‘saturation’ where the model cannot learn from new data.

Definition

The Exploding Gradient Problem occurs when gradients grow exponentially large, leading to divergent parameter updates and numerical overflow. This often results from poor weight initialization where weights are significantly larger than 1.

Exploding gradients can destroy a model in a single iteration by overwriting weights with NaN (Not a Number) values.

Important

To reduce these problems, we use careful Parameter Initialization. A standard approach is Xavier Initialization, which draws weights from a distribution with variance \(\sigma^2 = \frac{2}{n_{in} + n_{out}}\), ensuring that the variance of activations and gradients remains stable across layers.

Random initialization is also essential to ‘break symmetry.’ If all weights were initialized to the same value, every neuron in a hidden layer would compute the same function, effectively reducing the layer to a single unit.

Dropout Regularization

Dropout is a powerful regularization technique used to prevent overfitting in large neural networks. It operates by injecting noise into the hidden layers during training, forcing the network to learn robust features that do not rely on the presence of specific other neurons.

Definition

Dropout is a stochastic process where, during each training iteration, each neuron in a layer is dropped (set to zero) with probability \(p\). The remaining neurons are scaled by \(\frac{1}{1-p}\) to maintain the same expected total activation.

By design, the expectation of the activation remains unchanged: \(E[h'] = p \cdot 0 + (1-p) \cdot \frac{h}{1-p} = h\).

Code

def dropout_layer(X, dropout_prob):
    assert 0 <= dropout_prob <= 1
    if dropout_prob == 1: return torch.zeros_like(X)
    mask = (torch.rand(X.shape) > dropout_prob).float()
    return mask * X / (1.0 - dropout_prob)

Important

Dropout is a differentiable operator because, for any given iteration, the mask is treated as a constant scaling factor. The gradient simply passes through the binary mask during backpropagation, allowing for standard gradient descent.

Dropout is typically disabled at test time. This ensures that the model provides a deterministic prediction based on all available weights.

Training Strategies: Validation and Overfitting

Deep networks are often over-parametrized, meaning they have more parameters than training examples. This makes them highly prone to interpolating noise rather than learning general patterns. To combat this, we employ specific validation strategies to monitor and control model complexity.

Important

Early Stopping is a regularization technique where training is terminated as soon as the validation error stops decreasing, even if the training error is still falling. This prevents the model from entering the ‘interpolation regime’ where it begins to memorize specific training samples.

Early stopping is a ‘free’ regularizer because it reduces computation time while improving generalization.

Definition

K-Fold Cross-Validation is a robust method for model selection where the dataset is partitioned into \(K\) non-overlapping subsets. The model is trained \(K\) times, each time using \(K-1\) folds for training and 1 fold for validation. The results are averaged to provide a stable estimate of generalization performance.

This method is particularly effective when the dataset is small, as it ensures that every example is used for both training and validation exactly once.

Conclusion

The Multilayer Perceptron is the cornerstone of deep learning, providing the mathematical flexibility needed to model non-linear data. However, successfully training these networks requires more than just stacking layers; it necessitates a deep understanding of activation functions, the mechanics of backpropagation, numerical stability, and robust regularization techniques. By combining MLPs with strategies like Xavier initialization, dropout, and cross-validation, practitioners can build models that not only fit training data but generalize effectively to unseen real-world challenges.

References

Zhang, Aston, et al. Dive into deep learning. Cambridge University Press, 2023. (page index: 206 - 245)