Convolutional Neural Networks
Introduction
In traditional feedforward neural networks, data is often treated as flat vectors, ignoring the inherent spatial or temporal structure of the input. For tabular data, this approach is appropriate. However, image data possesses a rich grid-like topology where the spatial relationship between pixels carries essential information. Treating an image merely as a flattened vector of pixel intensities discards this structure and leads to statistically and computationally inefficient models. Convolutional Neural Networks (CNNs) address these issues by incorporating inductive biases—specifically translation invariance and locality—directly into the architecture.
Inefficiency of Fully Connected Layers
Before introducing convolutions, it is necessary to understand why standard Multilayer Perceptrons (MLPs) are ill-suited for image analysis. In an MLP, a layer is often described as densely connected or fully connected. This implies that every single input neuron is connected to every single output neuron via a unique learnable weight. While this allows for maximum flexibility in modeling interactions between features, it ignores the specific structure of image data.
For a fully connected layer transforming an input vector of dimension \(d_{in}\) to a hidden layer of dimension \(d_{out}\), the number of parameters required (excluding bias) is: \[ N_{params} = d_{in}\times d_{out} \]
The term “dense” reflects this \(O(d_{in} \cdot d_{out})\) scaling, which leads to an explosion in parameter count for high-dimensional inputs.
Constrained Learning and Inductive Biases
The fundamental issue with applying dense layers to images is not just the number of parameters, but what those parameters represent. In a dense layer, the model must learn patterns independently for every location in the image. For example, if a network learns to identify a cat in the top-left corner, a dense network does not automatically know how to identify the same cat in the bottom-right corner. It must relearn the weights for the “bottom-right cat detector” from scratch.
To solve this, we introduce constraints based on prior knowledge of image statistics. We desire the property of translation invariance (or equivariance), meaning that a shift in the input should result in a corresponding shift in the activation map, rather than a completely different activation pattern. Furthermore, we rely on the principle of locality: relevant patterns (like edges or textures) are local, and a neuron usually only needs to see a small window of pixels to detect a feature.
While a sufficiently large MLP could theoretically learn these constraints (given infinite data and time), it is extremely inefficient. By hard-coding these constraints into the architecture, we dramatically reduce the search space for the parameters, replacing general matrix multiplication with the specific structure of convolution.
The Convolution Operation
Motivated by the principles of translation invariance and locality, we define the convolution operation. In the context of neural networks, a convolutional layer applies a filter (or kernel) to the input data. This filter slides over the input, computing a dot product between the kernel weights and the local input patch at every position.
Mathematically, the 2D cross-correlation operation (often loosely referred to as convolution in deep learning) for an input matrix \(\mathbf{X}\) and a kernel \(\mathbf{K}\) is defined as: \[ (\mathbf{X} * \mathbf{K})_{i,j} = \sum_{a} \sum_{b} \mathbf{X}_{i+a, j+b} \cdot \mathbf{K}_{a, b} \] where the indices \(a, b\) run over the dimensions of the kernel.
Strictly speaking, the mathematical definition of convolution requires flipping the kernel relative to the input. However, since the weights of the kernel are learned from data, the distinction between cross-correlation and convolution is irrelevant for the expressiveness of the model. Deep learning frameworks generally implement cross-correlation but call it convolution.
When dealing with multi-channel inputs (such as RGB images), the convolution becomes a 3D operation. The kernel must have the same depth (number of input channels, \(c_{in}\)) as the input. The response at a spatial location is the sum of the cross-correlations across all channels.
Given an input of spatial size \(n_h \times n_w\) and a convolution kernel of size \(k_h \times k_w\), the output spatial dimensions (assuming valid cross-correlation without padding) are: \[ (n_h - k_h + 1) \times (n_w - k_w + 1) \]
The \(1 \times 1\) Convolution
A special case of convolution arises when the kernel size is \(1 \times 1\). While this operation does not aggregate information across adjacent spatial pixels, it is widely used in deep network architectures. A \(1 \times 1\) convolution computes a linear combination of the input channels at every pixel location.
This can be viewed as a fully connected layer applied independently and identically to each pixel position. It is computationally efficient for mixing channels (changing the depth of the tensor) without altering the spatial resolution.
Padding and Stride
Repeated application of convolutions with kernels larger than \(1 \times 1\) reduces the spatial dimensions of the feature maps, potentially losing information at the boundaries. To control the output size, we employ padding. Padding involves adding implicit rows and columns (typically zeros) around the border of the input tensor.
Conversely, we may wish to reduce the spatial resolution aggressively to lower computational costs or increase the receptive field. Stride refers to the interval at which the filter is applied. A stride of \(s\) means we skip \(s-1\) pixels between convolution steps.
For an input of height \(n_h\), kernel height \(k_h\), total vertical padding \(p_h\), and vertical stride \(s_h\), the output height is given by: \[ \lfloor \frac{n_h - k_h + p_h + s_h}{s_h} \rfloor \]
If we set padding \(p_h = k_h - 1\) and stride \(s_h = 1\), the output dimension equals the input dimension. Strides \(s > 1\) reduce the dimension by a factor of roughly \(s\).
Pooling Layers
Pooling layers serve a dual purpose: they reduce the spatial dimensions of the feature maps (downsampling) and they provide a degree of local translation invariance. By aggregating a local window of values into a single statistic, the network becomes less sensitive to small shifts in the input image.
Max Pooling computes the maximum value within a fixed window: \[ y_{i,j} = \max_{(a,b) \in \mathcal{W}_{i,j}} x_{a,b} \] where \(\mathcal{W}_{i,j}\) is the pooling window corresponding to output location \((i,j)\).
Average Pooling computes the arithmetic mean of the elements within the window: \[ y_{i,j} = \frac{1}{|\mathcal{W}_{i,j}|} \sum_{(a,b) \in \mathcal{W}_{i,j}} x_{a,b} \]
Max pooling is generally preferred in modern architectures as it preserves the strongest features (e.g., the strongest edge detection) in the window, effectively separating signal from noise.
Case Study: LeNet
LeNet, specifically LeNet-5, is a pioneering Convolutional Neural Network introduced by Yann LeCun in the 1990s. It was designed for the recognition of handwritten digits (such as those on bank checks) and demonstrated the feasibility of training CNNs end-to-end via backpropagation.
Despite its modest size by modern standards, LeNet established the standard design pattern for CNNs: a convolutional encoder that progressively reduces spatial resolution while increasing channel depth, followed by a dense classification head. Training was performed using stochastic gradient descent to minimize error, a method that remains the backbone of deep learning today.
Conclusion
This document has outlined the fundamental components of Convolutional Neural Networks. By constraining the learning process through shared weights and local connectivity, CNNs overcome the inefficiencies of fully connected networks on high-dimensional grid data. The convolution operation, augmented by padding and stride, allows for flexible feature extraction, while pooling layers provide invariance and dimensionality reduction. These concepts culminated in early architectures like LeNet, which laid the groundwork for the modern revolution in computer vision.