Deep Residual Learning for Image Recognition
Introduction
Depth has been a primary driver of success in convolutional neural networks (CNNs). As networks grow deeper, they can integrate more complex features in an end-to-end fashion. However, simply stacking layers leads to difficulties in convergence and optimization. While normalization techniques like Batch Normalization have mitigated the vanishing/exploding gradient problems, a different obstacle emerged: the degradation problem. This lesson details how residual learning addresses this bottleneck, allowing for networks with over 100 layers to be trained effectively.
1. Motivation and the Degradation Problem
In the evolution of deep learning, depth is recognized as a critical factor for success. Modern architectures often stack many layers to capture hierarchical feature representations. However, a counterintuitive phenomenon occurs when network depth increases beyond a certain point: accuracy saturates and then degrades rapidly. This degradation is not caused by overfitting, as evidenced by higher training errors in deeper models compared to shallower ones.
The Degradation Problem: With the network depth increasing, accuracy gets saturated and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error.
This suggests that not all systems are similarly easy to optimize. If a deeper model is constructed by adding identity layers to a shallower counterpart, the deeper model should, in theory, perform at least as well as the shallow one. The failure of optimization to find this solution indicates that standard solvers struggle with approximating identity mappings using stacks of nonlinear layers.
2. Empirical Evidence of Optimization Failure
Empirical studies on datasets like CIFAR-10 and ImageNet demonstrate the degradation problem clearly. When comparing a 20-layer ‘plain’ network with a 56-layer one, the 56-layer network consistently exhibits higher training error and, consequently, higher test error. This occurs despite the use of modern initialization and Batch Normalization, which ensure that gradients do not vanish.
Plain Network: A neural network architecture where layers are stacked sequentially, and the mapping learned by each layer or stack of layers is intended to directly approximate the desired underlying mapping \(H(x)\).
The fact that training error increases with depth proves that the problem is an optimization issue rather than a generalization (overfitting) issue. The solver fails to find a better solution in a larger hypothesis space that contains the solution space of the shallower network.
3. The Residual Learning Reformulation
To address degradation, we reformulate the learning objective. Instead of expecting a stack of layers to fit a desired mapping \(H(x)\), we let these layers fit a residual mapping \(F(x):=H(x)-x\). The original mapping is then recast as \(F(x)+x\). This approach is motivated by the hypothesis that it is easier to optimize the residual mapping than the original unreferenced mapping.
Residual Mapping: A reformulation where the layers learn the difference or ‘residue’ \(F(x)\) required to reach the target mapping \(H(x)\) from the input \(x\).
If the optimal mapping is an identity mapping, it is easier for the solver to drive the weights of the nonlinear layers toward zero than to fit an identity mapping from scratch using stacked nonlinearities.
In real cases, even if the identity mapping is not optimal, the residual reformulation provides a better preconditioning of the problem. If the optimal function is closer to an identity mapping than to a zero mapping, the solver will find perturbations more easily.
4. Architecture of the Residual Block
The residual learning principle is implemented through ‘shortcut connections’ that skip one or more layers. In the standard ResNet building block, the shortcut performs identity mapping, and its output is added to the output of the stacked layers. This addition is performed element-wise, followed by a second nonlinearity.
Consider a block with two layers. Let \(\sigma\) denote the ReLU activation function. The residual function is: \[F=W_2\sigma(W_1x)\] The operation \(F+x\) is performed by a shortcut connection and element-wise addition. After the addition, the second nonlinearity is applied: \[y=\sigma(F+x)\] This structure introduces no extra parameters and no computational complexity beyond negligible element-wise addition.
Q.E.D.
Identity shortcuts are particularly attractive because they do not increase the number of parameters or the computational cost (FLOPs) of the network, allowing for fair comparisons between plain and residual networks.
5. Handling Dimension Mismatch
The equation \(y=F(x)+x\) assumes that \(F\) and \(x\) have the same dimensions. However, in CNNs, dimensions often change due to pooling or strides (e.g., when the number of filters increases or the feature map size decreases). To handle this mismatch, two primary strategies are employed for the shortcut connection.
While projection shortcuts (Option B) can slightly outperform zero-padding (Option A) due to the added flexibility, the difference is usually marginal. Identity shortcuts are preferred for the ‘bottleneck’ designs described later to keep complexity manageable.
6. Scaling to Very Deep Networks: Bottleneck Blocks
For deeper versions of ResNet (e.g., ResNet-50/101/152), the standard two-layer residual block is modified into a ‘bottleneck’ design. This modification is driven by concerns regarding the training time and computational budget as depth increases significantly.
Bottleneck Block: A residual block consisting of three layers: \(1\times 1\), \(3\times 3\), and \(1\times 1\) convolutions. The \(1\times 1\) layers are responsible for reducing and then increasing (restoring) dimensions, leaving the \(3\times 3\) layer as a bottleneck with smaller input/output dimensions.
The bottleneck design allows for much deeper models. For example, a 152-layer ResNet has lower complexity (11.3 billion FLOPs) than a VGG-16 or VGG-19 network (15.3 or 19.6 billion FLOPs), despite being significantly deeper.
7. Experimental Evidence and Results
Extensive experiments on ImageNet demonstrate that ResNets successfully overcome the degradation problem. Unlike plain networks, deeper ResNets consistently yield lower training error and better validation accuracy than their shallower counterparts. For instance, ResNet-152 achieved a 3.57% top-5 error on the ImageNet test set, winning the ILSVRC 2015 competition.
Key Experimental Observations:
- Accuracy Gains from Depth: ResNets show a clear trend where increased depth leads to increased accuracy.
- Faster Convergence: ResNets tend to converge faster in the early stages of training compared to plain networks.
- Generalization: Representations learned by deep ResNets generalize well to other tasks like object detection and segmentation (e.g., COCO dataset).
Experiments on CIFAR-10 with networks up to 1202 layers further explored the limits of depth. While the 1202-layer model showed no optimization difficulty, it performed slightly worse than a 110-layer model, likely due to overfitting on the relatively small dataset.
Conclusion
Deep Residual Learning provides a robust solution to the degradation problem in deep neural networks. By recasting the learning objective as a residual function and utilizing identity shortcut connections, ResNets enable the training of extremely deep architectures that are easier to optimize and more accurate. This framework has become a foundational component of modern computer vision, proving that residual connections are an essential architectural element for deep representation learning.