Early Convolutional Neural Network Architectures: AlexNet, NiN, VGG, and GoogLeNet

Published

January 19, 2026

NoneAbstract

This lecture explores the foundational architectures in the modern era of deep convolutional neural networks (CNNs) for image classification. We begin with AlexNet (2012), examining its pioneering use of deep layers, ReLU activations, and GPU acceleration. We then discuss Network-in-Network (NiN, 2013), which introduced 1x1 convolutions and Global Average Pooling to replace heavy fully-connected layers. We proceed to VGG (2014), which demonstrated the power of depth using modular, uniform 3x3 convolutions. Finally, we analyze GoogLeNet (Inception v1, 2014), which introduced parallel processing paths within the Inception module to increase depth and width without a corresponding explosion in computational cost.

CREATED AT: 2026-01-19 19:14:21.111728

Introduction

The resurgence of neural networks in computer vision began in earnest around 2012, driven by the availability of large-scale datasets like ImageNet and the computational power of GPUs. Prior to this, computer vision largely relied on hand-crafted features such as SIFT or HOG, followed by shallow classifiers like Support Vector Machines. The shift towards end-to-end learning, where feature extractors are learned simultaneously with classifiers, revolutionized the field. This document traces the architectural evolution of CNNs during this pivotal period, moving from the successful scaling of standard layers in AlexNet to the structural innovations of Network-in-Network, the modular depth of VGG, and the complex parallel topology of GoogLeNet.

Overview of Early CNN Development

Before 2012, convolutional neural networks (such as LeNet-5) existed but were primarily applied to smaller, low-resolution datasets like MNIST. Scaling these architectures to high-resolution, realistic images was computationally prohibitive and prone to overfitting due to limited training data. The release of the ImageNet dataset (ILSVRC), containing over a million labeled images, provided the necessary data scale. Simultaneously, the utilization of Graphics Processing Units (GPUs) for general-purpose computing allowed for the efficient training of much larger models. This convergence set the stage for a rapid succession of architectural breakthroughs that defined the modern deep learning era.

AlexNet (2012)

AlexNet, proposed by Krizhevsky et al., is widely considered the breakthrough architecture that popularized deep learning for computer vision. Winning the ILSVRC-2012 competition by a large margin, it demonstrated that deep convolutional networks could learn complex features directly from raw pixels, significantly outperforming traditional hand-crafted feature methods.

ImportantImportant

Key Novel Contributions of AlexNet:

  1. Rectified Linear Units (ReLU): AlexNet replaced the standard sigmoid or tanh activation functions with ReLU (\(f(x) = \max(0, x)\)). This solved the vanishing gradient problem in deep networks and accelerated training convergence by six times.
  2. Dropout: To prevent overfitting in the massive fully-connected layers, AlexNet introduced Dropout, a regularization technique where neurons are randomly set to zero during training with probability \(0.5\).
  3. Multi-GPU Training: The model was split across two GPUs to handle the memory requirements, with specific inter-GPU communication pathways.
  4. Data Augmentation: Extensive use of image translations, reflections, and principal component analysis (PCA) color shifting to artificially expand the training set.

Historically, AlexNet proved that end-to-end representation learning was viable on large-scale data. It shifted the community’s focus from feature engineering to architecture engineering.

The architecture consists of five convolutional layers followed by three fully connected layers. The input is a \(224 \times 224 \times 3\) RGB image (technically the paper cites 224, but implementation details often require 227 to match padding/stride logic).

NoneAlgorithm

System Architecture of AlexNet

Let \(X_0 \in \mathbb{R}^{224 \times 224 \times 3}\) be the input image.

  1. Layer 1 (Conv + Pool):

    Convolution: 96 filters of size \(11 \times 11\), stride 4. \[ X_1 = \text{ReLU}(\text{Conv}(X_0, W_1)) \] Output volume: \(55 \times 55 \times 96\). Max Pooling: \(3 \times 3\) window, stride 2. \[ P_1 = \text{MaxPool}(X_1) \] Output volume: \(27 \times 27 \times 96\).

  2. Layer 2 (Conv + Pool):

    Convolution: 256 filters of size \(5 \times 5\), padding 2. \[ X_2 = \text{ReLU}(\text{Conv}(P_1, W_2)) \] Output volume: \(27 \times 27 \times 256\). Max Pooling: \(3 \times 3\) window, stride 2. \[ P_2 = \text{MaxPool}(X_2) \] Output volume: \(13 \times 13 \times 256\).

  3. Layer 3 (Conv):

    Convolution: 384 filters of size \(3 \times 3\), padding 1. \[ X_3 = \text{ReLU}(\text{Conv}(P_2, W_3)) \] Output volume: \(13 \times 13 \times 384\).

  4. Layer 4 (Conv):

    Convolution: 384 filters of size \(3 \times 3\), padding 1. \[ X_4 = \text{ReLU}(\text{Conv}(X_3, W_4)) \] Output volume: \(13 \times 13 \times 384\).

  5. Layer 5 (Conv + Pool):

    Convolution: 256 filters of size \(3 \times 3\), padding 1. \[ X_5 = \text{ReLU}(\text{Conv}(X_4, W_5)) \] Output volume: \(13 \times 13 \times 256\). Max Pooling: \(3 \times 3\) window, stride 2. \[ P_5 = \text{MaxPool}(X_5) \] Output volume: \(6 \times 6 \times 256\).

  6. Flattening:

    \(P_5\) is flattened into a vector \(v \in \mathbb{R}^{6 \times 6 \times 256} = \mathbb{R}^{9216}\).

  7. Fully Connected Layers (FC6, FC7, FC8):

    FC6: 4096 neurons, ReLU, Dropout. \[ H_1 = \text{Dropout}(\text{ReLU}(W_6 v + b_6)) \] FC7: 4096 neurons, ReLU, Dropout. \[ H_2 = \text{Dropout}(\text{ReLU}(W_7 H_1 + b_7)) \] FC8 (Output): 1000 neurons (for 1000 classes), Softmax. \[ \hat{y} = \text{Softmax}(W_8 H_2 + b_8) \]

AlexNet contains approximately 60 million parameters. The vast majority of these parameters reside in the fully connected layers, specifically the connection between the last convolutional block and the first fully connected layer (\(9216 \times 4096\) weights).

ImportantImportant

The usage of convolutional layers in AlexNet is critical for feature extraction. The early layers (1 and 2) learn low-level features like edges and blobs, while deeper layers (3, 4, and 5) learn higher-level semantic features. The drastic reduction in spatial size (via stride 4 in Layer 1 and max pooling) allows the network to cover a large receptive field.

Network-in-Network (NiN) (2013)

The Network-in-Network (NiN) architecture, proposed by Lin et al., introduced fundamental structural changes to address the parameter inefficiency of AlexNet’s fully connected layers and to enhance local feature abstraction.

NoteDefinition

Mlpconv Layer

Instead of a standard linear convolution filter, NiN proposes using a “micro network” (specifically a Multilayer Perceptron or MLP) to slide over the input feature map. At every pixel location \((i, j)\) of the input, a small MLP processes the local receptive field.

This highlights two key features:

  1. Spatially Distributed Application: The same function (the MLP) is applied at every spatial location.
  2. Weight Sharing: Just like a standard convolution, the weights of this MLP are shared across the width and height of the feature map.
ImportantImportant

The \(1 \times 1\) Convolution Trick

It turns out that applying a fully connected MLP at each pixel location is mathematically equivalent to performing a \(1 \times 1\) convolution followed by a ReLU activation function. If the MLP has two layers, it corresponds to two stacked \(1 \times 1\) convolutional layers. This insight allows standard convolution libraries to implement the complex “micro-network” structure efficiently.

NiN fundamentally changed how classification heads are designed. In AlexNet, the spatial structure is destroyed by flattening the feature map into a giant vector before passing it through dense layers.

NoneAlgorithm

Comparison with AlexNet: Global Average Pooling (GAP)

NiN removes the dense fully connected layers entirely. Instead, it uses Global Average Pooling.

  1. Channel Generation:

    The final convolutional layer is designed to output a feature map with \(K\) channels, where \(K\) is the number of classes (e.g., 1000 for ImageNet).

  2. GAP Definition:

    For each channel \(k\), we compute the spatial average of the feature map \(f_k(i, j)\) of size \(H \times W\): \[ s_k = \frac{1}{H \times W} \sum_{i=1}^{H} \sum_{j=1}^{W} f_k(i, j) \]

  3. Logit Calculation:

    The resulting vector \(s = [s_1, s_2, ..., s_K]\) serves directly as the logits (class scores), which are then passed to a Softmax function. \[ \hat{y} = \text{Softmax}(s) \]

This approach acts as a structural regularizer. By forcing the network to produce one map per class and averaging it, NiN interprets the feature maps as confidence maps for categories. It is far less prone to overfitting because there are no parameters to optimize in the GAP layer, effectively reducing the model size significantly compared to AlexNet.

VGG (2014)

The VGG network (Visual Geometry Group), proposed by Simonyan and Zisserman, explored the relationship between network depth and performance. Unlike AlexNet, which used varying filter sizes (\(11\times11\), \(5\times5\)), VGG enforced a strict design philosophy of simplicity and uniformity.

NoneAlgorithm

Architecture of VGG

VGG is composed of a stack of convolutional blocks followed by fully connected layers. The design is uniform:

  • Convolution: All convolutional layers use \(3 \times 3\) filters with stride 1 and padding 1.

    This is the smallest receptive field size to capture the notion of “up/down, left/right, center”. A stack of two \(3 \times 3\) layers has an effective receptive field of \(5 \times 5\). A stack of three \(3 \times 3\) layers has an effective receptive field of \(7 \times 7\). This factorization allows for more non-linearities (ReLUs) and fewer parameters than using single large filters.

  • Pooling: Max pooling is performed over \(2 \times 2\) pixel windows with stride 2.

  • Structure:

    The network is a sequence of blocks: [Conv-Conv-Pool], [Conv-Conv-Pool], [Conv-Conv-Conv-Pool], etc. The number of channels starts at 64 and doubles after each pooling layer: 64 \(\to\) 128 \(\to\) 256 \(\to\) 512 \(\to\) 512.

  • Head:

    Three fully connected layers: 4096, 4096, and 1000 channels, ending with Softmax.

ImportantImportant

Parameter Counts:

  • VGG-16: Contains approx. 138 million parameters.
  • VGG-19: Contains approx. 144 million parameters.

VGG demonstrated that depth alone—if done cleanly using modular blocks—dramatically improves performance.

Despite its success and elegant design, VGG has several limitations:

  1. No Parallel Paths: The architecture is strictly sequential.
  2. No Skip Connections: Gradients must propagate through all layers, making training of very deep versions difficult.
  3. Computational Cost: It is very heavy to evaluate and train due to the large number of parameters in the fully connected layers and the high spatial resolution of early convolutions.
  4. Diminishing Returns: Performance improvements plateaued significantly after 16-19 layers.

GoogLeNet and Inception (2014)

While VGG pushed the limit of sequential depth, GoogLeNet (based on the Inception architecture) sought to improve efficiency and performance through a more complex topology. It introduced the concept of the Inception Module.

NoteDefinition

The Inception Module

The Inception module replaces a single convolutional layer with a block containing four parallel paths. The input feature map is processed simultaneously by:

  1. \(1 \times 1\) Convolution
  2. \(1 \times 1\) Convolution \(\to\) \(3 \times 3\) Convolution
  3. \(1 \times 1\) Convolution \(\to\) \(5 \times 5\) Convolution
  4. \(3 \times 3\) Max Pooling \(\to\) \(1 \times 1\) Convolution

The outputs of these four paths are concatenated along the channel dimension to form the final output of the module. This allows the network to capture multi-scale information (features at different spatial sizes) within a single layer.

The \(1 \times 1\) convolutions inside the \(3 \times 3\) and \(5 \times 5\) paths are used for dimensionality reduction (bottleneck layers) to keep computational costs low.

NoneAlgorithm

GoogLeNet Architecture

The network begins with a standard convolutional trunk (stem) and then stacks 9 Inception modules.

  1. Stem:

    Standard convolution and pooling layers producing an output feature map of \(28 \times 28 \times 192\).

  2. Inception Blocks (Sequential Stacking):

    • Inception 3a: Input \(28 \times 28 \times 192\) \(\to\) Output \(28 \times 28 \times 256\)
    • Inception 3b: Input \(28 \times 28 \times 256\) \(\to\) Output \(28 \times 28 \times 480\)
    • Max Pooling (\(3\times3\), stride 2)
    • Inception 4a: Input \(14 \times 14 \times 480\) \(\to\) Output \(14 \times 14 \times 512\)
    • Inception 4b: Input \(14 \times 14 \times 512\) \(\to\) Output \(14 \times 14 \times 512\)
    • Inception 4c: Input \(14 \times 14 \times 512\) \(\to\) Output \(14 \times 14 \times 512\)
    • Inception 4d: Input \(14 \times 14 \times 512\) \(\to\) Output \(14 \times 14 \times 528\)
    • Inception 4e: Input \(14 \times 14 \times 528\) \(\to\) Output \(14 \times 14 \times 832\)
    • Max Pooling (\(3\times3\), stride 2)
    • Inception 5a: Input \(7 \times 7 \times 832\) \(\to\) Output \(7 \times 7 \times 832\)
    • Inception 5b: Input \(7 \times 7 \times 832\) \(\to\) Output \(7 \times 7 \times 1024\)
  3. Auxiliary Classifiers:

    To combat the vanishing gradient problem in such a deep network (22 layers), GoogLeNet uses three classification heads trained simultaneously.
    • Head #1: Connected to output of Inception 4a.
    • Head #2: Connected to output of Inception 4d.
    • Head #3 (Final): Connected to output of Inception 5b.
  4. Final Classification Head:

    • Global Average Pooling: Converts \(7 \times 7 \times 1024\) \(\to\) \(1 \times 1 \times 1024\).
    • Dropout: 40% dropout.
    • Linear Layer: Fully connected layer mapping 1024 \(\to\) 1000 classes.
    • Softmax.
ImportantImportant

During training, the loss is a weighted sum of the three heads (Head #1 and #2 help inject gradients into early layers). During inference, only the final head (Head #3) is used. This architecture achieved state-of-the-art results with significantly fewer parameters (\(~5\) million) compared to AlexNet (\(60\) million) or VGG (\(138\) million).

Conclusion

The evolution from AlexNet to GoogLeNet illustrates the rapid maturation of convolutional neural network design. AlexNet proved the viability of deep learning on GPUs. Network-in-Network introduced 1x1 convolutions and Global Average Pooling, concepts that became staples in efficient design. VGG established the importance of depth and uniform modularity, while GoogLeNet demonstrated that complex, parallel topologies could achieve high accuracy with high parameter efficiency. These innovations laid the groundwork for modern architectures like ResNet and DenseNet.

References

  1. Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012).
  2. Lin, Min, Qiang Chen, and Shuicheng Yan. “Network in network.” arXiv preprint arXiv:1312.4400 (2013).
  3. Szegedy, Christian, et al. “Going deeper with convolutions.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2015.
  4. Simonyan, Karen, and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition.” arXiv preprint arXiv:1409.1556 (2014).