YOLO: You Only Look Once — From v1 to v2

object-detection

deep-learning

computer-vision

real-time

A technical walkthrough of the YOLO family of real-time object detectors, tracing the evolution from YOLOv1 (2015) to YOLOv2/YOLO9000 (2016), covering unified detection as regression, the multi-part loss function, Darknet-19, anchor boxes with dimension clusters, and joint detection-classification training via WordTree.

Author

Ken Pu

Published

January 29, 2026

Introduction

By 2015, the dominant approach to object detection followed a multi-stage pipeline: generate region proposals, extract features per region, classify each proposal, and refine bounding boxes in a post-processing step. Systems like R-CNN took approximately 40 seconds per image. Even the faster variants — Fast R-CNN and Faster R-CNN — operated at only 0.5–7 FPS, well below real-time requirements.

The YOLO (You Only Look Once) family of detectors introduced a fundamentally different paradigm: object detection as a single regression problem. Instead of a complex pipeline of proposal generation, feature extraction, classification, and post-processing, YOLO runs a single neural network on the full image to directly predict bounding box coordinates and class probabilities.

This article covers two papers that define the YOLO approach:

YOLOv1 (Redmon et al., 2015): Introduced the unified detection framework — a single convolutional network that reasons globally about the full image, achieving 45 FPS with 63.4% mAP on PASCAL VOC 2007.
YOLOv2 / YOLO9000 (Redmon & Farhadi, 2016): Systematically addressed v1’s weaknesses through batch normalization, anchor boxes with learned dimension priors, multi-scale training, and a new backbone (Darknet-19), reaching 78.6% mAP at 40 FPS. The paper also introduced YOLO9000, which jointly trains on detection and classification data to detect over 9000 object categories in real-time.

Method	mAP (VOC 2007)	FPS	Approach
R-CNN	66.0%	0.02	Multi-stage pipeline
Fast R-CNN	70.0%	0.5	Shared features, RoI pooling
Faster R-CNN (VGG-16)	73.2%	7	Learned proposals (RPN)
YOLOv1	63.4%	45	Single regression network
YOLOv2 (544)	78.6%	40	Improved single-shot

The key insight across both papers is that speed and accuracy are not inherently opposed — with the right architectural choices, a single-shot detector can match or exceed multi-stage detectors while running at real-time speeds.

1. YOLOv1: Unified Detection

1.1 Detection as Regression

YOLOv1’s fundamental contribution is reframing object detection as a single regression problem. The entire detection pipeline — feature extraction, bounding box prediction, class probability estimation, and non-maximum suppression — is collapsed into a single neural network evaluation.

The system works as follows:

Divide the input image into an \(S \times S\) grid (with \(S = 7\) for PASCAL VOC).
Each grid cell is responsible for detecting objects whose center falls within that cell.
Each grid cell predicts:
- \(B\) bounding boxes (\(B = 2\)), each with 5 values: \((x, y, w, h, \text{confidence})\)
- \(C\) conditional class probabilities: \(\Pr(\text{Class}_i \mid \text{Object})\) (with \(C = 20\) for VOC)

The output is a single tensor of shape:

\[S \times S \times (B \cdot 5 + C) = 7 \times 7 \times 30\]

1.2 Confidence Score

Each bounding box has an associated confidence score defined as:

\[\text{Confidence} = \Pr(\text{Object}) \times \text{IOU}_{\text{pred}}^{\text{truth}}\]

If no object exists in a cell, the confidence should be zero. Otherwise, the confidence equals the IoU between the predicted box and the ground truth.

At test time, the class-specific confidence for each box is computed by combining the conditional class probability with the box confidence:

\[\Pr(\text{Class}_i \mid \text{Object}) \times \Pr(\text{Object}) \times \text{IOU}_{\text{pred}}^{\text{truth}} = \Pr(\text{Class}_i) \times \text{IOU}_{\text{pred}}^{\text{truth}} \tag{1}\]

This encodes both the probability that a particular class appears in the box and how well the predicted box fits the object.

Show code

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.patches as patches
import numpy as np

fig, ax = plt.subplots(figsize=(10, 10))
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.set_aspect('equal')
ax.set_title('YOLOv1: Grid-Based Detection (S=7)', fontsize=14, fontweight='bold', pad=15)

# Draw 7x7 grid
S = 7
cell_size = 10.0 / S
for i in range(S + 1):
    ax.axhline(i * cell_size, color='#BDBDBD', linewidth=0.8)
    ax.axvline(i * cell_size, color='#BDBDBD', linewidth=0.8)

# Draw a "ground truth" object (e.g., a dog)
gt_x, gt_y, gt_w, gt_h = 3.2, 2.5, 3.5, 4.0
gt_rect = patches.Rectangle((gt_x, gt_y), gt_w, gt_h, linewidth=2.5,
                             edgecolor='#4CAF50', facecolor='#4CAF50', alpha=0.12)
ax.add_patch(gt_rect)
ax.text(gt_x + gt_w / 2, gt_y + gt_h + 0.2, 'Ground Truth', ha='center',
        fontsize=10, color='#4CAF50', fontweight='bold')

# Highlight the responsible grid cell (center of object)
center_x = gt_x + gt_w / 2  # 4.95
center_y = gt_y + gt_h / 2  # 4.5
cell_col = int(center_x / cell_size)  # column 3
cell_row = int(center_y / cell_size)  # row 3
responsible_rect = patches.Rectangle(
    (cell_col * cell_size, cell_row * cell_size), cell_size, cell_size,
    linewidth=2.5, edgecolor='#F44336', facecolor='#F44336', alpha=0.25
)
ax.add_patch(responsible_rect)
ax.plot(center_x, center_y, 'r*', markersize=15, zorder=5)
ax.text(cell_col * cell_size + cell_size / 2, cell_row * cell_size + cell_size / 2 - 0.35,
        'Responsible\ncell', ha='center', va='center', fontsize=8, color='#D32F2F', fontweight='bold')

# Draw two predicted bounding boxes from the responsible cell
pred1 = patches.Rectangle((3.0, 2.8), 3.8, 3.6, linewidth=2, linestyle='--',
                           edgecolor='#1976D2', facecolor='none')
pred2 = patches.Rectangle((3.5, 2.2), 3.0, 4.5, linewidth=2, linestyle='--',
                           edgecolor='#FF9800', facecolor='none')
ax.add_patch(pred1)
ax.add_patch(pred2)

# Legend
legend_elements = [
    patches.Patch(facecolor='#4CAF50', alpha=0.3, edgecolor='#4CAF50', label='Ground truth box'),
    patches.Patch(facecolor='#F44336', alpha=0.3, edgecolor='#F44336', label='Responsible grid cell'),
    plt.Line2D([0], [0], color='#1976D2', linewidth=2, linestyle='--', label='Predicted box 1 (B=1)'),
    plt.Line2D([0], [0], color='#FF9800', linewidth=2, linestyle='--', label='Predicted box 2 (B=2)'),
    plt.Line2D([0], [0], marker='*', color='r', markersize=12, linestyle='None', label='Object center'),
]
ax.legend(handles=legend_elements, loc='upper right', fontsize=9, framealpha=0.9)

# Output tensor annotation
ax.text(5, 0.3, 'Output tensor: $7 \\times 7 \\times (2 \\cdot 5 + 20) = 7 \\times 7 \\times 30$',
        ha='center', fontsize=11, style='italic',
        bbox=dict(boxstyle='round,pad=0.4', facecolor='#FFF9C4', edgecolor='#F9A825'))

ax.set_xlabel('Image width', fontsize=11)
ax.set_ylabel('Image height', fontsize=11)
ax.invert_yaxis()
plt.tight_layout()
plt.show()

2. YOLOv1 Network Architecture

The YOLOv1 network architecture is inspired by GoogLeNet but uses a simpler design: instead of inception modules, it employs alternating \(1 \times 1\) reduction layers followed by \(3 \times 3\) convolutional layers. The full network has:

24 convolutional layers for feature extraction
2 fully connected layers for prediction

The first 20 convolutional layers are pre-trained on ImageNet at \(224 \times 224\) resolution (achieving 88% top-5 accuracy), then the full network is fine-tuned for detection at \(448 \times 448\) resolution. Four additional convolutional layers and two fully connected layers are added with randomly initialized weights.

A lightweight variant, Fast YOLO, uses only 9 convolutional layers with fewer filters, achieving 155 FPS with 52.7% mAP.

Show code

import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, ax = plt.subplots(figsize=(16, 4.5))
ax.set_xlim(0, 16)
ax.set_ylim(0, 4.5)
ax.axis('off')
ax.set_title('YOLOv1 Architecture: 24 Conv + 2 FC Layers', fontsize=14, fontweight='bold', pad=15)

# Define architecture blocks: (x, y, w, h, label, color, sublabel)
blocks = [
    (0.2, 0.8, 1.2, 2.8, 'Input\n448x448x3', '#FFE0B2', ''),
    (1.7, 1.0, 1.3, 2.4, 'Conv 7x7\n64, s=2\n+ Pool', '#BBDEFB', '224x224'),
    (3.3, 1.2, 1.2, 2.0, 'Conv 3x3\n192\n+ Pool', '#BBDEFB', '112x112'),
    (4.8, 1.3, 1.5, 1.8, '4x [1x1, 3x3]\n128-256\n512 + Pool', '#C8E6C9', '56x56'),
    (6.6, 1.4, 1.5, 1.6, '2x [1x1, 3x3]\n256-512\n1024 + Pool', '#C8E6C9', '28x28'),
    (8.4, 1.5, 1.5, 1.4, '2x [1x1, 3x3]\n512-1024\n+ 3x3, s=2', '#E1BEE7', '14x14'),
    (10.2, 1.6, 1.3, 1.2, '2x Conv\n3x3x1024', '#E1BEE7', '7x7'),
    (11.8, 1.7, 1.1, 1.0, 'FC\n4096', '#FFCDD2', ''),
    (13.2, 1.8, 1.1, 0.9, 'FC\n7x7x30', '#FFCDD2', ''),
    (14.6, 1.85, 1.1, 0.8, 'Output\n7x7x30', '#FFF9C4', ''),
]

for x, y, w, h, label, color, sublabel in blocks:
    rect = patches.FancyBboxPatch((x, y), w, h, boxstyle='round,pad=0.08',
                                  facecolor=color, edgecolor='#555555', linewidth=1.2)
    ax.add_patch(rect)
    ax.text(x + w/2, y + h/2, label, ha='center', va='center', fontsize=7, fontweight='bold')
    if sublabel:
        ax.text(x + w/2, y - 0.15, sublabel, ha='center', fontsize=6, color='#666666', style='italic')

# Arrows
arrow_xs = [(1.4, 1.7), (3.0, 3.3), (4.5, 4.8), (6.3, 6.6), (8.1, 8.4),
            (9.9, 10.2), (11.5, 11.8), (12.9, 13.2), (14.3, 14.6)]
for x1, x2 in arrow_xs:
    ax.annotate('', xy=(x2, 2.2), xytext=(x1, 2.2),
                arrowprops=dict(arrowstyle='->', color='#555555', lw=1.5))

# Pre-training bracket
ax.annotate('', xy=(1.7, 3.9), xytext=(8.1, 3.9),
            arrowprops=dict(arrowstyle='<->', color='#1565C0', lw=1.5))
ax.text(4.9, 4.15, 'Pre-trained on ImageNet (first 20 conv layers, 224x224)',
        ha='center', fontsize=8, color='#1565C0', fontweight='bold')

# Detection bracket
ax.annotate('', xy=(8.4, 0.4), xytext=(14.6, 0.4),
            arrowprops=dict(arrowstyle='<->', color='#C62828', lw=1.5))
ax.text(11.5, 0.15, 'Added for detection (448x448)',
        ha='center', fontsize=8, color='#C62828', fontweight='bold')

plt.tight_layout()
plt.show()

3. YOLOv1 Training

3.1 Multi-Part Loss Function

YOLOv1 uses a sum-squared error loss composed of five terms. The choice of sum-squared error is motivated by ease of optimization, though it requires careful weighting to balance localization, confidence, and classification objectives.

The full loss function (Equation 3 from the paper) is:

\[\mathcal{L} = \underbrace{\lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} \left[ (x_i - \hat{x}_i)^2 + (y_i - \hat{y}_i)^2 \right]}_{\text{Center coordinate loss}}\]

\[+ \underbrace{\lambda_{\text{coord}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} \left[ \left(\sqrt{w_i} - \sqrt{\hat{w}_i}\right)^2 + \left(\sqrt{h_i} - \sqrt{\hat{h}_i}\right)^2 \right]}_{\text{Width/height loss (square root)}}\]

\[+ \underbrace{\sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} \left( C_i - \hat{C}_i \right)^2}_{\text{Confidence loss (object)}}\]

\[+ \underbrace{\lambda_{\text{noobj}} \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{noobj}} \left( C_i - \hat{C}_i \right)^2}_{\text{Confidence loss (no object)}}\]

\[+ \underbrace{\sum_{i=0}^{S^2} \mathbb{1}_i^{\text{obj}} \sum_{c \in \text{classes}} \left( p_i(c) - \hat{p}_i(c) \right)^2}_{\text{Classification loss}}\]

where:

\(\mathbb{1}_{ij}^{\text{obj}}\) indicates the \(j\)-th bounding box predictor in cell \(i\) is “responsible” for an object (has highest IOU with ground truth)
\(\lambda_{\text{coord}} = 5\) increases the weight of coordinate predictions
\(\lambda_{\text{noobj}} = 0.5\) decreases the weight of confidence loss for cells without objects
Square roots of \(w\) and \(h\) are predicted to reduce sensitivity to size: small deviations in large boxes matter less than in small boxes

3.2 Predictor Responsibility

At training time, only one bounding box predictor per grid cell is assigned responsibility for each ground truth object — the one with the highest current IoU with the ground truth. This leads to specialization: each predictor gets better at predicting certain sizes, aspect ratios, or classes.

3.3 Activation Function

All layers (except the final output layer, which uses a linear activation) use the leaky ReLU activation:

\[\phi(x) = \begin{cases} x & \text{if } x > 0 \\ 0.1x & \text{otherwise} \end{cases}\]

3.4 Training Schedule and Augmentation

Learning rate warm-up: Start at \(10^{-3}\), slowly raise to \(10^{-2}\) during the first epochs to prevent early divergence from unstable gradients.
Continue with \(10^{-2}\) for 75 epochs, then \(10^{-3}\) for 30 epochs, and \(10^{-4}\) for 30 epochs.
Dropout with rate 0.5 after the first fully connected layer.
Data augmentation: random scaling and translations (up to 20% of image size), random adjustments to exposure and saturation in HSV color space (up to factor 1.5).
Trained for ~135 epochs on VOC 2007+2012 training data.

4. YOLOv1 Results and Properties

4.1 Speed

YOLOv1 achieves true real-time performance:

YOLO: 45 FPS (22ms per image) on a Titan X GPU
Fast YOLO: 155 FPS (6.4ms per image) — the fastest general-purpose object detector at the time

4.2 Detection Results on VOC 2007

Show code

import pandas as pd

data = {
    'Method': [
        '100Hz DPM', '30Hz DPM', 'Fast YOLO', 'YOLO',
        'Fastest DPM', 'R-CNN Minus R', 'Fast R-CNN',
        'Faster R-CNN VGG-16', 'Faster R-CNN ZF', 'YOLO VGG-16'
    ],
    'Train Data': [
        '2007', '2007', '2007+2012', '2007+2012',
        '2007', '2007', '2007+2012',
        '2007+2012', '2007+2012', '2007+2012'
    ],
    'mAP': [16.0, 26.1, 52.7, 63.4, 30.4, 53.5, 70.0, 73.2, 62.1, 66.4],
    'FPS': [100, 30, 155, 45, 15, 6, 0.5, 7, 18, 21],
    'Real-Time': ['Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'No', 'No', 'No', 'No']
}
df = pd.DataFrame(data)
df.style.set_properties(**{'text-align': 'center'}).set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'center'), ('font-weight', 'bold')]}
]).set_caption('Table 1: Real-Time Systems on PASCAL VOC 2007')

Table 1: Table 1: Real-Time Systems on PASCAL VOC 2007

	Method	Train Data	mAP	FPS	Real-Time
0	100Hz DPM	2007	16.000000	100.000000	Yes
1	30Hz DPM	2007	26.100000	30.000000	Yes
2	Fast YOLO	2007+2012	52.700000	155.000000	Yes
3	YOLO	2007+2012	63.400000	45.000000	Yes
4	Fastest DPM	2007	30.400000	15.000000	No
5	R-CNN Minus R	2007	53.500000	6.000000	No
6	Fast R-CNN	2007+2012	70.000000	0.500000	No
7	Faster R-CNN VGG-16	2007+2012	73.200000	7.000000	No
8	Faster R-CNN ZF	2007+2012	62.100000	18.000000	No
9	YOLO VGG-16	2007+2012	66.400000	21.000000	No

4.3 Error Analysis: YOLO vs Fast R-CNN

A detailed error analysis using the Hoiem et al. methodology reveals the complementary error profiles of YOLO and Fast R-CNN. The categories of errors are:

Correct: correct class and IoU > 0.5
Localization: correct class, 0.1 < IoU < 0.5
Similar: similar class, IoU > 0.1
Other: wrong class, IoU > 0.1
Background: IoU < 0.1 for any object

Show code

import matplotlib.pyplot as plt
import numpy as np

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Error breakdown data from Figure 4 of YOLOv1 paper
categories = ['Correct', 'Localization', 'Similar', 'Other', 'Background']
fast_rcnn_vals = [71.6, 8.6, 4.3, 1.9, 13.6]
yolo_vals = [65.5, 19.0, 6.75, 4.0, 4.75]
colors = ['#4CAF50', '#FF9800', '#2196F3', '#9C27B0', '#F44336']

# Fast R-CNN
wedges1, texts1, autotexts1 = axes[0].pie(
    fast_rcnn_vals, labels=categories, colors=colors, autopct='%1.1f%%',
    startangle=90, textprops={'fontsize': 9}
)
axes[0].set_title('Fast R-CNN', fontsize=13, fontweight='bold')

# YOLO
wedges2, texts2, autotexts2 = axes[1].pie(
    yolo_vals, labels=categories, colors=colors, autopct='%1.1f%%',
    startangle=90, textprops={'fontsize': 9}
)
axes[1].set_title('YOLO', fontsize=13, fontweight='bold')

fig.suptitle('Error Analysis: Fast R-CNN vs YOLO (VOC 2007)',
             fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

The error analysis reveals two key findings:

YOLO’s dominant error is localization (19.0% vs 8.6% for Fast R-CNN). This is the primary weakness of the grid-based approach.
Fast R-CNN’s dominant error is background false positives (13.6% vs 4.75% for YOLO). Because Fast R-CNN operates on local patches, it cannot reason about global context. YOLO sees the entire image, making it far less likely to mistake background for objects.

4.4 Complementarity with R-CNN

Because YOLO and Fast R-CNN make different kinds of errors, combining them produces a significant boost. For every bounding box predicted by Fast R-CNN, if YOLO also predicts a similar box, the prediction receives a confidence boost:

Model	mAP	Combined	Gain
Fast R-CNN alone	71.8	–	–
+ YOLO	63.4	75.0	+3.2
+ Fast R-CNN (2007 data)	66.9	72.4	+0.6
+ Fast R-CNN (VGG-M)	59.2	72.4	+0.6

The 3.2% boost from YOLO is far larger than the boost from other Fast R-CNN variants, confirming that the improvement comes from the complementary error profiles, not just model ensembling.

4.5 Generalization to Artwork

YOLO generalizes better to new domains than R-CNN. On the Picasso Dataset and People-Art Dataset (person detection on artwork), YOLO substantially outperforms R-CNN:

Method	VOC 2007 AP	Picasso AP	People-Art AP
YOLO	59.2	53.3	45
R-CNN	54.2	10.4	26
DPM	43.2	37.8	32

R-CNN depends on Selective Search, which is tuned for natural images and fails on artwork. YOLO models the size, shape, and layout of objects globally, which transfers across domains.

5. YOLOv1 Limitations

Despite its speed advantages, YOLOv1 has several important limitations:

Spatial constraint: Each grid cell predicts only 2 boxes and can have only 1 class. This limits the number of nearby objects that can be detected — the model struggles with small objects in groups (e.g., flocks of birds).
Localization errors are dominant: 19% of errors come from localization, more than all other error sources combined. The coarse grid and fully connected output layers limit spatial precision.
Coarse features: Multiple downsampling layers from the \(448 \times 448\) input to the \(7 \times 7\) feature map discard fine-grained spatial information.
Scale sensitivity: The loss function treats errors equally for small and large boxes. Although the square root partially addresses this, small objects remain difficult.

These specific limitations directly motivate the improvements in YOLOv2.

6. YOLOv2 — Better: Systematic Improvements

YOLOv2 addresses each of YOLOv1’s weaknesses through a series of incremental improvements. The paper carefully traces the mAP impact of each change, building from 63.4% to 78.6%.

6.1 Path from YOLO to YOLOv2

Show code

import pandas as pd

improvements = {
    'Configuration': [
        'YOLO (baseline)',
        '+ Batch Normalization',
        '+ High-Res Classifier',
        '+ Convolutional (FC removed)',
        '+ Anchor Boxes',
        '+ New Network (Darknet-19)',
        '+ Dimension Priors',
        '+ Location Prediction',
        '+ Passthrough',
        '+ Multi-Scale',
        '+ Hi-Res Detector (544)',
    ],
    'VOC 2007 mAP': [63.4, 65.8, 69.5, 69.2, 69.6, 74.4, 75.4, 76.8, 76.8, 76.8, 78.6],
}
df_improvements = pd.DataFrame(improvements)
df_improvements.style.set_properties(**{'text-align': 'left'}).set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'left'), ('font-weight', 'bold')]}
]).set_caption('Table 2: The path from YOLO to YOLOv2 (cumulative mAP on VOC 2007)')

Table 2: Table 2: The path from YOLO to YOLOv2 (cumulative mAP on VOC 2007)

	Configuration	VOC 2007 mAP
0	YOLO (baseline)	63.400000
1	+ Batch Normalization	65.800000
2	+ High-Res Classifier	69.500000
3	+ Convolutional (FC removed)	69.200000
4	+ Anchor Boxes	69.600000
5	+ New Network (Darknet-19)	74.400000
6	+ Dimension Priors	75.400000
7	+ Location Prediction	76.800000
8	+ Passthrough	76.800000
9	+ Multi-Scale	76.800000
10	+ Hi-Res Detector (544)	78.600000

Show code

import matplotlib.pyplot as plt
import numpy as np

labels = [
    'YOLO\n(baseline)', '+BatchNorm', '+Hi-Res\nClassifier',
    '+Conv', '+Anchors', '+Darknet-19',
    '+Dim Priors', '+Loc Pred', '+Passthrough',
    '+Multi-Scale', '+Hi-Res\n(544)'
]
mAP_vals = [63.4, 65.8, 69.5, 69.2, 69.6, 74.4, 75.4, 76.8, 76.8, 76.8, 78.6]

fig, ax = plt.subplots(figsize=(14, 5))
colors = plt.cm.YlOrRd(np.linspace(0.2, 0.85, len(labels)))
bars = ax.bar(range(len(labels)), mAP_vals, color=colors, edgecolor='white', linewidth=1.5)

for i, (bar, val) in enumerate(zip(bars, mAP_vals)):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
            f'{val}%', ha='center', va='bottom', fontweight='bold', fontsize=9)

ax.set_xticks(range(len(labels)))
ax.set_xticklabels(labels, fontsize=8, rotation=0)
ax.set_ylabel('mAP (%)', fontsize=12)
ax.set_title('The Path from YOLO to YOLOv2: Cumulative Improvements on VOC 2007',
             fontsize=13, fontweight='bold')
ax.set_ylim(58, 82)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid(axis='y', alpha=0.2)
plt.tight_layout()
plt.show()

6.2 Batch Normalization

Adding batch normalization to all convolutional layers yields:

+2% mAP improvement
Acts as a regularizer, allowing dropout to be removed without overfitting
Improves convergence during training

6.3 High Resolution Classifier

YOLOv1 pre-trained its classifier at \(224 \times 224\), then abruptly switched to \(448 \times 448\) for detection. YOLOv2 adds an intermediate step: fine-tune the classifier at \(448 \times 448\) for 10 epochs on ImageNet before switching to detection training. This gives the network time to adjust its filters to higher resolution input. Result: +4% mAP.

6.4 Anchor Boxes

YOLOv1 predicts bounding box coordinates directly through fully connected layers. YOLOv2 replaces this with anchor boxes (inspired by Faster R-CNN’s RPN):

Remove fully connected layers entirely
Shrink input from \(448 \times 448\) to \(416 \times 416\) to produce a \(13 \times 13\) feature map (odd dimensions ensure a single center cell for large objects)
Predict offsets from anchor priors instead of raw coordinates
Decouple class prediction from spatial location: predict class + objectness per anchor box

Result: mAP slightly decreased (69.5 to 69.2), but recall increased from 81% to 88%, providing more room for improvement.

6.5 Dimension Clusters

Rather than hand-picking anchor box dimensions (as in Faster R-CNN), YOLOv2 uses \(k\)-means clustering on the training set bounding boxes with a custom distance metric:

\[d(\text{box}, \text{centroid}) = 1 - \text{IOU}(\text{box}, \text{centroid})\]

Using standard Euclidean distance would bias toward larger boxes. The IOU-based distance is size-invariant and directly optimizes for what matters: overlap quality.

With \(k = 5\) priors, the cluster centroids achieve 61.0 Avg IOU, comparable to 60.9 for 9 hand-picked anchors from Faster R-CNN. Using \(k = 9\) clusters reaches 67.2 Avg IOU.

6.6 Direct Location Prediction

Faster R-CNN’s unconstrained offset predictions (\(x = t_x \cdot w_a - x_a\)) allow any anchor box to shift to any point in the image, causing model instability during early training.

YOLOv2 constrains the location prediction using sigmoid activation to bound the center coordinates within the grid cell:

Show code

import matplotlib.pyplot as plt
import matplotlib.patches as patches
import numpy as np

fig, ax = plt.subplots(figsize=(8, 8))
ax.set_xlim(-0.5, 4.5)
ax.set_ylim(-0.5, 4.5)
ax.set_aspect('equal')
ax.invert_yaxis()

# Draw grid
for i in range(5):
    ax.axhline(i, color='#BDBDBD', linewidth=1, linestyle='--')
    ax.axvline(i, color='#BDBDBD', linewidth=1, linestyle='--')

# Highlight the target cell (cx=1, cy=1)
cx, cy = 1, 1
cell_rect = patches.Rectangle((cx, cy), 1, 1, linewidth=2,
                               edgecolor='#1565C0', facecolor='#E3F2FD', alpha=0.5)
ax.add_patch(cell_rect)
ax.text(cx, cy - 0.1, f'($c_x={cx}$, $c_y={cy}$)', fontsize=10, color='#1565C0')

# Prior box (dashed)
pw, ph = 1.8, 2.5
sigma_tx, sigma_ty = 0.6, 0.4  # sigmoid outputs
bx = sigma_tx + cx
by = sigma_ty + cy
tw, th = 0.3, 0.2  # network predictions for width/height
bw = pw * np.exp(tw)
bh = ph * np.exp(th)

# Draw prior box centered at cell corner
prior_rect = patches.Rectangle(
    (cx + 0.5 - pw/2, cy + 0.5 - ph/2), pw, ph,
    linewidth=2, linestyle=':', edgecolor='#9E9E9E', facecolor='none'
)
ax.add_patch(prior_rect)
ax.text(cx + 0.5 + pw/2 + 0.1, cy + 0.5, f'Prior\n$p_w={pw}$, $p_h={ph}$',
        fontsize=9, color='#757575', va='center')

# Draw predicted box
pred_rect = patches.Rectangle(
    (bx - bw/2, by - bh/2), bw, bh,
    linewidth=2.5, edgecolor='#E53935', facecolor='#FFCDD2', alpha=0.3
)
ax.add_patch(pred_rect)
ax.plot(bx, by, 'ro', markersize=8, zorder=5)
ax.text(bx + 0.1, by + 0.15, f'($b_x$, $b_y$)', fontsize=10, color='#E53935', fontweight='bold')

# Annotations
ax.annotate(r'$\sigma(t_x)$', xy=(bx, cy + 0.5), xytext=(bx, cy + 0.95),
            fontsize=11, color='#E53935', ha='center',
            arrowprops=dict(arrowstyle='<->', color='#E53935', lw=1.5))
ax.annotate(r'$\sigma(t_y)$', xy=(cx + 0.5, by), xytext=(cx + 0.95, by),
            fontsize=11, color='#E53935', va='center',
            arrowprops=dict(arrowstyle='<->', color='#E53935', lw=1.5))

# Equations box
eq_text = (
    '$b_x = \\sigma(t_x) + c_x$\n'
    '$b_y = \\sigma(t_y) + c_y$\n'
    '$b_w = p_w e^{t_w}$\n'
    '$b_h = p_h e^{t_h}$'
)
ax.text(3.0, 0.2, eq_text, fontsize=11, va='top',
        bbox=dict(boxstyle='round,pad=0.4', facecolor='#FFF9C4', edgecolor='#F9A825', alpha=0.9))

ax.set_title('YOLOv2: Bounding Box Prediction with Dimension Priors',
             fontsize=13, fontweight='bold', pad=15)
ax.set_xlabel('Grid columns', fontsize=11)
ax.set_ylabel('Grid rows', fontsize=11)
plt.tight_layout()
plt.show()

The prediction equations are:

\[b_x = \sigma(t_x) + c_x \qquad b_y = \sigma(t_y) + c_y\] \[b_w = p_w e^{t_w} \qquad b_h = p_h e^{t_h}\] \[\Pr(\text{object}) \times \text{IOU}(b, \text{object}) = \sigma(t_o)\]

where \((c_x, c_y)\) is the top-left offset of the grid cell and \((p_w, p_h)\) is the prior box dimension from clustering. The sigmoid constrains the center to fall within the grid cell, stabilizing training. Combined with dimension clusters: +5% mAP over the anchor box baseline.

6.7 Fine-Grained Features (Passthrough Layer)

The \(13 \times 13\) feature map is sufficient for large objects but may miss small objects. YOLOv2 adds a passthrough layer that brings features from an earlier \(26 \times 26\) layer:

Adjacent spatial features are stacked into channels: \(26 \times 26 \times 512 \rightarrow 13 \times 13 \times 2048\)
This is concatenated with the \(13 \times 13\) feature map
Similar concept to skip connections in ResNet
Result: +1% mAP, particularly for smaller objects

6.8 Multi-Scale Training

Since YOLOv2 is fully convolutional (no FC layers), it can accept images of any size. During training, every 10 batches the network randomly switches to a new input resolution drawn from multiples of 32: \(\{320, 352, 384, \ldots, 608\}\).

This forces the network to learn scale-invariant predictions and offers a smooth speed/accuracy tradeoff at test time:

Input Size	mAP (VOC 2007)	FPS
288 x 288	69.0	91
352 x 352	73.7	81
416 x 416	76.8	67
480 x 480	77.8	59
544 x 544	78.6	40

7. YOLOv2 — Faster: Darknet-19

7.1 Backbone Comparison

Most detection frameworks at the time used VGG-16 as their backbone, which requires 30.69 billion floating point operations per forward pass. YOLOv2 introduces Darknet-19, a new backbone that is both faster and more accurate:

Backbone	Billion FLOPs	Top-5 Accuracy (ImageNet)
VGG-16	30.69	90.0%
YOLO (GoogLeNet-inspired)	8.52	88.0%
Darknet-19	5.58	91.2%

Darknet-19 achieves higher accuracy with 5.5x fewer operations than VGG-16.

7.2 Darknet-19 Architecture

Show code

import pandas as pd

darknet19 = {
    'Layer': [
        'Conv 1', 'Maxpool',
        'Conv 2', 'Maxpool',
        'Conv 3', 'Conv 4', 'Conv 5', 'Maxpool',
        'Conv 6', 'Conv 7', 'Conv 8', 'Maxpool',
        'Conv 9', 'Conv 10', 'Conv 11', 'Conv 12', 'Conv 13', 'Maxpool',
        'Conv 14', 'Conv 15', 'Conv 16', 'Conv 17', 'Conv 18',
        'Conv 19', 'Avgpool', 'Softmax'
    ],
    'Filters': [
        32, '--',
        64, '--',
        128, 64, 128, '--',
        256, 128, 256, '--',
        512, 256, 512, 256, 512, '--',
        1024, 512, 1024, 512, 1024,
        1000, '--', '--'
    ],
    'Size/Stride': [
        '3x3', '2x2/2',
        '3x3', '2x2/2',
        '3x3', '1x1', '3x3', '2x2/2',
        '3x3', '1x1', '3x3', '2x2/2',
        '3x3', '1x1', '3x3', '1x1', '3x3', '2x2/2',
        '3x3', '1x1', '3x3', '1x1', '3x3',
        '1x1', 'Global', '--'
    ],
    'Output': [
        '224x224', '112x112',
        '112x112', '56x56',
        '56x56', '56x56', '56x56', '28x28',
        '28x28', '28x28', '28x28', '14x14',
        '14x14', '14x14', '14x14', '14x14', '14x14', '7x7',
        '7x7', '7x7', '7x7', '7x7', '7x7',
        '7x7', '1000', '1000'
    ]
}
df_darknet = pd.DataFrame(darknet19)
df_darknet.style.set_properties(**{'text-align': 'center'}).set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'center'), ('font-weight', 'bold')]}
]).set_caption('Table 6: Darknet-19 Architecture (19 conv + 5 maxpool layers)')

Table 3: Table 6: Darknet-19 Architecture (19 conv + 5 maxpool layers)

	Layer	Filters	Size/Stride	Output
0	Conv 1	32	3x3	224x224
1	Maxpool	--	2x2/2	112x112
2	Conv 2	64	3x3	112x112
3	Maxpool	--	2x2/2	56x56
4	Conv 3	128	3x3	56x56
5	Conv 4	64	1x1	56x56
6	Conv 5	128	3x3	56x56
7	Maxpool	--	2x2/2	28x28
8	Conv 6	256	3x3	28x28
9	Conv 7	128	1x1	28x28
10	Conv 8	256	3x3	28x28
11	Maxpool	--	2x2/2	14x14
12	Conv 9	512	3x3	14x14
13	Conv 10	256	1x1	14x14
14	Conv 11	512	3x3	14x14
15	Conv 12	256	1x1	14x14
16	Conv 13	512	3x3	14x14
17	Maxpool	--	2x2/2	7x7
18	Conv 14	1024	3x3	7x7
19	Conv 15	512	1x1	7x7
20	Conv 16	1024	3x3	7x7
21	Conv 17	512	1x1	7x7
22	Conv 18	1024	3x3	7x7
23	Conv 19	1000	1x1	7x7
24	Avgpool	--	Global	1000
25	Softmax	--	--	1000

7.3 Design Principles

Darknet-19 follows several established design principles:

3x3 convolutions throughout (VGG-style), with channels doubled after each pooling step
1x1 filters between 3x3 convolutions to compress feature representations (Network-in-Network style)
Batch normalization on every convolutional layer
Global average pooling for final predictions (no fully connected layers)

7.4 Training Pipeline

Classification training: Train on ImageNet at \(224 \times 224\) for 160 epochs (SGD, lr=0.1, polynomial decay)
High-resolution fine-tuning: Fine-tune at \(448 \times 448\) for 10 epochs (lr=\(10^{-3}\)) — achieves 76.5% top-1
Detection adaptation: Remove last conv layer, add three \(3 \times 3 \times 1024\) conv layers + \(1 \times 1\) output layer
Add passthrough from the \(3 \times 3 \times 512\) layer for fine-grained features

8. YOLOv2 Results

8.1 Detection Frameworks on VOC 2007

Show code

import pandas as pd

voc2007 = {
    'Method': [
        'Fast R-CNN', 'Faster R-CNN VGG-16', 'Faster R-CNN ResNet',
        'YOLO', 'SSD300', 'SSD512',
        'YOLOv2 288x288', 'YOLOv2 352x352', 'YOLOv2 416x416',
        'YOLOv2 480x480', 'YOLOv2 544x544'
    ],
    'Train Data': [
        '07+12', '07+12', '07+12',
        '07+12', '07+12', '07+12',
        '07+12', '07+12', '07+12', '07+12', '07+12'
    ],
    'mAP': [70.0, 73.2, 76.4, 63.4, 74.3, 76.8, 69.0, 73.7, 76.8, 77.8, 78.6],
    'FPS': [0.5, 7, 5, 45, 46, 19, 91, 81, 67, 59, 40]
}
df_voc = pd.DataFrame(voc2007)
df_voc.style.set_properties(**{'text-align': 'center'}).set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'center'), ('font-weight', 'bold')]}
]).set_caption('Table 3: Detection Frameworks on PASCAL VOC 2007')

Table 4: Table 3: Detection Frameworks on PASCAL VOC 2007

	Method	Train Data	mAP	FPS
0	Fast R-CNN	07+12	70.000000	0.500000
1	Faster R-CNN VGG-16	07+12	73.200000	7.000000
2	Faster R-CNN ResNet	07+12	76.400000	5.000000
3	YOLO	07+12	63.400000	45.000000
4	SSD300	07+12	74.300000	46.000000
5	SSD512	07+12	76.800000	19.000000
6	YOLOv2 288x288	07+12	69.000000	91.000000
7	YOLOv2 352x352	07+12	73.700000	81.000000
8	YOLOv2 416x416	07+12	76.800000	67.000000
9	YOLOv2 480x480	07+12	77.800000	59.000000
10	YOLOv2 544x544	07+12	78.600000	40.000000

8.2 Speed vs. Accuracy

Show code

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(figsize=(10, 7))

# Data points
methods = [
    ('R-CNN', 0.05, 66.0, 's', '#EF5350'),
    ('Fast R-CNN', 0.5, 70.0, 's', '#EF5350'),
    ('Faster R-CNN\nVGG-16', 7, 73.2, 's', '#EF5350'),
    ('Faster R-CNN\nResNet', 5, 76.4, 's', '#EF5350'),
    ('YOLO', 45, 63.4, 'D', '#FFA726'),
    ('SSD300', 46, 74.3, '^', '#66BB6A'),
    ('SSD512', 19, 76.8, '^', '#66BB6A'),
]

# YOLOv2 multi-scale points
yolov2_fps = [91, 81, 67, 59, 40]
yolov2_mAP = [69.0, 73.7, 76.8, 77.8, 78.6]
yolov2_labels = ['288', '352', '416', '480', '544']

# Plot other methods
for name, fps, mAP, marker, color in methods:
    ax.scatter(fps, mAP, s=200, c=color, marker=marker, edgecolors='black',
               linewidth=1, zorder=5)
    offset_x = 2
    offset_y = 1.0
    if 'Faster' in name and 'ResNet' in name:
        offset_x = -8
        offset_y = -2.0
    elif 'Faster' in name:
        offset_x = -8
        offset_y = 1.0
    elif name == 'R-CNN':
        offset_x = 2
        offset_y = -2.0
    elif name == 'Fast R-CNN':
        offset_x = 2
        offset_y = -2.0
    elif name == 'SSD512':
        offset_x = -8
        offset_y = -2.0
    ax.annotate(f'{name}\n({mAP}%)', (fps, mAP),
                textcoords='offset points', xytext=(offset_x, offset_y),
                fontsize=8, fontweight='bold')

# Plot YOLOv2 line
ax.plot(yolov2_fps, yolov2_mAP, 'o-', color='#1565C0', markersize=10,
        markeredgecolor='black', markeredgewidth=1, linewidth=2, zorder=5,
        label='YOLOv2 (multi-scale)')
for fps, mAP, label in zip(yolov2_fps, yolov2_mAP, yolov2_labels):
    ax.annotate(f'{label}\n{mAP}%', (fps, mAP),
                textcoords='offset points', xytext=(3, 5),
                fontsize=8, color='#1565C0', fontweight='bold')

# Real-time threshold
ax.axvline(x=30, color='green', linestyle='--', alpha=0.4, linewidth=1.5)
ax.text(31, 62, 'Real-time\n(30 FPS)', fontsize=9, color='green', style='italic')

ax.set_xlabel('Frames Per Second (FPS)', fontsize=12)
ax.set_ylabel('mAP (%) on VOC 2007', fontsize=12)
ax.set_title('Speed vs. Accuracy on PASCAL VOC 2007 (Figure 4 from YOLOv2 paper)',
             fontsize=13, fontweight='bold')
ax.set_xlim(-2, 100)
ax.set_ylim(58, 82)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.grid(True, alpha=0.2)
ax.legend(loc='lower right', fontsize=10)
plt.tight_layout()
plt.show()

8.3 VOC 2012 Results

Show code

import pandas as pd

voc2012 = {
    'Method': [
        'Fast R-CNN', 'Faster R-CNN', 'YOLO',
        'SSD300', 'SSD512', 'ResNet', 'YOLOv2 544'
    ],
    'Train Data': ['07++12']*7,
    'mAP': [68.4, 70.4, 57.9, 72.4, 74.9, 73.8, 73.4],
}
df_voc12 = pd.DataFrame(voc2012)
df_voc12.style.set_properties(**{'text-align': 'center'}).set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'center'), ('font-weight', 'bold')]}
]).set_caption('Table 4: PASCAL VOC 2012 Test Detection Results')

Table 5: Table 4: PASCAL VOC 2012 Test Detection Results

	Method	Train Data	mAP
0	Fast R-CNN	07++12	68.400000
1	Faster R-CNN	07++12	70.400000
2	YOLO	07++12	57.900000
3	SSD300	07++12	72.400000
4	SSD512	07++12	74.900000
5	ResNet	07++12	73.800000
6	YOLOv2 544	07++12	73.400000

8.4 COCO Results

Show code

import pandas as pd

coco = {
    'Method': [
        'Fast R-CNN', 'Faster R-CNN', 'SSD300', 'SSD512', 'YOLOv2'
    ],
    'mAP@[.5:.95]': [20.5, 24.2, 23.2, 26.8, 21.6],
    'mAP@0.5': [39.9, 45.3, 41.2, 46.5, 44.0],
    'mAP@0.75': [19.4, 23.5, 23.4, 27.8, 19.2],
}
df_coco = pd.DataFrame(coco)
df_coco.style.set_properties(**{'text-align': 'center'}).set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'center'), ('font-weight', 'bold')]}
]).set_caption('Table 5: Results on COCO test-dev2015')

Table 6: Table 5: Results on COCO test-dev2015

	Method	mAP@[.5:.95]	mAP@0.5	mAP@0.75
0	Fast R-CNN	20.500000	39.900000	19.400000
1	Faster R-CNN	24.200000	45.300000	23.500000
2	SSD300	23.200000	41.200000	23.400000
3	SSD512	26.800000	46.500000	27.800000
4	YOLOv2	21.600000	44.000000	19.200000

On COCO, YOLOv2 achieves 44.0 mAP at IoU=0.5, comparable to SSD and Faster R-CNN. At the stricter IoU=0.75 metric, YOLOv2 lags behind, reflecting the persistent localization challenge. However, YOLOv2 runs significantly faster than all competing methods.

9. YOLO9000 — Stronger: Joint Training

9.1 The Dataset Scale Gap

A fundamental challenge in object detection is the disparity between detection and classification datasets:

Dataset Type	Images	Categories	Label Cost
Detection (COCO)	~120K	80	Expensive (bounding boxes)
Classification (ImageNet)	~14M	22K	Cheap (image-level labels)

Labeling bounding boxes is far more expensive than image-level classification labels. YOLO9000 bridges this gap by jointly training on detection and classification data.

9.2 WordTree: Hierarchical Classification

The key challenge in combining datasets is that their label spaces are structured differently. ImageNet labels like “Norfolk terrier” and COCO labels like “dog” are not mutually exclusive — a standard softmax over all classes would be incorrect.

YOLO9000 solves this by building a WordTree — a hierarchical tree of visual concepts derived from WordNet. Instead of a flat softmax over all classes, the model predicts conditional probabilities at each node:

\[\Pr(\text{Norfolk terrier} \mid \text{terrier}), \quad \Pr(\text{Yorkshire terrier} \mid \text{terrier}), \quad \ldots\]

The absolute probability for any node is computed by multiplying conditional probabilities along the path from that node to the root:

\[\Pr(\text{Norfolk terrier}) = \Pr(\text{Norfolk terrier} \mid \text{terrier}) \times \Pr(\text{terrier} \mid \text{hunting dog}) \times \cdots \times \Pr(\text{mammal} \mid \text{animal}) \times \Pr(\text{animal} \mid \text{physical object})\]

A softmax is applied over co-hyponyms (siblings in the tree) rather than all classes. For classification, \(\Pr(\text{physical object}) = 1\); for detection, YOLOv2’s objectness predictor provides this value.

WordTree1k (1000 ImageNet classes) expands to 1369 nodes with intermediate concepts. Hierarchical Darknet-19 achieves 71.9% top-1 accuracy (vs. 72.9% flat) — only a marginal drop, with the benefit of graceful degradation on uncertain categories.

Show code

import matplotlib.pyplot as plt
import matplotlib.patches as patches

fig, ax = plt.subplots(figsize=(14, 7))
ax.set_xlim(0, 14)
ax.set_ylim(0, 7)
ax.axis('off')
ax.set_title('WordTree: Hierarchical Classification with Conditional Probabilities',
             fontsize=14, fontweight='bold', pad=15)

# Tree nodes: (x, y, label, color)
nodes = [
    (7, 6.2, 'physical object', '#E0E0E0'),
    (3.5, 5.0, 'animal', '#C8E6C9'),
    (10.5, 5.0, 'artifact', '#BBDEFB'),
    (2, 3.8, 'mammal', '#C8E6C9'),
    (5, 3.8, 'bird', '#C8E6C9'),
    (9, 3.8, 'vehicle', '#BBDEFB'),
    (12, 3.8, 'equipment', '#BBDEFB'),
    (1, 2.5, 'dog', '#A5D6A7'),
    (3, 2.5, 'cat', '#A5D6A7'),
    (8, 2.5, 'car', '#90CAF9'),
    (10, 2.5, 'airplane', '#90CAF9'),
    (0.3, 1.2, 'terrier', '#81C784'),
    (1.7, 1.2, 'hound', '#81C784'),
    (0.3, 0.2, 'Norfolk\nterrier', '#66BB6A'),
    (1.7, 0.2, 'Yorkshire\nterrier', '#66BB6A'),
]

# Draw nodes
for x, y, label, color in nodes:
    rect = patches.FancyBboxPatch((x - 0.6, y - 0.25), 1.2, 0.5,
                                  boxstyle='round,pad=0.08',
                                  facecolor=color, edgecolor='#555555', linewidth=1)
    ax.add_patch(rect)
    ax.text(x, y, label, ha='center', va='center', fontsize=8, fontweight='bold')

# Draw edges
edges = [
    (7, 5.95, 3.5, 5.25), (7, 5.95, 10.5, 5.25),  # physical object -> animal, artifact
    (3.5, 4.75, 2, 4.05), (3.5, 4.75, 5, 4.05),    # animal -> mammal, bird
    (10.5, 4.75, 9, 4.05), (10.5, 4.75, 12, 4.05),  # artifact -> vehicle, equipment
    (2, 3.55, 1, 2.75), (2, 3.55, 3, 2.75),          # mammal -> dog, cat
    (9, 3.55, 8, 2.75), (9, 3.55, 10, 2.75),        # vehicle -> car, airplane
    (1, 2.25, 0.3, 1.45), (1, 2.25, 1.7, 1.45),    # dog -> terrier, hound
    (0.3, 0.95, 0.3, 0.45), (0.3, 0.95, 1.7, 0.45),  # terrier -> Norfolk, Yorkshire
]
for x1, y1, x2, y2 in edges:
    ax.plot([x1, x2], [y1, y2], 'k-', linewidth=1, alpha=0.5)

# Softmax annotations
softmax_groups = [
    (3.5, 10.5, 5.0, 'softmax'),  # animal vs artifact
    (2, 5, 3.8, 'softmax'),        # mammal vs bird
    (1, 3, 2.5, 'softmax'),        # dog vs cat
]
for x1, x2, y, label in softmax_groups:
    mid = (x1 + x2) / 2
    ax.annotate(label, xy=(mid, y + 0.35), fontsize=7, color='#C62828',
                ha='center', fontweight='bold', style='italic',
                bbox=dict(boxstyle='round,pad=0.15', facecolor='#FFEBEE', edgecolor='#C62828', alpha=0.8))

# Probability computation
prob_text = (
    'Pr(Norfolk terrier) = Pr(Norfolk terrier | terrier)\n'
    '  x Pr(terrier | dog) x Pr(dog | mammal)\n'
    '  x Pr(mammal | animal) x Pr(animal | physical object)'
)
ax.text(8.5, 1.5, prob_text, fontsize=9, va='center',
        bbox=dict(boxstyle='round,pad=0.5', facecolor='#FFF9C4', edgecolor='#F9A825', alpha=0.9),
        family='monospace')

plt.tight_layout()
plt.show()

9.3 Dataset Combination and Joint Training

Dataset combination. Categories from both COCO (80 detection classes) and the top 9000 ImageNet classes are mapped to synsets in the WordTree, producing a combined tree with 9418 classes. COCO is oversampled at a 4:1 ratio to balance with ImageNet.

Joint training algorithm. During training, detection and classification images are mixed:

Detection images (from COCO): backpropagate the full YOLOv2 loss (coordinates, objectness, and classification)
Classification images (from ImageNet): backpropagate only the classification loss at the label’s level in the tree and above. For example, if the label is “dog”, no error is assigned to finer distinctions like “German Shepherd” vs. “Golden Retriever”.

YOLO9000 uses 3 priors (instead of 5) to limit output size.

9.4 YOLO9000 Results

19.7 mAP overall on the ImageNet detection validation set
16.0 mAP on the 156 classes that have no detection training data (only classification labels)
Learns animal species well (objectness predictions generalize from COCO animals)
Struggles with clothing and equipment (COCO lacks bounding boxes for these categories)
Detects 9000+ object categories in real-time

Category Type	Example	Performance	Reason
Animals (strong)	armadillo (61.7), tiger (61.0)	High mAP	Objectness generalizes from COCO
Clothing (weak)	sunglasses (0.0), swimming trunks (0.0)	Near-zero mAP	COCO has no clothing bounding boxes

10. Summary: Evolution from YOLOv1 to YOLOv2

Show code

import pandas as pd

comparison = {
    'Aspect': [
        'Backbone',
        'Box prediction',
        'Anchor priors',
        'Location encoding',
        'Normalization',
        'Input resolution',
        'Classifier pretraining',
        'Fine-grained features',
        'Number of classes',
        'VOC 2007 mAP',
        'Speed',
    ],
    'YOLOv1': [
        'Custom GoogLeNet-inspired (24 conv)',
        'Fully connected layers, direct coordinates',
        'None (grid cells only)',
        'Direct x,y relative to grid cell',
        'None (uses dropout)',
        'Fixed 448x448',
        '224x224 only',
        'None',
        '~20 (VOC)',
        '63.4%',
        '45 FPS',
    ],
    'YOLOv2': [
        'Darknet-19 (19 conv, fewer FLOPs)',
        'Convolutional, anchor box offsets',
        'k-means clustered dimension priors (k=5)',
        'Sigmoid-constrained relative to grid cell',
        'Batch normalization (no dropout)',
        'Multi-scale {320..608}',
        '224x224, then fine-tuned at 448x448',
        'Passthrough layer (26x26 to 13x13)',
        '9000+ (via WordTree joint training)',
        '78.6%',
        '40-91 FPS (resolution-dependent)',
    ]
}
df_cmp = pd.DataFrame(comparison)
df_cmp.style.set_properties(**{'text-align': 'left'}).set_table_styles([
    {'selector': 'th', 'props': [('text-align', 'left'), ('font-weight', 'bold')]}
]).set_caption('YOLOv1 vs YOLOv2: Side-by-Side Comparison')

Table 7: YOLOv1 vs YOLOv2: Side-by-Side Comparison

	Aspect	YOLOv1	YOLOv2
0	Backbone	Custom GoogLeNet-inspired (24 conv)	Darknet-19 (19 conv, fewer FLOPs)
1	Box prediction	Fully connected layers, direct coordinates	Convolutional, anchor box offsets
2	Anchor priors	None (grid cells only)	k-means clustered dimension priors (k=5)
3	Location encoding	Direct x,y relative to grid cell	Sigmoid-constrained relative to grid cell
4	Normalization	None (uses dropout)	Batch normalization (no dropout)
5	Input resolution	Fixed 448x448	Multi-scale {320..608}
6	Classifier pretraining	224x224 only	224x224, then fine-tuned at 448x448
7	Fine-grained features	None	Passthrough layer (26x26 to 13x13)
8	Number of classes	~20 (VOC)	9000+ (via WordTree joint training)
9	VOC 2007 mAP	63.4%	78.6%
10	Speed	45 FPS	40-91 FPS (resolution-dependent)

Key Themes

Several themes run through the evolution from YOLOv1 to YOLOv2:

Detection as regression: Both papers maintain the core insight that detection can be cast as a single regression problem, avoiding the overhead of proposal generation.
Speed/accuracy tradeoff: YOLOv1 prioritized speed with an acceptable accuracy gap. YOLOv2 closed the accuracy gap while maintaining speed, and introduced multi-scale inference for flexible tradeoffs at test time.
Simplicity of architecture: Each improvement in v2 is motivated by a specific v1 limitation. The single-network, end-to-end pipeline is preserved throughout.
Learned priors over hand-designed components: From hand-picked anchor boxes to k-means clustered dimension priors, from flat classification to hierarchical WordTree prediction — the trend is toward letting data drive design decisions.
Bridging detection and classification: YOLO9000 demonstrated that hierarchical label spaces can bridge the gap between richly-labeled classification datasets and sparsely-labeled detection datasets, enabling detection of thousands of categories with minimal detection annotations.

The YOLO family established a research trajectory that continues to influence modern detection systems, with YOLOv3 and beyond building directly on the foundations laid in these two papers.

References

Redmon, J., Divvala, S., Girshick, R., and Farhadi, A., “You Only Look Once: Unified, Real-Time Object Detection,” CVPR 2016.
Redmon, J. and Farhadi, A., “YOLO9000: Better, Faster, Stronger,” CVPR 2017.