Course Outline

CSCI 4052U — Machine Learning II

Course Description

This course builds on the foundations of machine learning to explore modern deep learning architectures for computer vision, natural language processing, and generative modelling. Students progress from basic neural network building blocks through increasingly sophisticated architectures, culminating in state-of-the-art generative AI systems.

Unit 1: Preliminaries

Review of core deep learning building blocks that serve as the foundation for all subsequent units.

Linear Networks: Linear regression as a neural network, regularisation techniques
Linear Classification: Softmax classifiers, cross-entropy loss
Multilayer Perceptrons: Hidden layers, activation functions, universal approximation
Convolutional Networks: Convolution operations, pooling, feature maps

Unit 2: ConvNets for Image Classification

A survey of landmark convolutional neural network architectures that defined the modern era of computer vision.

Early ConvNets: AlexNet, VGGNet, Network-in-Network, GoogLeNet/Inception
ResNet: Residual connections, skip connections, deep network training
DenseNet: Dense connectivity patterns, feature reuse
MobileNet: Depthwise separable convolutions, efficient architectures for deployment

Unit 3: ConvNets for Object Detection

Extending image classification to the more challenging task of localising and classifying multiple objects within an image.

The R-CNN Family: R-CNN, Fast R-CNN, Faster R-CNN — from selective search to learned region proposals
SSD (Single Shot MultiBox Detector): Multi-scale feature maps, default boxes, single-pass detection
YOLO (v1 and v2): Real-time object detection, grid-based prediction, architectural improvements

Unit 4: Language Modeling

Transitioning from vision to sequence modelling, covering the architectures that underpin modern NLP.

Attention and Transformers: Self-attention mechanism, multi-head attention, positional encodings, the Transformer architecture
Landmark Architectures: GPT (autoregressive), BERT (masked language modelling), ViT (vision transformers), T5 (text-to-text)

Unit 5: Generative AI

The probabilistic and architectural foundations of modern generative models, progressing from theory to state-of-the-art systems.

Probabilistic Models: Probability distributions, discriminative vs. generative models, explicit vs. implicit density, maximum likelihood, KL divergence, latent variable models
Variational Autoencoders: Autoencoders, latent variable models, the ELBO, reparameterisation trick, generation and disentanglement
Diffusion Models: Forward (noising) and reverse (denoising) processes, diffusion kernels, ELBO simplification to noise prediction, U-Net architecture, time embeddings, DDIMs, cascaded generation, conditional guidance
CLIP (Contrastive Language-Image Pre-training): Dual-encoder architecture, contrastive training objective, zero-shot classification, text conditioning for diffusion models

--- title: "Course Outline" subtitle: "CSCI 4052U --- Machine Learning II" --- ## Course Description This course builds on the foundations of machine learning to explore modern deep learning architectures for computer vision, natural language processing, and generative modelling. Students progress from basic neural network building blocks through increasingly sophisticated architectures, culminating in state-of-the-art generative AI systems. ## Unit 1: Preliminaries Review of core deep learning building blocks that serve as the foundation for all subsequent units. - **Linear Networks:** Linear regression as a neural network, regularisation techniques - **Linear Classification:** Softmax classifiers, cross-entropy loss - **Multilayer Perceptrons:** Hidden layers, activation functions, universal approximation - **Convolutional Networks:** Convolution operations, pooling, feature maps ## Unit 2: ConvNets for Image Classification A survey of landmark convolutional neural network architectures that defined the modern era of computer vision. - **Early ConvNets:** AlexNet, VGGNet, Network-in-Network, GoogLeNet/Inception - **ResNet:** Residual connections, skip connections, deep network training - **DenseNet:** Dense connectivity patterns, feature reuse - **MobileNet:** Depthwise separable convolutions, efficient architectures for deployment ## Unit 3: ConvNets for Object Detection Extending image classification to the more challenging task of localising and classifying multiple objects within an image. - **The R-CNN Family:** R-CNN, Fast R-CNN, Faster R-CNN --- from selective search to learned region proposals - **SSD (Single Shot MultiBox Detector):** Multi-scale feature maps, default boxes, single-pass detection - **YOLO (v1 and v2):** Real-time object detection, grid-based prediction, architectural improvements ## Unit 4: Language Modeling Transitioning from vision to sequence modelling, covering the architectures that underpin modern NLP. - **Attention and Transformers:** Self-attention mechanism, multi-head attention, positional encodings, the Transformer architecture - **Landmark Architectures:** GPT (autoregressive), BERT (masked language modelling), ViT (vision transformers), T5 (text-to-text) ## Unit 5: Generative AI The probabilistic and architectural foundations of modern generative models, progressing from theory to state-of-the-art systems. - **Probabilistic Models:** Probability distributions, discriminative vs. generative models, explicit vs. implicit density, maximum likelihood, KL divergence, latent variable models - **Variational Autoencoders:** Autoencoders, latent variable models, the ELBO, reparameterisation trick, generation and disentanglement - **Diffusion Models:** Forward (noising) and reverse (denoising) processes, diffusion kernels, ELBO simplification to noise prediction, U-Net architecture, time embeddings, DDIMs, cascaded generation, conditional guidance - **CLIP (Contrastive Language-Image Pre-training):** Dual-encoder architecture, contrastive training objective, zero-shot classification, text conditioning for diffusion models