Linear Neural Networks for Classification
Introduction
In the preceding sections, we addressed regression problems where the objective was to predict continuous values. We now pivot to classification, where the goal is to assign inputs to discrete categories. While the underlying ‘plumbing’ of the learning pipeline—loading data, calculating gradients, and updating parameters—remains consistent with regression, classification requires a different approach to output parametrization and loss measurement. Specifically, we move from asking ‘how much?’ to ‘which category?’
The Classification Problem and Data Representation
A classification problem involves mapping an input vector to a specific category chosen from a finite set of discrete values. Common examples include identifying whether an email is spam or non-spam, or determining which animal species is depicted in an image. Unlike regression, where the target values possess a natural ordering and numerical distance, classification categories are often qualitatively distinct.
A classification problem is a supervised learning task where the goal is to learn a function \(f: \mathbb{R}^d \to \mathcal{Y}\), where \(\mathcal{Y} = {1, 2, \dots, q}\) is a finite set of \(q\) discrete labels.
In many contexts, we are not only interested in a ‘hard assignment’ (picking a single class) but also in ‘soft assignments’, which involve estimating the conditional probability \(P(y=c \mid \mathbf{x})\) for each category \(c\).
To train these models, we must decide how to represent categorical labels numerically. While assigning integers (e.g., 1 for cat, 2 for dog) is efficient for storage, it implies a mathematical ordering and distance (e.g., dog is ‘greater than’ cat) that does not exist. To avoid this bias, we use a vector-based representation.
A one-hot encoding is a vector representation of categorical data where a label is represented by a \(q\)-dimensional vector \(\mathbf{y}\). If the label corresponds to the \(k\)-th category, then the \(k\)-th element of the vector is set to 1, and all other elements are set to 0. Formally, \(\mathbf{y} \in {0, 1}^q\) such that \(\sum_{j=1}^q y_j = 1\).
One-hot encoding ensures that each category is equidistant from all others in the Euclidean sense, thereby preventing the model from assuming any inherent ordinal relationship between classes.
Softmax Regression and the Linear Model
To address classification with linear models, we require a model with multiple outputs, one for each possible class. Each output is associated with its own affine function. For an input feature vector \(\mathbf{x} \in \mathbb{R}^d\), the linear layer produces a vector of ‘logits’ \(\mathbf{o} \in \mathbb{R}^q\).
The linear classification model (or the logit layer) is defined by the transformation \(\mathbf{o} = \mathbf{W}\mathbf{x} + \mathbf{b}\), where \(\mathbf{W} \in \mathbb{R}^{q \times d}\) is the weight matrix and \(\mathbf{b} \in \mathbb{R}^q\) is the bias vector.
In this architecture, the calculation of each output \(o_i\) depends on every input feature \(x_j\), making this a fully connected layer. The weights \(W_{ij}\) represent the influence of the \(j\)-th feature on the score for the \(i\)-th class.
The raw outputs \(\mathbf{o}\) can take any real value. However, to interpret these as probabilities, we need a mechanism to ‘squish’ them into a range of \([0, 1]\) such that they sum to unity. We achieve this using the softmax activation function.
The softmax function transforms a vector of logits \(\mathbf{o}\) into a probability distribution \(\hat{\mathbf{y}}\) where the \(i\)-th component is given by:
\[\hat{y}_i = \text{softmax}(\mathbf{o})_i = \frac{\exp(o_i)}{\sum_{j=1}^q \exp(o_j)}\]
The use of the exponential function ensures non-negativity, while the denominator provides normalization. Note that softmax preserves the ordering of the logits: the largest logit will always correspond to the largest probability.
The predicted class label \(i^*\) is determined by finding the index that maximizes the output probability, which is equivalent to finding the index of the largest logit:
\[i^* = \text{argmax}_j \{\hat{y}_j\} = \text{argmax}_j \{o_j\}\]
This property allows us to skip the computationally expensive softmax operation during inference if only the ‘hard’ class assignment is required.
The Cross-Entropy Loss Function
Once we have a mapping from features to probabilities, we need a loss function to optimize the model. In classification, we evaluate how well the predicted distribution \(\hat{\mathbf{y}}\) matches the actual distribution \(\mathbf{y}\) (the one-hot encoded label). This is typically achieved using cross-entropy loss, which is derived from the principle of maximum likelihood estimation.
The cross-entropy loss for a single example with ground-truth \(\mathbf{y}\) and prediction \(\hat{\mathbf{y}}\) is defined as:
\[l(\mathbf{y}, \hat{\mathbf{y}}) = -\sum_{j=1}^q y_j \log \hat{y}_j\]
Because \(\mathbf{y}\) is a one-hot vector, the sum collapses to a single term. If the true class is \(k\), the loss becomes \(l = -\log \hat{y}_k\). This measures the ‘negative log-likelihood’ that the model assigns to the correct class.
Justification: Minimizing the cross-entropy loss is equivalent to maximizing the likelihood of the observed labels under the model’s predicted probability distribution. This provides a rigorous statistical foundation for the loss function choice.
One significant advantage of the softmax and cross-entropy combination is the simplicity of their gradients. When we compute the derivative of the cross-entropy loss with respect to the raw logit \(o_j\), we obtain a very intuitive result.
Substitute \(\hat{y}_k = \frac{\exp(o_k)}{\sum_m \exp(o_m)}\) into \(l = -\sum y_j \log \hat{y}_j\).
By applying the chain rule to the log-sum-exp structure of the softmax, we find that the derivative is the difference between the probability assigned by the model and the actual indicator in the one-hot vector.
Q.E.D.
This gradient is identical in form to the gradient of the squared error in linear regression. It represents the ‘error’ in our prediction, making it highly effective for gradient descent optimization.
Conclusion
In this document, we have established softmax regression as the standard linear baseline for classification. By representing labels with one-hot encoding and using the softmax activation function, we can map linear outputs to valid probability distributions. Finally, the cross-entropy loss provides a probabilistic objective that simplifies the learning process through an intuitive gradient that directly reflects the error between our prediction and the truth.