Table of Contents:
- Introduction
- Visualizing the loss function
- Optimization
- Computing the gradient
- Gradient descent
- Summary
Previously, we discussed two foundational components in the context of image classification:
- A score function, which maps raw image pixels to class scores using a parameterized function (e.g., a linear function).
- A loss function, which evaluates the quality of the parameters by assessing how well the predicted scores align with the ground truth labels in the training data. For instance, we explored loss functions like Softmax and SVM.
To recap, the linear function takes the form
$$L = \frac{1}{N} \sum_i \sum_{j\neq y_i} \left[ \max(0, f(x_i; W)j - f(x_i; W){y_i} + 1) \right] + \alpha R(W)$$
The parameters
Looking ahead: Once we understand how these three core components interact, we will revisit the first component (the parameterized function mapping) and extend it to functions much more complicated than a linear mapping: First entire Neural Networks, and then Convolutional Neural Networks. The loss functions and the optimization process will remain relatively unchanged.
The parameterized function mapping refers to a mathematical function that transforms input data (such as raw image pixels in the context of image classification) into output predictions (like class scores) using a set of adjustable parameters. These parameters are learned from the training data during the optimization process.
The loss functions we work with in machine learning are often defined in very high-dimensional spaces. For instance, in CIFAR-10, a linear classifier’s weight matrix has dimensions$[10 \times 3073]$, resulting in a total of 30,730 parameters. This complexity makes direct visualization challenging. However, we can build intuition by exploring slices of the high-dimensional space in one or two dimensions.
One way to achieve this is by generating a random weight matrix
The loss function landscape for the Multiclass SVM (without regularization) is shown for a single example (left, middle) and for a hundred examples (right) in CIFAR-10. On the left, the one-dimensional loss is visualized by varying only
The left plot is a 1D slice through the high-dimensional loss landscape.
$W$ and$W_1$ are high-dimensional matrices; their dimensionality does not change. The "1D" refers to the movement along the direction$W_1$ , controlled by the scalar$a$ .
The piecewise-linear structure of the SVM loss function can be understood by analyzing its mathematical formulation. For a single example, the loss is defined as:
This equation shows that the data loss for each example is a sum of linear terms (involving the weights
To make this explicit, consider a simple dataset with three 1-dimensional data points and three classes. The full SVM loss (without regularization) for this dataset becomes:
Each data point (
A 1-dimensional representation of the data loss, where the x-axis represents a single weight, and the y-axis shows the loss value. The total data loss is composed of multiple terms, each of which is either independent of the specific weight or a linear function of it, clamped to zero when inactive. The full SVM data loss extends this piecewise-linear structure to a 30,730-dimensional space.
Independent of the specific weight : For
$L_0 = \max(0, w_1 x_0 - w_0 x_0 + 1) + \max(0, w_2 x_0 - w_0 x_0 + 1)$ , the second term:$\max(0, w_2 x_0 - w_0 x_0 + 1)$ is independent of$w_1$ because it only involves$w_2$ and$w_0$ , not$w_1$ .
inactive means the term inside the
$\max(0, \cdot)$ is$\leq 0$ .
"Thresholded at zero" means that the value of a function is adjusted such that it cannot go below zero. If the value is positive, it remains unchanged, but if the value is zero or negative, it is "clamped" to zero. This is typically achieved using the
$\max(0, \cdot)$ operation.
- Downward Slope (varying
$w_0$ ): Increasing the weight for the correct class improves the margin, which decreases the loss term, leading to a downward slope. This slope becomes flat (zero) once the margin condition is fully satisfied.- Upward Slope (varying
$w_1$ ): Increasing the weight for an incorrect class worsens the margin, increasing the loss term and causing an upward slope. This term only starts to increase once the incorrect class score approaches or exceeds the correct class score.
As an aside, the bowl-shaped appearance of the SVM cost function indicates that it is a convex function. Convex functions are well-studied in optimization, and there is extensive literature on methods for efficiently minimizing them. For those interested, Stanford offers a well-known course on the topic: Convex Optimization. However, when we extend our score functions
Non-differentiable loss functions. A technical detail to note is that the kinks in the SVM loss function, caused by the
kinks: In the context of functions like the SVM loss, kinks refer to points in the graph of the function where there is an abrupt change in the slope (gradient). These points occur because the function transitions between different behaviors, such as:
- From a flat region (constant value) to a sloped region (linear growth).
- Or between two linear regions with different slopes.
Subgradient: A Generalization of the Gradient A subgradient extends the concept of a gradient to non-differentiable functions. At a kink, it represents any value within the range of slopes that approximate the function's behavior. For
$f(z) = \max(0, z)$ , the subgradient is:
- If
$z > 0$ :$f'(z) = 1$ (regular gradient).- If
$z < 0$ :$f'(z) = 0$ (regular gradient).- If
$z = 0$ : No unique gradient exists; the subgradient can take any value in$[0, 1]$ .
To reiterate,, the loss function serves as a measure of how well a particular set of weights W performs. The primary goal of optimization is to find the set of weights W that minimizes the loss function, leading to the best performance. We will now motivate and slowly develop an approach to optimizing the loss function. For those of you coming to this class with previous experience, this section might seem odd since the working example we'll use (the SVM loss) is a convex problem, but keep in mind that our goal is to eventually optimize Neural Networks where we can't easily use any of the tools developed in the Convex Optimization literature.
Convexity means the loss function has a single global minimum and no local minima, making optimization straightforward: Tools from convex optimization guarantee finding the global minimum efficiently using techniques like gradient descent. In contrast, neural network loss functions are non-convex, with many local minima and saddle points, making optimization much harder. By starting with a convex problem, the optimization challenge feels less realistic compared to what is encountered in neural networks.
Since evaluating the quality of a given set of parameters W is straightforward, the first (very bad) idea that may come to mind is to randomly trying out different weights and keeping the one that performs the best. This process could look something like this:
# assume X_train is the data where each column is an example (e.g. 3073 x 50,000)
# assume Y_train are the labels (e.g. 1D array of 50,000)
# assume the function L evaluates the loss function
bestloss = float("inf") # Python assigns the highest possible float value
for num in range(1000):
W = np.random.randn(10, 3073) * 0.0001 # generate random parameters
loss = L(X_train, Y_train, W) # get the loss over the entire training set
if loss < bestloss: # keep track of the best solution
bestloss = loss
bestW = W
print 'in attempt %d the loss was %f, best %f' % (num, loss, bestloss)
# prints:
# in attempt 0 the loss was 9.401632, best 9.401632
# in attempt 1 the loss was 8.959668, best 8.959668
# in attempt 2 the loss was 9.044034, best 8.959668
# in attempt 3 the loss was 9.278948, best 8.959668
# in attempt 4 the loss was 8.857370, best 8.857370
# in attempt 5 the loss was 8.943151, best 8.857370
# in attempt 6 the loss was 8.605604, best 8.605604
# ... (trunctated: continues for 1000 lines)
In the code above, we experimented with several random weight vectors W, and observed that some performed better than others. We can take the best-performing weights W from this search and evaluate them on the test set:
# Assume X_test is [3073 x 10000], Y_test [10000 x 1]
scores = Wbest.dot(Xte_cols) # 10 x 10000, the class scores for all test examples
# find the index with max score in each column (the predicted class)
Yte_predict = np.argmax(scores, axis = 0)
# and calculate accuracy (fraction of predictions that are correct)
np.mean(Yte_predict == Yte)
# returns 0.1555
With the best W this gives an accuracy of about 15.5%. Given that guessing classes completely at random achieves only 10%, that's not a very bad outcome for a such a brain-dead random search solution!
Core idea: iterative refinement. We can do much better. Instead of trying to find the perfect set of weights W all at once (a task that becomes increasingly difficult or even impossible for complex models like neural networks), we focus on refining an initial set of weights step by step. The idea is to start with a random W and iteratively improve it, making small adjustments to reduce the loss at each step.
Our strategy: Start with random weights and iteratively refine them over time to get lower loss
Blindfolded hiker analogy. Imagine you're hiking on hilly terrain while blindfolded, aiming to reach the bottom. The height of the terrain represents the loss, and each step you take is an attempt to lower the loss. In the case of CIFAR-10, the hills are exist in a 30,730-dimensional space (since W has dimensions 10 x 3073). Each point in this high-dimensional space corresponds to a specific set of weights W, and the height at that point on the hill represents the loss for those weights. Your goal is to navigate through this high-dimensional landscape to find the best W.
The first strategy you may think of is to extend one foot in a random direction and take a step only if it moves downhill. Specifically, we begin with a random
W = np.random.randn(10, 3073) * 0.001 # generate random starting W
bestloss = float("inf")
for i in range(1000):
step_size = 0.0001
Wtry = W + np.random.randn(10, 3073) * step_size
loss = L(Xtr_cols, Ytr, Wtry)
if loss < bestloss:
W = Wtry
bestloss = loss
print 'iter %d loss is %f' % (i, bestloss)
With the same number of loss function evaluations (1000), this method achieves a test set classification accuracy of 21.4%. While this is an improvement, it remains inefficient and computationally costly.
In the previous section, we explored finding a direction in the weight space that could improve our weight vector by lowering the loss. However, instead of randomly searching for a good direction, we can calculate the best direction for updating the weights. This direction, mathematically guaranteed to lead to the steepest descent in loss (at least in the limit as the step size goes towards zero), is determined by the gradient of the loss function. Using our hiking analogy, this process is like sensing the slope of the ground beneath your feet and moving in the direction that feels steepest downhill.
For one-dimensional functions, the slope represents the instantaneous rate of change of the function at a given point. The gradient generalizes this concept to functions that take a vector of numbers instead of a single number. 🌟🌟🌟Specifically, the gradient is a vector of slopes—commonly called derivatives—with respect to each dimension of the input space.
- The gradient is a vector of slopes, where each slope is a derivative with respect to a specific input dimension.
- In other words:
- The derivatives (e.g.,
$\frac{\partial f}{\partial x_1}, \frac{\partial f}{\partial x_2}, ...$ ) are the individual slopes.- The gradient (e.g., $\nabla f(x)$) is the collection of all these slopes in vector form.
The derivative of a one-dimensional function with respect to its input is mathematically expressed as:
When the functions of interest take a vector of numbers instead of a single number, these derivatives are referred to as partial derivatives, and the gradient is simply the collection of these partial derivatives for all dimensions.
There are two main approaches to compute gradients: a slow but straightforward method known as the numerical gradient, and a fast but more error-prone method requiring calculus called the analytic gradient. Both methods are outlined below.
The formula given above(f
, a vector x
to evaluate the gradient on, and returns the gradient of f
at x
:
def eval_numerical_gradient(f, x):
"""
a naive implementation of numerical gradient of f at x
- f should be a function that takes a single argument
- x is the point (numpy array) to evaluate the gradient at
"""
fx = f(x) # evaluate function value at original point
grad = np.zeros(x.shape)
h = 0.00001
# iterate over all indexes in x
it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
while not it.finished:
# evaluate function at x+h
ix = it.multi_index
old_value = x[ix]
x[ix] = old_value + h # increment by h
fxh = f(x) # evalute f(x + h)
x[ix] = old_value # restore to previous value (very important!)
# compute the partial derivative
grad[ix] = (fxh - fx) / h # the slope
it.iternext() # step to next dimension
return grad
Following the gradient formula we gave above, the code above iterates over all dimensions one by one, makes a small change h
along that dimension and calculates the partial derivative of the loss function along that dimension by seeing how much the function changed. The variable grad
holds the full gradient in the end.
$\frac{\partial f}{\partial x_i} \approx \frac{f(x_1, \dots, x_i + h, \dots, x_n) - f(x_1, \dots, x_i, \dots, x_n)}{h}$
Practical considerations
-
Step Size (
$h$ ): In theory,$h$ should approach zero, but in practice, a very small but finite value (e.g., 1e-5) is sufficient. The smallest$h$ that avoids numerical issues is generally preferred. -
Centered Difference Formula: In practice it often works better to compute the numeric gradient using the centered difference formula:
$\frac{f(x + h) - f(x - h)}{2h}$ . For more details, see Wikipedia on Numerical Differentiation.
With a very small
$h = 10^{-15}$ , floating-point precision limitations can cause$f(1 + h) - f(1)$ to incorrectly evaluate as$0.0$ or an inaccurate small value, leading to significant errors in the gradient calculation.
Looking for Appendix Part of why centered difference formula is better
We can use the function given above to compute the gradient at any point and for any function. Lets compute the gradient for the CIFAR-10 loss function at some random point in the weight space:
# to use the generic code above we want a function that takes a single argument
# (the weights in our case) so we close over X_train and Y_train
def CIFAR10_loss_fun(W):
return L(X_train, Y_train, W)
W = np.random.rand(10, 3073) * 0.001 # random weight vector
df = eval_numerical_gradient(CIFAR10_loss_fun, W) # get the gradient
The gradient indicates the slope of the loss function along each dimension, which can be used to update the weights:
loss_original = CIFAR10_loss_fun(W) # the original loss
print 'original loss: %f' % (loss_original, )
# lets see the effect of multiple step sizes
for step_size_log in [-10, -9, -8, -7, -6, -5,-4,-3,-2,-1]:
step_size = 10 ** step_size_log
W_new = W - step_size * df # new position in the weight space
loss_new = CIFAR10_loss_fun(W_new)
print 'for step size %f new loss: %f' % (step_size, loss_new)
# prints:
# original loss: 2.200718
# for step size 1.000000e-10 new loss: 2.200652
# for step size 1.000000e-09 new loss: 2.200057
# for step size 1.000000e-08 new loss: 2.194116
# for step size 1.000000e-07 new loss: 2.135493
# for step size 1.000000e-06 new loss: 1.647802
# for step size 1.000000e-05 new loss: 2.844355
# for step size 1.000000e-04 new loss: 25.558142
# for step size 1.000000e-03 new loss: 254.086573
# for step size 1.000000e-02 new loss: 2539.370888
# for step size 1.000000e-01 new loss: 25392.214036
Update in negative gradient direction. The gradient indicates the steepest rate of increase of the loss function, so the weights are updated in the negative gradient direction to decrease the loss.
Effect of step size. The gradient indicates the direction of the steepest increase in the function, but it doesn't specify how far we should move along this direction. Determining the step size (often referred to as the learning rate) is a critical and often challenging hyperparameter to set when training a neural network. In our analogy of descending a hill while blindfolded, we can feel the slope beneath our feet pointing in a certain direction, but the optimal step length remains uncertain. Taking small, cautious steps may lead to steady but slow progress (analogous to a small step size). On the other hand, taking large, confident strides might aim for faster descent but risks overshooting the target. As shown in the code example, larger step sizes can sometimes lead to higher losses due to "overstepping."
Visualizing the effect of step size. Starting at a specific point
A problem of efficiency. You may have noticed that evaluating the numerical gradient has complexity linear in the number of parameters. For instance, in the example above, with 30,730 parameters, 30,731 evaluations of the loss function were required for a single parameter update. This cost grows significantly for modern neural networks, which often have tens of millions of parameters. Clearly, this strategy is not scalable and we need something better.
The numerical gradient, computed using finite difference approximations, is straightforward to implement. However, it comes with limitations: it is approximate (since it depends on a small value
Let's use the example of the SVM loss function for a single datapoint:
Alternatively, we can represent this loss using an indicator function:
$L_i = \sum_{j \neq y_i} \mathbb{1}(w_j^T x_i - w_{y_i}^T x_i + \Delta > 0) \cdot (w_j^T x_i - w_{y_i}^T x_i + \Delta)$
We can differentiate the function with respect to the weights. For example, taking the gradient with respect to
differentiate means taking the derivative Treat
$\mathbb{1}(\text{condition})$ as a constant, so it stays as is during differentiation.
Here,
For the other rows of
While these expressions may look complex, implementing them in code is straightforward. You simply calculate the contributions of the data point to the gradients of the relevant weight rows and apply the gradient update accordingly.
🌟🌟🌟Now that we can compute the gradient of the loss function, the procedure of repeatedly evaluating the gradient and then performing a parameter update is called Gradient Descent. Its vanilla version looks as follows:
# Vanilla Gradient Descent
while True:
weights_grad = evaluate_gradient(loss_fun, data, weights)
weights += - step_size * weights_grad # perform parameter update
This simple loop is at the core of all Neural Network libraries. There are other ways of performing the optimization (e.g. LBFGS), but Gradient Descent is currently by far the most common and established way of optimizing Neural Network loss functions. Throughout the class we will put some bells and whistles on the details of this loop (e.g. the exact details of the update equation), but the core idea of following the gradient until we're happy with the results will remain the same.
Mini-batch gradient descent. In large-scale machine learning tasks (e.g., ILSVRC with over a million images), computing the gradient over the entire dataset for a single parameter update is computationally expensive. A more efficient approach is to compute the gradient using mini-batches—small subsets of the data. For example, in state-of-the-art Convolutional Neural Networks (ConvNets), a mini-batch typically contains 256 examples, sampled from a dataset of 1.2 million. Using this mini-batch, the gradient is computed, and parameters are updated:
A single parameter update refers to performing one gradient descent step, means updating all parameters in the model, i.e., the entire weight matrix
$W$ , not just$w_1$ or individual elements$w_{ij}$ .
# Vanilla Minibatch Gradient Descent
while True:
data_batch = sample_training_data(data, 256) # sample 256 examples
weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
weights += - step_size * weights_grad # perform parameter update
This approach works well because training examples are often correlated. To understand this, imagine an extreme case where all 1.2 million images in ILSVRC are duplicates of just 1000 unique images (one for each class). In such a scenario, the gradients computed for all duplicates would be identical, and the loss calculated over the entire dataset would be the same as using only the 1000 unique images. While real-world datasets do not contain exact duplicates, the gradient from a mini-batch is often a good approximation of the full dataset gradient. By using mini-batches, we can perform parameter updates more frequently, leading to faster convergence.
An extreme case of mini-batch Gradient Descent is when the batch size is reduced to a single training example. This is known as Stochastic Gradient Descent (SGD) or online Gradient Descent. While technically SGD refers to updating parameters based on one example at a time, in practice, the term "SGD" is often used interchangeably with mini-batch Gradient Descent, regardless of batch size.
Using a single example per update is less common because modern computational frameworks optimize better with vectorized operations. For instance, computing the gradient for 100 examples in a single operation is often more efficient than computing it for one example 100 times.
The size of the mini-batch is a hyperparameter that is rarely tuned through cross-validation. Instead, it is typically determined by memory constraints(if any) or set to a standard value such as 32, 64, or 128. Powers of 2 are often used because they enable faster computation in many vectorized implementations.
Summary of the information flow. The dataset of
- The data loss, which measures how well the scores $f$align with the labels
$y$ . - The regularization loss, which depends only on the weights.
In Gradient Descent, we compute the gradient of the weights (and optionally the data) and use these gradients to iteratively update the parameters, minimizing the loss.
In this section,
-
We developed the intuition of the loss function as a high-dimensional optimization landscape in which we are trying to reach the bottom. This was likened to a blindfolded hiker trying to reach the bottom of a valley. In particular, we observed that the SVM loss is piecewise linear and bowl-shaped.
-
We motivated the idea of optimizing the loss function with iterative refinement, starting with random weights and refining them step by step until the loss is minimized.
-
We saw that the gradient of a function gives the steepest ascent direction and we discussed a simple but inefficient way of computing it numerically using the finite difference approximation (the finite difference being the value of h used in computing the numerical gradient).
-
We highlighted the importance of choosing the correct step size (or learning rate) for parameter updates:
- A small step size leads to slow but steady progress.
- A large step size may accelerate progress but increases the risk of overshooting the minimum.
-
We discussed the tradeoffs between computing the numerical and analytic gradient. :
- The numerical gradient is simple to compute but approximate and computationally expensive.
- The analytic gradient is exact and efficient but prone to implementation errors due to the mathematical derivation involved.
- Hence, in practice we always use the analytic gradient and then perform a gradient check, in which its implementation is compared to the numerical gradient.
-
We introduced the Gradient Descent algorithm, which iteratively computes gradients and updates the parameters in a loop.
Coming up: The key takeaway from this section is the ability to compute gradients of a loss function with respect to its weights and understand their significance. This skill is fundamental for designing, training, and analyzing neural networks. **🌟🌟🌟In the next section we will develop proficiency in computing the gradient analytically using the chain rule, otherwise also referred to as backpropagation. ** This will allow us to efficiently optimize relatively arbitrary loss functions that express all kinds of Neural Networks, including Convolutional Neural Networks.
For a function
- The error term is now proportional to
$h^2$ , making it second-order accurate.
The forward difference formula:
- Has an error proportional to ( h ), making it first-order accurate.
- The centered difference formula is more accurate because it cancels even-order terms and achieves second-order accuracy.
- The error term scales linearly with the step size
$h$ . - For example, in the forward difference formula:
$$\frac{f(x + h) - f(x)}{h} = f'(x) + \mathcal{O}(h)$$ - The error is proportional to
$h$ , meaning:- If
$h$ is halved, the error is also roughly halved.
- If
- The error term scales quadratically with
$h$ . - For example, in the centered difference formula:
$$\frac{f(x + h) - f(x - h)}{2h} = f'(x) + \mathcal{O}(h^2)$$ - The error is proportional to
$h^2$ , meaning:- If
$h$ is halved, the error decreases by a factor of 4 (since$h^2$ scales quadratically).
- If
Explain Large $h$ : The approximation deviates from the true gradient due to truncation errors from higher-order terms.
When
Truncation error arises when we approximate a function using only part of its Taylor series expansion, ignoring the higher-order terms. These ignored terms contribute to the overall error in the approximation.
For example, for a smooth function
If we approximate the derivative using the forward difference formula:
We are effectively ignoring all terms beyond
These terms contribute to the truncation error, which depends on