class: middle, center, title-slide
Lecture 4: Training neural networks
Prof. Gilles Louppe
[email protected]
How to optimize parameters efficiently?
- Optimizers
- Initialization
- Normalization
class: middle
class: middle
???
This is our goal for today: to minimize the empirical risk.
class: middle
Training a massive deep neural network is long, complex and sometimes confusing.
A first step towards understanding, debugging and optimizing neural networks is to make use of visualization tools for
- plotting losses and metrics,
- visualizing computational graphs,
- or showing additional data as the network is being trained.
exclude: True class: middle
.center[Tensorboard]
class: middle
.center[Weights & Biases (wandb.ai)]
class: middle, red-slide, center
Let me say this once again: .bold[plot your losses].
To minimize ]
class: middle
.center[
class: middle
While it makes sense to compute the gradient exactly,
- it takes time to compute and becomes inefficient for large
$N$ , - it is an empirical estimation of an hidden quantity (the expected risk), and any partial sum is also an unbiased estimate, although of greater variance.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
exclude: True class: middle
To illustrate how partial sums are good estimates, consider an ideal case where the training set is the same set of
Although this is an ideal case, there is redundancy in practice that results in similar behaviors.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
To reduce the computational complexity, stochastic gradient descent (SGD) consists in updating the parameters after every sample $$\begin{aligned} g_t &= \nabla_\theta \ell(y_{n(t)}, f(\mathbf{x}_{n(t)}; \theta_t)) \\ \theta_{t+1} &= \theta_t - \gamma g_t. \end{aligned}$$
class: middle
.center[
class: middle
While being computationally faster than batch gradient descent,
- gradient estimates used by SGD can be very noisy, which may help escape from local minima;
- but SGD does not benefit from the speed-up of batch-processing.
class: middle
Instead, mini-batch SGD consists in visiting the samples in mini-batches and updating the parameters each time
$$
\begin{aligned}
g_t &= \frac{1}{B} \sum_{b=1}^B \nabla_\theta \ell(y_{n(t,b)}, f(\mathbf{x}_{n(t,b)}; \theta_t)) \\
\theta_{t+1} &= \theta_t - \gamma g_t,
\end{aligned}
$$
where the order
- Increasing the batch size
$B$ reduces the variance of the gradient estimates and enables the speed-up of batch processing. - The interplay between
$B$ and$\gamma$ is still unclear.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
The gradient descent method makes strong assumptions about
- the magnitude of the local curvature to set the step size,
- the isotropy of the curvature, so that the same step size
$\gamma$ makes sense in all directions.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
???
Draw plot from lecture 2 (loss and surrogate).
class: middle
.center[
$\gamma=0.01$ ]
class: middle
.center[
$\gamma=0.01$ ]
???
Draw each slice.
class: middle
.center[
$\gamma=0.1$ ]
class: middle
.center[
$\gamma=0.4$ ]
???
Draw plots too fast/just right/too slow.
exclude: True class: middle
Let us consider a function
For
- Sufficient decrease condition:
$$f(x + \gamma p) \leq f(x) + c_1 \gamma p^T \nabla f(x)$$ - Curvature condition:
$$c_2 p^T \nabla f(x) \leq p^T \nabla f(x + \gamma p)$$
Typical values are
???
- The sufficient decrease condition ensures that the function decreases sufficiently, as predicted by the slope of f in the direction p.
- The curvature condition ensures that steps are not too short by ensuring that the slope has decreased (in magnitude) by some relative amount.
exclude: True class: middle
.center[
.width-100[]
The sufficient decrease condition ensures that
(
.footnote[Credits: Wikipedia, Wolfe conditions.]
exclude: True class: middle
.center[
.width-100[]
The curvature condition ensures that the slope has been reduced sufficiently.
]
.footnote[Credits: Wikipedia, Wolfe conditions.]
class: middle
Wolfe conditions could be used to design line search algorithms to automatically determine a step size
However, in deep learning,
- these algorithms are impractical because of the size of the parameter space and the overhead it would induce,
- they might lead to overfitting when the empirical risk is minimized too well.
class: middle
A fundamental result due to Bottou and Bousquet (2011) states that stochastic optimization algorithms (e.g., SGD) yield the best generalization performance (in terms of excess error) despite being the worst optimization algorithms for minimizing the empirical risk.
That is, .bold[for a fixed computational budget, stochastic optimization algorithms reach a lower test error than more sophisticated algorithms] (2nd order methods, line search algorithms, etc) that would fit the training error too well or would consume too large a part of the computational budget at every step.
class: middle
.footnote[Credits: Dive Into Deep Learning, 2020.]
.center.width-70[![](figures/lec4/floor.png)]
In the situation of small but consistent gradients, as through valley floors, gradient descent moves very slowly.
???
Hint: the optimization process is like a ball that we keep pushing in the direction of the gradient.
class: middle, black-slide
.center[
.footnote[Image credits: Kosta Derpanis, Deep Learning in Computer Vision, 2018]
class: middle
An improvement to gradient descent is to use momentum to add inertia in the choice of the step direction, that is
.grid[ .kol-2-3[
- The new variable
$u_t$ is the velocity. It corresponds to the direction and speed by which the parameters move as the learning dynamics progresses, modeled as an exponentially decaying moving average of negative gradients. - Gradient descent with momentum has three nice properties:
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
???
Dampening arises because of the accumulation of the gradients, which make them cancel each other.
class: middle
The hyper-parameter
- Usually,
$\alpha=0.9$ , with$\alpha > \gamma$ . - If at each update we observed
$g$ , the step would (eventually) be$$u = -\frac{\gamma}{1-\alpha} g.$$ - Therefore, for
$\alpha=0.9$ , it is like multiplying the maximum speed by$10$ relative to the current direction.
class: middle
.center[
class: middle
An alternative consists in simulating a step in the direction of the velocity, then calculate the gradient and make a correction.
class: middle
.center[
Vanilla gradient descent assumes the isotropy of the curvature, so that the same step size
.center[
.width-45[]
.width-45[
]
Isotropic vs. Anistropic ]
class: middle
Per-parameter downscale by square-root of sum of squares of all its historical values.
- AdaGrad eliminates the need to manually tune the learning rate.
Most implementation use
$\gamma=0.01$ as default. - It is good when the objective is convex.
-
$r_t$ grows unboundedly during training, which may cause the step size to shrink and eventually become infinitesimally small.
???
Interpretation:
- very steep, hence large gradients, the step size shrinks and eventually becomes infinitesimally small.
- very shallow, hence small gradients, the step size grows and eventually becomes very large.
Draw the two cases.
class: middle
Same as AdaGrad but accumulate an exponentially decaying average of the gradient.
- Perform better in non-convex settings.
- Does not grow unboundedly.
class: middle
Similar to RMSProp with momentum, but with bias correction terms for the first and second moments.
- Good defaults are
$\rho_1=0.9$ and$\rho_2=0.999$ . - Adam is one of the default optimizers in deep learning, along with SGD with momentum.
.center[
class: middle
- Weight decay is a regularization technique that penalizes large weights.
- For vanilla SGD, it is equivalent to adding a penalty term to the loss function
$$\ell_\theta + \frac{\lambda}{2} ||\mathbf{\theta}||^2.$$ - For more complex optimizers, it is equivalent to adding a penalty term to the update rule
$$\theta_{t+1} = \theta_t - \gamma \left(g_t + \lambda \theta \right).$$
class: middle
.center[
.width-45[]
.width-45[
]
Training without (left) and with (right) weight decay. ]
.footnote[Credits: Dive Into Deep Learning, 2023.]
class: middle
Despite per-parameter adaptive learning rate methods, it is usually helpful to anneal the learning rate
- Step decay: reduce the learning rate by some factor every few epochs (e.g, by half every 10 epochs).
- Exponential decay:
$\gamma_t = \gamma_0 \exp(-kt)$ where$\gamma_0$ and$k$ are hyper-parameters. -
$1/t$ decay:$\gamma_t = \gamma_0 / (1+kt)$ where$\gamma_0$ and$k$ are hyper-parameters.
class: middle
.center[
.width-70[]
.caption[Step decay scheduling for training ResNets.]
]
class: middle
class: middle
class: middle
In convex problems, provided a good learning rate
In the non-convex regime, initialization is much more important! Little is known on the mathematics of initialization strategies of neural networks.
- What is known: initialization should break symmetry.
- What is known: the scale of weights is important.
class: middle, center
(demo)
???
Take the time to explain, look at weights and gradients over time.
Take the time to insist again on the loss curves.
class: middle
A first strategy is to initialize the network parameters such that activations preserve the same variance across layers.
Intuitively, this ensures that the information keeps flowing during the forward pass, without reducing or magnifying the magnitude of input signals exponentially.
class: middle
Let us assume that
- we are in a linear regime at initialization (e.g., the positive part of a ReLU or the middle of a sigmoid),
- weights
$w_{ij}^l$ are initialized i.i.d, - biases
$b_l$ are initialized to be$0$ , - input features are i.i.d, with a variance denoted as
$\mathbb{V}[x]$ .
Then, the variance of the activation
???
Do it on blackboard.
Use
- V(AB) = V(A)V(B)+V(A)E(B)+V(B)E(A)
- V(A+B) = V(A)+V(B)+Cov(A,B)
class: middle
Since the weights
Therefore, the variance of the activations is preserved across layers when
This condition is enforced in LeCun's uniform initialization, which is defined as
???
Var[ U(a,b) ] = 1/12 (b-a)^2
class: middle
A similar idea can be applied to ensure that the gradients flow in the backward pass (without vanishing nor exploding), by maintaining the variance of the gradient with respect to the activations fixed across layers.
Under the same assumptions as before, $$\begin{aligned} \mathbb{V}\left[ \frac{\text{d}\hat{y}}{\text{d} h_i^l} \right] &= \mathbb{V}\left[ \sum_{j=0}^{q_{l+1}-1} \frac{\text{d} \hat{y}}{\text{d} h_j^{l+1}} \frac{\partial h_j^{l+1}}{\partial h_i^l} \right] \\ &= \mathbb{V}\left[ \sum_{j=0}^{q_{l+1}-1} \frac{\text{d} \hat{y}}{\text{d} h_j^{l+1}} w_{j,i}^{l+1} \right] \\ &= \sum_{j=0}^{q_{l+1}-1} \mathbb{V}\left[\frac{\text{d} \hat{y}}{\text{d} h_j^{l+1}}\right] \mathbb{V}\left[ w_{ji}^{l+1} \right] \end{aligned}$$
class: middle
If we further assume that
- the gradients of the activations at layer
$l$ share the same variance - the weights at layer
$l+1$ share the same variance$\mathbb{V}\left[ w^{l+1} \right]$ ,
then we can drop the indices and write $$ \mathbb{V}\left[ \frac{\text{d}\hat{y}}{\text{d} h^l} \right] = q_{l+1} \mathbb{V}\left[ \frac{\text{d}\hat{y}}{\text{d} h^{l+1}} \right] \mathbb{V}\left[ w^{l+1} \right]. $$
Therefore, the variance of the gradients with respect to the activations is preserved across layers when
class: middle
We have derived two different conditions on the variance of
$\mathbb{V}\left[w^l\right] = \frac{1}{q_{l-1}}$ -
$\mathbb{V}\left[w^l\right] = \frac{1}{q_{l}}$ .
A compromise is the Xavier initialization, which initializes
For example, normalized initialization is defined as
class: middle
.footnote[Credits: Glorot and Bengio, Understanding the difficulty of training deep feedforward neural networks, 2010.]
class: middle
.footnote[Credits: Glorot and Bengio, Understanding the difficulty of training deep feedforward neural networks, 2010.]
class: middle
Because
class: middle
class: middle, center
(Back to demo)
class: middle
class: middle
Previous weight initialization strategies rely on preserving the activation variance constant across layers, under the assumption that the input feature variances are the same.
That is,
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
class: middle
In general, this constraint is not satisfied but can be enforced by standardizing the input data feature-wise,
.footnote[Credits: Scikit-Learn, Compare the effect of different scalers on data with outliers.]
???
Make them think about what would happen if data is not standardized.
class: middle
Maintaining proper statistics of the activations and derivatives is critical for training neural networks.
This constraint can be enforced explicitly during the forward pass by re-normalizing them. Batch normalization was the first method introducing this idea.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL; Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.]
class: middle
Let us consider a minibatch of samples at training, for which
In batch normalization following the node
During testing, the mean and variance computed on the entire training set and used to standardize the activations.
.footnote[Credits: Francois Fleuret, EE559 Deep Learning, EPFL.]
???
Exercise: How does batch normalization combine with backpropagation?
class: middle
.footnote[Credits: Ioffe and Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, 2015.]
class: middle
Layer normalization is a variant of batch normalization that normalizes the activations across the features of each sample, rather than across the samples of each feature:
class: end-slide, center count: false
The end.
???
Make them do a recap:
- good default optimizer?
- good default initialization?
- preprocessing?