marp | math |
---|---|
true |
true |
- Understand the gradient descent learning algorithm.
- Learn about its main issues.
- Discover some of the main GD optimization techniques.
- The model's parameters are iteratively updated until an optimum is reached.
- Each GD iteration combines two steps: computing the gradient of the loss function, then use it to update model parameters.
A gradient expresses the variation of a function relative to the variation of its parameters.
-
$\nabla_{\pmb{\omega}}\mathcal{L}(\pmb{\omega})$ : gradient of loss function$\mathcal{L}(\pmb{\omega})$ . -
$\frac{\partial}{\partial \omega_i} \mathcal{L}(\pmb{\omega})$ : partial derivative of the loss function w.r.t. its $i$th parameter.
In order to reduce loss for the next iteration, parameters are updated in the opposite direction of the gradient.
-
$\pmb{\omega_{t}}$ : set of parameters at step$t$ of the gradient descent. -
$\pmb{\omega_{t+1}}$ : set of parameters at step$t+1$ (after update). -
$\eta$ (sometimes denoted$\alpha$ or$\lambda$ ): update factor for parameters, called the learning rate.
The gradient is computed on the whole dataset before model parameters are updated.
- Advantages: simple and safe (always converges in the right direction).
- Drawback: can become slow and even untractable with a big dataset.
The gradient is computed on only one randomly chosen sample whole dataset before parameters are updated.
- Advantages:
- Very fast.
- Enables learning from each new sample (online learning).
- Drawback:
- Convergence is not guaranteed.
- No vectorization of computations.
The gradient is computed on a small set of samples, called a batch, before parameters are updated.
- Combines the advantages of batch and stochastic GD.
- Default method for many ML libraries.
- The mini-batch size varies between 10 and 1000 samples, depending of the dataset size.
Momentum optimization accelerates the descent speed in the direction of the minimum by accumulating previous gradients. It can also escape plateaux faster then plain GD.
-
$\pmb{m_t}$ : momentum at step$t$ . -
$\beta_t \in [0,1]$ : friction factor that prevents gradients updates from growing too large. A typical value is$0.9$ .
RMSprop decays the learning rate differently for each parameter, scaling down the gradient vector along the steepest dimensions. The underlying idea is to adjust the descent direction a bit more towards the global minimum.
-
$\pmb{v_t}$ : moving average of squared gradients at step$t$ . -
$\epsilon$ : smoothing term to avoid divisions by zero. A typical value is$10^{-10}$ .
Adam (Adaptive Moment Estimation) combines the ideas of momentum and RMSprop. It is the de facto choice nowadays.
Gradient descent optimization is a rich subfield of Machine Learning. Read more in this article.
- Finite difference approximation of derivatives.
- Interpretations: instantaneous rate of change, slope of the tangent.
- Generally unstable and limited to a small set of functions.
- Automatic manipulation of expressions for obtaining derivative expressions.
- Used in modern mathematical software (Mathematica, Maple...).
- Can lead to expression swell: exponentially large symbolic expressions.
- Family of techniques for efficiently computing derivatives of numeric functions.
- Can differentiate closed-form math expressions, but also algorithms using branching, loops or recursion.
- AD combines numerical and symbolic differentiation.
- General idea: apply symbolic differentiation at the elementary operation level and keep intermediate numerical results.
- AD exists in two modes: forward and reverse. Both rely on the chain rule.
- Computes gradients w.r.t. one parameter along with the function output.
- Relies on dual numbers.
- Efficient when output dimension >> number of parameters.
- Computes function output, then do a backward pass to compute gradients w.r.t. all parameters for the output.
- Efficient when number of parameters >> output dimension.
Let's define the function
It can be represented as a computational graph:
Intermediate values are calculated and tensor operations are memorized for future gradient computations.
The chain rule is applied to compute every intermediate gradient, starting from output.
Many Machine Learning libraries offer out-of-the-box support for automatic gradients computation. Some prominent ones are currently TensorFlow, PyTorch and JAX.
Aka software 2.0.
"People are now building a new kind of software by assembling networks of parameterized functional blocks and by training them from examples using some form of gradient-based optimization…. It’s really very much like a regular program, except it’s parameterized, automatically differentiated, and trainable/optimizable" (Y. LeCun).