- "However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{w}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
0 commit comments