Skip to content

Commit 396a729

Browse files
committed
Fix typo in 'Gradient Descent, revisited' section
1 parent 3f2731f commit 396a729

File tree

1 file changed

+11
-2
lines changed

1 file changed

+11
-2
lines changed

math_differential_calculus.ipynb

+11-2
Original file line numberDiff line numberDiff line change
@@ -5392,7 +5392,7 @@
53925392
"\n",
53935393
"In Deep Learning, the letter $\\mathbf{x}$ is generally used to represent the input data. When you _use_ a neural network to make predictions, you feed the neural network the inputs $\\mathbf{x}$, and you get back a prediction $\\hat{y} = f(\\mathbf{x})$. The function $f$ treats the model parameters as constants. We can use more explicit notation by writing $\\hat{y} = f_\\mathbf{w}(\\mathbf{x})$, where $\\mathbf{w}$ represents the model parameters and indicates that the function relies on them, but treats them as constants.\n",
53945394
"\n",
5395-
"However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{w}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
5395+
"However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{y}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
53965396
]
53975397
},
53985398
{
@@ -5995,8 +5995,17 @@
59955995
"nbconvert_exporter": "python",
59965996
"pygments_lexer": "ipython3",
59975997
"version": "3.7.6"
5998+
},
5999+
"pycharm": {
6000+
"stem_cell": {
6001+
"cell_type": "raw",
6002+
"source": [],
6003+
"metadata": {
6004+
"collapsed": false
6005+
}
6006+
}
59986007
}
59996008
},
60006009
"nbformat": 4,
60016010
"nbformat_minor": 1
6002-
}
6011+
}

0 commit comments

Comments
 (0)