Fix typo in 'Gradient Descent, revisited' section

davidcotton · davidcotton · commit 396a729676b3 · 2020-04-21T11:45:59.000+10:00
diff --git a/math_differential_calculus.ipynb b/math_differential_calculus.ipynb
@@ -5392,7 +5392,7 @@
     "\n",
     "In Deep Learning, the letter $\\mathbf{x}$ is generally used to represent the input data. When you _use_ a neural network to make predictions, you feed the neural network the inputs $\\mathbf{x}$, and you get back a prediction $\\hat{y} = f(\\mathbf{x})$. The function $f$ treats the model parameters as constants. We can use more explicit notation by writing $\\hat{y} = f_\\mathbf{w}(\\mathbf{x})$, where $\\mathbf{w}$ represents the model parameters and indicates that the function relies on them, but treats them as constants.\n",
     "\n",
-    "However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{w}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
+    "However, when _training_ a neural network, we do quite the opposite: all the training examples are grouped in a matrix $\\mathbf{X}$, all the labels are grouped in a vector $\\mathbf{y}$, and both $\\mathbf{X}$ and $\\mathbf{y}$ are treated as constants, while $\\mathbf{w}$ is treated as variable: specifically, we try to minimize the cost function $\\mathcal L_{\\mathbf{X}, \\mathbf{y}}(\\mathbf{w}) = g(f_{\\mathbf{X}}(\\mathbf{w}), \\mathbf{y})$, where $g$ is a function that measures the \"discrepancy\" between the predictions $f_{\\mathbf{X}}(\\mathbf{w})$ and the labels $\\mathbf{y}$, where $f_{\\mathbf{X}}(\\mathbf{w})$ represents the vector containing the predictions for each training example. Minimizing the loss function is usually performed using Gradient Descent (or a variant of GD): we start with random model parameters $\\mathbf{w}_0$, then we compute $\\nabla \\mathcal L(\\mathbf{w}_0)$ and we use this gradient vector to perform a Gradient Descent step, then we repeat the process until convergence. It is crucial to understand that the gradient of the loss function is with regards to the model parameters $\\mathbf{w}$ (_not_ the inputs $\\mathbf{x}$)."
    ]
   },
   {
@@ -5995,8 +5995,17 @@
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
    "version": "3.7.6"
+  },
+  "pycharm": {
+   "stem_cell": {
+    "cell_type": "raw",
+    "source": [],
+    "metadata": {
+     "collapsed": false
+    }
+   }
   }
  },
  "nbformat": 4,
  "nbformat_minor": 1
-}
+}