updates

karpathy · karpathy · commit 8f5797687c84 · 2016-01-13T23:29:31.000-08:00
diff --git a/linear-classify.md b/linear-classify.md
@@ -115,19 +115,19 @@ For example, going back to the example image of a cat and its scores for the cla
 
 There are several ways to define the details of the loss function. As a first example we will first develop a commonly used loss called the **Multiclass Support Vector Machine** (SVM) loss. The SVM loss is set up so that the SVM "wants" the correct class for each image to a have a score higher than the incorrect classes by some fixed margin \\(\Delta\\). Notice that it's sometimes helpful to anthropomorphise the loss functions as we did above: The SVM "wants" a certain outcome in the sense that the outcome would yield a lower loss (which is good).
 
-Let's now get more precise. Recall that for the i-th example we are given the pixels of image \\( x\_i \\) and the label \\( y\_i \\) that specifies the index of the correct class. The score function takes the pixels and computes the vector \\( f(x\_i, W) \\) of class scores.  For example, the score for the j-th class is the j-th element: \\( f(x\_i, W)\_j \\). The Multiclass SVM loss for the i-th example is then formalized as follows:
+Let's now get more precise. Recall that for the i-th example we are given the pixels of image \\( x\_i \\) and the label \\( y\_i \\) that specifies the index of the correct class. The score function takes the pixels and computes the vector \\( f(x\_i, W) \\) of class scores, which we will abbreviate to \\(s\\) (short for scores).  For example, the score for the j-th class is the j-th element: \\( s\_j = f(x\_i, W)\_j \\). The Multiclass SVM loss for the i-th example is then formalized as follows:
 
 $$
-L\_i = \sum\_{j\neq y\_i} \max(0, f(x\_i, W)\_j - f(x\_i, W)\_{y\_i} + \Delta)
+L\_i = \sum\_{j\neq y\_i} \max(0, s\_j - s\_{y\_i} + \Delta)
 $$
 
-**Example.** This expression may seem daunting if you're seeing it for the first time, so lets unpack it with an example to see how it works. Suppose that we have three classes that receive the scores \\(f(x\_i, W) = [13, -7, 11]\\), and that the first class is the true class (i.e. \\(y\_i = 0\\)). Also assume that \\(\Delta\\) (a hyperparameter we will go into more detail about soon) is 10. The expression above sums over all incorrect classes (\\(j \neq y\_i\\)), so we get two terms:
+**Example.** Lets unpack this with an example to see how it works. Suppose that we have three classes that receive the scores \\( s = [13, -7, 11]\\), and that the first class is the true class (i.e. \\(y\_i = 0\\)). Also assume that \\(\Delta\\) (a hyperparameter we will go into more detail about soon) is 10. The expression above sums over all incorrect classes (\\(j \neq y\_i\\)), so we get two terms:
 
 $$
 L\_i = \max(0, -7 - 13 + 10) + \max(0, 11 - 13 + 10)
 $$
 
-You can see that the first term gives zero since [-7 - 13 + 10] gives a negative number, which is then thresholded to zero with the \\(max(0,-)\\) function. We get zero loss for this pair because the correct class score (13) was greater than the incorrect class score (-7) by at least the margin 10. In fact the difference was 20, which is much greater than 10 but the SVM only cares that the difference is at least 10; Any additional difference above the margin is clamped at zero with the max operation. The second term computes [11 - 13 + 10] which gives 8. That is, even though the correct class had a higher score than the incorrect class (13 > 11), it was not greater by the desired margin of 10. The difference was only 2, which is why the loss comes out to 8 (i.e. how much higher the difference would have to be to meet the margin). In summary, the SVM loss function wants the score of the correct class \\(y\_i\\) to be larger than the incorrect class scores by at least by \\(\Delta\\) (delta). If this is not the case, we will accumulate loss (and that's bad).
+You can see that the first term gives zero since [-7 - 13 + 10] gives a negative number, which is then thresholded to zero with the \\(max(0,-)\\) function. We get zero loss for this pair because the correct class score (13) was greater than the incorrect class score (-7) by at least the margin 10. In fact the difference was 20, which is much greater than 10 but the SVM only cares that the difference is at least 10; Any additional difference above the margin is clamped at zero with the max operation. The second term computes [11 - 13 + 10] which gives 8. That is, even though the correct class had a higher score than the incorrect class (13 > 11), it was not greater by the desired margin of 10. The difference was only 2, which is why the loss comes out to 8 (i.e. how much higher the difference would have to be to meet the margin). In summary, the SVM loss function wants the score of the correct class \\(y\_i\\) to be larger than the incorrect class scores by at least by \\(\Delta\\) (delta). If this is not the case, we will accumulate loss.
 
 Note that in this particular module we are working with linear score functions ( \\( f(x\_i; W) =  W x\_i \\) ), so we can also rewrite the loss function in this equivalent form: