linear-classify.md - regularizer: SQUARED l2 norm

PegLegPete · web-flow · commit 277f01fc4184 · 2017-10-11T19:37:06.000+02:00
isnt the regularizer we are using for W the SQUARED l2 norm? because if we were to use the regular l2 norm we would need to take the root over the entire sum of squares (i.e. R(W) = ||W||^2). but we are not doing that. it is just a small detail, i know, but it has been bugging me. best(: peglegpete
diff --git a/linear-classify.md b/linear-classify.md
@@ -160,7 +160,7 @@ A last piece of terminology we'll mention before we finish with this section is
 
 **Regularization**. There is one bug with the loss function we presented above. Suppose that we have a dataset and a set of parameters **W** that correctly classify every example (i.e. all scores are so that all the margins are met, and \\(L_i = 0\\) for all i). The issue is that this set of **W** is not necessarily unique: there might be many similar **W** that correctly classify the examples. One easy way to see this is that if some parameters **W** correctly classify all examples (so loss is zero for each example), then any multiple of these parameters \\( \lambda W \\) where \\( \lambda > 1 \\) will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences. For example, if the difference in scores between a correct class and a nearest incorrect class was 15, then multiplying all elements of **W** by 2 would make the new difference 30.
 
-In other words, we wish to encode some preference for a certain set of weights **W** over others to remove this ambiguity. We can do so by extending the loss function with a **regularization penalty** \\(R(W)\\). The most common regularization penalty is the **L2** norm that discourages large weights through an elementwise quadratic penalty over all parameters:
+In other words, we wish to encode some preference for a certain set of weights **W** over others to remove this ambiguity. We can do so by extending the loss function with a **regularization penalty** \\(R(W)\\). The most common regularization penalty is the squared **L2** norm that discourages large weights through an elementwise quadratic penalty over all parameters:
 
 $$
 R(W) = \sum_k\sum_l W_{k,l}^2