Skip to content

Commit 277f01f

Browse files
authored
linear-classify.md - regularizer: SQUARED l2 norm
isnt the regularizer we are using for W the SQUARED l2 norm? because if we were to use the regular l2 norm we would need to take the root over the entire sum of squares (i.e. R(W) = ||W||^2). but we are not doing that. it is just a small detail, i know, but it has been bugging me. best(: peglegpete
1 parent affb8b8 commit 277f01f

File tree

1 file changed

+1
-1
lines changed

1 file changed

+1
-1
lines changed

linear-classify.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -160,7 +160,7 @@ A last piece of terminology we'll mention before we finish with this section is
160160

161161
**Regularization**. There is one bug with the loss function we presented above. Suppose that we have a dataset and a set of parameters **W** that correctly classify every example (i.e. all scores are so that all the margins are met, and \\(L_i = 0\\) for all i). The issue is that this set of **W** is not necessarily unique: there might be many similar **W** that correctly classify the examples. One easy way to see this is that if some parameters **W** correctly classify all examples (so loss is zero for each example), then any multiple of these parameters \\( \lambda W \\) where \\( \lambda > 1 \\) will also give zero loss because this transformation uniformly stretches all score magnitudes and hence also their absolute differences. For example, if the difference in scores between a correct class and a nearest incorrect class was 15, then multiplying all elements of **W** by 2 would make the new difference 30.
162162

163-
In other words, we wish to encode some preference for a certain set of weights **W** over others to remove this ambiguity. We can do so by extending the loss function with a **regularization penalty** \\(R(W)\\). The most common regularization penalty is the **L2** norm that discourages large weights through an elementwise quadratic penalty over all parameters:
163+
In other words, we wish to encode some preference for a certain set of weights **W** over others to remove this ambiguity. We can do so by extending the loss function with a **regularization penalty** \\(R(W)\\). The most common regularization penalty is the squared **L2** norm that discourages large weights through an elementwise quadratic penalty over all parameters:
164164

165165
$$
166166
R(W) = \sum_k\sum_l W_{k,l}^2

0 commit comments

Comments
 (0)