concept l2 norm in category R

appears as: L2 norm, The L2 norm
Machine Learning with R, the tidyverse, and mlr

This is an excerpt from Manning's book Machine Learning with R, the tidyverse, and mlr.

What does this penalty look like that we add to the least squares estimate? Two penalties are frequently used: the L1 norm and the L2 norm. I’ll start by showing you what the L2 norm is and how it works, because this is the regularization method used in ridge regression. Then I’ll extend this to show you how LASSO uses the L1 norm method, and how elastic net combines both the L1 and L2 norms.

coefTibInts <- tibble(Coef = rownames(ridgeCoefs),
                  Ridge = as.vector(ridgeCoefs),
                  Lm = as.vector(lmCoefs))
coefUntidyInts <- gather(coefTibInts, key = Model, value = Beta, -Coef)

ggplot(coefUntidyInts, aes(reorder(Coef, Beta), Beta, fill = Model)) +
  geom_bar(stat = "identity", col = "black") +
  facet_wrap(~Model) +
  theme_bw()  +
  theme(legend.position = "none")

# The intercepts are different. The intercept isn't included when
# calculating the L2 norm, but is the value of the outcome when all
# the predictors are zero. Because ridge regression changes the parameter
# estimates of the predictors, the intercept changes as a result.

11.3. What is the L2 norm, and how does ridge regression use it?

In this section, I’ll show you a mathematical and graphical explanation of the L2 norm, how ridge regression uses it, and why you would use it. Imagine that you want to predict how busy your local park will be, depending on the temperature that day. An example of what this data might look like is shown in figure 11.4.

Ridge regression modifies the least squares loss function slightly to include a term that makes the function’s value larger, the larger the parameter estimates are. As a result, the algorithm now has to balance selecting the model parameters that minimize the sum of squares, and selecting parameters than minimize this new penalty. In ridge regression, this penalty is called the L2 norm, and it is very easy to calculate: we simply square all of the model parameters and add them up (all except the intercept). When we have only one continuous predictor, we have only one parameter (the slope), so the L2 norm is its square. When we have two predictors, we square the slopes for each and then add these squares together, and so on. This is illustrated for our park example in figure 11.5.

Figure 11.5. Calculating the sum of squares and the L2 norm for the slope between temperature and the number of people at the park.
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest