Shrinkage with LASSO and Ridge

Content for Thursday, October 26, 2023

Required Reading

  • This page.

Guiding Questions

  • What is shrinkage?
  • What do we do with too may right-hand-side variables?
  • What is LASSO?

This week is a little different

We will use the lecture slides from my friend and ace economist, Ed Rubin (U of Oregon). They are available right here. A couple things first, though:

The data and packages

Dr. Rubin uses the ISLR package’s credit dataset, which we can get from the ISLR package (which you may need to install). You’ll also want to install the wooldridge package for our later work:

library(ISLR)
credit = ISLR::Credit
head(credit)
##   ID  Income Limit Rating Cards Age Education Gender Student Married Ethnicity
## 1  1  14.891  3606    283     2  34        11   Male      No     Yes Caucasian
## 2  2 106.025  6645    483     3  82        15 Female     Yes     Yes     Asian
## 3  3 104.593  7075    514     4  71        11   Male      No      No     Asian
## 4  4 148.924  9504    681     3  36        11 Female      No      No     Asian
## 5  5  55.882  4897    357     2  68        16   Male      No     Yes Caucasian
## 6  6  80.180  8047    569     4  77        10   Male      No      No Caucasian
##   Balance
## 1     333
## 2     903
## 3     580
## 4     964
## 5     331
## 6    1151

We will also need to load the caret package (which you’re used before), as well as the glmnet package, which is new for us.

Terminology

I used to have this lecture at the end of our semester, but I think the intuition behind LASSO and ridge regression helps understand our “overfitting” problem. There are two terms I want to cover before we dive into the slides:

  1. Bias vs. Variance: We saw that super-overfit polynomial last week (where we took 20 observations and fit a 16th-degree polynomial). The model we fit was very flexible and bendy, but it did get most of the data points right. It had low bias as it was generally right, but had huge variance – it was all over the place, even within a small range of advertising mode spending. Bias vs. variance refers to the innate tradeoff between these two things. When we used the train and test samples to get the best out of sample fit, we were balancing bias and variance

  2. Cross validation: This is the term for using two (or more) different subsets of sample data (e.g. test and train) to fit a model.

Now, back to the lecture notes for today

Try it!

Try it! Use the wooldridge::wage2 dataset and LASSO to predict wage. Since these shrinkage methods (LASSO, ridge) work well with many right-hand-side variables, we can create some additional variables. A good candidate would be squared versions of all of the numeric variables. To do this, we’ll use mutate_if along with is.numeric. The function mutate_if, when used in a pipe %>%, will check each column to see if the given function is true, then will add a new mutation of that column when true. We need this because we don’t want to try to add a squared term for things like factor variables. It will look something like this:

data %>%
  mutate_if(is.numeric, list(squared = function(x) x^2))

We pass a list with potentially many functions, but only one here in the example. For each passed function, mutate_if will create a copy of each existing column, square the data, and call name it by appending the name (here, ‘squared’) to the original column name.

This will results in some columns we don’t want to keep – wage_squared should be dropped since wage is the outcome we want to predict. We should also drop lwage and lwage_squared since those are transformations of our outcome. We want our data to only have the outcome and all the possible predictors so we can simply pass everything but wage for our X’s and wage for our y.

  1. Drop all of the copies of the outcome variable and transformations of the outcome variable. Also, drop any of the squared terms of binary variables – if a variable is \(\{0,1\}\), then the squared term is exactly the same. For instance, married and married_squared are exactly the same.

  2. How many predictors (right-side variabes) do we have?

  3. Use glmnet from the glmnet package to run a LASSO (alpha=1) using all of the data you’ve assembled in the wage2 dataset. Select a wide range of values for lambda. You can use cv.glmnet to do the cross-validation for you (see the Rubin lecture notes), or you can do it manually as we did with KNN and Regression Trees (test-train split, etc.).

  4. Find the lambda that minimizes RMSE in the test data. You can extract the optimal lambda from a cv.glmnet object by referring to object$lambda, and the RMSE using sqrt(object$cvm). These can be used to make the plots in the Rubin lecture notes, and to find the lambda that minimizes RMSE. object$lambda.min will also tell you the optimal lambda.

  5. Using the optimal lambda, run a final LASSO model.

  6. Use coef to extract the coefficients from the optimal model. Coefficients that are zeroed out by the LASSO are shown as ‘.’

  • Which variables were “kept” in the model?
  • Which variables were eliminated from the model at the optimal lambda?
  • Is the train RMSE lower than it would be if we just ran OLS with all of the variables?