Shrinkage with LASSO and Ridge
Required Reading
- This page.
Guiding Questions
- What is shrinkage?
- What do we do with too may right-hand-side variables?
- What is LASSO?
This week is a little different
We will use the lecture slides from my friend and ace economist, Ed Rubin (U of Oregon). They are available right here. A couple things first, though:
The data and packages
Dr. Rubin uses the ISLR
package’s credit
dataset, which we can get from the ISLR
package (which you may need to install). You’ll also want to install the wooldridge
package for our later work:
library(ISLR)
credit = ISLR::Credit
head(credit)
## ID Income Limit Rating Cards Age Education Gender Student Married Ethnicity
## 1 1 14.891 3606 283 2 34 11 Male No Yes Caucasian
## 2 2 106.025 6645 483 3 82 15 Female Yes Yes Asian
## 3 3 104.593 7075 514 4 71 11 Male No No Asian
## 4 4 148.924 9504 681 3 36 11 Female No No Asian
## 5 5 55.882 4897 357 2 68 16 Male No Yes Caucasian
## 6 6 80.180 8047 569 4 77 10 Male No No Caucasian
## Balance
## 1 333
## 2 903
## 3 580
## 4 964
## 5 331
## 6 1151
We will also need to load the caret
package (which you’re used before), as well as the glmnet
package, which is new for us.
Terminology
I used to have this lecture at the end of our semester, but I think the intuition behind LASSO and ridge regression helps understand our “overfitting” problem. There are two terms I want to cover before we dive into the slides:
Bias vs. Variance: We saw that super-overfit polynomial last week (where we took 20 observations and fit a 16th-degree polynomial). The model we fit was very flexible and bendy, but it did get most of the data points right. It had low bias as it was generally right, but had huge variance – it was all over the place, even within a small range of advertising mode spending. Bias vs. variance refers to the innate tradeoff between these two things. When we used the train and test samples to get the best out of sample fit, we were balancing bias and variance
Cross validation: This is the term for using two (or more) different subsets of sample data (e.g. test and train) to fit a model.
Now, back to the lecture notes for today
Try it!
Try it! Use the wooldridge::wage2
dataset and LASSO to predict wage
. Since these shrinkage methods (LASSO, ridge) work well with many right-hand-side variables, we can create some additional variables. A good candidate would be squared versions of all of the numeric variables. To do this, we’ll use mutate_if
along with is.numeric
. The function mutate_if
, when used in a pipe %>%
, will check each column to see if the given function is true, then will add a new mutation of that column when true. We need this because we don’t want to try to add a squared term for things like factor variables. It will look something like this:
data %>%
mutate_if(is.numeric, list(squared = function(x) x^2))
We pass a list with potentially many functions, but only one here in the example. For each passed function, mutate_if
will create a copy of each existing column, square the data, and call name it by appending the name (here, ‘squared’) to the original column name.
This will results in some columns we don’t want to keep – wage_squared
should be dropped since wage
is the outcome we want to predict. We should also drop lwage
and lwage_squared
since those are transformations of our outcome. We want our data to only have the outcome and all the possible predictors so we can simply pass everything but wage
for our X’s and wage
for our y
.
Drop all of the copies of the outcome variable and transformations of the outcome variable. Also, drop any of the squared terms of binary variables – if a variable is \(\{0,1\}\), then the squared term is exactly the same. For instance,
married
andmarried_squared
are exactly the same.How many predictors (right-side variabes) do we have?
Use
glmnet
from theglmnet
package to run a LASSO (alpha=1
) using all of the data you’ve assembled in thewage2
dataset. Select a wide range of values forlambda
. You can usecv.glmnet
to do the cross-validation for you (see the Rubin lecture notes), or you can do it manually as we did with KNN and Regression Trees (test-train split, etc.).Find the
lambda
that minimizes RMSE in the test data. You can extract the optimal lambda from acv.glmnet
object by referring toobject$lambda
, and the RMSE usingsqrt(object$cvm)
. These can be used to make the plots in the Rubin lecture notes, and to find thelambda
that minimizes RMSE.object$lambda.min
will also tell you the optimallambda
.Using the optimal
lambda
, run a final LASSO model.Use
coef
to extract the coefficients from the optimal model. Coefficients that are zeroed out by the LASSO are shown as ‘.’
- Which variables were “kept” in the model?
- Which variables were eliminated from the model at the optimal
lambda
? - Is the train RMSE lower than it would be if we just ran OLS with all of the variables?