Nonparametric Regression
Our data and goal
We want to use the wooldridge::wage2
data on wages to generate and tune a non-parametric model of wages using a regression tree (or decision tree, same thing). Use ?wage2
(after you’ve loaded the wooldridge
package) to see the data dictionary.
We’ve learned that our RMSE calculations have a hard time with NA
s in the data. So let’s use the skim
output to tell us which variables have too many NA
(see n_missing
) values and should be dropped. Let’s set a high bar here, and drop anything that isn’t 100% complete. Of course, there are other things we can do (impute the NA
s, or make dummies for them), but for now, it’s easiest to drop them.
Remember, skim()
is for your own exploration, it’s not something that you should include in your output.
Once cleaned, we should be able to use rpart(wage ~ ., data = wage_clean)
and not have any NA
s in our prediction. That’s what we’ll do in our first Breakout.
Breakout #1 - Predicting Wage
I’m going to have you form groups of 2-3 to work on live-coding. One person in the group will need to have Rstudio up, be able to share screen, and have the correct packages loaded (caret
, rpart
, and rpart.plot
, plus skimr
). Copy the rmse
and get_rmse
code into a blank R script (you don’t have to use RMarkdown, just use a blank R script and run from there). You’ll also want to have these functions from last week loaded:
rmse = function(actual, predicted) {
sqrt(mean((actual - predicted) ^ 2))
}
get_rmse = function(model, data, response) {
rmse(actual = subset(data, select = response, drop = TRUE),
predicted = predict(model, data))
}
For the first breakout, all I want you to do is the following. We are trying to predict wage
for this exercise:
- Estimate a default regression tree on
wage_clean
using the default parameters (use the whole dataset for now, we’ll train-test on the problem set). - Use
rpart.plot
to vizualize your regression tree, and talk through the interpretation with each other. - Calculate the RMSE for your regression tree.
We’ll spend about 5 minutes on this. Remember, you can use ?wage2
to see the variable names. Make sure you know what variables are showing up in the plot and explaining wage
in your model. You may find something odd at first and may need to drop more variables…
5 minutes
Breakout #1 discussion
Let’s choose a group to share their plot and discuss the results. Use our sharing zoom link: bit.ly/KIRKPATRICK
Automating models
Let’s talk about a little code shortcut that helps iterate through your model selection.
First, before, we used list()
to store all of our models. This is because list()
can “hold” anything at all, unlike a matrix
, which is only numeric, or a data.frame
which needs all rows in a column to be the same data type. list()
is also recursive, so each element in a list can be a list. Of lists. Of lists!
Lists are also really easy to add to iteratively. We can initiate an empty list using myList <- list()
, then we can add things to it. Note that we use the double [
to index:
myList <- list()
myList[['first']] = 'This is the first thing on my list'
print(myList)
## $first
## [1] "This is the first thing on my list"
Lists let you name the “containers” (much like you can name colums in a data.frame
). Our first one is called “first”. We can add more later, but they always have to be unique:
myList[['second']] = c(1,2,3)
print(myList)
## $first
## [1] "This is the first thing on my list"
##
## $second
## [1] 1 2 3
And still more:
myList[['third']] = data.frame(a = c(1,2,3), b = c('Albert','Alex','Alice'))
print(myList)
## $first
## [1] "This is the first thing on my list"
##
## $second
## [1] 1 2 3
##
## $third
## a b
## 1 1 Albert
## 2 2 Alex
## 3 3 Alice
Now we have a data.frame
in there! We can use lapply
to do something to every element in the list:
lapply(myList, length) # the length() function with the first entry being the list element
## $first
## [1] 1
##
## $second
## [1] 3
##
## $third
## [1] 2
We get back a list of equal length but each (still-named) container is now the length of the original list’s contents.
Loops
R has a very useful looping function that takes the form:
for(i in c('first','second','third')){
print(i)
}
## [1] "first"
## [1] "second"
## [1] "third"
Here, R is repeating the thing in the loop (print(i)
) with a different value for i
each time. R repeats only what is inside the curly-brackets, then when it reaches the close-curly-bracket, it goes back to the top, changes i
to the next element, and repeats.
We can use this to train our models. First, we clear our list object myList
by setting it equal to an empty list. Then, we loop over some regression tree tuning parameters. First, we have to figure out how to use the loop to set a unique name for each list container. To do this, we’ll use paste0('Tuning',i)
which will result in a character string of Tuning0
when i=0
, Tuning0.01
when i=0.01
, etc.
myList <- list() # resets the list. Otherwise, you'll just be adding to your old list!
for(i in c(0, 0.01, 0.02)){
myList[[paste0('Tuning',i)]] = rpart(wage ~ ., data = wage_clean, cp = i)
}
If you use names(myList)
you’ll see the result of our paste0
naming. If you want to see the plotted results, you can use:
rpart.plot(myList[['Tuning0.01']])
Using lapply
to repeat something
Once we have a list built, lapply
lets us do something to each element of the list, and then returns a list of results (of the same length as the input).
Note that in the above myList
, we don’t actually have the cp
parameter recorded anywhere except in the name of the list element. If we want to plot with cp
on the x-axis, we’d need to have the number. Later on, we’ll cover text as data and learn how to extract it, but an easier way is to lapply
a function that extracts the cp
used over the list.
You can retrieve the cp
and minsplit
arguments from the tree by noting that the rpart object is itself technically a list, and it has an element called control
that is itself a list, and that list has cp
and minsplit
as used in the estimation. You can use the $
accessor on lists:
myTree = myList[[2]]
myTree$control$cp
## [1] 0.01
The only problem here is that we don’t have a function written that does $control$cp
. We can use an anonymous function in lapply
as follows:
cp_list = lapply(myList, function(x){
x$control$cp
})
unlist(cp_list)
## Tuning0 Tuning0.01 Tuning0.02
## 0.00 0.01 0.02
Well, it gets us the right answer, but whaaaaaat is going on? Curly brackets? x
?
This is an “anonymous function”, or a function created on the fly. Here’s how it works in lapply
:
- The first argument is the list you want to do something to
- The second argument would usually be the function you want to apply, like
get_rmse
- Here, we’re going to ask R to temporarily create a function that takes one argument,
x
. x
is going to be each list element inmyList
.- Think of it as a loop:
- Take the first element of
myList
and refer to it asx
. Run the function’s code. - Then it’ll take the second element of
myList
and refer to it asx
and run the function’s code. - Repeat until all elements of
x
have been used.
- Take the first element of
- Once the anonymous function has been applied to
x
, the result is passed back and saved as the new element of the list output, always in the same position from where thex
was taken.
So you can think of it like this:
x = myList[[1]]
myList[[1]] = x$control$cp
x = myList[[2]]
myList[[2]] = x$control$cp
x = myList[[3]]
myList[[3]] = x$control$cp
As long as every entry in myList
is an rpart object with control$cp
, then you’ll get those values in a list. unlist
just coerces the list to a vector. This is also the same way we get the RMSE, but instead of the anonymous function, we write one out and use that.
Lab 7
In Lab 7, linear models, you were asked to make increasingly complex models by adding variables. In regression trees and kNN, complexity comes from a tuning parameter. But many people have asked about ways of doing Lab 7, so here are two possible solutions.
Solution 1: Make a list of models
First, we make a vector of all of our possible explanatory variables:
wagenames = names(wage2)
wagenames = wagenames[!wagenames %in% c('wage','lwage')]
wagenames
## [1] "hours" "IQ" "KWW" "educ" "exper" "tenure" "age"
## [8] "married" "black" "south" "urban" "sibs" "brthord" "meduc"
## [15] "feduc"
And we’ll do all the interactions, quadratic terms, and put them all in one long vector:
wagenamesintx = expand.grid(wagenames, wagenames)
wagenamesintx = paste0(wagenamesintx[,1], ':', wagenamesintx[,2])
wagenamessquared = paste0(wagenames, '^2')
wageparameters = c(wagenames, wagenamesintx, wagenamessquared)
wageparameters[c(1,10,100,200)] # just to see a few
## [1] "hours" "south" "south:tenure" "exper:brthord"
Now we need to make a list of formulas that start with wage ~
and include an ever-increasing number of entries in wageparameters
(and, to be nested, always includes the previous iteration). We’ll do this by taking an index of [1:N]
where N
gets bigger.
There are 255 possible entries, so we want to sequence along that and always need round numbers:
complexity = round(seq(from = 1, to = length(wageparameters), length.out = 15))
modelList = lapply(complexity, function(x){
list_formula_start = 'wage ~ '
list_formula_rhs = paste0(wageparameters[1:x], collapse = ' + ')
return(paste0(list_formula_start, list_formula_rhs))
})
print(modelList[1:3])
## [[1]]
## [1] "wage ~ hours"
##
## [[2]]
## [1] "wage ~ hours + IQ + KWW + educ + exper + tenure + age + married + black + south + urban + sibs + brthord + meduc + feduc + hours:hours + IQ:hours + KWW:hours + educ:hours"
##
## [[3]]
## [1] "wage ~ hours + IQ + KWW + educ + exper + tenure + age + married + black + south + urban + sibs + brthord + meduc + feduc + hours:hours + IQ:hours + KWW:hours + educ:hours + exper:hours + tenure:hours + age:hours + married:hours + black:hours + south:hours + urban:hours + sibs:hours + brthord:hours + meduc:hours + feduc:hours + hours:IQ + IQ:IQ + KWW:IQ + educ:IQ + exper:IQ + tenure:IQ + age:IQ"
Finally, we start using lapply
to estimate the model:
myOLS = lapply(modelList, function(y){
lm(y, data = wage2)
})
lapply(myOLS[1:3], summary)
## [[1]]
##
## Call:
## lm(formula = y, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -839.72 -287.21 -52.38 200.46 2131.26
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 981.315 81.575 12.03 <2e-16 ***
## hours -0.532 1.832 -0.29 0.772
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 404.6 on 933 degrees of freedom
## Multiple R-squared: 9.033e-05, Adjusted R-squared: -0.0009814
## F-statistic: 0.08429 on 1 and 933 DF, p-value: 0.7716
##
##
## [[2]]
##
## Call:
## lm(formula = y, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -823.41 -209.85 -36.73 169.56 1945.67
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -333.4545 673.8072 -0.495 0.620853
## hours -12.1500 14.0223 -0.866 0.386553
## IQ 8.4166 6.9960 1.203 0.229395
## KWW -16.7569 12.7468 -1.315 0.189112
## educ 29.6737 48.0285 0.618 0.536903
## exper 9.9812 4.4862 2.225 0.026435 *
## tenure 2.9739 2.9258 1.016 0.309808
## age 8.9932 6.0635 1.483 0.138520
## married 178.8599 46.4966 3.847 0.000132 ***
## black -100.5218 56.4193 -1.782 0.075271 .
## south -23.3841 31.0616 -0.753 0.451828
## urban 179.0588 31.5547 5.675 2.1e-08 ***
## sibs 8.6548 7.9696 1.086 0.277895
## brthord -18.0249 11.7300 -1.537 0.124871
## meduc 7.1703 6.2323 1.151 0.250362
## feduc 8.8008 5.4714 1.609 0.108214
## hours:IQ -0.1226 0.1566 -0.783 0.434026
## hours:KWW 0.4812 0.2801 1.718 0.086282 .
## hours:educ 0.1991 1.0457 0.190 0.849036
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 354.1 on 644 degrees of freedom
## (272 observations deleted due to missingness)
## Multiple R-squared: 0.262, Adjusted R-squared: 0.2414
## F-statistic: 12.7 on 18 and 644 DF, p-value: < 2.2e-16
##
##
## [[3]]
##
## Call:
## lm(formula = y, data = wage2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -876.85 -208.34 -41.58 184.31 1964.55
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2526.42104 1778.98063 1.420 0.15606
## hours -28.84096 27.34871 -1.055 0.29203
## IQ -11.05195 14.41481 -0.767 0.44354
## KWW -32.51353 21.95935 -1.481 0.13921
## educ -65.27306 87.11894 -0.749 0.45399
## exper -56.47871 43.73030 -1.292 0.19700
## tenure 61.99723 27.55945 2.250 0.02482 *
## age -20.60242 53.42036 -0.386 0.69987
## married 174.59592 291.22705 0.600 0.54904
## black 129.07837 389.95580 0.331 0.74075
## south -219.74759 209.99378 -1.046 0.29576
## urban 294.24907 215.20016 1.367 0.17201
## sibs 33.26592 57.47098 0.579 0.56291
## brthord -12.35032 81.74380 -0.151 0.87996
## meduc -14.65740 42.07236 -0.348 0.72767
## feduc 45.44872 37.63676 1.208 0.22767
## hours:IQ -0.13457 0.17547 -0.767 0.44345
## hours:KWW 0.50497 0.31608 1.598 0.11064
## hours:educ 1.70417 1.35782 1.255 0.20992
## hours:exper 1.28802 0.68683 1.875 0.06121 .
## hours:tenure -1.23958 0.44602 -2.779 0.00561 **
## hours:age -0.12173 0.82131 -0.148 0.88222
## hours:married 0.10483 6.63109 0.016 0.98739
## hours:black -5.90930 8.93377 -0.661 0.50856
## hours:south 4.48976 4.76877 0.941 0.34681
## hours:urban -2.64292 4.85060 -0.545 0.58604
## hours:sibs -0.59065 1.30658 -0.452 0.65138
## hours:brthord -0.14475 1.88695 -0.077 0.93888
## hours:meduc 0.54974 0.95729 0.574 0.56600
## hours:feduc -0.86832 0.83946 -1.034 0.30136
## IQ:KWW 0.14886 0.15533 0.958 0.33823
## IQ:educ 0.24425 0.57544 0.424 0.67138
## IQ:exper 0.09552 0.31666 0.302 0.76302
## IQ:tenure -0.04357 0.20709 -0.210 0.83342
## IQ:age 0.32399 0.41337 0.784 0.43347
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 353.2 on 628 degrees of freedom
## (272 observations deleted due to missingness)
## Multiple R-squared: 0.2838, Adjusted R-squared: 0.2451
## F-statistic: 7.32 on 34 and 628 DF, p-value: < 2.2e-16
Breakout #2
Using the loop method or the lapply
method, generate 10 regression trees to explain wage
in wage_clean
. You can iterate through values of cp
, the complexity parameter, or minsplit
, the minimum # of points that have to be in each split.
10 minutes
Breakout #2 discussion
We’ll ask someone to share (a few of) their group’s 10 regression trees using rpart.plot
. Use bit.ly/KIRKPATRICK to volunteer to share.
Breakout #3 (if time)
Use lapply
to get a list of your RMSE’s (one for each of your models). The anonymous function may come in handy here (though it’s not necessary). Note that we are not yet splitting into test
and train
(which you will need to do on your lab assignment).
Once you have your list, create the plot of RMSEs similar to the one we looked at in Content this week. Note: you can use unlist(myRMSE)
to get a numeric vector of the RMSE’s (as long as all of the elements in myRMSE
are numeric). Then, it’s a matter of plotting either with base plot
or with ggplot
(if you use ggplot
you’ll have to tidy
the RMSE by adding the index column or naming the x-axis).