Advanced Model Building

This is a pretty long lab. Leave plenty of time to complete it.

You must turn in a PDF document of your R Markdown code. Submit this to D2L by 11:59 PM Eastern Time on Monday, October 24th Monday, October 30th (there is no Lab 8).

Backstory and Set Up

You still work for Zillow as a junior analyst (sorry). But you’re hunting a promotion. Your job is to present some more advanced predictions for housing values in a small geographic area (Ames, IA) using this historical pricing.

As always, let’s load the data.

Ames <- read.table('https://raw.githubusercontent.com/ajkirkpatrick/FS20/postS21_rev/classdata/ames.csv', 
                   header = TRUE,
                   sep = ',') 

Data Cleaning

Do not skip this section. This isn’t your kitchen junk drawer – you can’t get away with not cleaning your data.

Oh, the Ames data yet again. It’s given us lots of trouble. Many of you have found a few variables (columns) that should be avoided. The main problem is that some columns have only one value in them, or they have only NA and one value, so once lm(...) drops the NA rows, they are left with only one value. Linear regression by OLS does not like variables that don’t vary! So, let’s be systematic about figuring out which columns in our data are to be avoided.

The skimr package is very helpful for seeing what our data contains. Install it, and then use skim(Ames) directly in the console (we’re just looking at data at the moment – do not put skim output into your RMarkdown output - it will give you an error). Take a look at the “complete rate” column - this tells us the fraction of observations in that column that are NA. If it’s very small (see Alley), then that variable will be problematic. The “n_unique” column tells us if there are few or many different values - a “1” in “n_unique” is definitely going to be a problem and you must drop that variable.

You can make a note of those columns that have extremely low “complete rates” and drop them to start off. Of course, we could keep them, and drop all observations where any of those columns are NA, but once we do that, there probably won’t be many rows left! There are about 6-7 of them that will drop so many that it will cause an error if we include them in a regression. Let’s drop those. (Note: the list of columns to drop in Lab 06 is a good start and should suffice, but feel free to drop others that have low complete rates).

Many models do not like NA values

predict has some unusual behavior that can give unexpected results. Thus far, we have mostly used predict(myOLS), which gives the predicted values from a model using the same data that estimated the model. When we ask lm (or, later, other machine learning models) to estimate a model, it will drop any rows of our data that contain a NA value for any of the variables used in the estimation. If your regression is SalePrice ~ GrLivArea, then it will check SalePrice and GrLivArea for NA’s. If you add another variable, then you add another possible set of NA values that can be dropped, and R will estimate the model on a subset of the data.

This will mess with your measure of \(RMSE\) - if you compare two models that use different sets of the data, then you aren’t really comparing the fit very well. Because of that, we need to take a moment to check our data.

Ames has a lot of variables, and in this assignment, you’re going to be asked to construct 15 regressions of increasing complexity. So we’re going to choose 15 variables to be explanatory, plus one variable that isn’t SalePrice that we want to be the target variable we predict (in Exercise 1), plus SalePrice, which will be our target variable for Exercise 2. That makes 17 variables.

Select your 17 variables by making a character vector of the variables you think will best predict your chosen variable. Then, make a new version of Ames that contains only those 17 variables (you can use dplyr::select or any other method). Once you’ve done that, use na.omit to make a new, clean version of Ames that has (1) no NA’s in it, and (2) 17 variables of your choice (as long as one is SalePrice). Use the help for na.omit to see how it works. This way, we know every model we make will have the same number of rows in it as none will be dropped due to NA values.

Linear Models

When exploring linear models in other classes, we often emphasize asymptotic results under distributional assumptions. That is, we make assumptions about the model in order to derive properties of large samples. This general approach is useful for creating and performing hypothesis tests. Frequently, when developing a linear regression model, part of the goal is to explain a relationship. However, this isn’t always the case. And it’s often not a valid approach, as we discussed in this week’s content.

So, we will ignore much of what we have learned in other classes (sorry, EC420) and instead simply use regression as a tool to predict. Instead of a model which supposedly explains relationships, we seek a model which minimizes errors.

To discuss linear models in the context of prediction, we return to the Ames data. Accordingly, you should utilize some of the early code from Lab 2 to hasten your progress in this lab.

Assesing Model Accuracy

There are many metrics to assess the accuracy of a regression model. Most of these measure in some way the average error that the model makes. The metric that we will be most interested in is the root-mean-square error.

\[ \text{RMSE}(\hat{f}, \text{Data}) = \sqrt{\frac{1}{n}\displaystyle\sum_{i = 1}^{n}\left(y_i - \hat{f}(\bf{x}_i)\right)^2} \]

While for the sake of comparing models, the choice between RMSE and MSE is arbitrary, we have a preference for RMSE, as it has the same units as the response variable. Also, notice that in the prediction context MSE refers to an average, whereas in an ANOVA context, the denominator for MSE may not be \(n\).

For a linear model , the estimate of \(f\), \(\hat{f}\), is given by the fitted regression line.

\[ \hat{y}({\bf{x}_i}) = \hat{f}({\bf{x}_i}) \]

We can write an R function that will be useful for performing this calculation.

rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

Model Complexity

Aside from how well a model predicts, we will also be very interested in the complexity (flexibility) of a model. For now, we will only consider nested linear models for simplicity. Then in that case, the more predictors that a model has, the more complex the model. For the sake of assigning a numerical value to the complexity of a linear model, we will use the number of predictors, \(p\).

We write a simple R function to extract this information from a model.

get_complexity = function(model) {
  length(coef(model)) - 1
}

When deciding how complex of a model to use, we can utilize two techniques: forward selection or backward selection. Forward selection means that we start with the simplest model (with a single predictor) and then add one at a time until we decide to stop. Backward selection means that we start with the most complex model (with every available predictor) and then remove one at a time until we decide to stop. There are many criteria for “when to stop”. Below, we’ll try to give you some intuition on the model-building process.

EXERCISE 1 of 2

  1. Using skimr::skim, find the variables that have a low complete rate (below 60%) and drop them. 60% isn’t a magic number by any means, the “right” number is entirely dependent on your data. It is always standard practice to document the fields you have dropped from the data, so make sure you state which variables have been dropped. Also using skim, note the variables with values for “n_unique” equal to 1 and drop them. See Data Cleaning (above) for details

  2. Then, choose one numeric variable (not SalePrice) to be your target variable (the variable you want to explain or predict), and the 15 variables you want to use to predict it - they can be numeric or categorical, both will work. Then, as described in Data Cleaning above, use na.omit to make a clean version of Ames where every variable is 100% complete. Include SalePrice in your cleaning, but don’t use it yet – it will be the target variable for Exercise 2.

  3. Using forward selection (that is, select one variable, then select another) create a series of models up to complexity length 15. While you can code each of the 15 regressions separately, if you really want to be efficient, try to use a loop along with a character vector of your 15 variables to “step” through the 15 regressions. There are multiple ways to specify a regression using a character vector of arguments, but I’ll leave the solution to you. Use a list object to hold your results, and use lapply along with your RMSE function(s) to get a list of the RMSE’s, one for each model.

  4. Make a data.frame or tibble of the RMSE results and the model complexity (the function unlist is helpful when you have a list of identical types, as should be the case with your measures of model complexity and RMSE). Create a chart plotting the model complexity as the \(x\)-axis variable and RMSE as the \(y\)-axis variable. Describe any patterns you see. Do you think you should use the full-size model? Why or why not? What criterion are you using to make this statement?


Test-Train Split

Here, we will use SalePrice as our target variable. You can re-use your code for cleaning from above (you included SalePrice there, right?), and if you want to choose different predictors, feel free to do so.

There is an issue with fitting a model to all available data then using RMSE to determine how well the model predicts: it is essentially cheating. As a linear model becomes more complex, the RSS, thus RMSE, can never go up. It will only go down—or, in very specific cases where a new predictor is completely uncorrelated with the target, stay the same. This might seem to suggest that in order to predict well, we should use the largest possible model. However, in reality we have fit to a specific dataset, but as soon as we see new data, a large model may (in fact) predict poorly. This is called overfitting.

The most common approach to overfitting is to take a dataset of interest and split it in two. One part of the datasets will be used to fit (train) a model, which we will call the training data. The remainder of the original data will be used to assess how well the model is predicting, which we will call the test data. Test data should never be used to train a model—its pupose is to evaluate the fitted model once you’ve settled on something.1

Here we use the sample() function to obtain a random sample of the rows of the original data. We then use those row numbers (and remaining row numbers) to split the data accordingly. Notice we used the set.seed() function to allow use to reproduce the same random split each time we perform this analysis. Sometimes we don’t want to do this; if we want to run lots of independent splits, then we do not need to set the initial seed.

set.seed(9)
num_obs = nrow(Amesclean)

train_index = sample(num_obs, size = trunc(0.50 * num_obs))
train_data = Amesclean[train_index, ]
test_data = Amesclean[-train_index, ]

Of course, you’ll have different results here since you’ll have different columns in Amesclean and will thus have different numbers after using na.omit. We will look at two measures that assess how well a model is predicting: train RMSE and test RMSE.

\[ \text{RMSE}_\text{Train} = \text{RMSE}(\hat{f}, \text{Train Data}) = \sqrt{\frac{1}{n_{\text{Tr}}}\sum_{i \in \text{Train}}\left(y_i - \hat{f}(\bf{x}_i)\right)^2} \]

Here \(n_{Tr}\) is the number of observations in the train set. Train RMSE will still always go down (or stay the same) as the complexity of a linear model increases. That means train RMSE will not be useful for comparing models, but checking that it decreases is a useful sanity check.

\[ \text{RMSE}_{\text{Test}} = \text{RMSE}(\hat{f}, \text{Test Data}) = \sqrt{\frac{1}{n_{\text{Te}}}\sum_{i \in \text{Test}} \left ( y_i - \hat{f}(\bf{x}_i) \right ) ^2} \]

Here \(n_{Te}\) is the number of observations in the test set. Test RMSE uses the model fit to the training data, but evaluated on the unused test data. This is a measure of how well the fitted model will predict in general, not simply how well it fits data used to train the model, as is the case with train RMSE. What happens to test RMSE as the size of the model increases? That is what we will investigate.

We will start with the simplest possible linear model, that is, a model with no predictors.

fit_0 = lm(SalePrice ~ 1, data = train_data)
get_complexity(fit_0)
## [1] 0
# train RMSE
sqrt(mean((train_data$SalePrice - predict(fit_0, train_data)) ^ 2))
## [1] 80875.98
# test RMSE
sqrt(mean((test_data$SalePrice - predict(fit_0, test_data)) ^ 2))
## [1] 77928.62

Your results will be different, depending on what 17 variables you selected (and which rows contained NA for them). The previous two operations obtain the train and test RMSE. Since these are operations we are about to use repeatedly, we should use the function that we happen to have already written.

# train RMSE
rmse(actual = train_data$SalePrice, predicted = predict(fit_0, train_data))
## [1] 80875.98
# test RMSE
rmse(actual = test_data$SalePrice, predicted = predict(fit_0, test_data))
## [1] 77928.62

This function can actually be improved for the inputs that we are using. We would like to obtain train and test RMSE for a fitted model, given a train or test dataset, and the appropriate response variable.

get_rmse = function(model, data, response) {
  rmse(actual = subset(data, select = response, drop = TRUE),
       predicted = predict(model, data))
}

By using this function, our code becomes easier to read, and it is more obvious what task we are accomplishing.

get_rmse(model = fit_0, data = train_data, response = "SalePrice") # train RMSE
## [1] 80875.98
get_rmse(model = fit_0, data = test_data, response = "SalePrice") # test RMSE
## [1] 77928.62

Try it: Apply this basic function with different arguments. Do you understand how we’ve nested functions within functions?

Adding Flexibility to Linear Models

We started with the simpliest model including only a constant (which gives us only the (Intercept) as a coefficient). This is identical to estimating \(\widehat{\bar{y}}\). But we want to let our model be more complex – we want to add variables, polynomial terms, and interactions. Let’s do this.

Try it: (This is Exercise 2, Question 1, so probably a good idea to do this) Using lm(), predict SalePrice with no fewer than five nested models of increasing complexity (each new model must include all of the terms from the previous, plus something new in the predictor). Here, we probably want to use interactions and polynomial terms, but make sure your models are still nested! Call them fit_1 to fit_5. It may be easiest to write out the formulas in your code rather than automating as in Exercise 1.

Each successive model we fit will be more and more flexible using both interactions and polynomial terms. We will see the training error decrease each time the model is made more flexible. We expect the test error to decrease a number of times, then eventually start going up, as a result of overfitting. To better understand the relationship between train RMSE, test RMSE, and model complexity, we’ll explore further.

Hopefully, you tried the in-line excercise above. If so, we can create a list of the models fit.

model_list = list(fit_0, fit_1, fit_2, fit_3, fit_4, fit_5)

We then obtain train RMSE, test RMSE, and model complexity for each using our old friend sapply().

train_rmse = sapply(model_list, get_rmse, data = train_data, response = "SalePrice")
test_rmse = sapply(model_list, get_rmse, data = test_data, response = "SalePrice")
model_complexity = sapply(model_list, get_complexity)

Once you’ve done this, you’ll notice the following:

# This is the same as the sapply command above

test_rmse = c(get_rmse(fit_0, test_data, "SalePrice"),
              get_rmse(fit_1, test_data, "SalePrice"),
              get_rmse(fit_2, test_data, "SalePrice"),
              get_rmse(fit_3, test_data, "SalePrice"),
              get_rmse(fit_4, test_data, "SalePrice"),
              get_rmse(fit_5, test_data, "SalePrice"))

We can plot the results. If you execute the code below, you’ll see the train RMSE in blue, while the test RMSE is given in orange.2

plot(model_complexity, train_rmse, type = "b",
     ylim = c(min(c(train_rmse, test_rmse)) - 0.02,
              max(c(train_rmse, test_rmse)) + 0.02),
     col = "dodgerblue",
     xlab = "Model Size",
     ylab = "RMSE")
lines(model_complexity, test_rmse, type = "b", col = "darkorange")

We could also summarize the results as a table. fit_1 is the least flexible, and fit_5 is the most flexible. If we were to do this (see the exercise below) we would see that Train RMSE decreases as flexibility increases forever. However, this may not be the case for the Test RMSE.

Model Train RMSE Test RMSE Predictors
fit_1 RMSE\(_{\text{train}}\) for model 1 RMSE\(_{\text{test}}\) for model 1 put predictors here
….
fit_5 RMSE\(_{\text{train}}\) for model 5 RMSE\(_{\text{train}}\) for model 5 \(p\) predictors

To summarize:

  • Underfitting models: In general High Train RMSE, High Test RMSE.
  • Overfitting models: In general Low Train RMSE, High Test RMSE.

Specifically, we say that a model is overfitting if there exists a less complex model with lower Test RMSE.3 Then a model is underfitting if there exists a more complex model with lower Test RMSE.

EXERCISE 2 of 2

  1. Using lm() and any number of regressors, predict SalePrice with no fewer than five models of increasing complexity (as in the try-it above). Complexity can be increased by adding variables or by adding interactions or polynomials of existing variables. Put the five models into a list.

  2. Calculate the Train and Test RMSE. Your goal is to have a lower Test RMSE than others in the class.

  3. Make a table exactly like the table above for the 5 models you just fit. The first column should have the name of the model (e.g. fit_1). Hint: you can get the names of the entries in a list using names(model_list) provided you named the list items when you added them.

  4. In a short paragraph, describe the resulting model. Discuss how you arrived at this model, what interactions you’re using (if any) and how confident you are that your prediction will perform well, relative to other people in the class.

A final note on the analysis performed here; we paid no attention whatsoever to the “assumptions” of a linear model. We only sought a model that predicted well, and paid no attention to a model for explaination. This is especially true if we interacted variables without any theory as to why they might need to be interacted – “exterior type interacted with number of half-baths” may help with RMSE, but it certainly isn’t grounded in a story that a real estate agent might tell. Hypothesis testing did not play a role in deciding the model, only prediction accuracy. Collinearity? We don’t care. Assumptions? Still don’t care. Diagnostics? Never heard of them. (These statements are a little over the top, and not completely true, but just to drive home the point that we only care about prediction. Often we latch onto methods that we have seen before, even when they are not needed.)


  1. Note that sometimes the terms evaluation set and test set are used interchangeably. We will give somewhat specific definitions to these later. For now we will simply use a single test set for a training set.↩︎

  2. The train RMSE is guaranteed to follow this non-increasing pattern as long as no data is being dropped when new variables are added (see Data Cleaning above). The same is not true of test RMSE. We often see a nice U-shaped curve. There are theoretical reasons why we should expect this, but that is on average. Because of the randomness of one test-train split, we may not always see this result. Re-perform this analysis with a different seed value and the pattern may not hold. We will discuss why we expect this next week. We will discuss how we can help create this U-shape much later. Also, we might intuitively expect train RMSE to be lower than test RMSE. Again, due to the randomness of the split, you may get (un)lucky and this will not be true.↩︎

  3. The labels of under and overfitting are relative to the best model we see. Any model more complex with higher Test RMSE is overfitting. Any model less complex with higher Test RMSE is underfitting.↩︎