Model Building

NOTE

You must turn in a PDF document of your R Markdown code. Submit this to D2L by 11:59 PM Eastern Time on Monday, October 16th. Remember, your Project 1 is due on Saturday the 14th.

This week’s lab will extend last week’s lab. The introduction is a direct repeat.

Backstory and Set Up

You have been recently hired to Zillow’s Zestimate product team as a junior analyst. As a part of their regular hazing, they have given you access to a small subset of their historic sales data. Your job is to present some basic predictions for housing values in a small geographic area (Ames, IA) using this historical pricing.

First, let’s load the data.

ameslist <- read.table('https://raw.githubusercontent.com/ajkirkpatrick/FS20/postS21_rev/classdata/ames.csv', 
                   header = TRUE,
                   sep = ',') 

Building a Model

We’re now ready to start playing with a model. We will start by using the lm() function to fit a simple linear regression model, with SalePrice as the response and GrLivArea as the predictor.

Recall that the basic lm() syntax is lm(y∼x,data), where y is the response, x is the predictor, and data is the data set in which these two variables are kept. Let’s quickly run this with two variables:

lm.fit = lm(SalePrice ~ GrLivArea, data = ameslist)

If we type lm.fit, some basic information about the model is output. For more detailed information, we use summary(lm.fit). This gives us p-values and standard errors for the coefficients, as well as the \(R^2\) statistic and \(F\)-statistic for the entire model.1

Utilizing these functions hels us see some interesting results. Note that we built (nearly) the simplest possible model:

\[\text{SalePrice} = \beta_0 + \beta_1*(\text{GrLivArea}) + \epsilon.\]

But even on its own, this model is instructive. It suggest that an increase in overall living area of 1 ft \(^2\) is correlated with an expected increase in sales price of $107. (Note that we cannot make causal claims!)

Saving the model as we did above is useful because we can explore other pieces of information it stores. Specifically, we can use the names() function in order to find out what else is stored in lm.fit. Although we can extract these quantities by name—e.g. lm.fit$coefficients—it is safer to use the extractor functions like coef(lm.fit) to access them. We can also use a handy tool like plot() applied directly to lm.fit to see some interesting data that is automatically stored by the model.

Try it: Use plot() to explore the model above (it will make a sequence of plots; don’t put it in a code chunk, just use it for your own exploration). Do you suspect that some outliers have a large influence on the data? We will explore this point specifically in the future.

We can now go crazy adding variables to our model. It’s as simple as appending them to the previous code—though you should be careful executing this, as it will overwrite your previous output:

lm.fit = lm(SalePrice ~ GrLivArea + LotArea, data = ameslist)

Try it: Does controlling for LotArea change the qualitative conclusions from the previous regression? What about the quantitative results? Does the direction of the change in the quantitative results make sense to you?

EXERCISES

  1. Use the lm() function in a simple linear regression (e.g., with only one predictor) with SalePrice as the response to determine the value of a garage.

  2. Use the lm() function to perform a multiple linear regression with SalePrice as the response and all other variables from your Ames data as the predictors. You can do this easily with the formula SalePrice ~ . which tells lm to use all of the data’s columns (except SalePrice) on the right-hand-side. To do this, you’ll need to drop a few variables first, though. Use dplyr::select(-PoolQC, -MiscFeature, -Fence, -FireplaceQu, -LotFrontage, -Exterior2nd, -Electrical) to get rid of some variables that have a lot of NA values. Use the summary() or tidy function to print the results. Comment on the output. For instance:

    • Is there a relationship between the predictors and the response?
    • Which predictors appear to have a statistically significant relationship to the response? (Hint: look for stars)
    • What does the coefficient for the year variable suggest?
  3. There are a few NAs in the output from the regression in Question 2. You can use tidy to save the output in a familiar data.frame style “tibble”, and then explore it to see what variables are coming up NA. Remember what R did when we tried to give it dummy variables representing all three possible values of a factor variable (see “Parameterization” in Example 06). Keeping that in mind, scroll to the first NA in your regression output and see if you can explain why it might be NA. Please remember that we can use functions like View() to explore data, but we never put View() in a code chunk.

  4. It’s rarely a good idea to throw all the variables into a regression. We want to be smarter about building our model. We’ll use fewer variables, but include interactions. As we saw this week, the : symbol allows you to create an interction term between two variables. Use the : symbols to fit a linear regression model with one well-chosen interaction effect plus 3-4 of the other variables of your choice. Why did you select the variables you did, and what was the result?

  5. Try a few (e.g., two) different transformations of the variables, such as \(ln(x)\), \(x^2\), \(\sqrt x\). Do any of these make sense to include in a model of SalePrice? Comment on your findings.


  1. When we use the simple regression model with a single input, the \(F\)-stat includes the intercept term. Otherwise, it does not. See Lecture 5 for more detail.↩︎