Model Building
NOTE
You must turn in a PDF document of your R Markdown
code. Submit this to D2L by 11:59 PM Eastern Time on Monday, October 16th. Remember, your Project 1 is due on Saturday the 14th.
This week’s lab will extend last week’s lab. The introduction is a direct repeat.
Backstory and Set Up
You have been recently hired to Zillow’s Zestimate product team as a junior analyst. As a part of their regular hazing, they have given you access to a small subset of their historic sales data. Your job is to present some basic predictions for housing values in a small geographic area (Ames, IA) using this historical pricing.
First, let’s load the data.
ameslist <- read.table('https://raw.githubusercontent.com/ajkirkpatrick/FS20/postS21_rev/classdata/ames.csv',
header = TRUE,
sep = ',')
Building a Model
We’re now ready to start playing with a model. We will start by using the lm()
function to fit a simple linear regression
model, with SalePrice
as the response and GrLivArea
as the predictor.
Recall that the basic lm()
syntax is lm(y∼x,data)
, where y
is the response, x
is the predictor, and data
is the data set in which these two variables are kept. Let’s quickly run this with two variables:
lm.fit = lm(SalePrice ~ GrLivArea, data = ameslist)
If we type lm.fit
, some basic information about the model is output. For more detailed information, we use summary(lm.fit)
. This gives us p-values and standard errors for the coefficients, as well as the \(R^2\) statistic and \(F\)-statistic for the entire model.1
Utilizing these functions hels us see some interesting results. Note that we built (nearly) the simplest possible model:
\[\text{SalePrice} = \beta_0 + \beta_1*(\text{GrLivArea}) + \epsilon.\]
But even on its own, this model is instructive. It suggest that an increase in overall living area of 1 ft \(^2\) is correlated with an expected increase in sales price of $107. (Note that we cannot make causal claims!)
Saving the model as we did above is useful because we can explore other pieces of information it stores. Specifically, we can use the names()
function in order to find out what else is stored in lm.fit
. Although we can extract these quantities by name—e.g. lm.fit$coefficients
—it is safer to use the extractor functions like coef(lm.fit)
to access them. We can also use a handy tool like plot()
applied directly to lm.fit
to see some interesting data that is automatically stored by the model.
Try it: Use plot()
to explore the model above (it will make a sequence of plots; don’t put it in a code chunk, just use it for your own exploration). Do you suspect that some outliers have a large influence on the data? We will explore this point specifically in the future.
We can now go crazy adding variables to our model. It’s as simple as appending them to the previous code—though you should be careful executing this, as it will overwrite your previous output:
lm.fit = lm(SalePrice ~ GrLivArea + LotArea, data = ameslist)
Try it: Does controlling for LotArea
change the qualitative conclusions from the previous regression? What about the quantitative results? Does the direction of the change in the quantitative results make sense to you?
EXERCISES
Use the
lm()
function in a simple linear regression (e.g., with only one predictor) withSalePrice
as the response to determine the value of a garage.Use the
lm()
function to perform a multiple linear regression withSalePrice
as the response and all other variables from yourAmes
data as the predictors. You can do this easily with the formulaSalePrice ~ .
which tellslm
to use all of the data’s columns (exceptSalePrice
) on the right-hand-side. To do this, you’ll need to drop a few variables first, though. Usedplyr::select(-PoolQC, -MiscFeature, -Fence, -FireplaceQu, -LotFrontage, -Exterior2nd, -Electrical)
to get rid of some variables that have a lot ofNA
values. Use thesummary()
ortidy
function to print the results. Comment on the output. For instance:- Is there a relationship between the predictors and the response?
- Which predictors appear to have a statistically significant relationship to the response? (Hint: look for stars)
- What does the coefficient for the year variable suggest?
There are a few
NA
s in the output from the regression in Question 2. You can usetidy
to save the output in a familiar data.frame style “tibble”, and then explore it to see what variables are coming upNA
. Remember whatR
did when we tried to give it dummy variables representing all three possible values of a factor variable (see “Parameterization” in Example 06). Keeping that in mind, scroll to the firstNA
in your regression output and see if you can explain why it might beNA
. Please remember that we can use functions likeView()
to explore data, but we never putView()
in a code chunk.It’s rarely a good idea to throw all the variables into a regression. We want to be smarter about building our model. We’ll use fewer variables, but include interactions. As we saw this week, the
:
symbol allows you to create an interction term between two variables. Use the:
symbols to fit a linear regression model with one well-chosen interaction effect plus 3-4 of the other variables of your choice. Why did you select the variables you did, and what was the result?Try a few (e.g., two) different transformations of the variables, such as \(ln(x)\), \(x^2\), \(\sqrt x\). Do any of these make sense to include in a model of
SalePrice
? Comment on your findings.
When we use the simple regression model with a single input, the \(F\)-stat includes the intercept term. Otherwise, it does not. See Lecture 5 for more detail.↩︎