Correlations and Simple Models

NOTE

You must turn in a PDF document of your R Markdown code. Submit this to D2L by 11:59 PM Eastern Time on Monday, October 9th.

Backstory and Set Up

You have been recently hired to Zillow’s Zestimate product team as a junior analyst. As a part of their regular hazing, they have given you access to a small subset of their historic sales data. Your job is to present some basic predictions for housing values in a small geographic area (Ames, IA) using this historical pricing.

First, let’s load the data.

ameslist  <- read.table('https://raw.githubusercontent.com/ajkirkpatrick/FS20/postS21_rev/classdata/ames.csv', 
                   header = TRUE,
                   sep = ',') 

Before we proceed, let’s note a few things about the (simple) code above. First, we have specified header = TRUE because—you guessed it—the original dataset has headers. Although simple, this is an incredibly important step because it allows R to do some smart R things. Specifically, once the headers are in, the variables are formatted as int and factor where appropriate. It is absolutely vital that we format the data correctly; otherwise, many R commands will whine at us.

Try it: Run the above, but instead specifying header = FALSE. What data type are the various columns? Now try ommitting the line altogether. What is the default behavior of the read.table function?1

Data Exploration and Processing

We are not going to tell you anything about this data. This is intended to replicate a real-world experience that you will all encounter in the (possibly near) future: someone hands you data and you’re expected to make sense of it. Fortunately for us, this data is (somewhat) self-contained. We’ll first check the variable names to try to divine some information. Recall, we have a handy little function for that:

names(ameslist)

Note that, when doing data exploration, we will sometimes choose to not save our output. This is a judgement call; here we’ve chosen to merely inspect the variables rather than diving in.

Inspection yields some obvious truths. For example:

Variable Explanation Type
ID Unique identifier for each row int
LotArea Size of lot (units unknown) int
SalePrice Sale price of house ($) int

…but we face some not-so-obvious things as well. For example:

Variable Explanation Type
LotShape ? Something about the lot factor
MSSubClass ? No clue at all int
Condition1 ? Seems like street info factor

It will be difficult to learn anything about the data that is of type int without outside documentation. However, we can learn something more about the factor-type variables. In order to understand these a little better, we need to review some of the values that each take on.

Try it: Go through the variables in the dataset and make a note about your interpretation for each. Many will be obvious, but some require additional thought.

Although there are some variables that would be difficult to clean, there are a few that we can address with relative ease. Consider, for instance, the variable GarageType. This might not be that important, but, remember, the weather in Ames, IA is pretty crummy—a detached garage might be a dealbreaker for some would-be homebuyers. Let’s inspect the values:

> unique(ameslist$GarageType)
[1] Attchd  Detchd  BuiltIn CarPort <NA> Basment 2Types

With this, we could make an informed decision and create a new variable. Let’s create OutdoorGarage to indicate, say, homes that have any type of garage that requires the homeowner to walk outdoors after parking their car. (For those who aren’t familiar with different garage types, a car port is not insulated and is therefore considered outdoors. A detached garage presumably requires that the person walks outside after parking. The three other types are inside the main structure, and 2Types we can assume includes at least one attached garage of some sort). This is going to require a bit more coding and we will have to think through each step carefully.

First, let’s create a new object that has indicator variables (that is, a variable whose values are either zero or one) for each of the GarageType values. That is, it has a \(1\) if the variable takes on some specific value, and a \(0\) otherwise. Do this for all but one of the different values in GarageType, and your descriptive variable is now represented by numbers.

As with everything in R, there’s a handy function to do this for us:

GarageTemp = model.matrix( ~ GarageType - 1, data=ameslist )

We now have two separate objects living in our computer’s memory: ameslist and GarageTemp—so named to indicate that it is a temporary object.2 We now need to stitch it back onto our original data; we’ll use a simple concatenation and write over our old list with the new one:

ameslist <- cbind(ameslist, GarageTemp)
> Error in data.frame(..., check.names = FALSE) :
  arguments imply differing number of rows: 1460, 1379

Huh. What’s going on?

EXERCISE 1 of 5

  1. Figure out what’s going on above. Figure out where the 1460-1379 = 81 rows of data are going when using model.matrix. Fix this issue so that you have a working version.

Now that we’ve got that working (ha!) we can generate a new variable for our outdoor garage:

ameslist$GarageOutside <- ifelse(ameslist$GarageTypeDetchd == 1 | ameslist$GarageTypeCarPort == 1, 1, 0)
unique(ameslist$GarageOutside) 
[1]  0  1 NA

This seems to have worked. The command above ifelse() does what it says: if some condition is met (here, either of two variables equals one) then it returns a one; else it returns a zero. Such functions are very handy, though as mentioned above, there are other ways of doing this. Also note, that while fixed the issue with NA above, we’ve got new issues: we definitely don’t want NA outputted from this operation. Accordingly, we’re going to need to deal with it somehow.

Try it: Utilizing a similar approach to what you did above, fix this so that the only outputs are zero and one. This requires taking a stand on what the NA values mean. If you think they correspond to a detached garage (or something functionally equivalent, like “no parking whatsoever”), then change the NA values to zero. If you think they are mistakes, then we should drop all data with NA for the this column. State what you did and why. You can do this just using a subset to state which values you want to replace, or you can use case_when and make sure your last conditional always returns a value. Refresh yourself in Content 01.

Generally speaking, this is a persistent issue, and you will spend an extraordinary amount of time dealing with missing data or data that does not encode a variable exactly as you want it. This is expecially true if you deal with real-world data: you will need to learn how to handle NAs. There are a number of fixes (as always, Google is your friend) and anything that works is good. But you should spend some time thinking about this and learning at least one approach.

EXERCISES 2-5

  1. Prune the data to 6-8 of the variables that are type = int about which you have some reasonable intuition for what they mean. Choose those that you believe are likely to be correlated with SalePrice. This must include the variable SalePrice and GrLivArea. Save this new dataset as Ames. Produce documentation for this object in the form of a Markdown table or see further documentation here. This must describe each of the preserved variables, the values it can take (e.g., can it be negative?) and your definition of the variable. Counting the variable name, this means your table should have three columns. Markdown tables are entered in the text body, not code chunks, of your .rmd, so your code creating Ames will be in a code chunk, and your table will be right after it.

  2. Produce a scatterplot matrix of the chosen variables3

  3. Compute a matrix of correlations between these variables using the function cor(). Do the correlations match your prior beliefs? Briefly discuss the correlation between the chosen variables and SalePrice and any correlations between these variables.

  4. Produce a scatterplot between SalePrice and GrLivArea. Run a linear model using lm() to explore the relationship. Finally, use the geom_abline() function to plot the relationship that you’ve found in the simple linear regression. You’ll need to extract the intercept and slope from your lm object. See coef(...) for information on this.4

    • What is the largest outlier that is above the regression line? Produce the other information about this house.

(Bonus) Create a visualization that shows the rise of air conditioning over time in homes in Ames.


  1. Of course, you could find out the defaults of the function by simply using the handy ? command. Don’t forget about this tool!↩︎

  2. It’s not exactly true that these objects are in memory. They are… sort of. But how R handles memory is complicated and silly and blah blah who cares. It’s basically in memory.↩︎

  3. If you are not familiar with this type of visualization, consult the book (Introduction to Statistical Learning), Chapters 2 and 3.↩︎

  4. We could also use geom_smooth(method = 'lm') to add the regression line, but it’s good practice to work with lm objects.↩︎