Correlations and Simple Models
NOTE
You must turn in a PDF document of your R Markdown
code. Submit this to D2L by 11:59 PM Eastern Time on Monday, October 9th.
Backstory and Set Up
You have been recently hired to Zillow’s Zestimate product team as a junior analyst. As a part of their regular hazing, they have given you access to a small subset of their historic sales data. Your job is to present some basic predictions for housing values in a small geographic area (Ames, IA) using this historical pricing.
First, let’s load the data.
ameslist <- read.table('https://raw.githubusercontent.com/ajkirkpatrick/FS20/postS21_rev/classdata/ames.csv',
header = TRUE,
sep = ',')
Before we proceed, let’s note a few things about the (simple) code above. First, we have specified header = TRUE
because—you guessed it—the original dataset has headers. Although simple, this is an incredibly important step because it allows R
to do some smart R
things. Specifically, once the headers are in, the variables are formatted as int
and factor
where appropriate. It is absolutely vital that we format the data correctly; otherwise, many R
commands will whine at us.
Try it: Run the above, but instead specifying header = FALSE
. What data type are the various columns? Now try ommitting the line altogether. What is the default behavior of the read.table
function?1
Data Exploration and Processing
We are not going to tell you anything about this data. This is intended to replicate a real-world experience that you will all encounter in the (possibly near) future: someone hands you data and you’re expected to make sense of it. Fortunately for us, this data is (somewhat) self-contained. We’ll first check the variable names to try to divine some information. Recall, we have a handy little function for that:
names(ameslist)
Note that, when doing data exploration, we will sometimes choose to not save our output. This is a judgement call; here we’ve chosen to merely inspect the variables rather than diving in.
Inspection yields some obvious truths. For example:
Variable | Explanation | Type |
---|---|---|
ID |
Unique identifier for each row | int |
LotArea |
Size of lot (units unknown) | int |
SalePrice |
Sale price of house ($) | int |
…but we face some not-so-obvious things as well. For example:
Variable | Explanation | Type |
---|---|---|
LotShape |
? Something about the lot | factor |
MSSubClass |
? No clue at all | int |
Condition1 |
? Seems like street info | factor |
It will be difficult to learn anything about the data that is of type int
without outside documentation. However, we can learn something more about the factor
-type variables. In order to understand these a little better, we need to review some of the values that each take on.
Try it: Go through the variables in the dataset and make a note about your interpretation for each. Many will be obvious, but some require additional thought.
Although there are some variables that would be difficult to clean, there are a few that we can address with relative ease. Consider, for instance, the variable GarageType
. This might not be that important, but, remember, the weather in Ames, IA is pretty crummy—a detached garage might be a dealbreaker for some would-be homebuyers. Let’s inspect the values:
> unique(ameslist$GarageType)
[1] Attchd Detchd BuiltIn CarPort <NA> Basment 2Types
With this, we could make an informed decision and create a new variable. Let’s create OutdoorGarage
to indicate, say, homes that have any type of garage that requires the homeowner to walk outdoors after parking their car. (For those who aren’t familiar with different garage types, a car port is not insulated and is therefore considered outdoors. A detached garage presumably requires that the person walks outside after parking. The three other types are inside the main structure, and 2Types
we can assume includes at least one attached garage of some sort). This is going to require a bit more coding and we will have to think through each step carefully.
First, let’s create a new object that has indicator variables (that is, a variable whose values are either zero or one) for each of the GarageType
values. That is, it has a \(1\) if the variable takes on some specific value, and a \(0\) otherwise. Do this for all but one of the different values in GarageType
, and your descriptive variable is now represented by numbers.
As with everything in R
, there’s a handy function to do this for us:
GarageTemp = model.matrix( ~ GarageType - 1, data=ameslist )
We now have two separate objects living in our computer’s memory: ameslist
and GarageTemp
—so named to indicate that it is a temporary object.2 We now need to stitch it back onto our original data; we’ll use a simple concatenation and write over our old list with the new one:
ameslist <- cbind(ameslist, GarageTemp)
> Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 1460, 1379
Huh. What’s going on?
EXERCISE 1 of 5
- Figure out what’s going on above. Figure out where the 1460-1379 = 81 rows of data are going when using
model.matrix
. Fix this issue so that you have a working version.
Now that we’ve got that working (ha!) we can generate a new variable for our outdoor garage:
ameslist$GarageOutside <- ifelse(ameslist$GarageTypeDetchd == 1 | ameslist$GarageTypeCarPort == 1, 1, 0)
unique(ameslist$GarageOutside)
[1] 0 1 NA
This seems to have worked. The command above ifelse()
does what it says: if
some condition is met (here, either of two variables equals one) then it returns a one; else
it returns a zero. Such functions are very handy, though as mentioned above, there are other ways of doing this. Also note, that while fixed the issue with NA
above, we’ve got new issues: we definitely don’t want NA
outputted from this operation. Accordingly, we’re going to need to deal with it somehow.
Try it: Utilizing a similar approach to what you did above, fix this so that the only outputs are zero and one. This requires taking a stand on what the NA
values mean. If you think they correspond to a detached garage (or something functionally equivalent, like “no parking whatsoever”), then change the NA
values to zero. If you think they are mistakes, then we should drop all data with NA
for the this column. State what you did and why. You can do this just using a subset to state which values you want to replace, or you can use case_when
and make sure your last conditional always returns a value. Refresh yourself in Content 01.
Generally speaking, this is a persistent issue, and you will spend an extraordinary amount of time dealing with missing data or data that does not encode a variable exactly as you want it. This is expecially true if you deal with real-world data: you will need to learn how to handle NA
s. There are a number of fixes (as always, Google is your friend) and anything that works is good. But you should spend some time thinking about this and learning at least one approach.
EXERCISES 2-5
Prune the data to 6-8 of the variables that are
type = int
about which you have some reasonable intuition for what they mean. Choose those that you believe are likely to be correlated withSalePrice
. This must include the variableSalePrice
andGrLivArea
. Save this new dataset asAmes
. Produce documentation for this object in the form of a Markdown table or see further documentation here. This must describe each of the preserved variables, the values it can take (e.g., can it be negative?) and your definition of the variable. Counting the variable name, this means your table should have three columns. Markdown tables are entered in the text body, not code chunks, of your .rmd, so your code creatingAmes
will be in a code chunk, and your table will be right after it.Produce a scatterplot matrix of the chosen variables3
Compute a matrix of correlations between these variables using the function
cor()
. Do the correlations match your prior beliefs? Briefly discuss the correlation between the chosen variables andSalePrice
and any correlations between these variables.Produce a scatterplot between
SalePrice
andGrLivArea
. Run a linear model usinglm()
to explore the relationship. Finally, use thegeom_abline()
function to plot the relationship that you’ve found in the simple linear regression. You’ll need to extract the intercept and slope from yourlm
object. Seecoef(...)
for information on this.4- What is the largest outlier that is above the regression line? Produce the other information about this house.
(Bonus) Create a visualization that shows the rise of air conditioning over time in homes in Ames.
Of course, you could find out the defaults of the function by simply using the handy
?
command. Don’t forget about this tool!↩︎It’s not exactly true that these objects are in memory. They are… sort of. But how
R
handles memory is complicated and silly and blah blah who cares. It’s basically in memory.↩︎If you are not familiar with this type of visualization, consult the book (Introduction to Statistical Learning), Chapters 2 and 3.↩︎
We could also use
geom_smooth(method = 'lm')
to add the regression line, but it’s good practice to work withlm
objects.↩︎