Applied Logistic Regression - Classification

You must turn in a PDF document of your R Markdown code. Submit this to D2L by 11:59 PM Eastern Time on the date listed on the syllabus calendar

Backstory and Set Up

You work for a bank. This bank is trying to predict defaults on loans (a relatively uncommon occurence, but one that costs the bank a great deal of money when it does happen.) They’ve given you a dataset on defaults (encoded as the variable y). You’re going to try to predict this.

This is some new data. The snippet below loads it.

bank <- read.table("https://raw.githubusercontent.com/ajkirkpatrick/FS20/postS21_rev/classdata/bank.csv",
                 header = TRUE,
                 sep = ",")

There’s not going to be a whole lot of wind-up here. You should be well-versed in doing these sorts of things by now (if not, look back at the previous lab for sample code).

EXERCISE 1

  1. Split the data into an 80/20 train vs. test split. Make sure you explicitly set the seed for replicability, but do not share your seed with others in the class. (We may compare some results across people.)

  2. Run a series of logistic regressions with between 1 and 4 predictors of your choice (you can use interactions).

  3. Create eight total confusion matrices: four by applying your models to the training data, and four by applying your models to the test data. Briefly discuss your findings. How does the error rate, sensitivity, and specificity change as the number of predictors increases? If you are not getting a 2x2 confusion matrix, you might need to adjust your cutoff probability. It might be the case that your model perfectly predicts the outcome variable when the setup cutoff probability is too high.