Visualizing Large(ish) Data

NOTE

You must turn in a PDF document of your R Markdown code. Submit this to D2L by 11:59 PM Eastern Time on Monday, September 25th

Getting started

For this exercise you’ll use state-level unemployment data from 2006 to 2016 that comes from the US Bureau of Labor Statistics (if you’re curious, we describe how we built this dataset down below).

In the end, to help you master file organization, we suggest that the structure of your project directory should look something like this:

your-project-name\
  03-lab.Rmd
  data\
    unemployment.csv

The example for this week will be incredibly helpful for this exercise. Reference it.

For this week, you need to start making your plots look nice. For full credit, you will have to label axes, label the plot, and experiment with themes. Experiment with adding a labs() layer or changing colors. Make sure all plots are legible and neat. If you’re super brave, try modifying a theme and its elements. Default plots will not receive full credit.

EXERCISE 1

Use data from the US Bureau of Labor Statistics (BLS) to show the trends in employment rate for all 50 states plus DC between 2006 and 2016. Your first instinct might be to put all 50 lines on one graph. Even if using an aesthetic mapping for color, this will be a mess. Let’s give each state its own plot – use facet_grid or facet_wrap to do this. What stories does this plot tell? Which states struggled to recover from the 2008–09 recession?

Some hints/tips:

  • You won’t need to filter out any missing rows because the data here is complete -— there are no state-year combinations with missing unemployment data.

  • You’ll be plotting 51 facets. You can filter out DC if you want to have a better grid (like 5 × 10).

  • Plot the date column along the x-axis, not the year column. If you plot by year, you’ll get weird looking lines (try it for fun?), since these observations are monthly. If you really want to plot by year only, you’ll need to create a different data frame where you group by year and state and calculate the average unemployment rate for each year/state combination (i.e. group_by(year, state) %>% summarize(avg_unemployment = mean(unemployment)))

  • Try mapping other aesthetics onto the graph too. You’ll notice there are columns for region and division—play with those as colors, for instance.

  • This plot might be big, so make sure you adjust fig.width and fig.height in the chunk options so that it’s visible when you knit it - try different values and knit to see how they look. If you’re unfamiliar with chunk options, see the Using R Markdown resources.

EXERCISE 2

Use data from the BLS to create a slopegraph that compares the unemployment rate in January 2006 with the unemployment rate in January 2009, either for all 50 states at once (good luck with that!) or for a specific region or division. Make sure the plot doesn’t look too busy or crowded in the end.

What story does this plot tell? Which states in the US (or in the specific region you selected) were the most/least affected the Great Recession?

Some hints/tips:

  • You should use filter() to only select rows where the year is 2006 or 2009 (i.e. filter(year %in% c(2006, 2009)) and to select rows where the month is January (filter(month == 1) or filter(month_name == "January"))

  • In order for the year to be plotted as separate categories on the x-axis, it needs to be a factor, so use mutate(year = factor(year)) to convert it.

  • To make ggplot draw lines between the 2006 and 2009 categories, you need to include group = state in the aesthetics.

Bonus Exercise

This is entirely optional but might be fun. Then again, it might not be fun. I don’t know.

For extra fun times, if you feel like it, create a bump chart showing something from the unemployment data (perhaps the top 10 states or bottom 10 states in unemployment?) Adapt the code in the example for today’s session.

If you do this, plotting 51 lines is going to be a huge mess. But filtering the data is also a bad idea, because states could drop in and out of the top/bottom 10 over time, and we don’t want to get rid of them. Instead, you can zoom in on a specific range of data in your plot with coord_cartesian(ylim = c(1, 10)), for instance.

Turning everything in

When you’re all done, click on the “Knit” button at the top of the editing window and create a PDF. Upload the PDF file to D2L.

Postscript: how we got this unemployment data

For the curious, here’s the code we used to download the unemployment data from the BLS.

And to pull the curtain back and show how much googling is involved in data visualization (and data analysis and programming in general), here was my process for getting this data:

  1. We thought “We want to have students show variation in something domestic over time” and then we googled “us data by state”. Nothing really came up (since it was an exceedingly vague search in the first place), but some results mentioned unemployment rates, so we figured that could be cool.

  2. We googled “unemployment statistics by state over time” and found that the BLS keeps statistics on this. We clicked on the “Data Tools” link in their main navigation bar, clicked on “Unemployment”, and then clicked on the “Multi-screen data search” button for the Local Area Unemployment Statistics (LAUS).

  3. We walked through the multiple screens and got excited that we’d be able to download all unemployment stats for all states for a ton of years, but then the final page had links to 51 individual Excel files, which was dumb.

  4. So we went back to Google and searched for “download bls data r” and found a few different packages people have written to do this. The first one we clicked on was blscrapeR at GitHub, and it looked like it had been updated recently, so we went with it.

  5. We followed the examples in the blscrapeR package and downloaded data for every state.

Another day in the life of doing modern data science. This is an example of something you will be able to do by the end of this class. we had no idea people had written R packages to access BLS data, but there are (at least) 3 packages out there. After a few minutes of tinkering, we got it working and it is relatively straightforward.