Basics of ggplot

NOTE

You must turn in a PDF document of your R markdown output. Submit this to D2L by 11:59 PM on September 11th.

Note: This lab has a few “try its” along the way. They are for your own practice and are not required. The lab consists of the three “exercises.”

Our primary tool for data visualization in the course will be ggplot. Technically, we’re using ggplot2; the o.g. version lacked some of the modern features of its big brother. ggplot2 implements the grammar of graphics, a coherent and relatively straightforward system for describing and building graphs. With ggplot2, you can do more faster by learning one system and applying it in many places. Other languages provide more specific tools, but require you to learn a different tool for each application. In this class, we’ll dig into a single package for our visuals.

Using ggplot2

In order to get our hands dirty, we will first have to load ggplot2. To do this, and to access the datasets, help pages, and functions that we will use in this assignment, we will load the so-called tidyverse by running this code:

library(tidyverse)

If you run this code and get an error message “there is no package called ‘tidyverse’”, you’ll need to first install it, then run library() once again. To install packages in R, we utilize the simple function install.packages(). In this case, we would write:

install.packages("tidyverse")
library(tidyverse)

Once we’re up and running, we’re ready to dive into some basic exercises. ggplot2 works by specifying the connections between the variables in the data and the colors, points, and shapes you see on the screen. These logical connections are called aesthetic mappings or simply aesthetics.

How to use ggplot2 – the too-fast and wholly unclear recipe

  • data =: Define what your data is. For instance, below we’ll use the mpg data frame found in ggplot2 (by using ggplot2::mpg). As a reminder, a data frame is a rectangular collection of variables (in the columns) and observations (in the rows). This structure of data is often called a “table” but we’ll try to use terms slightly more precisely. The mpg data frame contains observations collected by the US Environmental Protection Agency on 38 different models of car.

  • mapping = aes(...): How to map the variables in the data to aesthetics

    • Axes, size of points, intensities of colors, which colors, shape of points, lines/points
  • Then say what type of plot you want:

    • boxplot, scatterplot, histogram, …
    • these are called ‘geoms’ in ggplot’s grammar, such as geom_point() giving scatter plots
library(ggplot2)
... + geom_point() # Produces scatterplots
... + geom_bar() # Bar plots
.... + geom_boxplot() # boxplots
... #

You link these steps by literally adding them together with + as we’ll see.

Try it: (OPTIONAL) What other types of plots are there? Try to find several more geom_ functions.

The Recipe

  1. Tell the ggplot() function what our data is.
  2. Tell ggplot() what relationships we want to see. For convenience we will put the results of the first two steps in an object called p.
  3. Tell ggplot how we want to see the relationships in our data.
  4. Layer on geoms as needed, by adding them on the p object one at a time.
  5. Use some additional functions to adjust scales, labels, tickmarks, titles.
  • e.g. scale_, labs(), and guides() functions

As you start to run more R code, you’re likely to run into problems. Don’t worry — it happens to everyone. I have been writing code in numerous languages for years, and every day I still write code that doesn’t work. Sadly, R is particularly persnickity, and its error messages are often opaque.

Start by carefully comparing the code that you’re running to the code in these notes. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every ” is paired with another “. Sometimes you’ll run the code and nothing happens. Check the left-hand of your console: if it’s a +, it means that R doesn’t think you’ve typed a complete expression and it’s waiting for you to finish it. In this case, it’s usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.

One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start.

Mapping Aesthetics vs Setting them

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp, color = 'yellow'))
p + geom_point() + scale_x_log10()

This is interesting (or annoying): the points are not yellow. How can we tell ggplot to draw yellow points?

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp, ...))
p + geom_point(...) + scale_x_log10()

Try it: (OPTIONAL) describe in your words what is going on. One way to avoid such mistakes is to read arguments inside aes(<property> = <variable>)as the property in the graph is determined by the data in .

Try it: (OPTIONAL) Write the above sentence for the original call aes(x = gdpPercap, y = lifeExp, color = 'yellow').

Aesthetics convey information about a variable in the dataset, whereas setting the color of all points to yellow conveys no information about the dataset - it changes the appearance of the plot in a way that is independent of the underlying data.

Remember: color = 'yellow' and aes(color = 'yellow') are very different, and the second makes usually no sense, as 'yellow' is treated as data.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(color = "orange", se = FALSE, size = 8, method = "lm") + scale_x_log10()

Try it: (OPTIONAL) Write down what all those arguments in geom_smooth(...) do.

p + geom_point(alpha = 0.3) +
  geom_smooth(method = "gam") +
  scale_x_log10(labels = scales::dollar) +
  labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
       title = "Economic Growth and Life Expectancy",
       subtitle = "Data Points are country-years",
       caption = "Source: Gapminder")

Coloring by continent:

library(scales)
p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp, color = continent, fill = continent))
p + geom_point()
p + geom_point() + scale_x_log10(labels = dollar)
p + geom_point() + scale_x_log10(labels = dollar) + geom_smooth()

Try it: (OPTIONAL) What does fill = continent do? What do you think about the match of colors between lines and error bands?

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(color = continent)) + geom_smooth() + scale_x_log10()

Try it: (OPTIONAL) Notice how the above code leads to a single smooth line, not one per continent. Why?

Try it: (OPTIONAL) What is bad about the following example, assuming the graph is the one we want? Think about why you should set aesthetics at the top level rather than at the individual geometry level if that’s your intent.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(color = continent)) +
  geom_smooth(mapping = aes(color = continent, fill = continent)) +
  scale_x_log10() +
  geom_smooth(mapping = aes(color = continent), method = "gam")

Exercise 3 of 3

Generate two new plots with data = gapminder (note: you’ll need to install the package by the same name if you have not already). Label the axes and the header with clear, easy to understand language. In a few sentences, describe what you’ve visualized and why.

Note that this is your first foray into ggplot2; accordingly, you should try to make sure that you do not bite off more than you can chew. We will improve and refine our abilities as we progress through the semester.