Visualizations

Content for Thursday, September 21, 2023
# 9-25-2023, I changed the bland-altman to no longer use `spread`
# And added fig.width=8 as a good use of size (since '75%' seems to no longer work?)
library(emo)

Adding graphics to an Rmarkdown file

From the web

If you are incorporating an image (.png or .jpg) from another site on the web, you can refer to the image directly *provided the web address ends in .png or .jpg or .gif. Google image search makes it a little hard to get directly to the image source, so click through a search until you get to the original image. Once you’re there, right-click on the image and select “Copy Image Address” (may vary by system). If you can paste the URL into a new window and get the image itself, you’re good to go with the instructions here. Some sites and formats do not host the images as a separate file – they may be generated by an app. For instance, if we go to https://msu.edu/students, the background image address is not available by right-clicking. But scrolling down, the image for the Student Information System is https://student.msu.edu/. In cases where the image address is not readily available, you’ll have to take a screenshot and use the instructions in the next section (or dig into the site code if you know how to do that sort of thing). Let’s work on getting this image into our output: https://msu.edu/-/media/assets/msu/images/audience-student/students-sis-home.jpg.

Note that using a web address for an image means if the image owner changes the address or removes the image, you won’t be able to re-knit your document. See the next section for downloading the image and inserting into your document.

There are two ways of inserting an image: in markdown text, or in a code chunk. Both work. I prefer using the code chunk method, which uses knitr::include_graphics. This is an R function, so you use it inside a code chunk. When you are including an image inside a code chunk, the code chunk options can be used to control the fig.width or fig.height, and fig.align. For instance, fig.width = '75%' will use 75% of the available page width, whatever it may be. The down side is that you have to pull a copy of the image from the web and save it locally (earlier versions of Rmarkdown would do this for you automatically, but this feature was removed in rmarkdown v1.6 for security reasons). Here’s the code to do so:

Images inserted in code chunks

```
{r insert-image, fig.width = '75%', fig.align='center', echo = TRUE}
download.file("https://msu.edu/-/media/assets/msu/images/audience-student/students-sis-home.jpg", destfile = 'temporary.jpg')
knitr::include_graphics(path = 'temporary.jpg')
```

The file you specify with destfile doesn’t matter - R will create that file, but you do need to use the corresponding suffix (don’t use temporary.jpg if you’re downloading a .png). By default, the file will be saved in the same folder as your .rmd file. When include_graphics goes to read the file, it will start looking relative to the folder that contains your .rmd file. That is, download.file will copy the image to the same place that include_graphics looks for it. See below for more on relative filepaths.

Note the code chunk option set above as well - fig.width='75%', which is stated in the curly-brackets that head the chunk. This is where Knitr finds details about how you want to handle the output. Before, we saw that echo=T would add a copy of the code itself to the output (vs. echo=F which output only the result). Similarly, fig.width='75%' should size the output to take up about 75% of the text width. You can also use fig.width=8 to set the output to 8 inches wide (closet to the width of a sheet of paper), which will maximize the space used for your plot. You may need this if/when you start plotting larger things. Hint.

Images inserted via Markdown

The markdown language is what controls the text outside of the R code chunks. It has its own way of inserting images. Here the image is inserted in the text, not in a code chunk.

![](https://msu.edu/-/media/assets/msu/images/audience-student/students-sis-home.jpg")

Images from a local file

Whether you have the file on your hard drive already, right-click and download from the web to keep a copy for posterity, or save the image from a screenshot, you will often need to insert an image from a local file. In Rmarkdown, the path will always be relative to the folder containing your .Rmd file. So if you keep your .Rmd in /Users/jkirk/SSC442/Example3 and you have a folder /Users/jkirk/SSC442/Example3/images that contains a file picture.png, then you would tell Rmarkdown to find the file at ./images/picture.png (which implies its filepath is /Users/jkirk/SSC442/Example3/images/picture.png). The ./ tells R to start looking in the local directory holding the .Rmd file you’re working on.

Once you know your local relative path, you can use either of the above methods knitr::include_graphics('./images/picture.png') or ![](./images/picture.png).

It is possible for R to find your image with an incorrect filepath when you click the “run chunk” button, but then not be able to find it when you knit. This can be very frustrating. It is almost always because you do not have the right relative filepath.

Data visualization in practice

In this chapter, we will demonstrate how relatively simple ggplot2 code can create insightful and aesthetically pleasing plots. As motivation we will create plots that help us better understand trends in world health and economics. We will implement what we learned in previous sections of the class and learn how to augment the code to perfect the plots. As we go through our case study, we will describe relevant general data visualization principles.

Case study: new insights on poverty

Hans Rosling1 was the co-founder of the Gapminder Foundation2, an organization dedicated to educating the public by using data to dispel common myths about the so-called developing world. The organization uses data to show how actual trends in health and economics contradict the narratives that emanate from sensationalist media coverage of catastrophes, tragedies, and other unfortunate events. As stated in the Gapminder Foundation’s website:

Journalists and lobbyists tell dramatic stories. That’s their job. They tell stories about extraordinary events and unusual people. The piles of dramatic stories pile up in peoples’ minds into an over-dramatic worldview and strong negative stress feelings: “The world is getting worse!”, “It’s we vs. them!”, “Other people are strange!”, “The population just keeps growing!” and “Nobody cares!”

Hans Rosling conveyed actual data-based trends in a dramatic way of his own, using effective data visualization. This section is based on two talks that exemplify this approach to education: [New Insights on Poverty]3 and The Best Stats You’ve Ever Seen4. Specifically, in this section, we use data to attempt to answer the following two questions:

  1. Is it a fair characterization of today’s world to say it is divided into western rich nations and the developing world in Africa, Asia, and Latin America?
  2. Has income inequality across countries worsened during the last 40 years?

To answer these questions, we will be using the gapminder dataset provided in dslabs. This dataset was created using a number of spreadsheets available from the Gapminder Foundation. You can access the table like this:

library(tidyverse)
library(dslabs)
library(ggrepel)
library(ggthemes)
gapminder = dslabs::gapminder %>% as_tibble()

Exploring the Data

Taking an exercise from the New Insights on Poverty video, we start by testing our knowledge regarding differences in child mortality across different countries. For each of the six pairs of countries below, which country do you think had the highest child mortality rates in 2015? Which pairs do you think are most similar?

  1. Sri Lanka or Turkey
  2. Poland or South Korea
  3. Malaysia or Russia
  4. Pakistan or Vietnam
  5. Thailand or South Africa

When answering these questions without data, the non-European countries are typically picked as having higher child mortality rates: Sri Lanka over Turkey, South Korea over Poland, and Malaysia over Russia. It is also common to assume that countries considered to be part of the developing world: Pakistan, Vietnam, Thailand, and South Africa, have similarly high mortality rates.

To answer these questions with data, we can use dplyr. For example, for the first comparison we see that:

dslabs::gapminder %>%
  filter(year == 2015 & country %in% c("Sri Lanka","Turkey")) %>%
  select(country, infant_mortality)
##     country infant_mortality
## 1 Sri Lanka              8.4
## 2    Turkey             11.6

Turkey has the higher infant mortality rate.

We can use this code on all comparisons and find the following:

country infant mortality country infant mortality
Sri Lanka 8.4 Turkey 11.6
Poland 4.5 South Korea 2.9
Malaysia 6.0 Russia 8.2
Pakistan 65.8 Vietnam 17.3
Thailand 10.5 South Africa 33.6

We see that the European countries on this list have higher child mortality rates: Poland has a higher rate than South Korea, and Russia has a higher rate than Malaysia. We also see that Pakistan has a much higher rate than Vietnam, and South Africa has a much higher rate than Thailand. It turns out that when Hans Rosling gave this quiz to educated groups of people, the average score was less than 2.5 out of 5, worse than what they would have obtained had they guessed randomly. This implies that more than ignorant, we are misinformed. In this chapter we see how data visualization helps inform us.

Slope charts

The slopechart is informative when you are comparing variables of the same type, but at different time points and for a relatively small number of comparisons. For example, comparing life expectancy between 2010 and 2015. In this case, we might recommend a slope chart.

There is no geometry for slope charts in ggplot2, but we can construct one using geom_line. We need to do some tinkering to add labels. We’ll paste together a character stright with the country name and the starting life expectancy, then do the same with just the later life expectancy for the right side. Below is an example comparing 2010 to 2015 for large western countries:

west <- c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand")

dat <- gapminder %>%
  filter(year%in% c(2010, 2015) & region %in% west &
           !is.na(life_expectancy) & population > 10^7) %>%
    mutate(label_first = ifelse(year == 2010, paste0(country, ": ", round(life_expectancy, 1), ' years'), NA),
           label_last = ifelse(year == 2015,  paste0(round(life_expectancy, 1),' years'), NA))

dat %>%
  mutate(location = ifelse(year == 2010, 1, 2),
         location = ifelse(year == 2015 &
                             country %in% c("United Kingdom", "Portugal"),
                           location+0.22, location),
         hjust = ifelse(year == 2010, 1, 0)) %>%
  mutate(year = as.factor(year)) %>%
  ggplot(aes(year, life_expectancy, group = country)) +
  geom_line(aes(color = country), show.legend = FALSE) +
  geom_text_repel(aes(label = label_first, color = country), direction = 'y', nudge_x = -1, seed = 1234, show.legend = FALSE) +
  geom_text_repel(aes(label = label_last, color = country), direction = 'y', nudge_x =  1, seed = 1234, show.legend = FALSE) +
  xlab("") + ylab("Life Expectancy")

An advantage of the slope chart is that it permits us to quickly get an idea of changes based on the slope of the lines. Although we are using angle as the visual cue, we also have position to determine the exact values. Comparing the improvements is a bit harder with a scatterplot:

In the scatterplot, we have followed the principle use common axes since we are comparing these before and after. However, if we have many points, slope charts stop being useful as it becomes hard to see all the lines.

Bland-Altman plot

Since we are primarily interested in the difference, it makes sense to dedicate one of our axes to it. The Bland-Altman plot, also known as the Tukey mean-difference plot and the MA-plot, shows the difference versus the average:

dat %>%
  group_by(country) %>%
  filter(year %in% c(2010, 2015)) %>%
  dplyr::summarize(average = mean(life_expectancy),
                   difference = life_expectancy[year==2015]-life_expectancy[year==2010]) %>%
  ggplot(aes(average, difference, label = country)) +
  geom_point() +
  geom_text_repel() +
  geom_abline(lty = 2) +
  xlab("Average of 2010 and 2015") +
  ylab("Difference between 2015 and 2010")

Here, by simply looking at the y-axis, we quickly see which countries have shown the most improvement. We also get an idea of the overall value from the x-axis. You already made a similar Altman plot in an earlier problem set, so we’ll move on.

Bump charts

Finally, we can make a bump chart that shows changes in rankings over time. We’ll look at fertility in South Asia. First we need to calculate a new variable that shows the rank of each country within each year. We can do this if we group by year and then use the rank() function to rank countries by the fertility column.

sa_fe <- gapminder %>%
  filter(region == "Southern Asia") %>%
  filter(year >= 2004, year < 2015) %>%
  group_by(year) %>%
  mutate(rank = rank(fertility))

We then plot this with points and lines, reversing the y-axis so 1 is at the top:

ggplot(sa_fe, aes(x = year, y = rank, color = country)) +
  geom_line() +
  geom_point() +
  scale_y_reverse(breaks = 1:8)

Iran holds the number 1 spot, while Sri Lanka dropped from 2 to 6, and Bangladesh increased from 4 to 2.

As with the slopegraph, there are 8 different colors in the legend and it’s hard to line them all up with the different lines, so we can plot the text directly instead. We’ll use geom_text() again. We don’t need to repel anything, since the text should fit in each row just fine. We need to change the data argument in geom_text() though and filter the data to only include one year, otherwise we’ll get labels on every point, which is excessive. We can also adjust the theme and colors to make it cleaner.

bumpplot <- ggplot(sa_fe, aes(x = year, y = rank, color = country)) +
  geom_line(size = 2) +
  geom_point(size = 4) +
  geom_text(data = sa_fe %>% dplyr::filter(year==2004) %>% arrange(rank),
            aes(label = country, x = 2003), fontface = "bold", angle = 45) +
 geom_text(data = sa_fe %>% dplyr::filter(year==2014) %>% arrange(rank),
            aes(label = country, x = 2015), fontface = "bold", angle = 45) + 
  guides(color = FALSE) +
  scale_y_reverse(breaks = 1:8) +
  scale_x_continuous(breaks = 2004:2014) +
  scale_color_viridis_d(option = "C", begin = 0.2, end = 0.9) +
  labs(x = NULL, y = "Rank")

bumpplot

If you want to be super fancy, you can use flags instead of country codes, but that’s a little more complicated (you need to install the ggflags package. See here for an example.

Themes

We can go a little further towards a clean, easy-to-read data visualization by using themes in our plots. Themes allow us to set a particular range of plot settings in one command, and let us further tweak things like fonts, background colors, and much more. We’ve used them in passing a few times without highlighting them, but we’ll discuss them here.

A pre-constructed set of instructions for making a visual theme can be had by using a theme’s ggplot function. Let’s look at two of my favorites.

theme_bw() uses the black-and-white theme, which is helpful in making a nice, clean plot:

bumpplot + theme_bw()

The background shading is gone, which gives the plot a nice, crisp feel. It adds the black outline around the plot, but doesn’t mess with the colors in the plot.

Here’s theme_minimal()

bumpplot + theme_minimal()

Themes can alter things in the plot as well. If we really want to strip it down and remove the Y-axis (which is rarely a good idea, but in a bump chart, it makes sense):

bumpplot + theme_void()

Now that’s clean!

In our opening unit, we had a plot that was styled after the plots in the magazine, The Economist. That’s a theme (in the ggthemes package that we loaded at the top)!

bumpplot + theme_economist()

Themes affect some of the plot elements that we haven’t gotten much into (like length of axis ticks and the color of the panel grid behind the plot). We can use a theme, then make further changes to the theme. We won’t go into a lot of detail, but here’s an example. Use the ?theme to learn more about what you can change. Half the challenge is finding the right term for the thing you want to tweak! Theme changes occur in code order, so you can update a pre-set theme with your own details:

bumpplot +   theme_bw() + theme(strip.text = element_text(face = "bold"),
                   plot.title = element_text(face = "bold"),
                   axis.text.x = element_text(angle = 45, hjust = 1),
                   panel.grid.major.y = element_blank(), # turn off all of the Y grid
                   panel.grid.minor.y = element_blank(),
                   panel.grid.minor.x = element_blank()) # turn off small x grid

Small multiples

First we can make some small multiples plots and show life expectancy over time for a handful of countries. We’ll make a list of some countries chosen at random while I scrolled through the data, and then filter our data to include only those rows. We then plot life expectancy, faceting by country.

life_expectancy_small <- gapminder %>%
  filter(country %in% c("Argentina", "Bolivia", "Brazil",
                        "Belize", "Canada", "Chile"))
ggplot(data = life_expectancy_small,
       mapping = aes(x = year, y = life_expectancy)) +
  geom_line(size = 1) +
  facet_wrap(vars(country))

Small multiples! That’s all we need to do.

We can do some fancier things, though. We can make this plot hyper minimalist with a theme:

ggplot(data = life_expectancy_small,
       mapping = aes(x = year, y = life_expectancy)) +
  geom_line(size = 1) +
  facet_wrap(vars(country), scales = "free_y") +
  theme_void() +
  theme(strip.text = element_text(face = "bold"))

We can do a whole part of a continent (poor Syria 😞)

life_expectancy_mena <- gapminder %>%
  filter(region == "Northern Africa" | region == "Western Asia")

ggplot(data = life_expectancy_mena,
       mapping = aes(x = year, y = life_expectancy)) +
  geom_line(size = 1) +
  facet_wrap(vars(country), scales = "free_y", nrow = 3) +
  theme_void() +
  theme(strip.text = element_text(face = "bold"))

We can use the geofacet package to arrange these facets by geography:

library(geofacet)

life_expectancy_eu <- gapminder %>%
  filter(region== 'Western Europe' | region=='Northern Europe' | region=='Southern Europe')

ggplot(life_expectancy_eu, aes(x = year, y = life_expectancy)) +
  geom_line(size = 1) +
  facet_geo(vars(country), grid = "europe_countries_grid1", scales = "free_y") +
  labs(x = NULL, y = NULL, title = "Life expectancy from 1960–2015",
       caption = "Source: Gapminder") +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"),
        plot.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 45, hjust = 1))

Neat!

Anybody see any problems here?

Sparklines

Sparklines are just line charts (or bar charts) that are really really small.

india_fert <- gapminder %>%
  filter(country == "India")
plot_india <- ggplot(india_fert, aes(x = year, y = fertility)) +
  geom_line() +
  theme_void()
plot_india

ggsave("india_fert.pdf", plot_india, width = 1, height = 0.15, units = "in")
ggsave("india_fert.png", plot_india, width = 1, height = 0.15, units = "in")
china_fert <- gapminder %>%
  filter(country == "China")
plot_china <- ggplot(china_fert, aes(x = year, y = fertility)) +
  geom_line() +
  theme_void()
plot_china

ggsave("china_fert.pdf", plot_china, width = 1, heighlt = 0.15, units = "in")
ggsave("china_fert.png", plot_china, width = 1, height = 0.15, units = "in")

You can then use those saved tiny plots in your text (with a little html extra in there).

Both India <img class="img-inline" src="/your/path/to/india_fert.png" width = "100"/> and 
China <img class="img-inline" src="/your/path/to/china-fert.png" width = "100"/> have 
seen decreased fertility over the past 20 years.

Both India and China have seen decreased fertility over the past 20 years.

The ecological fallacy and importance of showing the data

Throughout this section, we have been comparing regions of the world. We have seen that, on average, some regions do better than others. In this section, we focus on describing the importance of variability within the groups when examining the relationship between a country’s infant mortality rates and average income.

We define a few more regions and compare the averages across regions:

The relationship between these two variables is almost perfectly linear and the graph shows a dramatic difference. While in the West less than 0.5% of infants die, in Sub-Saharan Africa the rate is higher than 6%!

Note that the plot uses a new transformation, the logit transformation.

Logit transformation

The logit transformation for a proportion or rate \(p\) is defined as:

\[f(p) = \log \left( \frac{p}{1-p} \right)\]

When \(p\) is a proportion or probability, the quantity that is being logged, \(p/(1-p)\), is called the odds. In this case \(p\) is the proportion of infants that survived. The odds tell us how many more infants are expected to survive than to die. The log transformation makes this symmetric. If the rates are the same, then the log odds is 0. Fold increases or decreases turn into positive and negative increments, respectively.

This scale is useful when we want to highlight differences near 0 or 1. For survival rates this is important because a survival rate of 90% is unacceptable, while a survival of 99% is relatively good. We would much prefer a survival rate closer to 99.9%. We want our scale to highlight these difference and the logit does this. Note that 99.9/0.1 is about 10 times bigger than 99/1 which is about 10 times larger than 90/10. By using the log, these fold changes turn into constant increases.

Show the data

Now, back to our plot. Based on the plot above, do we conclude that a country with a low income is destined to have low survival rate? Do we conclude that survival rates in Sub-Saharan Africa are all lower than in Southern Asia, which in turn are lower than in the Pacific Islands, and so on?

Jumping to this conclusion based on a plot showing averages is referred to as the ecological fallacy. The almost perfect relationship between survival rates and income is only observed for the averages at the region level. Once we show all the data, we see a somewhat more complicated story:

Specifically, we see that there is a large amount of variability. We see that countries from the same regions can be quite different and that countries with the same income can have different survival rates. For example, while on average Sub-Saharan Africa had the worse health and economic outcomes, there is wide variability within that group. Mauritius and Botswana are doing better than Angola and Sierra Leone, with Mauritius comparable to Western countries.

Case study: vaccines and infectious diseases

Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, such as smallpox and polio, were common. However, today vaccination programs have become somewhat controversial despite all the scientific evidence for their importance.

The controversy started with a paper5 published in 1988 and led by Andrew Wakefield claiming there was a link between the administration of the measles, mumps, and rubella (MMR) vaccine and the appearance of autism and bowel disease. Despite much scientific evidence contradicting this finding, sensationalist media reports and fear-mongering from conspiracy theorists led parts of the public into believing that vaccines were harmful. As a result, many parents ceased to vaccinate their children. This dangerous practice can be potentially disastrous given that the Centers for Disease Control (CDC) estimates that vaccinations will prevent more than 21 million hospitalizations and 732,000 deaths among children born in the last 20 years (see Benefits from Immunization during the Vaccines for Children Program Era — United States, 1994-2013, MMWR6). The 1988 paper has since been retracted and Andrew Wakefield was eventually “struck off the UK medical register, with a statement identifying deliberate falsification in the research published in The Lancet, and was thereby barred from practicing medicine in the UK.” (source: Wikipedia7). Yet misconceptions persist, in part due to self-proclaimed activists who continue to disseminate misinformation about vaccines.

Effective communication of data is a strong antidote to misinformation and fear-mongering. Earlier we used an example provided by a Wall Street Journal article8 showing data related to the impact of vaccines on battling infectious diseases. Here we reconstruct that example.

The data used for these plots were collected, organized, and distributed by the Tycho Project9. They include weekly reported counts for seven diseases from 1928 to 2011, from all fifty states. The yearly totals are helpfully included in the dslabs package:

library(RColorBrewer)
data(us_contagious_diseases)
names(us_contagious_diseases)
## [1] "disease"         "state"           "year"            "weeks_reporting"
## [5] "count"           "population"

We create a temporary object dat that stores only the measles data, includes a per 100,000 rate, orders states by average value of disease and removes Alaska and Hawaii since they only became states in the late 1950s. Note that there is a weeks_reporting column that tells us for how many weeks of the year data was reported. We have to adjust for that value when computing the rate.

the_disease <- "Measles"
dat <- us_contagious_diseases %>%
  filter(!state%in%c("Hawaii","Alaska") & disease == the_disease) %>%
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>%
  mutate(state = reorder(state, rate))

We can now easily plot disease rates per year. Here are the measles data from California:

dat %>% filter(state == "California" & !is.na(rate)) %>%
  ggplot(aes(year, rate)) +
  geom_line() +
  ylab("Cases per 10,000")  +
  geom_vline(xintercept=1963, col = "blue")

We add a vertical line at 1963 since this is when the vaccine was introduced [Control, Centers for Disease; Prevention (2014). CDC health information for international travel 2014 (the yellow book). p. 250. ISBN 9780199948505].

Now can we show data for all states in one plot? We have three variables to show: year, state, and rate. In the WSJ figure, they use the x-axis for year, the y-axis for state, and color hue to represent rates. However, the color scale they use, which goes from yellow to blue to green to orange to red, can be improved.

In our example, we want to use a sequential palette since there is no meaningful center, just low and high rates.

We use the geometry geom_tile to tile the region with colors representing disease rates. We use a square root transformation to avoid having the really high counts dominate the plot. Notice that missing values are shown in grey. Note that once a disease was pretty much eradicated, some states stopped reporting cases all together. This is why we see so much grey after 1980.

dat %>% ggplot(aes(year, state, fill = rate)) +
  geom_tile(color = "grey50") +
  scale_x_continuous(expand=c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept=1963, col = "blue") +
  theme_minimal() +
  theme(panel.grid = element_blank(),
        legend.position="bottom",
        text = element_text(size = 8)) +
  ggtitle(the_disease) +
  ylab("") + xlab("")

This plot makes a very striking argument for the contribution of vaccines. However, one limitation of this plot is that it uses color to represent quantity, which we earlier explained makes it harder to know exactly how high values are going. Position and lengths are better cues. If we are willing to lose state information, we can make a version of the plot that shows the values with position. We can also show the average for the US, which we compute like this:

avg <- us_contagious_diseases %>%
  filter(disease==the_disease) %>% group_by(year) %>%
  summarize(us_rate = sum(count, na.rm = TRUE) /
              sum(population, na.rm = TRUE) * 10000)

Now to make the plot we simply use the geom_line geometry:

dat %>%
  filter(!is.na(rate)) %>%
    ggplot() +
  geom_line(aes(year, rate, group = state),  color = "grey50",
            show.legend = FALSE, alpha = 0.2, size = 1) +
  geom_line(mapping = aes(year, us_rate),  data = avg, size = 1) +
  scale_y_continuous(trans = "sqrt", breaks = c(5, 25, 125, 300)) +
  ggtitle("Cases per 10,000 by state") +
  xlab("") + ylab("") +
  geom_text(data = data.frame(x = 1955, y = 50),
            mapping = aes(x, y, label="US average"),
            color="black") +
  geom_vline(xintercept=1963, col = "blue")

In theory, we could use color to represent the categorical value state, but it is hard to pick 50 distinct colors.

TRY IT

TRY IT

  1. Reproduce the heatmap plot we previously made but for smallpox. For this plot, do not include years in which cases were not reported in 10 or more weeks.

  2. Now reproduce the time series plot we previously made, but this time following the instructions of the previous question for smallpox.

  3. For the state of California, make a time series plot showing rates for all diseases. Include only years with 10 or more weeks reporting. Use a different color for each disease.

  4. Now do the same for the rates for the US. Hint: compute the US rate by using summarize: the total divided by total population.