Introduction to the tidyverse
Readings
- This page.
- Chapter 1 of Introduction to Statistical Learning, available here.
- Optional: The “Tidy Your Data” tutorial on Rstudio Cloud Primers
Quick notes 9-5-2023
- Start labs early
- You
install.packages("packageName")
once on your computer.- And never ever ever in your code.
- You load an already-installed package using
library(packageName)
in a code chunk- Never in your console
- When RMarkdown knits, it starts a whole new, empty session that has no knowledge of what you typed into the console
- Slack
- Use it.
- I would very much prefer posting in the class-visible channels. Others can learn from your issues.
- We have a channel just for labs and R. Please use that one.
Group Projects
Your final is a group project. You will also have two “mini” projects. They comprise a large part of your grade.
You need to start planning soon.
To aid in your planning, here are the required elements of your final project.
You must find existing data to analyze. Aggregating and merging data from multiple sources is encouraged.
You must visualize 3 intersting features of that data.
You must come up with some analysis—using tools from this course—which relates your data to either a prediction or a policy conclusion.
You must think critically about your analysis and be able to identify potential issues.
You must present your analysis as if presenting to a C-suite executive.
Your mini-projects along the way will be more structured, but will serve to guide you towards the final project.
Teams
Please form teams of 2-3 people. Once all agree to be on a team, have one person email our TA, Nick, armst427@msu.edu and cc all of the members of the team so that nobody is surprised to be included on a team. Title the email [SSC442] - Group Formation. Tell us your team name (be creative), and list in the email the names of all of the team members and their email address (in addition to cc-ing those team members on the email).
If you opt to not form a team, you will be automatically added to the “willing to be randomly assigned” pool and will be paired with another 1-2 persons.
Send this email by September 21st and we will assign un-teamed folks at the beginning of the following week. Project 1 is due in mid-October. See schedule for all the important project dates.
Guiding Question
For future lectures, the guiding questions will be more pointed and at a higher level to help steer your thinking. Here, we want to ensure you remember some basics and accordingly the questions are straightforward.
- Why do we want tidy data?
- What are the challenges associated with shaping things into a tidy format?
Randomness and Data Analytics
And the fabulous importance of probabilistic inference…
This half of the lecture is very “high-level,” which means it is talking about abstract concepts. It is also quite important.
We want to discuss why we eventually will need to utilize tons of difficult mathematics. Why do we care so much about hypothesis tests and the like?
Moreover, we can highlight why we want our data structured to behave nicely.
Learning From Data
The following are the baisc requirements for statistical learning
A pattern exists
This pattern is not easily expressed in a closed mathematical form
You have data
Formalization
We think of our outcome-of-interest as a reponse or target that we wish to predict or wish to learn something about.
We generically refer to the response as \(Y\)
Other aspects of the data are known as features, inputs, predictors, or regressors. We call one of these \(X_i\).
- The subscript \(i\) indicates that we have an \(X\) realized for every individual observation in our data
We can refer to the input vector collectively as:
\[X = \begin{bmatrix} x_{11} & x_{12} \\ x_{21} & x_{22} \\ \vdots & \vdots \\ x_{N1} & x_{N2} \end{bmatrix}\]
This is sometimes called a “rectangular array” of data. Sounds complicated, but you’ve definitely encountered it before. Excel holds “rectangular arrays” of data.
We are seeking some unknown function that maps \(X\) to \(Y\)
Put another way, we are seeking to explain \(Y\) as follows:
\[Y = f(X) + e\]
The Target Function
We call the function \(f: \mathcal{X} \rightarrow \mathcal{Y}\) the target function
How do we find the function? We don’t! We get as close as we can, though:
Observe data \((\mathbf{x}_1, y_1), \cdots, (\mathbf{x}_N, y_N)\)
Use some algorithm to approximate \(f\)
Produce final hypothesis function \(g \approx f\)
Evaluate how well \(g\) approximates \(f\) and iterate as needed.
Why Estimate an Unknown Function?
With a good estimate of \(f\) we can make predictions of \(Y\) at new points \(X = x\)
We can also understand which components of \(X = (X_1, X_2, \cdots, X_m)\) are important in explaining \(Y\), and which are (potentially) irrelevant
- e.g.
GDP
andyearsindustrialized
have a big impact onemissions
buthydroutilization
typically does not.
Depending on the complexity of \(f\), we may be able to meaningfully understand how each component of \(X\) affects \(Y\).
(But we should be careful about assigning causal interpretations, more on this later)
The Parable of the Marbles
(Courtesy of Prof. Bushong)
Imagine a bag of marbles with two types of marbles: ♣️ and ♦️.
We are going to pick a sample of \(n\) marbles (with replacement).
We want to learn something about \(\mu\), the objective probability to pick a ♣️.
In addition to defining the objective probability of picking a ♣️, we have an observed fraction \(\eta\), which will define as the fraction of ♣️ in the sample.
Question: Can we say anything exact and for-sure about \(\mu\) (outside the data) after observing \(\eta\) (the data)?
No. It is possible for the sample to be all ♣️, ♣️, ♣️, ♣️, ♣️ even when the bag is is 50/50 ♦️
No matter what we draw, we can’t (based on that draw alone) eliminate the possibility of drawing a ♦️.
And unless we assume that the only two values in the world are ♦️ and ♣️, we can’t rule out 💩1
Question: If we really can’t say anything exact and for-sure, then why do we do things like polling (e.g. to predict the outcome of a presidential election)?
- The bad case, that we draw something that has is completely misleading, is possible but not probable.
Outside the Data
Put another way, since \(f\) is unknown, it can take on any value outside the data we have, no matter how large the data.
- This is called No Free Lunch
You cannot know anything for sure about \(f\) outside the data without making assumptions.
Is there any hope to know anything about \(f\) outside the data set without making assumptions about \(f\)?
Yes, if we are willing to give up the “for sure”.
Instead of
\[ Y = f(X) \]
we can work with
\[ Y = f(X) + e \]
and if we guess at an \(f\) function that gets pretty close, \(e\) will be pretty small. If \(e\) is pretty small, then we’re in good shape for prediction!
Hoeffding’s Inequality
Hoeffding’s Inequality states, loosely, that \(\eta\) (the fraction of our sample that is ♣️) cannot be too far from \(\mu\).
\[\mathbb{P}\left[|\eta - \mu| > \epsilon \right] \leq 2e^{-2\epsilon^2n}\] \(\eta \approx \mu\) is called probably approximately correct (PAC) learning.
An example of Hoeffding’s Inequality
Example: n = 1,000. Draw a sample and observe \(\eta\)
\(\sim\) 99% of the time, \(\mu - .05 \leq \eta \leq \mu+.05\)
- This is implied by setting \(\epsilon = 0.05\) and using \(n=1,000\)
99.9999996% of the time \(\mu - .10 \leq \eta \leq \mu + .10\)
What does this mean?
If I repeatedly pick a sample of size 1,000, observe \(\eta\) and claim that \(\mu \in \left[\eta - .05, \eta + .05\right]\) (or that the error bar is \(\pm 0.05\)), I will be right 99% of the time.
On any particular sample you may be wrong, but not often.
NOTE
This week’s content is split into two “halves”: the critical data manipulation information contained below and a more-entertaining discussion of visualization included in the Exercises. That’s a pretty jarring change, so here’s a picture of my dog.
The tidyverse
In last week’s content and lab, we demonstrated how to manipulate vectors by reordering and subsetting them through indexing. However, once we start more advanced analyses, the preferred unit for data storage is not the vector but the data frame. In this lecture, we learn to work directly with data frames, which greatly facilitate the organization of information. We will be using data frames for the majority of this class and you will use them for the majority of your data science life (however long that might be). We will focus on a specific data format referred to as tidy and on specific collection of packages that are particularly helpful for working with tidy data referred to as the tidyverse.
We can load all the tidyverse packages at once by installing and loading the tidyverse package:2
library(tidyverse)
We will learn how to implement the tidyverse approach throughout the book, but before delving into the details, in this chapter we introduce some of the most widely used tidyverse functionality, starting with the dplyr package for manipulating data frames and the purrr package for working with functions. Note that the tidyverse also includes a graphing package, ggplot2, the readr package, and many others. In this lesson, we first introduce the concept of tidy data and then demonstrate how we use the tidyverse to work with data frames in this format.
Tidy data
We say that a data table is in tidy format if each row represents one observation and columns represent the different variables available for each of these observations. The
murders
dataset is an example of a tidy data frame.
library(dslabs)
data(murders)
head(murders)
## state abb region population total
## 1 Alabama AL South 4779736 135
## 2 Alaska AK West 710231 19
## 3 Arizona AZ West 6392017 232
## 4 Arkansas AR South 2915918 93
## 5 California CA West 37253956 1257
## 6 Colorado CO West 5029196 65
Each row represent a state with each of the five columns providing a different variable related to these states: name, abbreviation, region, population, and total murders.
To see how the same information can be provided in different formats, consider the following example:
library(dslabs)
data("gapminder") # gapminder will now be a data.frame in your "environment" (memory)
tidy_data <- gapminder %>%
filter(country %in% c("South Korea", "Germany") & !is.na(fertility)) %>%
select(country, year, fertility)
head(tidy_data, 6)
## country year fertility
## 1 Germany 1960 2.41
## 2 South Korea 1960 6.16
## 3 Germany 1961 2.44
## 4 South Korea 1961 5.99
## 5 Germany 1962 2.47
## 6 South Korea 1962 5.79
This tidy dataset provides fertility rates for two countries across the years. This is a tidy dataset because each row presents one observation with the three variables being country, year, and fertility rate. However, this dataset originally came in another format and was reshaped for the dslabs package. Originally, the data was in the following format:
## country 1960 1961 1962
## 1 Germany 2.41 2.44 2.47
## 2 South Korea 6.16 5.99 5.79
The same information is provided, but there are two important differences in the format: 1) each row includes several observations and 2) one of the variables’ values, year, is stored in the header. For the tidyverse packages to be optimally used, data need to be reshaped into tidy
format, which you will learn to do throughout this course. For starters, though, we will use example datasets that are already in tidy format.
Although not immediately obvious, as you go through the book you will start to appreciate the advantages of working in a framework in which functions use tidy formats for both inputs and outputs. You will see how this permits the data analyst to focus on more important aspects of the analysis rather than the format of the data.
TRY IT
- Examine the built-in dataset
co2
. Which of the following is true:
co2
is tidy data: it has one year for each row.co2
is not tidy: we need at least one column with a character vector.co2
is not tidy: it is a matrix instead of a data frame.co2
is not tidy: to be tidy we would have to wrangle it to have three columns (year, month and value), then each co2 observation would have a row.
- Examine the built-in dataset
ChickWeight
. Which of the following is true:
ChickWeight
is not tidy: each chick has more than one row.ChickWeight
is tidy: each observation (a weight) is represented by one row. The chick from which this measurement came is one of the variables.ChickWeight
is not tidy: we are missing the year column.ChickWeight
is tidy: it is stored in a data frame.
- Examine the built-in dataset
BOD
. Which of the following is true:
BOD
is not tidy: it only has six rows.BOD
is not tidy: the first column is just an index.BOD
is tidy: each row is an observation with two values (time and demand)BOD
is tidy: all small datasets are tidy by definition.
- Which of the following built-in datasets is tidy (you can pick more than one):
BJsales
EuStockMarkets
DNase
Formaldehyde
Orange
UCBAdmissions
Manipulating data frames
The dplyr package from the tidyverse introduces functions that perform some of the most common operations when working with data frames and uses names for these functions that are relatively easy to remember. For instance, to change the data table by adding a new column, we use mutate
. To filter the data table to a subset of rows, we use filter
. Finally, to subset the data by selecting specific columns, we use select
.
Adding a column with mutate
We want all the necessary information for our analysis to be included in the data table. So the first task is to add the murder rates to our murders data frame. The function mutate
takes the data frame as a first argument and the name and values of the variable as a second argument using the convention name = values
. So, to add murder rates, we use:
library(dslabs)
data("murders")
murders <- mutate(murders, rate = total / population * 100000)
Notice that here we used total
and population
inside the function, which are objects that are not defined in our workspace. But why don’t we get an error?
This is one of dplyr’s main features. Functions in this package, such as mutate
, know to look for variables in the data frame provided in the first argument. In the call to mutate above, total
will have the values in murders$total
. This approach makes the code much more readable.
We can see that the new column is added:
head(murders)
## state abb region population total rate
## 1 Alabama AL South 4779736 135 2.824424
## 2 Alaska AK West 710231 19 2.675186
## 3 Arizona AZ West 6392017 232 3.629527
## 4 Arkansas AR South 2915918 93 3.189390
## 5 California CA West 37253956 1257 3.374138
## 6 Colorado CO West 5029196 65 1.292453
Note: Although we have overwritten the original murders
object, this does not change the object that loaded with data(murders)
. If we load the murders
data again, the original will overwrite our mutated version.
Subsetting with filter
Now suppose that we want to filter the data table to only show the entries for which the murder rate is lower than 0.71. To do this we use the filter
function, which takes the data table as the first argument and then the conditional statement as the second. Like mutate
, we can use the unquoted variable names from murders
inside the function and it will know we mean the columns and not objects in the workspace.
filter(murders, rate <= 0.71)
## state abb region population total rate
## 1 Hawaii HI West 1360301 7 0.5145920
## 2 Iowa IA North Central 3046355 21 0.6893484
## 3 New Hampshire NH Northeast 1316470 5 0.3798036
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Vermont VT Northeast 625741 2 0.3196211
Selecting columns with select
Although our data table only has six columns, some data tables include hundreds. If we want to view just a few, we can use the dplyr select
function. In the code below we select three columns, assign this to a new object and then filter the new object:
new_table <- select(murders, state, region, rate)
filter(new_table, rate <= 0.71)
## state region rate
## 1 Hawaii West 0.5145920
## 2 Iowa North Central 0.6893484
## 3 New Hampshire Northeast 0.3798036
## 4 North Dakota North Central 0.5947151
## 5 Vermont Northeast 0.3196211
In the call to select
, the first argument murders
is an object, but state
, region
, and rate
are variable names.
TRY IT
- Load the dplyr package and the murders dataset.
library(dplyr)
library(dslabs)
data(murders)
You can add columns using the dplyr function mutate
. This function is aware of the column names and inside the function you can call them unquoted:
murders <- mutate(murders, population_in_millions = population / 10^6)
We can write population
rather than murders$population
because mutate
is part of dplyr
. The function mutate
knows we are grabbing columns from murders
.
Use the function mutate
to add a murders column named rate
with the per 100,000 murder rate as in the example code above. Make sure you redefine murders
as done in the example code above ( murders <- [your code]) so we can keep using this variable.
If
rank(x)
gives you the ranks ofx
from lowest to highest,rank(-x)
gives you the ranks from highest to lowest. Use the functionmutate
to add a columnrank
containing the rank, from highest to lowest murder rate. Make sure you redefinemurders
so we can keep using this variable.With dplyr, we can use
select
to show only certain columns. For example, with this code we would only show the states and population sizes:
select(murders, state, population) %>% head()
Use select
to show the state names and abbreviations in murders
. Do not redefine murders
, just show the results.
- The dplyr function
filter
is used to choose specific rows of the data frame to keep. Unlikeselect
which is for columns,filter
is for rows. For example, you can show just the New York row like this:
filter(murders, state == "New York")
You can use other logical vectors to filter rows.
Use filter
to show the top 5 states with the highest murder rates. After we add murder rate and rank, do not change the murders dataset, just show the result. Remember that you can filter based on the rank
column.
- We can remove rows using the
!=
operator. For example, to remove Florida, we would do this:
no_florida <- filter(murders, state != "Florida")
Create a new data frame called no_south
that removes states from the South region. How many states are in this category? You can use the function nrow
for this.
- We can also use
%in%
to filter with dplyr. You can therefore see the data from New York and Texas like this:
filter(murders, state %in% c("New York", "Texas"))
Create a new data frame called murders_nw
with only the states from the Northeast and the West. How many states are in this category?
- Suppose you want to live in the Northeast or West and want the murder rate to be less than 1. We want to see the data for the states satisfying these options. Note that you can use logical operators with
filter
. Here is an example in which we filter to keep only small states in the Northeast region.
filter(murders, population < 5000000 & region == "Northeast")
Make sure murders
has been defined with rate
and rank
and still has all states. Create a table called my_states
that contains rows for states satisfying both the conditions: it is in the Northeast or West and the murder rate is less than 1. Use select
to show only the state name, the rate, and the rank.
The pipe: %>%
With dplyr we can perform a series of operations, for example select
and then filter
, by sending the results of one function to another using what is called the pipe operator: %>%
. Some details are included below.
We wrote code above to show three variables (state, region, rate) for states that have murder rates below 0.71. To do this, we defined the intermediate object new_table
. In dplyr we can write code that looks more like a description of what we want to do without intermediate objects:
\[ \mbox{original data } \rightarrow \mbox{ select } \rightarrow \mbox{ filter } \]
For such an operation, we can use the pipe %>%
. The code looks like this:
murders %>% select(state, region, rate) %>% filter(rate <= 0.71)
## state region rate
## 1 Hawaii West 0.5145920
## 2 Iowa North Central 0.6893484
## 3 New Hampshire Northeast 0.3798036
## 4 North Dakota North Central 0.5947151
## 5 Vermont Northeast 0.3196211
This line of code is equivalent to the two lines of code above. What is going on here?
In general, the pipe sends the result of the left side of the pipe to be the first argument of the function on the right side of the pipe. Here is a very simple example:
16 %>% sqrt()
## [1] 4
We can continue to pipe values along:
16 %>% sqrt() %>% log2()
## [1] 2
The above statement is equivalent to log2(sqrt(16))
.
Remember that the pipe sends values to the first argument, so we can define other arguments as if the first argument is already defined:
16 %>% sqrt() %>% log(base = 2)
## [1] 2
Therefore, when using the pipe with data frames and dplyr, we no longer need to specify the required first argument since the dplyr functions we have described all take the data as the first argument. In the code we wrote:
murders %>% select(state, region, rate) %>% filter(rate <= 0.71)
murders
is the first argument of the select
function, and the new data frame (formerly new_table
) is the first argument of the filter
function.
Note that the pipe works well with functions where the first argument is the input data. Functions in tidyverse packages like dplyr have this format and can be used easily with the pipe. It’s worth noting that as of R 4.1, there is a base-R version of the pipe |>
, though this has its disadvantages. We’ll stick with %>%
for now.
TRY IT
- The pipe
%>%
can be used to perform operations sequentially without having to define intermediate objects. Start by redefining murder to include rate and rank.
murders <- mutate(murders, rate = total / population * 100000,
rank = rank(-rate))
In the solution to the previous exercise, we did the following:
my_states <- filter(murders, region %in% c("Northeast", "West") &
rate < 1)
select(my_states, state, rate, rank)
The pipe %>%
permits us to perform both operations sequentially without having to define an intermediate variable my_states
. We therefore could have mutated and selected in the same line like this:
mutate(murders, rate = total / population * 100000,
rank = rank(-rate)) %>%
select(state, rate, rank)
Notice that select
no longer has a data frame as the first argument. The first argument is assumed to be the result of the operation conducted right before the %>%
.
Repeat the previous exercise, but now instead of creating a new object, show the result and only include the state, rate, and rank columns. Use a pipe %>%
to do this in just one line.
- Reset
murders
to the original table by usingdata(murders)
. Use a pipe to create a new data frame calledmy_states
that considers only states in the Northeast or West which have a murder rate lower than 1, and contains only the state, rate and rank columns. The pipe should also have four components separated by three%>%
. The code should look something like this:
my_states <- murders %>%
mutate SOMETHING %>%
filter SOMETHING %>%
select SOMETHING
Summarizing data
An important part of exploratory data analysis is summarizing data. The average and standard deviation are two examples of widely used summary statistics. More informative summaries can often be achieved by first splitting data into groups. In this section, we cover two new dplyr verbs that make these computations easier: summarize
and group_by
. We learn to access resulting values using the pull
function.
summarize
The summarize
function in dplyr provides a way to compute summary statistics with intuitive and readable code. We start with a simple example based on heights. The heights
dataset includes heights and sex reported by students in an in-class survey.
library(dplyr)
library(dslabs)
data(heights)
head(heights)
## sex height
## 1 Male 75
## 2 Male 70
## 3 Male 68
## 4 Male 74
## 5 Male 61
## 6 Female 65
The following code computes the average and standard deviation for females:
s <- heights %>%
filter(sex == "Female") %>%
summarize(average = mean(height), standard_deviation = sd(height))
s
## average standard_deviation
## 1 64.93942 3.760656
This takes our original data table as input, filters it to keep only females, and then produces a new summarized table with just the average and the standard deviation of heights. We get to choose the names of the columns of the resulting table. For example, above we decided to use average
and standard_deviation
, but we could have used other names just the same.
Because the resulting table stored in s
is a data frame, we can access the components with the accessor $
:
s$average
## [1] 64.93942
s$standard_deviation
## [1] 3.760656
As with most other dplyr functions, summarize
is aware of the variable names and we can use them directly. So when inside the call to the summarize
function we write mean(height)
, the function is accessing the column with the name “height” and then computing the average of the resulting numeric vector. We can compute any other summary that operates on vectors and returns a single value. For example, we can add the median, minimum, and maximum heights like this:
heights %>%
filter(sex == "Female") %>%
summarize(median = median(height), minimum = min(height),
maximum = max(height))
## median minimum maximum
## 1 64.98031 51 79
We can obtain these three values with just one line using the quantile
function: for example, quantile(x, c(0,0.5,1))
returns the min (0th percentile), median (50th percentile), and max (100th percentile) of the vector x
. However, if we attempt to use a function like this that returns two or more values inside summarize
:
heights %>%
filter(sex == "Female") %>%
summarize(range = quantile(height, c(0, 0.5, 1)))
we will receive an error: Error: expecting result of length one, got : 2
. With the function summarize
, we can only call functions that return a single value. In later sections, we will learn how to deal with functions that return more than one value.
For another example of how we can use the summarize
function, let’s compute the average murder rate for the United States. Remember our data table includes total murders and population size for each state and we have already used dplyr to add a murder rate column:
murders <- murders %>% mutate(rate = total/population*100000)
Remember that the US murder rate is not the average of the state murder rates:
summarize(murders, mean(rate))
## mean(rate)
## 1 2.779125
This is because in the computation above the small states are given the same weight as the large ones. The US murder rate is the total number of murders in the US divided by the total US population. So the correct computation is:
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 100000)
us_murder_rate
## rate
## 1 3.034555
This computation counts larger states proportionally to their size which results in a larger value.
pull
The us_murder_rate
object defined above represents just one number. Yet we are storing it in a data frame:
class(us_murder_rate)
## [1] "data.frame"
since, as most dplyr functions, summarize
always returns a data frame.
This might be problematic if we want to use this result with functions that require a numeric value. Here we show a useful trick for accessing values stored in data when using pipes: when a data object is piped that object and its columns can be accessed using the pull
function. To understand what we mean take a look at this line of code:
us_murder_rate %>% pull(rate)
## [1] 3.034555
This returns the value in the rate
column of us_murder_rate
making it equivalent to us_murder_rate$rate
.
To get a number from the original data table with one line of code we can type:
us_murder_rate <- murders %>%
summarize(rate = sum(total) / sum(population) * 100000) %>%
pull(rate)
us_murder_rate
## [1] 3.034555
which is now a numeric:
class(us_murder_rate)
## [1] "numeric"
Group then summarize with group_by
A common operation in data exploration is to first split data into groups and then compute summaries for each group. For example, we may want to compute the average and standard deviation for men’s and women’s heights separately. The group_by
function helps us do this.
If we type this:
heights %>% group_by(sex)
## # A tibble: 1,050 × 2
## # Groups: sex [2]
## sex height
## <fct> <dbl>
## 1 Male 75
## 2 Male 70
## 3 Male 68
## 4 Male 74
## 5 Male 61
## 6 Female 65
## 7 Female 66
## 8 Female 62
## 9 Female 66
## 10 Male 67
## # ℹ 1,040 more rows
The result does not look very different from heights
, except we see Groups: sex [2]
when we print the object. Although not immediately obvious from its appearance, this is now a special data frame called a grouped data frame, and dplyr functions, in particular summarize
, will behave differently when acting on this object. Conceptually, you can think of this table as many tables, with the same columns but not necessarily the same number of rows, stacked together in one object. When we summarize the data after grouping, this is what happens:
heights %>%
group_by(sex) %>%
summarize(average = mean(height), standard_deviation = sd(height))
## # A tibble: 2 × 3
## sex average standard_deviation
## <fct> <dbl> <dbl>
## 1 Female 64.9 3.76
## 2 Male 69.3 3.61
The summarize
function applies the summarization to each group separately.
For another example, let’s compute the median murder rate in the four regions of the country:
murders %>%
group_by(region) %>%
summarize(median_rate = median(rate))
## # A tibble: 4 × 2
## region median_rate
## <fct> <dbl>
## 1 Northeast 1.80
## 2 South 3.40
## 3 North Central 1.97
## 4 West 1.29
Sorting data frames
When examining a dataset, it is often convenient to sort the table by the different columns. We know about the order
and sort
function, but for ordering entire tables, the dplyr function arrange
is useful. For example, here we order the states by population size:
murders %>%
arrange(population) %>%
head()
## state abb region population total rate
## 1 Wyoming WY West 563626 5 0.8871131
## 2 District of Columbia DC South 601723 99 16.4527532
## 3 Vermont VT Northeast 625741 2 0.3196211
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Alaska AK West 710231 19 2.6751860
## 6 South Dakota SD North Central 814180 8 0.9825837
With arrange
we get to decide which column to sort by. To see the states by murder rate, from lowest to highest, we arrange by rate
instead:
murders %>%
arrange(rate) %>%
head()
## state abb region population total rate
## 1 Vermont VT Northeast 625741 2 0.3196211
## 2 New Hampshire NH Northeast 1316470 5 0.3798036
## 3 Hawaii HI West 1360301 7 0.5145920
## 4 North Dakota ND North Central 672591 4 0.5947151
## 5 Iowa IA North Central 3046355 21 0.6893484
## 6 Idaho ID West 1567582 12 0.7655102
Note that the default behavior is to order in ascending order. In dplyr, the function desc
transforms a vector so that it is in descending order. To sort the table in descending order, we can type:
murders %>%
arrange(desc(rate))
Nested sorting
If we are ordering by a column with ties, we can use a second column to break the tie. Similarly, a third column can be used to break ties between first and second and so on. Here we order by region
, then within region we order by murder rate:
murders %>%
arrange(region, rate) %>%
head()
## state abb region population total rate
## 1 Vermont VT Northeast 625741 2 0.3196211
## 2 New Hampshire NH Northeast 1316470 5 0.3798036
## 3 Maine ME Northeast 1328361 11 0.8280881
## 4 Rhode Island RI Northeast 1052567 16 1.5200933
## 5 Massachusetts MA Northeast 6547629 118 1.8021791
## 6 New York NY Northeast 19378102 517 2.6679599
The top \(n\)
In the code above, we have used the function head
to avoid having the page fill up with the entire dataset. If we want to see a larger proportion, we can use the top_n
function. This function takes a data frame as it’s first argument, the number of rows to show in the second, and the variable to filter by in the third. Here is an example of how to see the top 5 rows:
murders %>% top_n(5, rate)
## state abb region population total rate
## 1 District of Columbia DC South 601723 99 16.452753
## 2 Louisiana LA South 4533372 351 7.742581
## 3 Maryland MD South 5773552 293 5.074866
## 4 Missouri MO North Central 5988927 321 5.359892
## 5 South Carolina SC South 4625364 207 4.475323
Note that rows are not sorted by rate
, only filtered. If we want to sort, we need to use arrange
.
Note that if the third argument is left blank, top_n
filters by the last column.
TRY IT
For these exercises, we will be using the data from the survey collected by the United States National Center for Health Statistics (NCHS). This center has conducted a series of health and nutrition surveys since the 1960’s. Starting in 1999, about 5,000 individuals of all ages have been interviewed every year and they complete the health examination component of the survey. Part of the data is made available via the NHANES package. Once you install the NHANES package, you can load the data like this:
library(NHANES)
## Warning: package 'NHANES' was built under R version 4.2.3
data(NHANES)
The NHANES data has many missing values. The mean
and sd
functions in R will return NA
if any of the entries of the input vector is an NA
. Here is an example:
library(dslabs)
data(na_example)
mean(na_example)
## [1] NA
sd(na_example)
## [1] NA
To ignore the NA
s we can use the na.rm
argument:
mean(na_example, na.rm = TRUE)
## [1] 2.301754
sd(na_example, na.rm = TRUE)
## [1] 1.22338
Let’s now explore the NHANES data.
- We will provide some basic facts about blood pressure. First let’s select a group to set the standard. We will use 20-to-29-year-old females.
AgeDecade
is a categorical variable with these ages. Note that the category is coded like ” 20-29”, with a space in front! What is the average and standard deviation of systolic blood pressure as saved in theBPSysAve
variable? Save it to a variable calledref
.
Hint: Use filter
and summarize
and use the na.rm = TRUE
argument when computing the average and standard deviation. You can also filter the NA values using filter
.
Using a pipe, assign the average to a numeric variable
ref_avg
. Hint: Use the code similar to above and thenpull
.Now report the min and max values for the same group.
Compute the average and standard deviation for females, but for each age group separately rather than a selected decade as in question 1. Note that the age groups are defined by
AgeDecade
. Hint: rather than filtering by age and gender, filter byGender
and then usegroup_by
.Repeat exercise 4 for males.
We can actually combine both summaries for exercises 4 and 5 into one line of code. This is because
group_by
permits us to group by more than one variable. Obtain one big summary table usinggroup_by(AgeDecade, Gender)
.For males between the ages of 40-49, compare systolic blood pressure across race as reported in the
Race1
variable. Order the resulting table from lowest to highest average systolic blood pressure.
Tibbles
Tidy data must be stored in data frames. We introduced the data frame in our section on data.frames and have been using the murders
data frame throughout the unit. In an earlier section we introduced the group_by
function, which permits stratifying data before computing summary statistics. But where is the group information stored in the data frame?
murders %>% group_by(region)
## # A tibble: 51 × 6
## # Groups: region [4]
## state abb region population total rate
## <chr> <chr> <fct> <dbl> <dbl> <dbl>
## 1 Alabama AL South 4779736 135 2.82
## 2 Alaska AK West 710231 19 2.68
## 3 Arizona AZ West 6392017 232 3.63
## 4 Arkansas AR South 2915918 93 3.19
## 5 California CA West 37253956 1257 3.37
## 6 Colorado CO West 5029196 65 1.29
## 7 Connecticut CT Northeast 3574097 97 2.71
## 8 Delaware DE South 897934 38 4.23
## 9 District of Columbia DC South 601723 99 16.5
## 10 Florida FL South 19687653 669 3.40
## # ℹ 41 more rows
Notice that there are no columns with this information. But, if you look closely at the output above, you see the line A tibble
followed by dimensions. We can learn the class of the returned object using:
murders %>% group_by(region) %>% class()
## [1] "grouped_df" "tbl_df" "tbl" "data.frame"
The tbl
, pronounced tibble, is a special kind of data frame. The functions group_by
and summarize
always return this type of data frame. The group_by
function returns a special kind of tbl
, the grouped_df
. We will say more about these later. For consistency, the dplyr manipulation verbs (select
, filter
, mutate
, and arrange
) preserve the class of the input: if they receive a regular data frame they return a regular data frame, while if they receive a tibble they return a tibble. But tibbles are the preferred format in the tidyverse and as a result tidyverse functions that produce a data frame from scratch return a tibble.
Tibbles are very similar to data frames. In fact, you can think of them as a modern version of data frames. Nonetheless there are three important differences which we describe next.
Tibbles display better
The print method for tibbles is more readable than that of a data frame. To see this, compare the outputs of typing murders
and the output of murders if we convert it to a tibble. We can do this using as_tibble(murders)
. If using RStudio, output for a tibble adjusts to your window size. To see this, change the width of your R console and notice how more/less columns are shown.
Subsets of tibbles are tibbles
If you subset the columns of a data frame, you may get back an object that is not a data frame, such as a vector or scalar. For example:
class(murders[,4])
## [1] "numeric"
is not a data frame. With tibbles this does not happen:
class(as_tibble(murders)[,4])
## [1] "tbl_df" "tbl" "data.frame"
This is useful in the tidyverse since functions require data frames as input.
With tibbles, if you want to access the vector that defines a column, and not get back a data frame, you need to use the accessor $
:
class(as_tibble(murders)$population)
## [1] "numeric"
A related feature is that tibbles will give you a warning if you try to access a column that does not exist. If we accidentally write Population
instead of population
this:
murders$Population
## NULL
returns a NULL
with no warning, which can make it harder to debug. In contrast, if we try this with a tibble we get an informative warning:
as_tibble(murders)$Population
## Warning: Unknown or uninitialised column: `Population`.
## NULL
Tibbles can have complex entries
While data frame columns need to be vectors of numbers, strings, or logical values, tibbles can have more complex objects, such as lists or functions. Also, we can create tibbles with functions:
tibble(id = c(1, 2, 3), func = c(mean, median, sd))
## # A tibble: 3 × 2
## id func
## <dbl> <list>
## 1 1 <fn>
## 2 2 <fn>
## 3 3 <fn>
Tibbles can be grouped
The function group_by
returns a special kind of tibble: a grouped tibble. This class stores information that lets you know which rows are in which groups. The tidyverse functions, in particular the summarize
function, are aware of the group information.
Create a tibble using tibble
instead of data.frame
It is sometimes useful for us to create our own data frames. To create a data frame in the tibble format, you can do this by using the tibble
function.
grades <- tibble(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
Note that base R (without packages loaded) has a function with a very similar name, data.frame
, that can be used to create a regular data frame rather than a tibble. One other important difference is that by default data.frame
coerces characters into factors without providing a warning or message:
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90))
class(grades$names)
## [1] "character"
To avoid this, we use the rather cumbersome argument stringsAsFactors
:
grades <- data.frame(names = c("John", "Juan", "Jean", "Yao"),
exam_1 = c(95, 80, 90, 85),
exam_2 = c(90, 85, 85, 90),
stringsAsFactors = FALSE)
class(grades$names)
## [1] "character"
To convert a regular data frame to a tibble, you can use the as_tibble
function.
as_tibble(grades) %>% class()
## [1] "tbl_df" "tbl" "data.frame"
The dot operator
One of the advantages of using the pipe %>%
is that we do not have to keep naming new objects as we manipulate the data frame. As a quick reminder, if we want to compute the median murder rate for states in the southern states, instead of typing:
tab_1 <- filter(murders, region == "South")
tab_2 <- mutate(tab_1, rate = total / population * 10^5)
rates <- tab_2$rate
median(rates)
## [1] 3.398069
We can avoid defining any new intermediate objects by instead typing:
filter(murders, region == "South") %>%
mutate(rate = total / population * 10^5) %>%
summarize(median = median(rate)) %>%
pull(median)
## [1] 3.398069
We can do this because each of these functions takes a data frame as the first argument. But what if we want to access a component of the data frame. For example, what if the pull
function was not available and we wanted to access tab_2$rate
? What data frame name would we use? The answer is the dot operator.
For example to access the rate vector without the pull
function we could use
rates <- filter(murders, region == "South") %>%
mutate(rate = total / population * 10^5) %>%
.$rate
median(rates)
## [1] 3.398069
In the next section, we will see other instances in which using the .
is useful.
do
The tidyverse functions know how to interpret grouped tibbles. Furthermore, to facilitate stringing commands through the pipe %>%
, tidyverse functions consistently take data frames and return data frames, since this assures that the output of a function is accepted as the input of another. But most R functions do not recognize grouped tibbles nor do they return data frames. The quantile
function is an example we described earlier. The do
function serves as a bridge between R functions such as quantile
and the tidyverse. The do
function understands grouped tibbles and always returns a data frame.
In the summarize section (above), we noted that if we attempt to use quantile
to obtain the min, median and max in one call, we will receive something unexpected. Prior to R 4.1, we would receive an error. After R 4.1, we actually get:
data(heights)
heights %>%
filter(sex == "Female") %>%
summarize(range = quantile(height, c(0, 0.5, 1)))
We probably wanted three columns: min, median, and max. We can use the do
function to fix this.
First we have to write a function that fits into the tidyverse approach: that is, it receives a data frame and returns a data frame. Note that it returns a single-row data frame.
my_summary <- function(dat){
x <- quantile(dat$height, c(0, 0.5, 1))
tibble(min = x[1], median = x[2], max = x[3])
}
We can now apply the function to the heights dataset to obtain the summaries:
heights %>%
group_by(sex) %>%
my_summary
## # A tibble: 1 × 3
## min median max
## <dbl> <dbl> <dbl>
## 1 50 68.5 82.7
But this is not what we want. We want a summary for each sex and the code returned just one summary. This is because my_summary
is not part of the tidyverse and does not know how to handled grouped tibbles. do
makes this connection:
heights %>%
group_by(sex) %>%
do(my_summary(.))
## # A tibble: 2 × 4
## # Groups: sex [2]
## sex min median max
## <fct> <dbl> <dbl> <dbl>
## 1 Female 51 65.0 79
## 2 Male 50 69 82.7
Note that here we need to use the dot operator. The tibble created by group_by
is piped to do
. Within the call to do
, the name of this tibble is .
and we want to send it to my_summary
. If you do not use the dot, then my_summary
has no argument and returns an error telling us that argument "dat"
is missing. You can see the error by typing:
heights %>%
group_by(sex) %>%
do(my_summary())
If you do not use the parenthesis, then the function is not executed and instead do
tries to return the function. This gives an error because do
must always return a data frame. You can see the error by typing:
heights %>%
group_by(sex) %>%
do(my_summary)
So do
serves as a bridge between non-tidyverse functions and the tidyverse.
The purrr package
In previous sections (and labs) we learned about the sapply
function, which permitted us to apply the same function to each element of a vector. We constructed a function and used sapply
to compute the sum of the first n
integers for several values of n
like this:
compute_s_n <- function(n){
x <- 1:n
sum(x)
}
n <- 1:25
s_n <- sapply(n, compute_s_n)
s_n
## [1] 1 3 6 10 15 21 28 36 45 55 66 78 91 105 120 136 153 171 190
## [20] 210 231 253 276 300 325
This type of operation, applying the same function or procedure to elements of an object, is quite common in data analysis. The purrr package includes functions similar to sapply
but that better interact with other tidyverse functions. The main advantage is that we can better control the output type of functions. In contrast, sapply
can return several different object types; for example, we might expect a numeric result from a line of code, but sapply
might convert our result to character under some circumstances. purrr functions will never do this: they will return objects of a specified type or return an error if this is not possible.
The first purrr function we will learn is map
, which works very similar to sapply
but always, without exception, returns a list:
library(purrr) # or library(tidyverse)
n <- 1:25
s_n <- map(n, compute_s_n)
class(s_n)
## [1] "list"
If we want a numeric vector, we can instead use map_dbl
which always returns a vector of numeric values.
s_n <- map_dbl(n, compute_s_n)
class(s_n)
## [1] "numeric"
This produces the same results as the sapply
call shown above.
A particularly useful purrr function for interacting with the rest of the tidyverse is map_df
, which always returns a tibble data frame. However, the function being called needs to return a vector or a list with names. For this reason, the following code would result in a Argument 1 must have names
error:
s_n <- map_df(n, compute_s_n)
We need to change the function to make this work:
compute_s_n <- function(n){
x <- 1:n
tibble(sum = sum(x))
}
s_n <- map_df(n, compute_s_n)
head(s_n)
## # A tibble: 6 × 1
## sum
## <int>
## 1 1
## 2 3
## 3 6
## 4 10
## 5 15
## 6 21
Because map_df
returns a tibble, we can have more columns defined in our function and returned.
compute_s_n2 <- function(n){
x <- 1:n
tibble(sum = sum(x), sumSquared = sum(x^2))
}
s_n <- map_df(n, compute_s_n2)
head(s_n)
## # A tibble: 6 × 2
## sum sumSquared
## <int> <dbl>
## 1 1 1
## 2 3 5
## 3 6 14
## 4 10 30
## 5 15 55
## 6 21 91
The purrr package provides much more functionality not covered here. For more details you can consult this online resource.
Tidyverse conditionals
A typical data analysis will often involve one or more conditional operations. In the section on Conditionals, we described the ifelse
function, which we will use extensively in this book. In this section we present two dplyr functions that provide further functionality for performing conditional operations.
case_when
The case_when
function is useful for vectorizing conditional statements. It is similar to ifelse
but can output any number of values, as opposed to just TRUE
or FALSE
. Here is an example splitting numbers into negative, positive, and 0:
x <- c(-2, -1, 0, 1, 2)
case_when(x < 0 ~ "Negative",
x > 0 ~ "Positive",
x == 0 ~ "Zero")
## [1] "Negative" "Negative" "Zero" "Positive" "Positive"
A common use for this function is to define categorical variables based on existing variables. For example, suppose we want to compare the murder rates in four groups of states: New England, West Coast, South, and other. For each state, we need to ask if it is in New England, if it is not we ask if it is in the West Coast, if not we ask if it is in the South, and if not we assign other. Here is how we use case_when
to do this:
murders %>%
mutate(group = case_when(
abb %in% c("ME", "NH", "VT", "MA", "RI", "CT") ~ "New England",
abb %in% c("WA", "OR", "CA") ~ "West Coast",
region == "South" ~ "South",
TRUE ~ "Other")) %>%
group_by(group) %>%
summarize(rate = sum(total) / sum(population) * 10^5)
## # A tibble: 4 × 2
## group rate
## <chr> <dbl>
## 1 New England 1.72
## 2 Other 2.71
## 3 South 3.63
## 4 West Coast 2.90
That TRUE
on the fourth line of case_when
serves as a catch-all. As case_when
steps through the conditions, if none of them are true, it comes to the last line. Since TRUE
is always true, the function will return “Other”. Leaving out the last line of case_when
would result in NA
values for any observation that fails the first three conditionals. This may or may not be what you want.
between
A common operation in data analysis is to determine if a value falls inside an interval. We can check this using conditionals. For example, to check if the elements of a vector x
are between a
and b
we can type
x >= a & x <= b
However, this can become cumbersome, especially within the tidyverse approach. The between
function performs the same operation.
between(x, a, b)
TRY IT
- Load the
murders
dataset. Which of the following is true?
murders
is in tidy format and is stored in a tibble.murders
is in tidy format and is stored in a data frame.murders
is not in tidy format and is stored in a tibble.murders
is not in tidy format and is stored in a data frame.
Use
as_tibble
to convert themurders
data table into a tibble and save it in an object calledmurders_tibble
.Use the
group_by
function to convertmurders
into a tibble that is grouped by region.Write tidyverse code that is equivalent to this code:
exp(mean(log(murders$population)))
Write it using the pipe so that each function is called without arguments. Use the dot operator to access the population. Hint: The code should start with murders %>%
.
- Use the
map_df
to create a data frame with three columns namedn
,s_n
, ands_n_2
. The first column should contain the numbers 1 through 100. The second and third columns should each contain the sum of 1 through \(n\) with \(n\) the row number.