Geospatial with R

Content for Thursday, December 7, 2023

A little more on Geospatial in R
- Tigris for polygons
- Tidycensus

Our last class!

A little more on Geospatial in R

We went over the basics and got some practice using geospatial data in R. Let’s look closely at the tidycensus packages. Thus far, we have used tigris to get TIGER/line census boundaries for things like counties. tidycensus does more than this – we can pull actual census data with its geospatial references (the polygons referring to the counties, tracts, block groups, and blocks) along with census data because tidycensus interfaces with tigris to combine the census data and the census geographies.

Tigris for polygons

Make sure you load tigris (for spatial data) and tidycensus for census data to attach to the spatial data:

library(tigris)
library(tidycensus)
library(tidyverse)

We already saw that we could extract counties using the tigris::counties function. Some “census geographies” change from year to year. Counties don’t (usually), but census tracts, block groups, and blocks do. These last three are all nested units of observation - tracts hold multiple block groups, each block group holds multiple blocks. Blocks are pretty small, and a lot of census data isn’t reported at that level for confidentiality. Block groups don’t report all data, either. Tracts are usually the most reliable balance between “census data availability” and “small geographic area”. We’ll use the 2010 census boundaries by specifying year = 2010.

Let’s look at MI counties, then look at Ingham County census tracts:

MI.counties = tigris::counties(state='MI', year = 2010, progress_bar = FALSE)
ggplot(MI.counties, aes(fill = NAME10) ) + 
  geom_sf() + 
  theme_minimal() + theme(legend.position = 'none')

As you can see, we get some counties that extend into the Great Lakes. That’s OK - we can use st_intersect with a map of the Great Lakes to clean that up, but I won’t do that here.

Now, the census tracts for Ingham County. We can use str_detect to find the County FIPS for Ingham:

Ingham_FIPS = MI.counties %>%
  dplyr::filter(str_detect(NAME10, 'Ingham')) %>%
  dplyr::select(COUNTYFP, STATEFP) %>%
  pull(COUNTYFP)

Ingham.tracts = tigris::tracts(state='MI', county = Ingham_FIPS, year = 2010, progress_bar = FALSE)

print(Ingham.tracts %>% dplyr::select(TRACTCE10, NAME10, GEOID10))

## Simple feature collection with 81 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.60314 ymin: 42.42194 xmax: -84.1406 ymax: 42.77664
## Geodetic CRS:  NAD83
## First 10 features:
##      TRACTCE10 NAME10     GEOID10                       geometry
## 2011    006600     66 26065006600 MULTIPOLYGON (((-84.54269 4...
## 2012    003103  31.03 26065003103 MULTIPOLYGON (((-84.52985 4...
## 2013    006800     68 26065006800 MULTIPOLYGON (((-84.56177 4...
## 2014    006700     67 26065006700 MULTIPOLYGON (((-84.56159 4...
## 2015    006001  60.01 26065006001 MULTIPOLYGON (((-84.2875 42...
## 2016    006002  60.02 26065006002 MULTIPOLYGON (((-84.21456 4...
## 2017    002101  21.01 26065002101 MULTIPOLYGON (((-84.53305 4...
## 2018    007000     70 26065007000 MULTIPOLYGON (((-84.58218 4...
## 2023    001703  17.03 26065001703 MULTIPOLYGON (((-84.58662 4...
## 2024    004494  44.94 26065004494 MULTIPOLYGON (((-84.46658 4...

Notice the GEOID10 starts with the MI FIPS (26), the Ingham FIPS (065), then the 6 digits that make up the TRACTCE10 value. This is because tracts are nested in states and counties.

Finally, let’s show all the Ingham County tracts. I’ve changed the color mapping for the col aesthetic so that the tract that holds MSU will be outlined in green:

ggplot(Ingham.tracts, aes(fill = TRACTCE10, col = (NAME10=='9800'))) + 
  geom_sf() + theme_minimal() + theme(legend.position = 'none') + scale_fill_viridis_d() +
  scale_color_manual(values = c('TRUE' = 'green','FALSE' = 'gray50'))

Tidycensus

Tidycensus takes some getting used to because census data is very complicated, has many geographies, and many subsets (e.g. you could be asking about the age of a specific combination of race and income). Here, we’re going to learn two things: how to find some basic census data by tract, and how to extract it with tidycensus.

First, we can use the tidyverse author’s “basic usage” page as a guide: https://walker-data.com/tidycensus/articles/basic-usage.html. There are important directions for getting a Census API key, which must be added to your system. tidycensus makes it easy - once you get your key (free from the US census, see the link for instructions), you just add it to your system using the census_api_key function. You only have to do it once, it records the API key in your system files, and will find it automatically on subsequent projects.

Finding variables

This is actually pretty tricky. The census asks many questions about demographics and other topics (house characteristics, earnings, health, etc.) and it can be hard to find exactly what you’re looking for. We’re going to seek some simple data: the average income, and the average “tenure” (how long someone has lived in their current residence) in each tract in Ingham County.

tidycensus has some useful functions for finding variables, especially when combined with str_detect, which helps us find certain words or phrases (our RegExp skills come in handy!) in the census variable names. Variables are identified by an ID, not by name. “P013001”, for instance, is the median age in a given geography. All variables are not available at all geographic units: “P013001” is available for all state, county, tract, and block group, but not block. The link above shows all of the geographies available.

To make things even more complicated, there is more than the decennial census. The American Community Survey (ACS) samples 1-5% of the population each year, and reports yearly but on a limited number of variables. We’ll use the ACS five-year (acs5) as it has more easily available data, even if it asks fewer questions. We’ll use the 2019 data, the most recently available.

Let’s start by finding all variables that contain “income”. Variable descriptions have the variable name as well as the “concept”. We want to use “concept” to search for our term. Each concept has multiple variables with it:

allvars = load_variables(2019, 'acs5')
incvars = allvars %>%
  dplyr::filter(str_detect(concept, 'MEDIAN INCOME'))

incvars[1:5,]

## # A tibble: 5 × 4
##   name         label                                           concept geography
##   <chr>        <chr>                                           <chr>   <chr>    
## 1 B06011PR_001 Estimate!!Median income in the past 12 months … MEDIAN… <NA>     
## 2 B06011PR_002 Estimate!!Median income in the past 12 months … MEDIAN… <NA>     
## 3 B06011PR_003 Estimate!!Median income in the past 12 months … MEDIAN… <NA>     
## 4 B06011PR_004 Estimate!!Median income in the past 12 months … MEDIAN… <NA>     
## 5 B06011PR_005 Estimate!!Median income in the past 12 months … MEDIAN… <NA>

We got lucky - the first concept is “MEDIAN INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) BY PLACE OF BIRTH”. Now, we didn’t want the “place of birth” part (though…sounds interesting), but look at the first label. It is the !!Total:, meaning it is the median income when you combine all places of birth. The second label is !!Total:!!Born in state of residence, which is a subset of the total. Same with the next. We do not need to use the subsets in order to use the !!Total value. Note the name is B06011_001.

Lots and lots of variables will have the same total - the next variable B07011_001 (after the Puerto Rico version) is the median income broken down by whether or not the household lives in the same house, instead of by place of birth. It’s long, but you can see that it is “MEDIAN INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) BY GEOGRAPHICAL MOBILITY IN THE PAST YEAR FOR CURRENT RESIDENCE IN THE UNITED STATES”

incvars[6:15,]

## # A tibble: 10 × 4
##    name         label                                          concept geography
##    <chr>        <chr>                                          <chr>   <chr>    
##  1 B06011_001   Estimate!!Median income in the past 12 months… MEDIAN… tract    
##  2 B06011_002   Estimate!!Median income in the past 12 months… MEDIAN… tract    
##  3 B06011_003   Estimate!!Median income in the past 12 months… MEDIAN… tract    
##  4 B06011_004   Estimate!!Median income in the past 12 months… MEDIAN… tract    
##  5 B06011_005   Estimate!!Median income in the past 12 months… MEDIAN… tract    
##  6 B07011PR_001 Estimate!!Median income in the past 12 months… MEDIAN… <NA>     
##  7 B07011PR_002 Estimate!!Median income in the past 12 months… MEDIAN… <NA>     
##  8 B07011PR_003 Estimate!!Median income in the past 12 months… MEDIAN… <NA>     
##  9 B07011PR_004 Estimate!!Median income in the past 12 months… MEDIAN… <NA>     
## 10 B07011PR_005 Estimate!!Median income in the past 12 months… MEDIAN… <NA>

While the values of the subsets will be different, the !!Total: variable will be the same. So, we can really choose any !!Total: variable for any set of variables in a concept that fits our search.

Let’s call up the median income using B06011_001. Note that we are naming our variables in the third line, and specifying that we only want county = '065' for Ingham:

ing.medincome = get_acs(geography = 'tract',
                        county = '065', state = 'MI',
                        variables = c(medincome = 'B06011_001'),
                        year = 2019)
head(ing.medincome)

## # A tibble: 6 × 5
##   GEOID       NAME                                     variable  estimate   moe
##   <chr>       <chr>                                    <chr>        <dbl> <dbl>
## 1 26065000100 Census Tract 1, Ingham County, Michigan  medincome    22994  6149
## 2 26065000400 Census Tract 4, Ingham County, Michigan  medincome    29059  4043
## 3 26065000600 Census Tract 6, Ingham County, Michigan  medincome    18549  4830
## 4 26065000700 Census Tract 7, Ingham County, Michigan  medincome    17196  8455
## 5 26065000800 Census Tract 8, Ingham County, Michigan  medincome    23142  5845
## 6 26065001000 Census Tract 10, Ingham County, Michigan medincome    30704  1982

Since the ACS is a sample, the variable is returned as the estimate along with a margin of error, which we’ll ignore for now. Note that the data is “tidy” for now - each row is one observation of one variable. If we have multiple variables, we’ll have to keep it tidy. We’ll compare the B06011_001 with B007011_001:

ing.medincome = get_acs(geography = 'tract',
                        county = '065', state = 'MI',
                        variables = c(medincome = 'B06011_001',
                                      medincome2 = 'B07011_001'), 
                        year = 2019) %>% arrange(GEOID)
head(ing.medincome)

## # A tibble: 6 × 5
##   GEOID       NAME                                    variable   estimate   moe
##   <chr>       <chr>                                   <chr>         <dbl> <dbl>
## 1 26065000100 Census Tract 1, Ingham County, Michigan medincome     22994  6149
## 2 26065000100 Census Tract 1, Ingham County, Michigan medincome2    22994  6149
## 3 26065000400 Census Tract 4, Ingham County, Michigan medincome     29059  4043
## 4 26065000400 Census Tract 4, Ingham County, Michigan medincome2    29059  4043
## 5 26065000600 Census Tract 6, Ingham County, Michigan medincome     18549  4830
## 6 26065000600 Census Tract 6, Ingham County, Michigan medincome2    18549  4830

We got to name the variables (useful!) as you see in the variable column. However, there is more than one row per observation now: the results are not “tidy”. But clearly the two variable’s totals are the same, so we have convinced ourselves that !!Total: works for any of the different concepts that cover median income.

Often, the data we wish to pull is a total, but it is broken down (by race, income groups, etc.). This means we want one observation per row (census tract), but multiple columns representing the total and the breakdown of a variable. Imagine “race” by tract with counts of households by race. We’ll have to use output = 'wide' in our get_acs call:

ing.medincome.wide = get_acs(geography = 'tract',
                        county = '065', state = 'MI',
                        variables = c(medincome = 'B06011_001',
                                      medincome2 = 'B07011_001'), 
                        year = 2019,
                        output = 'wide') %>% arrange(GEOID)
head(ing.medincome.wide)

## # A tibble: 6 × 6
##   GEOID       NAME                 medincomeE medincomeM medincome2E medincome2M
##   <chr>       <chr>                     <dbl>      <dbl>       <dbl>       <dbl>
## 1 26065000100 Census Tract 1, Ing…      22994       6149       22994        6149
## 2 26065000400 Census Tract 4, Ing…      29059       4043       29059        4043
## 3 26065000600 Census Tract 6, Ing…      18549       4830       18549        4830
## 4 26065000700 Census Tract 7, Ing…      17196       8455       17196        8455
## 5 26065000800 Census Tract 8, Ing…      23142       5845       23142        5845
## 6 26065001000 Census Tract 10, In…      30704       1982       30704        1982

We still get to name the variables (so name wisely!). We get an Estimate and w Margin of error for each variable. We’ll mostly stick with the estimate and ignore the MoE.

Tidycensus and geographies

Now, let’s find the other variable we’re interested in - housing tenure. We’ll keep it simple and calculate the percentage of people in a tract who lived in the same house last year.

tenvars = allvars %>%
  dplyr::filter(str_detect(concept, 'TENURE'))

tenvars[1:5,]

## # A tibble: 5 × 4
##   name         label                                           concept geography
##   <chr>        <chr>                                           <chr>   <chr>    
## 1 B07013PR_001 Estimate!!Total:                                GEOGRA… <NA>     
## 2 B07013PR_002 Estimate!!Total:!!Householder lived in owner-o… GEOGRA… <NA>     
## 3 B07013PR_003 Estimate!!Total:!!Householder lived in renter-… GEOGRA… <NA>     
## 4 B07013PR_004 Estimate!!Total:!!Same house 1 year ago:        GEOGRA… <NA>     
## 5 B07013PR_005 Estimate!!Total:!!Same house 1 year ago:!!Hous… GEOGRA… <NA>

Here, we get B07013_001, which is the total number of households, and B07014_004, which is the total number of households that lived in the same house a year ago. The ratio of these two tells us what percent didn’t move in the last year, which is what we want. We’re going to take the output in wide format so that we can calculate this ratio.

Now, with variable numbers in hand, we can pretty easily get the attached geography of interest. We just add geometry = TRUE to the get_acs call. We’re also going to ask for wide format, which gives us on row per tract and a column for each variable.

ing.tenure = get_acs(geography = 'tract',
                     county = '065', state = 'MI',
                     variables = c(samehouse = 'B07013_004',
                                   totalhouse = 'B07013_001'),
                     geometry = TRUE, output = 'wide', progress_bar = FALSE,
                     year = 2019)
ing.tenure

## Simple feature collection with 81 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.60314 ymin: 42.42195 xmax: -84.14062 ymax: 42.77664
## Geodetic CRS:  NAD83
## First 10 features:
##          GEOID                                        NAME samehouseE
## 1  26065004302 Census Tract 43.02, Ingham County, Michigan        498
## 2  26065002800    Census Tract 28, Ingham County, Michigan       2396
## 3  26065005201 Census Tract 52.01, Ingham County, Michigan       5216
## 4  26065001200    Census Tract 12, Ingham County, Michigan       1201
## 5  26065006302 Census Tract 63.02, Ingham County, Michigan       3660
## 6  26065002200    Census Tract 22, Ingham County, Michigan       1349
## 7  26065000800     Census Tract 8, Ingham County, Michigan       2601
## 8  26065004494 Census Tract 44.94, Ingham County, Michigan          0
## 9  26065004700    Census Tract 47, Ingham County, Michigan       2346
## 10 26065006301 Census Tract 63.01, Ingham County, Michigan       3580
##    samehouseM totalhouseE totalhouseM                       geometry
## 1         165        2383         326 MULTIPOLYGON (((-84.47205 4...
## 2         310        2656         296 MULTIPOLYGON (((-84.54808 4...
## 3         539        5961         481 MULTIPOLYGON (((-84.5826 42...
## 4         233        2137         239 MULTIPOLYGON (((-84.54113 4...
## 5         242        4139         201 MULTIPOLYGON (((-84.48344 4...
## 6         143        1577         110 MULTIPOLYGON (((-84.53812 4...
## 7         400        2987         403 MULTIPOLYGON (((-84.54287 4...
## 8          10           0          10 MULTIPOLYGON (((-84.46662 4...
## 9         181        2908         186 MULTIPOLYGON (((-84.43269 4...
## 10        257        4221         213 MULTIPOLYGON (((-84.48351 4...

We get an “E” at the end of the variable which stands for “Estimate” (the “M” is “Margin of Error” since the ACS is a sample). We can now calculate the percent living in the same house they were in last year:

ing.tenure = ing.tenure %>%
  dplyr::mutate(percentSameHouse = samehouseE/totalhouseE) %>%
  dplyr::select(GEOID, NAME, percentSameHouse)

ing.tenure

## Simple feature collection with 81 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -84.60314 ymin: 42.42195 xmax: -84.14062 ymax: 42.77664
## Geodetic CRS:  NAD83
## First 10 features:
##          GEOID                                        NAME percentSameHouse
## 1  26065004302 Census Tract 43.02, Ingham County, Michigan        0.2089803
## 2  26065002800    Census Tract 28, Ingham County, Michigan        0.9021084
## 3  26065005201 Census Tract 52.01, Ingham County, Michigan        0.8750210
## 4  26065001200    Census Tract 12, Ingham County, Michigan        0.5620028
## 5  26065006302 Census Tract 63.02, Ingham County, Michigan        0.8842716
## 6  26065002200    Census Tract 22, Ingham County, Michigan        0.8554217
## 7  26065000800     Census Tract 8, Ingham County, Michigan        0.8707734
## 8  26065004494 Census Tract 44.94, Ingham County, Michigan              NaN
## 9  26065004700    Census Tract 47, Ingham County, Michigan        0.8067400
## 10 26065006301 Census Tract 63.01, Ingham County, Michigan        0.8481403
##                          geometry
## 1  MULTIPOLYGON (((-84.47205 4...
## 2  MULTIPOLYGON (((-84.54808 4...
## 3  MULTIPOLYGON (((-84.5826 42...
## 4  MULTIPOLYGON (((-84.54113 4...
## 5  MULTIPOLYGON (((-84.48344 4...
## 6  MULTIPOLYGON (((-84.53812 4...
## 7  MULTIPOLYGON (((-84.54287 4...
## 8  MULTIPOLYGON (((-84.46662 4...
## 9  MULTIPOLYGON (((-84.43269 4...
## 10 MULTIPOLYGON (((-84.48351 4...

The NA’s come from tracts with zero households in them (the haunted Seven Gables Nature Area, GM’s Lansing plant, industrial areas, etc.). Let’s plot this!

ggplot(ing.tenure, aes(fill = percentSameHouse)) + 
  geom_sf() + theme_minimal() + scale_fill_viridis_c() +
  labs(fill = 'Percent Same House\n1 yr ago')

An alternative source for finding Census variables

We can also find census variables by searching the data.census.gov website. We are looking for tables, so click on “explore tables”.

First thing we want to do is set our level (e.g. block, block-group, tract, county, etc.) so that we are searching for data that exists at the level we want. No sense in searching for something you want at the county level to find out it’s only available at the state level! On the left side under “geography” click on the level you want (say, “census tract”) then choose a random state and tract – you’re only using this to find the variable’s code, not to actually retrieve data. I always use the first county in Alabama just because it’s small-ish so it doesn’t take long to load up.

With the tract(s) set on geography, go up to the search box and type in the topic you’re interested in. It’ll draw from decennial census, the ACS, and more. You can view the data (for the set of selected geographies) by clicking on the results to make sure the numbers look like you’d expect. That will also let you see the crosstabs – some data will have further breakdowns (by race, by tenure, etc.) and the crosstabs will show you what those breakdowns are, and if there is a “topline” number that serves your purpose.

Once you find the series you want, you’re getting close to being done. The variable code will be on the results (you’ll see “B….” or “H…” or “S….”). Note that and the source (ACS, decennial) and the year(s) you want. What’s missing, though? The _001 suffix! Frustrating, I know.

To find the exact variable you need, you’ll need what’s called a “table shell”. You can find them all here. Download the “table shells for all detailed data” for the data series and year you need (ACS 2021, decennial 2020, etc.) and search using your variable name. That’ll take you to the complete crosstabs available for that variable, including the _00X suffixes. Those are the variables (“table ID’s” in census terms) you’ll need to feed to tidycensus. You can name each of them as you see fit, take whichever ones you need (just the topline, or all of the crosstabs – it’s up to you).

Wrapping up

Finding census variables via tidycensus can be frustrating, and other sources exist to help lookup census tables. One useful tip is that you can use the “filter” button in your Rstudio View() window to interactively filter the concept column from load_variables. Once you find what you want there, then use the get_acs function in your code. Fundamentally, it’s a lot of information with the ability to subset in many way, which makes it hard to wade through and find what you’re looking for.

Try it! (This will be your Lab 14, due on Monday)

Load up all the necessary packages
Find the FIPS for your home county (or a county you’re interested in)
Choose a variable you’re interested in. Finding census variables can be tricky! If you want a challenge, find more than one variable that can construct something you’re interested in (like our “percent of housholds residing at the same address a year ago” measure).
Using load_variables, find the best representation of that variable
Make a map of that variable at the census tract level

Last updated on December 11, 2023