Chapter 5 Data wrangling on multiple tables
In the previous chapter, we illustrated how the five data wrangling verbs can be chained to perform operations on a single table. A single table is reminiscent of a single well-organized spreadsheet. But in the same way that a workbook can contain multiple spreadsheets, we will often work with multiple tables. In Chapter 15, we will describe how multiple tables related by unique identifiers called keys can be organized into a relational database management system.
It is more efficient for the computer to store and search tables in which “like is stored with like.”
Thus, a database maintained by the Bureau of Transportation Statistics on the arrival times of U.S. commercial flights will consist of multiple tables, each of which contains data about different things.
For example, the nycflights13 package contains one table about flights
—each row in this table is a single flight.
As there are many flights, you can imagine that this table will get very long—hundreds of thousands of rows per year.
There are other related kinds of information that we will want to know about these flights.
We would certainly be interested in the particular airline to which each flight belonged.
It would be inefficient to store the complete name of the airline (e.g., American Airlines Inc.
) in every row of the flights table. A simple code (e.g., AA
) would take up less space on disk.
For small tables, the savings of storing two characters instead of 25 is insignificant, but for large tables, it can add up to noticeable savings both in terms of the size of data on disk, and the speed with which we can search it.
However, we still want to have the full names of the airlines available if we need them.
The solution is to store the data about airlines in a separate table called airlines
, and to provide a key that links the data in the two tables together.
5.1 inner_join()
If we examine the first few rows of the flights
table, we observe that the carrier
column contains a two-character string corresponding to the airline.
library(tidyverse)
library(mdsr)
library(nycflights13)
glimpse(flights)
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 5…
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 6…
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, …
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, …
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 30…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "L…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "I…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 14…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 7…
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, …
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-…
In the airlines
table, we have those same two-character strings, but also the full names of the airline.
head(airlines, 3)
# A tibble: 3 × 2
carrier name
<chr> <chr>
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
In order to retrieve a list of flights and the full names of the airlines that managed each flight, we need to match up the rows in the flights
table with those rows in the airlines
table that have the corresponding values for the carrier
column in both tables. This is achieved with the function inner_join()
.
<- flights %>%
flights_joined inner_join(airlines, by = c("carrier" = "carrier"))
glimpse(flights_joined)
Rows: 336,776
Columns: 20
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 201…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 5…
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 6…
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, …
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, …
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 30…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "L…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "I…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 14…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 7…
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, …
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-…
$ name <chr> "United Air Lines Inc.", "United Air Lines Inc.", "…
Notice that the flights_joined
data frame now has an additional variable called name
.
This is the column from airlines
that is now included in the combined data frame.
We can view the full names of the airlines instead of the cryptic two-character codes.
%>%
flights_joined select(carrier, name, flight, origin, dest) %>%
head(3)
# A tibble: 3 × 5
carrier name flight origin dest
<chr> <chr> <int> <chr> <chr>
1 UA United Air Lines Inc. 1545 EWR IAH
2 UA United Air Lines Inc. 1714 LGA IAH
3 AA American Airlines Inc. 1141 JFK MIA
In an inner_join()
, the result set contains only those rows that have matches in both tables. In this case, all of the rows in flights
have exactly one corresponding entry in airlines
, so the number of rows in flights_joined
is the same as the number of rows in flights
(this will not always be the case).
nrow(flights)
[1] 336776
nrow(flights_joined)
[1] 336776
It is always a good idea to carefully check that the number of rows returned by a join operation is what you expected. In particular, you should carefully check for rows in one table that matched to more than one row in the other table.
5.2 left_join()
Another commonly-used type of join is a left_join()
. Here the rows of the first table are always returned, regardless of whether there is a match in the second table.
Suppose that we are only interested in flights from the New York City airports on the West Coast.
To restrict ourselves to airports in the Pacific Time Zone (UTC -8) we can filter the airports
data frame to only include those airports.
<- airports %>%
airports_pt filter(tz == -8)
nrow(airports_pt)
[1] 178
Now, if we perform an inner_join()
on flights
and airports_pt
, matching the destinations in flights
to the FAA codes in airports
, we retrieve only those flights that flew to our airports in the Pacific Time Zone.
<- flights %>%
nyc_dests_pt inner_join(airports_pt, by = c("dest" = "faa"))
nrow(nyc_dests_pt)
[1] 46324
However, if we use a left_join()
with the same conditions, we retrieve all of the rows of flights
. NA
’s are inserted into the columns where no matched data was found.
<- flights %>%
nyc_dests left_join(airports_pt, by = c("dest" = "faa"))
%>%
nyc_dests summarize(
num_flights = n(),
num_flights_pt = sum(!is.na(name)),
num_flights_not_pt = sum(is.na(name))
)
# A tibble: 1 × 3
num_flights num_flights_pt num_flights_not_pt
<int> <int> <int>
1 336776 46324 290452
Left joins are particularly useful in databases in which referential integrity is broken (not all of the keys are present—see Chapter 15).
5.3 Extended example: Manny Ramirez
In the context of baseball and the Lahman package, multiple tables are used to store information. The batting statistics of players are stored in one table (Batting
), while information about people (most of whom are players) is in a different table (Master
).
Every row in the Batting
table contains the statistics accumulated by a single player during a single stint for a single team in a single year. Thus, a player like Manny Ramirez has many rows in the Batting
table (21, in fact).
library(Lahman)
<- Batting %>%
manny filter(playerID == "ramirma02")
nrow(manny)
[1] 21
Using what we’ve learned, we can quickly tabulate Ramirez’s most common career offensive statistics.
For those new to baseball, some additional background may be helpful.
A hit (H
) occurs when a batter reaches base safely. A home run (HR
) occurs when the ball is hit out of the park or the runner advances through all of the bases during that play.
Barry Bonds has the record for most home runs (762) hit in a career. A player’s batting average (BA
) is the ratio of the number of hits to the number of eligible at-bats.
The highest career batting average in Major League Baseball history of 0.366 was achieved by Ty Cobb—season averages above 0.300 are impressive.
Finally, runs batted in (RBI
) is the number of runners (including the batter in the case of a home run) that score during that batter’s at-bat.
Hank Aaron has the record for most career RBIs with 2,297.
%>%
manny summarize(
span = paste(min(yearID), max(yearID), sep = "-"),
num_years = n_distinct(yearID),
num_teams = n_distinct(teamID),
BA = sum(H)/sum(AB),
tH = sum(H),
tHR = sum(HR),
tRBI = sum(RBI)
)
span num_years num_teams BA tH tHR tRBI
1 1993-2011 19 5 0.312 2574 555 1831
Notice how we have used the paste()
function to combine results from multiple variables into a new variable, and how we have used the n_distinct()
function to count the number of distinct rows.
In his 19-year career, Ramirez hit 555 home runs, which puts him in the top 20 among all Major League players.
However, we also see that Ramirez played for five teams during his career.
Did he perform equally well for each of them?
Breaking his statistics down by team, or by league, is as easy as adding an appropriate group_by()
command.
%>%
manny group_by(teamID) %>%
summarize(
span = paste(min(yearID), max(yearID), sep = "-"),
num_years = n_distinct(yearID),
num_teams = n_distinct(teamID),
BA = sum(H)/sum(AB),
tH = sum(H),
tHR = sum(HR),
tRBI = sum(RBI)
%>%
) arrange(span)
# A tibble: 5 × 8
teamID span num_years num_teams BA tH tHR tRBI
<fct> <chr> <int> <int> <dbl> <int> <int> <int>
1 CLE 1993-2000 8 1 0.313 1086 236 804
2 BOS 2001-2008 8 1 0.312 1232 274 868
3 LAN 2008-2010 3 1 0.322 237 44 156
4 CHA 2010-2010 1 1 0.261 18 1 2
5 TBA 2011-2011 1 1 0.0588 1 0 1
While Ramirez was very productive for Cleveland, Boston, and the Los Angeles Dodgers, his brief tours with the Chicago White Sox and Tampa Bay Rays were less than stellar. In the pipeline below, we can see that Ramirez spent the bulk of his career in the American League.
%>%
manny group_by(lgID) %>%
summarize(
span = paste(min(yearID), max(yearID), sep = "-"),
num_years = n_distinct(yearID),
num_teams = n_distinct(teamID),
BA = sum(H)/sum(AB),
tH = sum(H),
tHR = sum(HR),
tRBI = sum(RBI)
%>%
) arrange(span)
# A tibble: 2 × 8
lgID span num_years num_teams BA tH tHR tRBI
<fct> <chr> <int> <int> <dbl> <int> <int> <int>
1 AL 1993-2011 18 4 0.311 2337 511 1675
2 NL 2008-2010 3 1 0.322 237 44 156
If Ramirez played in only 19 different seasons, why were there 21 rows attributed to him? Notice that in 2008, he was traded from the Boston Red Sox to the Los Angeles Dodgers, and thus played for both teams. Similarly, in 2010 he played for both the Dodgers and the Chicago White Sox.
When summarizing data, it is critically important to understand exactly how the rows of your data frame are organized.
To see what can go wrong here, suppose we were interested in tabulating the number of seasons in which Ramirez hit at least 30 home runs. The simplest solution is:
%>%
manny filter(HR >= 30) %>%
nrow()
[1] 11
But this answer is wrong, because in 2008, Ramirez hit 20 home runs for Boston before being traded and then 17 more for the Dodgers afterwards. Neither of those rows were counted, since they were both filtered out. Thus, the year 2008 does not appear among the 11 that we counted in the previous pipeline. Recall that each row in the manny
data frame corresponds to one stint with one team in one year. On the other hand, the question asks us to consider each year, regardless of team. In order to get the right answer, we have to aggregate the rows by team. Thus, the correct solution is:
%>%
manny group_by(yearID) %>%
summarize(tHR = sum(HR)) %>%
filter(tHR >= 30) %>%
nrow()
[1] 12
Note that the filter()
operation is applied to tHR
, the total number of home runs in a season, and not HR
, the number of home runs in a single stint for a single team in a single season. (This distinction between filtering the rows of the original data versus the rows of the aggregated results will appear again in Chapter 15.)
We began this example by filtering the Batting
table for the player with playerID
equal to ramirma02
.
How did we know to use this identifier?
This player ID is known as a key, and in fact, playerID
is the primary key defined in the Master
table.
That is, every row in the Master
table is uniquely identified by the value of playerID
.
There is exactly one row in that table for which playerID
is equal to ramirma02
.
But how did we know that this ID corresponds to Manny Ramirez? We can search the Master
table.
The data in this table include characteristics about Manny Ramirez that do not change across multiple seasons (with the possible exception of his weight).
%>%
Master filter(nameLast == "Ramirez" & nameFirst == "Manny")
playerID birthYear birthMonth birthDay birthCountry birthState
1 ramirma02 1972 5 30 D.R. Distrito Nacional
birthCity deathYear deathMonth deathDay deathCountry deathState
1 Santo Domingo NA NA NA <NA> <NA>
deathCity nameFirst nameLast nameGiven weight height bats throws
1 <NA> Manny Ramirez Manuel Aristides 225 72 R R
debut finalGame retroID bbrefID deathDate birthDate
1 1993-09-02 2011-04-06 ramim002 ramirma02 <NA> 1972-05-30
The playerID
column forms a primary key in the Master
table, but it does not in the Batting
table, since as we saw previously, there were 21 rows with that playerID
. In the Batting
table, the playerID
column is known as a foreign key, in that it references a primary key in another table. For our purposes, the presence of this column in both tables allows us to link them together. This way, we can combine data from the Batting
table with data in the Master
table. We do this with inner_join()
by specifying the two tables that we want to join, and the corresponding columns in each table that provide the link. Thus, if we want to display Ramirez’s name in our previous result, as well as his age, we must join the Batting
and Master
tables together.
Always specify the by
argument that defines the join condition. Don’t rely on the defaults.
%>%
Batting filter(playerID == "ramirma02") %>%
inner_join(Master, by = c("playerID" = "playerID")) %>%
group_by(yearID) %>%
summarize(
Age = max(yearID - birthYear),
num_teams = n_distinct(teamID),
BA = sum(H)/sum(AB),
tH = sum(H),
tHR = sum(HR),
tRBI = sum(RBI)
%>%
) arrange(yearID)
# A tibble: 19 × 7
yearID Age num_teams BA tH tHR tRBI
<int> <int> <int> <dbl> <int> <int> <int>
1 1993 21 1 0.170 9 2 5
2 1994 22 1 0.269 78 17 60
3 1995 23 1 0.308 149 31 107
4 1996 24 1 0.309 170 33 112
5 1997 25 1 0.328 184 26 88
6 1998 26 1 0.294 168 45 145
7 1999 27 1 0.333 174 44 165
8 2000 28 1 0.351 154 38 122
9 2001 29 1 0.306 162 41 125
10 2002 30 1 0.349 152 33 107
11 2003 31 1 0.325 185 37 104
12 2004 32 1 0.308 175 43 130
13 2005 33 1 0.292 162 45 144
14 2006 34 1 0.321 144 35 102
15 2007 35 1 0.296 143 20 88
16 2008 36 2 0.332 183 37 121
17 2009 37 1 0.290 102 19 63
18 2010 38 2 0.298 79 9 42
19 2011 39 1 0.0588 1 0 1
Notice that even though Ramirez’s age is a constant for each season, we have to use a vector operation (i.e., max()
or first()
) in order to reduce any potential vector to a single number.
Which season was Ramirez’s best as a hitter? One relatively simple measurement of batting prowess is OPS, or On-Base Plus Slugging Percentage, which is the simple sum of two other statistics: On-Base Percentage (OBP) and Slugging Percentage (SLG). The former basically measures the proportion of time that a batter reaches base safely, whether it comes via a hit (H
), a base on balls (BB
), or from being hit by the pitch (HBP
). The latter measures the average number of bases advanced per at-bat (AB
), where a single is worth one base, a double (X2B
) is worth two, a triple (X3B
) is worth three, and a home run (HR
) is worth four. (Note that every hit is exactly one of a single, double, triple, or home run.)
Let’s add these statistics to our results and use it to rank the seasons.
<- Batting %>%
manny_by_season filter(playerID == "ramirma02") %>%
inner_join(Master, by = c("playerID" = "playerID")) %>%
group_by(yearID) %>%
summarize(
Age = max(yearID - birthYear),
num_teams = n_distinct(teamID),
BA = sum(H)/sum(AB),
tH = sum(H),
tHR = sum(HR),
tRBI = sum(RBI),
OBP = sum(H + BB + HBP) / sum(AB + BB + SF + HBP),
SLG = sum(H + X2B + 2 * X3B + 3 * HR) / sum(AB)
%>%
) mutate(OPS = OBP + SLG) %>%
arrange(desc(OPS))
manny_by_season
# A tibble: 19 × 10
yearID Age num_teams BA tH tHR tRBI OBP SLG OPS
<int> <int> <int> <dbl> <int> <int> <int> <dbl> <dbl> <dbl>
1 2000 28 1 0.351 154 38 122 0.457 0.697 1.15
2 1999 27 1 0.333 174 44 165 0.442 0.663 1.11
3 2002 30 1 0.349 152 33 107 0.450 0.647 1.10
4 2006 34 1 0.321 144 35 102 0.439 0.619 1.06
5 2008 36 2 0.332 183 37 121 0.430 0.601 1.03
6 2003 31 1 0.325 185 37 104 0.427 0.587 1.01
7 2001 29 1 0.306 162 41 125 0.405 0.609 1.01
8 2004 32 1 0.308 175 43 130 0.397 0.613 1.01
9 2005 33 1 0.292 162 45 144 0.388 0.594 0.982
10 1996 24 1 0.309 170 33 112 0.399 0.582 0.981
11 1998 26 1 0.294 168 45 145 0.377 0.599 0.976
12 1995 23 1 0.308 149 31 107 0.402 0.558 0.960
13 1997 25 1 0.328 184 26 88 0.415 0.538 0.953
14 2009 37 1 0.290 102 19 63 0.418 0.531 0.949
15 2007 35 1 0.296 143 20 88 0.388 0.493 0.881
16 1994 22 1 0.269 78 17 60 0.357 0.521 0.878
17 2010 38 2 0.298 79 9 42 0.409 0.460 0.870
18 1993 21 1 0.170 9 2 5 0.2 0.302 0.502
19 2011 39 1 0.0588 1 0 1 0.0588 0.0588 0.118
We see that Ramirez’s OPS was highest in 2000. But 2000 was the height of the steroid era, when many sluggers were putting up tremendous offensive numbers. As data scientists, we know that it would be more instructive to put Ramirez’s OPS in context by comparing it to the league average OPS in each season—the resulting ratio is often called OPS+. To do this, we will need to compute those averages. Because there is missing data in some of these columns in some of these years, we need to invoke the na.rm
argument to ignore that data.
<- Batting %>%
mlb filter(yearID %in% 1993:2011) %>%
group_by(yearID) %>%
summarize(
lg_OBP = sum(H + BB + HBP, na.rm = TRUE) /
sum(AB + BB + SF + HBP, na.rm = TRUE),
lg_SLG = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE) /
sum(AB, na.rm = TRUE)
%>%
) mutate(lg_OPS = lg_OBP + lg_SLG)
Next, we need to match these league average OPS values to the corresponding entries for Ramirez. We can do this by joining these tables together, and computing the ratio of Ramirez’s OPS to that of the league average.
<- manny_by_season %>%
manny_ratio inner_join(mlb, by = c("yearID" = "yearID")) %>%
mutate(OPS_plus = OPS / lg_OPS) %>%
select(yearID, Age, OPS, lg_OPS, OPS_plus) %>%
arrange(desc(OPS_plus))
manny_ratio
# A tibble: 19 × 5
yearID Age OPS lg_OPS OPS_plus
<int> <int> <dbl> <dbl> <dbl>
1 2000 28 1.15 0.782 1.48
2 2002 30 1.10 0.748 1.47
3 1999 27 1.11 0.778 1.42
4 2006 34 1.06 0.768 1.38
5 2008 36 1.03 0.749 1.38
6 2003 31 1.01 0.755 1.34
7 2001 29 1.01 0.759 1.34
8 2004 32 1.01 0.763 1.32
9 2005 33 0.982 0.749 1.31
10 1998 26 0.976 0.755 1.29
11 1996 24 0.981 0.767 1.28
12 1995 23 0.960 0.755 1.27
13 2009 37 0.949 0.751 1.26
14 1997 25 0.953 0.756 1.26
15 2010 38 0.870 0.728 1.19
16 2007 35 0.881 0.758 1.16
17 1994 22 0.878 0.763 1.15
18 1993 21 0.502 0.736 0.682
19 2011 39 0.118 0.720 0.163
In this case, 2000 still ranks as Ramirez’s best season relative to his peers, but notice that his 1999 season has fallen from 2nd to 3rd. Since by definition a league average batter has an OPS+ of 1, Ramirez posted 17 consecutive seasons with an OPS that was at least 15% better than the average across the major leagues—a truly impressive feat.
Finally, not all joins are the same.
An inner_join()
requires corresponding entries in both tables.
Conversely, a left_join()
returns at least as many rows as there are in the first table, regardless of whether there are matches in the second table.
An inner_join()
is bidirectional, whereas in a left_join()
, the order in which you specify the tables matters.
Consider the career of Cal Ripken, who played in 21 seasons from 1981 to 2001. His career overlapped with Ramirez’s in the nine seasons from 1993 to 2001, so for those, the league averages we computed before are useful.
<- Batting %>%
ripken filter(playerID == "ripkeca01")
%>%
ripken inner_join(mlb, by = c("yearID" = "yearID")) %>%
nrow()
[1] 9
# same
%>%
mlb inner_join(ripken, by = c("yearID" = "yearID")) %>%
nrow()
[1] 9
For seasons when Ramirez did not play, NA
’s will be returned.
%>%
ripken left_join(mlb, by = c("yearID" = "yearID")) %>%
select(yearID, playerID, lg_OPS) %>%
head(3)
yearID playerID lg_OPS
1 1981 ripkeca01 NA
2 1982 ripkeca01 NA
3 1983 ripkeca01 NA
Conversely, by reversing the order of the tables in the join, we return the 19 seasons for which we have already computed the league averages, regardless of whether there is a match for Ripken (results not displayed).
%>%
mlb left_join(ripken, by = c("yearID" = "yearID")) %>%
select(yearID, playerID, lg_OPS)
5.4 Further resources
Sean Lahman has long curated his baseball data set, which feeds the popular website baseball-reference.com. Michael Friendly maintains the Lahman R package (Friendly et al. 2021). For the baseball enthusiast, Cleveland Indians analyst Max Marchi and Jim Albert have written an excellent book on analyzing baseball data in R (Marchi and Albert 2013). A second edition updates the code for the tidyverse (James Albert, Marchi, and Baumer 2018). Albert has also written a book describing how baseball can be used as a motivating example for teaching statistics (Jim Albert 2003).
5.5 Exercises
Problem 1 (Easy): Consider the following data frames with information about U.S. states from 1977.
<- tibble(names = state.name, twoletter = state.abb)
statenames glimpse(statenames)
Rows: 50
Columns: 2
$ names <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California"…
$ twoletter <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "G…
<- tibble(
statedata names = state.name,
income = state.x77[, 2],
illiteracy = state.x77[, 3]
)glimpse(statedata)
Rows: 50
Columns: 3
$ names <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California…
$ income <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4…
$ illiteracy <dbl> 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1.9, …
Create a scatterplot of illiteracy (as a percent of population) and per capita income (in U.S. dollars) with points plus labels of the two letter state abbreviations. Add a smoother. Use the ggrepel
package to offset the names that overlap. What pattern do you observe? Are there any outlying observations?
Problem 2 (Medium): Use the Batting
, Pitching
, and Master
tables in the Lahman
package to answer the following questions.
Name every player in baseball history who has accumulated at least 300 home runs (
HR
) and at least 300 stolen bases (SB
). You can find the first and last name of the player in theMaster
data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (
W
) and at least 3,000 strikeouts (SO
).Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?
Problem 3 (Medium): Use the nycflights13
package and the flights
and planes
tables to answer the following questions:
How many planes have a missing date of manufacture?
What are the five most common manufacturers?
Has the distribution of manufacturer changed over time as reflected by the airplanes flying from NYC in 2013? (Hint: you may need to use
case_when()
to recode the manufacturer name and collapse rare vendors into a category calledOther
.)
Problem 4 (Medium): Use the nycflights13
package and the flights
and planes
tables to answer the following questions:
What is the oldest plane (specified by the
tailnum
variable) that flew from New York City airports in 2013?How many airplanes that flew from New York City are included in the
planes
table?
Problem 5 (Medium): The Relative Age Effect is an attempt to explain anomalies in the distribution of birth month among athletes. Briefly, the idea is that children born just after the age cut-off to enroll in school will be as much as 11 months older than their fellow athletes, which is enough of a disparity to give them an advantage. That advantage will then be compounded over the years, resulting in notably more professional athletes born in these months.
Display the distribution of birth months of baseball players who batted during the decade of the 2000s.
How are they distributed over the calendar year? Does this support the notion of a relative age effect? Use the
Births78
data set from themosaicData
package as a reference.
Problem 6 (Hard): Use the fec12
package to download the Federal Election Commission data for 2012. Re-create Figures 2.1 and 2.2 from the text using ggplot2
.
5.6 Supplementary exercises
Available at https://mdsr-book.github.io/mdsr2e/ch-join.html#join-online-exercises
Problem 1 (Easy): What type of join operation is depicted below?
Problem 2 (Easy): What type of join operation is depicted below?
Problem 3 (Easy): What type of join operation is depicted below?
Problem 4 (Hard): Use the FEC data to re-create Figure 2.8.