5  Data wrangling on multiple tables

In the previous chapter, we illustrated how the five data wrangling verbs can be chained to perform operations on a single table. A single table is reminiscent of a single well-organized spreadsheet. But in the same way that a workbook can contain multiple spreadsheets, we will often work with multiple tables. In Chapter 15, we will describe how multiple tables related by unique identifiers called keys can be organized into a relational database management system.

It is more efficient for the computer to store and search tables in which “like is stored with like.” Thus, a database maintained by the Bureau of Transportation Statistics on the arrival times of U.S. commercial flights will consist of multiple tables, each of which contains data about different things. For example, the nycflights13 package contains one table about flights—each row in this table is a single flight. As there are many flights, you can imagine that this table will get very long—hundreds of thousands of rows per year.

There are other related kinds of information that we will want to know about these flights. We would certainly be interested in the particular airline to which each flight belonged. It would be inefficient to store the complete name of the airline (e.g., American Airlines Inc.) in every row of the flights table. A simple code (e.g., AA) would take up less space on disk. For small tables, the savings of storing two characters instead of 25 is insignificant, but for large tables, it can add up to noticeable savings both in terms of the size of data on disk, and the speed with which we can search it. However, we still want to have the full names of the airlines available if we need them. The solution is to store the data about airlines in a separate table called airlines, and to provide a key that links the data in the two tables together.

5.1 inner_join()

If we examine the first few rows of the flights table, we observe that the carrier column contains a two-character string corresponding to the airline.

library(tidyverse)
library(mdsr)
library(nycflights13)
glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

In the airlines table, we have those same two-character strings, but also the full names of the airline.

head(airlines, 3)
# A tibble: 3 × 2
  carrier name                  
  <chr>   <chr>                 
1 9E      Endeavor Air Inc.     
2 AA      American Airlines Inc.
3 AS      Alaska Airlines Inc.  

In order to retrieve a list of flights and the full names of the airlines that managed each flight, we need to match up the rows in the flights table with those rows in the airlines table that have the corresponding values for the carrier column in both tables. This is achieved with the function inner_join().

flights_joined <- flights |>
  inner_join(airlines, by = c("carrier" = "carrier"))
glimpse(flights_joined)
Rows: 336,776
Columns: 20
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
$ name           <chr> "United Air Lines Inc.", "United Air Lines Inc.", "Amer…

Notice that the flights_joined data frame now has an additional variable called name. This is the column from airlines that is now included in the combined data frame. We can view the full names of the airlines instead of the cryptic two-character codes.

flights_joined |>
  select(carrier, name, flight, origin, dest) |>
  head(3)
# A tibble: 3 × 5
  carrier name                   flight origin dest 
  <chr>   <chr>                   <int> <chr>  <chr>
1 UA      United Air Lines Inc.    1545 EWR    IAH  
2 UA      United Air Lines Inc.    1714 LGA    IAH  
3 AA      American Airlines Inc.   1141 JFK    MIA  

In an inner_join(), the result set contains only those rows that have matches in both tables. In this case, all of the rows in flights have exactly one corresponding entry in airlines, so the number of rows in flights_joined is the same as the number of rows in flights (this will not always be the case).

nrow(flights)
[1] 336776
nrow(flights_joined)
[1] 336776
Helpful Tip

It is always a good idea to carefully check that the number of rows returned by a join operation is what you expected. In particular, you should carefully check for rows in one table that matched to more than one row in the other table.

5.2 left_join()

Another commonly-used type of join is a left_join(). Here the rows of the first table are always returned, regardless of whether there is a match in the second table.

Suppose that we are only interested in flights from the New York City airports to the West Coast. To restrict ourselves to airports in the Pacific Time Zone (UTC -8) we can filter the airports data frame to only include those airports.

airports_pt <- airports |>
  filter(tz == -8)
nrow(airports_pt)
[1] 178

Now, if we perform an inner_join() on flights and airports_pt, matching the destinations in flights to the FAA codes in airports, we retrieve only those flights that flew to our airports in the Pacific Time Zone.

nyc_dests_pt <- flights |>
  inner_join(airports_pt, by = c("dest" = "faa"))
nrow(nyc_dests_pt)
[1] 46324

However, if we use a left_join() with the same conditions, we retrieve all of the rows of flights. NA’s are inserted into the columns where no matched data was found.

nyc_dests <- flights |>
  left_join(airports_pt, by = c("dest" = "faa"))

nyc_dests |>
  summarize(
    num_flights = n(),
    num_flights_pt = sum(!is.na(name)),
    num_flights_not_pt = sum(is.na(name))
  )
# A tibble: 1 × 3
  num_flights num_flights_pt num_flights_not_pt
        <int>          <int>              <int>
1      336776          46324             290452

Left joins are particularly useful in databases in which referential integrity is broken (not all of the keys are present—see Chapter 15).

5.3 Extended example: Manny Ramirez

In the context of baseball and the Lahman package, multiple tables are used to store information. The batting statistics of players are stored in one table (Batting), while information about people (most of whom are players) is in a different table (People).

Every row in the Batting table contains the statistics accumulated by a single player during a single stint for a single team in a single year. Thus, a player like Manny Ramirez has many rows in the Batting table (21, in fact).

library(Lahman)
manny <- Batting |>
  filter(playerID == "ramirma02")
nrow(manny)
[1] 21

Using what we’ve learned, we can quickly tabulate Ramirez’s most common career offensive statistics. For those new to baseball, some additional background may be helpful. A hit (H) occurs when a batter reaches base safely. A home run (HR) occurs when the ball is hit out of the park or the runner advances through all of the bases during that play. Barry Bonds has the record for most home runs (762) hit in a career. A player’s batting average (BA) is the ratio of the number of hits to the number of eligible at-bats. The highest career batting average in Major League Baseball history of 0.366 was achieved by Ty Cobb—season averages above 0.300 are impressive. Finally, runs batted in (RBI) is the number of runners (including the batter in the case of a home run) that score during that batter’s at-bat. Hank Aaron has the record for most career RBIs with 2,297.

manny |>
  summarize(
    span = paste(min(yearID), max(yearID), sep = "-"), 
    num_years = n_distinct(yearID), 
    num_teams = n_distinct(teamID), 
    BA = sum(H)/sum(AB), 
    tH = sum(H), 
    tHR = sum(HR), 
    tRBI = sum(RBI)
  )
       span num_years num_teams        BA   tH tHR tRBI
1 1993-2011        19         5 0.3122271 2574 555 1831

Notice how we have used the paste() function to combine results from multiple variables into a new variable, and how we have used the n_distinct() function to count the number of distinct rows. In his 19-year career, Ramirez hit 555 home runs, which puts him in the top 20 among all Major League players.

However, we also see that Ramirez played for five teams during his career. Did he perform equally well for each of them? Breaking his statistics down by team, or by league, is as easy as adding an appropriate group_by() command.

manny |>
  group_by(teamID) |>
  summarize(
    span = paste(min(yearID), max(yearID), sep = "-"), 
    num_years = n_distinct(yearID), 
    num_teams = n_distinct(teamID), 
    BA = sum(H)/sum(AB), 
    tH = sum(H), 
    tHR = sum(HR), 
    tRBI = sum(RBI)
  ) |>
  arrange(span)
# A tibble: 5 × 8
  teamID span      num_years num_teams     BA    tH   tHR  tRBI
  <fct>  <chr>         <int>     <int>  <dbl> <int> <int> <int>
1 CLE    1993-2000         8         1 0.313   1086   236   804
2 BOS    2001-2008         8         1 0.312   1232   274   868
3 LAN    2008-2010         3         1 0.322    237    44   156
4 CHA    2010-2010         1         1 0.261     18     1     2
5 TBA    2011-2011         1         1 0.0588     1     0     1

While Ramirez was very productive for Cleveland, Boston, and the Los Angeles Dodgers, his brief tours with the Chicago White Sox and Tampa Bay Rays were less than stellar. In the pipeline below, we can see that Ramirez spent the bulk of his career in the American League.

manny |>
  group_by(lgID) |>
  summarize(
    span = paste(min(yearID), max(yearID), sep = "-"), 
    num_years = n_distinct(yearID), 
    num_teams = n_distinct(teamID), 
    BA = sum(H)/sum(AB), 
    tH = sum(H), 
    tHR = sum(HR), 
    tRBI = sum(RBI)
  ) |>
  arrange(span)
# A tibble: 2 × 8
  lgID  span      num_years num_teams    BA    tH   tHR  tRBI
  <fct> <chr>         <int>     <int> <dbl> <int> <int> <int>
1 AL    1993-2011        18         4 0.311  2337   511  1675
2 NL    2008-2010         3         1 0.322   237    44   156

If Ramirez played in only 19 different seasons, why were there 21 rows attributed to him? Notice that in 2008, he was traded from the Boston Red Sox to the Los Angeles Dodgers, and thus played for both teams. Similarly, in 2010 he played for both the Dodgers and the Chicago White Sox.

Helpful Tip

When summarizing data, it is critically important to understand exactly how the rows of your data frame are organized.

To see what can go wrong here, suppose we were interested in tabulating the number of seasons in which Ramirez hit at least 30 home runs. The simplest solution is:

manny |>
  filter(HR >= 30) |>
  nrow()
[1] 11

But this answer is wrong, because in 2008, Ramirez hit 20 home runs for Boston before being traded and then 17 more for the Dodgers afterwards. Neither of those rows were counted, since they were both filtered out. Thus, the year 2008 does not appear among the 11 that we counted in the previous pipeline. Recall that each row in the manny data frame corresponds to one stint with one team in one year. On the other hand, the question asks us to consider each year, regardless of team. In order to get the right answer, we have to aggregate the rows by team. Thus, the correct solution is:

manny |>
  group_by(yearID) |>
  summarize(tHR = sum(HR)) |>
  filter(tHR >= 30) |>
  nrow()
[1] 12

Note that the filter() operation is applied to tHR, the total number of home runs in a season, and not HR, the number of home runs in a single stint for a single team in a single season. (This distinction between filtering the rows of the original data versus the rows of the aggregated results will appear again in Chapter 15.)

We began this example by filtering the Batting table for the player with playerID equal to ramirma02. How did we know to use this identifier? This player ID is known as a key, and in fact, playerID is the primary key defined in the People table. That is, every row in the People table is uniquely identified by the value of playerID. There is exactly one row in that table for which playerID is equal to ramirma02.

But how did we know that this ID corresponds to Manny Ramirez? We can search the People table. The data in this table include characteristics about Manny Ramirez that do not change across multiple seasons (with the possible exception of his weight).

People |>
  filter(nameLast == "Ramirez" & nameFirst == "Manny")
   playerID birthYear birthMonth birthDay birthCountry        birthState
1 ramirma02      1972          5       30         D.R. Distrito Nacional
      birthCity deathYear deathMonth deathDay deathCountry deathState deathCity
1 Santo Domingo        NA         NA       NA         <NA>       <NA>      <NA>
  nameFirst nameLast        nameGiven weight height bats throws      debut
1     Manny  Ramirez Manuel Aristides    225     72    R      R 1993-09-02
   finalGame  retroID   bbrefID deathDate  birthDate
1 2011-04-06 ramim002 ramirma02      <NA> 1972-05-30

The playerID column forms a primary key in the People table, but it does not in the Batting table, since as we saw previously, there were 21 rows with that playerID. In the Batting table, the playerID column is known as a foreign key, in that it references a primary key in another table. For our purposes, the presence of this column in both tables allows us to link them together. This way, we can combine data from the Batting table with data in the People table. We do this with inner_join() by specifying the two tables that we want to join, and the corresponding columns in each table that provide the link. Thus, if we want to display Ramirez’s name in our previous result, as well as his age, we must join the Batting and People tables together.

Helpful Tip

Always specify the by argument that defines the join condition. Don’t rely on the defaults.

Batting |>
  filter(playerID == "ramirma02") |>
  inner_join(People, by = c("playerID" = "playerID")) |>
  group_by(yearID) |>
  summarize(
    Age = max(yearID - birthYear), 
    num_teams = n_distinct(teamID), 
    BA = sum(H)/sum(AB), 
    tH = sum(H), 
    tHR = sum(HR), 
    tRBI = sum(RBI)
  ) |>
  arrange(yearID)
# A tibble: 19 × 7
   yearID   Age num_teams     BA    tH   tHR  tRBI
    <int> <int>     <int>  <dbl> <int> <int> <int>
 1   1993    21         1 0.170      9     2     5
 2   1994    22         1 0.269     78    17    60
 3   1995    23         1 0.308    149    31   107
 4   1996    24         1 0.309    170    33   112
 5   1997    25         1 0.328    184    26    88
 6   1998    26         1 0.294    168    45   145
 7   1999    27         1 0.333    174    44   165
 8   2000    28         1 0.351    154    38   122
 9   2001    29         1 0.306    162    41   125
10   2002    30         1 0.349    152    33   107
11   2003    31         1 0.325    185    37   104
12   2004    32         1 0.308    175    43   130
13   2005    33         1 0.292    162    45   144
14   2006    34         1 0.321    144    35   102
15   2007    35         1 0.296    143    20    88
16   2008    36         2 0.332    183    37   121
17   2009    37         1 0.290    102    19    63
18   2010    38         2 0.298     79     9    42
19   2011    39         1 0.0588     1     0     1

Notice that even though Ramirez’s age is a constant for each season, we have to use a vector operation (i.e., max() or first()) in order to reduce any potential vector to a single number.

Which season was Ramirez’s best as a hitter? One relatively simple measurement of batting prowess is OPS, or On-Base Plus Slugging Percentage, which is the simple sum of two other statistics: On-Base Percentage (OBP) and Slugging Percentage (SLG). The former basically measures the proportion of time that a batter reaches base safely, whether it comes via a hit (H), a base on balls (BB), or from being hit by the pitch (HBP). The latter measures the average number of bases advanced per at-bat (AB), where a single is worth one base, a double (X2B) is worth two, a triple (X3B) is worth three, and a home run (HR) is worth four. (Note that every hit is exactly one of a single, double, triple, or home run.) Let’s add these statistics to our results and use it to rank the seasons.

manny_by_season <- Batting |>
  filter(playerID == "ramirma02") |>
  inner_join(People, by = c("playerID" = "playerID"))  |>
  group_by(yearID) |>
  summarize(
    Age = max(yearID - birthYear), 
    num_teams = n_distinct(teamID), 
    BA = sum(H)/sum(AB), 
    tH = sum(H), 
    tHR = sum(HR), 
    tRBI = sum(RBI), 
    OBP = sum(H + BB + HBP) / sum(AB + BB + SF + HBP),
    SLG = sum(H + X2B + 2 * X3B + 3 * HR) / sum(AB)
  ) |>
  mutate(OPS = OBP + SLG) |>
  arrange(desc(OPS))
manny_by_season
# A tibble: 19 × 10
   yearID   Age num_teams     BA    tH   tHR  tRBI    OBP    SLG   OPS
    <int> <int>     <int>  <dbl> <int> <int> <int>  <dbl>  <dbl> <dbl>
 1   2000    28         1 0.351    154    38   122 0.457  0.697  1.15 
 2   1999    27         1 0.333    174    44   165 0.442  0.663  1.11 
 3   2002    30         1 0.349    152    33   107 0.450  0.647  1.10 
 4   2006    34         1 0.321    144    35   102 0.439  0.619  1.06 
 5   2008    36         2 0.332    183    37   121 0.430  0.601  1.03 
 6   2003    31         1 0.325    185    37   104 0.427  0.587  1.01 
 7   2001    29         1 0.306    162    41   125 0.405  0.609  1.01 
 8   2004    32         1 0.308    175    43   130 0.397  0.613  1.01 
 9   2005    33         1 0.292    162    45   144 0.388  0.594  0.982
10   1996    24         1 0.309    170    33   112 0.399  0.582  0.981
11   1998    26         1 0.294    168    45   145 0.377  0.599  0.976
12   1995    23         1 0.308    149    31   107 0.402  0.558  0.960
13   1997    25         1 0.328    184    26    88 0.415  0.538  0.953
14   2009    37         1 0.290    102    19    63 0.418  0.531  0.949
15   2007    35         1 0.296    143    20    88 0.388  0.493  0.881
16   1994    22         1 0.269     78    17    60 0.357  0.521  0.878
17   2010    38         2 0.298     79     9    42 0.409  0.460  0.870
18   1993    21         1 0.170      9     2     5 0.2    0.302  0.502
19   2011    39         1 0.0588     1     0     1 0.0588 0.0588 0.118

We see that Ramirez’s OPS was highest in 2000. But 2000 was the height of the steroid era, when many sluggers were putting up tremendous offensive numbers. As data scientists, we know that it would be more instructive to put Ramirez’s OPS in context by comparing it to the league average OPS in each season—the resulting ratio is often called OPS+. To do this, we will need to compute those averages. Because there is missing data in some of these columns in some of these years, we need to invoke the na.rm argument to ignore that data.

mlb <- Batting |>
  filter(yearID %in% 1993:2011) |>
  group_by(yearID) |>
  summarize(
    lg_OBP = sum(H + BB + HBP, na.rm = TRUE) / 
      sum(AB + BB + SF + HBP, na.rm = TRUE), 
    lg_SLG = sum(H + X2B + 2*X3B + 3*HR, na.rm = TRUE) /
      sum(AB, na.rm = TRUE)
  ) |>
  mutate(lg_OPS = lg_OBP + lg_SLG)

Next, we need to match these league average OPS values to the corresponding entries for Ramirez. We can do this by joining these tables together, and computing the ratio of Ramirez’s OPS to that of the league average.

manny_ratio <- manny_by_season |>
  inner_join(mlb, by = c("yearID" = "yearID")) |>
  mutate(OPS_plus = OPS / lg_OPS) |>
  select(yearID, Age, OPS, lg_OPS, OPS_plus) |>
  arrange(desc(OPS_plus))
manny_ratio
# A tibble: 19 × 5
   yearID   Age   OPS lg_OPS OPS_plus
    <int> <int> <dbl>  <dbl>    <dbl>
 1   2000    28 1.15   0.782    1.48 
 2   2002    30 1.10   0.748    1.47 
 3   1999    27 1.11   0.778    1.42 
 4   2006    34 1.06   0.768    1.38 
 5   2008    36 1.03   0.749    1.38 
 6   2003    31 1.01   0.755    1.34 
 7   2001    29 1.01   0.759    1.34 
 8   2004    32 1.01   0.763    1.32 
 9   2005    33 0.982  0.749    1.31 
10   1998    26 0.976  0.755    1.29 
11   1996    24 0.981  0.767    1.28 
12   1995    23 0.960  0.755    1.27 
13   2009    37 0.949  0.751    1.26 
14   1997    25 0.953  0.756    1.26 
15   2010    38 0.870  0.728    1.19 
16   2007    35 0.881  0.758    1.16 
17   1994    22 0.878  0.763    1.15 
18   1993    21 0.502  0.736    0.682
19   2011    39 0.118  0.720    0.163

In this case, 2000 still ranks as Ramirez’s best season relative to his peers, but notice that his 1999 season has fallen from 2nd to 3rd. Since by definition a league average batter has an OPS+ of 1, Ramirez posted 17 consecutive seasons with an OPS that was at least 15% better than the average across the major leagues—a truly impressive feat.

Finally, not all joins are the same. An inner_join() requires corresponding entries in both tables. Conversely, a left_join() returns at least as many rows as there are in the first table, regardless of whether there are matches in the second table. An inner_join() is bidirectional, whereas in a left_join(), the order in which you specify the tables matters.

Consider the career of Cal Ripken, who played in 21 seasons from 1981 to 2001. His career overlapped with Ramirez’s in the nine seasons from 1993 to 2001, so for those, the league averages we computed before are useful.

ripken <- Batting |>
  filter(playerID == "ripkeca01")
ripken |>
  inner_join(mlb, by = c("yearID" = "yearID")) |>
  nrow()
[1] 9
# same
mlb |>
  inner_join(ripken, by = c("yearID" = "yearID")) |>
  nrow()
[1] 9

For seasons when Ramirez did not play, NA’s will be returned.

ripken |>
  left_join(mlb, by = c("yearID" = "yearID")) |>
  select(yearID, playerID, lg_OPS) |>
  head(3)
  yearID  playerID lg_OPS
1   1981 ripkeca01     NA
2   1982 ripkeca01     NA
3   1983 ripkeca01     NA

Conversely, by reversing the order of the tables in the join, we return the 19 seasons for which we have already computed the league averages, regardless of whether there is a match for Ripken (results not displayed).

mlb |>
  left_join(ripken, by = c("yearID" = "yearID")) |>
  select(yearID, playerID, lg_OPS)

5.4 Further resources

Sean Lahman has long curated his baseball data set, which feeds the popular website baseball-reference.com. Michael Friendly maintains the Lahman R package (Friendly et al. 2023). For the baseball enthusiast, Cleveland Guardians analyst Max Marchi and Jim Albert have written an excellent book on analyzing baseball data in R (Marchi and Albert 2013). A second edition updates the code for the tidyverse (James Albert, Marchi, and Baumer 2018). Albert has also written a book describing how baseball can be used as a motivating example for teaching statistics (Jim Albert 2003).

5.5 Exercises

Problem 1 (Easy): Consider the following data frames with information about U.S. states from 1977.

statenames <- tibble(names = state.name, twoletter = state.abb)
glimpse(statenames)
Rows: 50
Columns: 2
$ names     <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "C…
$ twoletter <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL", "GA", …
statedata <- tibble(
  names = state.name, 
  income = state.x77[, 2], 
  illiteracy = state.x77[, 3]
  )
glimpse(statedata)
Rows: 50
Columns: 3
$ names      <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", "…
$ income     <dbl> 3624, 6315, 4530, 3378, 5114, 4884, 5348, 4809, 4815, 4091,…
$ illiteracy <dbl> 2.1, 1.5, 1.8, 1.9, 1.1, 0.7, 1.1, 0.9, 1.3, 2.0, 1.9, 0.6,…

Create a scatterplot of illiteracy (as a percent of population) and per capita income (in U.S. dollars) with points plus labels of the two letter state abbreviations. Add a smoother. Use the ggrepel package to offset the names that overlap. What pattern do you observe? Are there any outlying observations?

Problem 2 (Medium): Use the Batting, Pitching, and People tables in the Lahman package to answer the following questions.

  1. Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB). You can find the first and last name of the player in the People data frame. Join this to your result along with the total home runs and total bases stolen for each of these elite players.

  2. Similarly, name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (SO).

  3. Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?

Problem 3 (Medium): Use the nycflights13 package and the flights and planes tables to answer the following questions:

  1. How many planes have a missing date of manufacture?

  2. What are the five most common manufacturers?

  3. Has the distribution of manufacturer changed over time as reflected by the airplanes flying from NYC in 2013? (Hint: you may need to use case_when() to recode the manufacturer name and collapse rare vendors into a category called Other.)

Problem 4 (Medium): Use the nycflights13 package and the flights and planes tables to answer the following questions:

  1. What is the oldest plane (specified by the tailnum variable) that flew from New York City airports in 2013?

  2. How many airplanes that flew from New York City are included in the planes table?

Problem 5 (Medium): The Relative Age Effect is an attempt to explain anomalies in the distribution of birth month among athletes. Briefly, the idea is that children born just after the age cut-off to enroll in school will be as much as 11 months older than their fellow athletes, which is enough of a disparity to give them an advantage. That advantage will then be compounded over the years, resulting in notably more professional athletes born in these months.

  1. Display the distribution of birth months of baseball players who batted during the decade of the 2000s.

  2. How are they distributed over the calendar year? Does this support the notion of a relative age effect? Use the Births78 data set from the mosaicData package as a reference.

Problem 6 (Hard): Use the fec12 package to download the Federal Election Commission data for 2012. Re-create Figures 2.1 and 2.2 from the text using ggplot2.

5.6 Supplementary exercises

Available at https://mdsr-book.github.io/mdsr2e/05-joins.html#join-online-exercises

Problem 1 (Easy): What type of join operation is depicted below?

Problem 2 (Easy): What type of join operation is depicted below?

Problem 3 (Easy): What type of join operation is depicted below?

Problem 4 (Hard): Use the FEC data to re-create Figure 2.8.