Summarizing Groups

In this lab, we will continue to develop our data transformation skills by learning how to use the group_by() function in conjunction with the summarize() verb that we learned previously.

Goal: by the end of this lab, you will be able to use group_by() to perform summary operations on groups.

Summarization with `group_by()`

Consider the following problem using the babynames table in the babynames package.

Think of a name. In which year was that name given to M and F babies most equally (i.e. closest to a 50/50 split)?

How would you do this? You could, of course, scan the data visually to estimate the percentages in each year:

babynames %>%
  filter(name == "Jackie")

But this is very inefficient and does not provide an exact solution.

The key to solving this problem is to recognize that we need to collapse the two rows corresponding to each assigned sex in each year into a single row that contains the information for both sexes (this is the group_by() part). Unfortunately, there is no way for R to know what to compute its own – we have to tell it (this is the summarize() part).

The group_by function specifies a variable on which the data frame will be collapsed. Each row in the result set will correspond to one unique value of that variable. In this case, we want to group by year. [This is sometimes called “rolling up” a data set.]

babynames %>%
  filter(name == "Jackie") %>%
  group_by(year)

This doesn’t actually do much, since we haven’t told R what to compute. summarize() takes a list of definitions for columns you want to see in the result set. The key to understanding summarize() is to note that it operates on vectors (which may contain many values), but it must return a single value. [Why?] Thus, the variables defined by the arguments to summarize() are usually aggregate functions like sum(), mean(), length(), max(), n(), etc.

babynames %>%
  filter(name == "Jackie") %>%
  group_by(year) %>%
  summarize(N = n(), 
            total = sum(n), 
            boys = sum(ifelse(sex == "M", n, 0)))

Which year had the highest number of births?

SAMPLE SOLUTION:

babynames %>%
  group_by(year) %>%
  summarize(num_births = sum(n)) %>%
  arrange(desc(num_births))

In a single pipeline, compute the earliest and latest year that each name appears?

SAMPLE SOLUTION:

babynames %>%
  group_by(name) %>%
  summarize(earliest = min(year), latest = max(year))

There are 16 names that have been assigned in all 135 years. List them.

SAMPLE SOLUTION:

babynames %>%
  group_by(name) %>%
  summarize(num_appearances = n()) %>%
  filter(num_appearances == 270)

Among popular names (let’s say at least 1% of the births in a given year), which name is the youngest – meaning that its first appearance as a popular name is the most recent?

SAMPLE SOLUTION:

babynames %>%
  mutate(is_popular = prop >= 0.01) %>%
  filter(is_popular == TRUE) %>%
  group_by(name) %>%
  summarize(earliest = min(year)) %>%
  arrange(desc(earliest))

It seems like there is more diversity of names now than in the past. How have the number of names used changed over time? Has it been the same for boys and girls?
Find the most popular names of the 1990s.

SAMPLE SOLUTION:

babynames %>%
  filter(year >= 1990 & year < 2000) %>%
  group_by(name) %>%
  summarize(num_births = sum(n)) %>%
  arrange(desc(num_births))

Use ggplot2 and group_by() to create an interesting and informative data graphic. It need not be about babynames. Post your graphic and a short description of it to Slack.

Summarizing Groups

Summarization with group_by()

More Practice (optional)

Summarization with `group_by()`