In this lab, we will learn how to write user-defined functions.

library(tidyverse)
library(babynames)

Goal: by the end of this lab, you will be able to write a function in R and execute it.

Extending a single pipeline to a function

We already know how to filter for a particular name:

babynames %>%
  filter(name == "Benjamin")

Suppose that we want to find the year in which that name was most popular (see Exercise 2 from Lab 4). To do this we need a pipeline that consists of several verbs chained together.

babynames %>%
  filter(name == "Benjamin") %>%
  group_by(year) %>%
  summarize(total = sum(prop)) %>%
  arrange(desc(total)) %>%
  head(1) %>%
  select(year)

But we might want to do this for many names, and it would be tedious to have to re-type – or even just re-run – the same code over and over again. An elegant solution is to write a function. For example, here we write a function called most_popular_year() that will return the year in which a specific name was most popular.

most_popular_year <- function(name_arg) {
  babynames %>%
    filter(name == name_arg) %>%
    group_by(year) %>%
    summarize(total = sum(prop)) %>%
    arrange(desc(total)) %>%
    head(1) %>%
    select(year)
}

Now we can run our function on several different names without having to re-type all of that code. Here we find the popularity of names associated with actors and actresses who won at the 89th Academy Awards.

most_popular_year(name_arg = "Emma")
most_popular_year(name_arg = "Viola")
most_popular_year(name_arg = "Casey")
most_popular_year(name_arg = "Mahershala")

Signatures

R doesn’t have formal type signatures for its functions the way that some other programming languages do. However, being aware of what kind of objects your functions take, and what kind of objects your function returns, is usually very important.

You can always show the arguments that a given function takes by using the formals() function.

formals(most_popular_year)

In this case, the most_popular_year() function takes a single argument called name_arg, which should be a character vector, and returns a tbl_df.

More details about functions that exist within packages are available via help(name_of_function).

Return values

By default, an R function returns the result of the last command that is executed by the function. For most_popular_year(), there is only one “line” of code (i.e., the whole pipeline), and the result of that will be a tbl_df.

Alternatively, you can use return(blah) to explicitly return objects. (I think) that every R function returns something (i.e., there is no such thing as a “void” function).

Default argument values

If you want an argument to your function have a default value, specify it in the function definition.

The way that we have defined most_popular_year(), there is no default value for name_arg. Thus, if we call the function with no arguments, it will break.

most_popular_year()

In this case, this is probably the desired behavior, since it doesn’t make sense to call this function without specifying a name. However, we could have defined it with a default value, say "Benjamin".

most_popular_year_ben <- function(name_arg = "Benjamin") {
  babynames %>%
    filter(name == name_arg) %>%
    group_by(year) %>%
    summarize(total = sum(prop)) %>%
    arrange(desc(total)) %>%
    head(1) %>%
    select(year)
}

Now we can call the function without specifying the name_arg argument, but in that case we’ll get the results for "Benjamin".

most_popular_year_ben()

We can still of course still override the default value of name_arg:

most_popular_year_ben(name_arg = "Jordan")

Scoping

How did our function know about the babynames table? Why wasn’t that an input to the function? The answer to the first question involes the notion of variable scoping, while the answer to the second question is a design choice.

The rules for variable scoping in R are…complicated. But what is important for you to understand is that R will look for objects in the global environment if it can’t find them locally. So when we run most_popular_year(), R will look for a data frame called babynames in the global environment. If it exists, then the function should work, but if not, it won’t. Thus, whether a user-defined function in R works as expected depends on what is in the global environment. This behavior is different than most compiled programming languages (e.g. C++, Java, etc.), but it is designed to make it easy to script with functions on-the-fly.

Note that if we unload the babynames package, thus removing the babynames table from the environment, our function no longer works.

detach("package:babynames", unload = TRUE)
# should throw an error
most_popular_year("Benjamin")

Don’t forget to bring babynames back.

library(babynames)

To be more explicit, we could pass the table that we want to search for to the function. We can achieve this by re-writing the function to take a data argument:

most_popular_year2 <- function(data, name_arg) {
  data %>%
    filter(name == name_arg) %>%
    group_by(year) %>%
    summarize(total = sum(prop)) %>%
    arrange(desc(total)) %>%
    head(1) %>%
    select(year)
}
# will throw error because we didn't specify "data"
most_popular_year2(name_arg = "Casey")
# works
most_popular_year2(data = babynames, name_arg = "Casey")

This also enables us to apply our function to subsets of the original data. So we can search for the most popular year for Casey among boys and girls separately.

babynames %>%
  filter(sex == "F") %>%
  most_popular_year2(name_arg = "Casey")
babynames %>%
  filter(sex == "M") %>%
  most_popular_year2(name_arg = "Casey")

Order of arguments

Note that the order of the arguments matters only if they are not named.

most_popular_year2(babynames, "Emma")
most_popular_year2("Emma", babynames)
most_popular_year2("Emma", data = babynames)

To be safe (and explicit), name your arguments unless you have a good reason not to.

Exercises

These exercises use the nycflights13 data package.

  1. Write a function that, for a given carrier identifier (e.g. DL), will retrieve the five most common airport destinations from NYC in 2013, and how often the carrier flew there.

  2. Use your function to find the top five destinations for Delta Airlines (DL).

  3. Use your function to find the top five destinations for American Airlines (AA). How many of these destinations are shared with Delta?

  4. Write a function that, for a given airport code (e.g. BDL), will retrieve the five most common carriers that service that airport from NYC in 2013, and what their average arrival delay time was.