# Chapter 7 Iteration

Calculators free human beings from having to perform arithmetic computations *by hand*. Similarly, programming languages free humans from having to perform iterative computations by re-running chunks of code, or worse, copying-and-pasting a chunk of code many times, while changing just one or two things in each chunk.

For example, in *Major League Baseball* there are 30 teams, and the game has been played for over 100 years. There are a number of natural questions that we might want to ask about *each team* (e.g., which player has accrued the most hits for that team?) or about each season (e.g., which seasons had the highest levels of scoring?).
If we can write a chunk of code that will answer these questions for a single team or a single season, then we should be able to generalize that chunk of code to work for *all* teams or seasons.
Furthermore, we should be able to do this without having to re-type that chunk of code. In this section, we present a variety of techniques for automating these types of iterative operations.

## 7.1 Vectorized operations

In every programming language that we can think of, there is a way to write a *loop*. For example, you can write a `for()`

loop in **R** the same way you can with most programming languages.
Recall that the `Teams`

data frame contains one row for each team in each MLB season.

```
library(tidyverse)
library(mdsr)
library(Lahman)
names(Teams)
```

```
[1] "yearID" "lgID" "teamID" "franchID"
[5] "divID" "Rank" "G" "Ghome"
[9] "W" "L" "DivWin" "WCWin"
[13] "LgWin" "WSWin" "R" "AB"
[17] "H" "X2B" "X3B" "HR"
[21] "BB" "SO" "SB" "CS"
[25] "HBP" "SF" "RA" "ER"
[29] "ERA" "CG" "SHO" "SV"
[33] "IPouts" "HA" "HRA" "BBA"
[37] "SOA" "E" "DP" "FP"
[41] "name" "park" "attendance" "BPF"
[45] "PPF" "teamIDBR" "teamIDlahman45" "teamIDretro"
```

What might not be immediately obvious is that columns 15 through 40 of this data frame contain numerical data about how each team performed in that season. To see this, you can execute the `str()`

command to see the **str**ucture of the data frame, but we suppress that output here. For data frames, a similar alternative that is a little cleaner is `glimpse()`

.

```
str(Teams)
glimpse(Teams)
```

Suppose you are interested in computing the averages of these 26 numeric columns.
You don’t want to have to type the names of each of them, or re-type the `mean()`

command 26 times.
Seasoned programmers would identify this as a situation in which a *loop* is a natural and efficient solution.
A `for()`

loop will iterate over the selected column indices.

```
<- NULL
averages for (i in 15:40) {
- 14] <- mean(Teams[, i], na.rm = TRUE)
averages[i
}names(averages) <- names(Teams)[15:40]
averages
```

```
R AB H X2B X3B HR BB SO
680.496 5126.260 1339.657 228.330 45.910 104.929 473.020 755.573
SB CS HBP SF RA ER ERA CG
109.786 46.873 45.411 44.233 680.495 572.404 3.836 48.011
SHO SV IPouts HA HRA BBA SOA E
9.584 24.267 4010.716 1339.434 104.929 473.212 755.050 181.732
DP FP
132.669 0.966
```

This certainly works. However, there are a number of problematic aspects of this code (e.g., the use of multiple *magic numbers* like 14, 15, and 40). The use of a `for()`

loop may not be ideal.
For problems of this type, it is almost always possible (and usually preferable) to iterate without explicitly defining a loop.
**R** programmers prefer to solve this type of problem by applying an operation to each element in a vector.
This often requires only one line of code, with no appeal to indices.

It is important to understand that the fundamental architecture of **R** is based on *vectors*. That is, in contrast to *general-purpose programming languages* like *C++* or *Python* that distinguish between single items—like strings and integers—and arrays of those items, in **R** a “string” is just a character vector of length 1. There is no special kind of atomic object. Thus, if you assign a single “string” to an object, **R** still stores it as a vector.

```
<- "a string"
a class(a)
```

`[1] "character"`

`is.vector(a)`

`[1] TRUE`

`length(a)`

`[1] 1`

As a consequence of this construction, **R** is highly optimized for vectorized operations (see Appendix B for more detailed information about **R** internals). Loops, by their nature, do not take advantage of this optimization. Thus, **R** provides several tools for performing loop-like operations without actually writing a loop. This can be a challenging conceptual hurdle for those who are used to more general-purpose programming languages.

Try to avoid writing `for()`

loops, even when it seems like the easiest solution.

Many functions in **R** are *vectorized*. This means that they will perform an operation on every element of a vector by default. For example, many mathematical functions (e.g., `exp()`

) work this way.

`exp(1:3)`

`[1] 2.72 7.39 20.09`

Note that vectorized functions like `exp()`

take a vector as an input, and return a vector *of the same length* as an output.

This is importantly different behavior than so-called *summary functions*, which take a vector as an input, and return *a single value*. Summary functions (e.g., `mean()`

) are commonly useful within a call to `summarize()`

. Note that when we call `mean()`

on a vector, it only returns a single value no matter how many elements there are in the input vector.

`mean(1:3)`

`[1] 2`

Other functions in **R** are not vectorized. They may assume an input that is a vector of length one, and fail or exhibit strange behavior if given a longer vector. For example, `if()`

throws a warning if given a vector of length more than one.

```
if (c(TRUE, FALSE)) {
cat("This is a great book!")
}
```

```
Warning in if (c(TRUE, FALSE)) {: the condition has length > 1 and only the
first element will be used
```

`This is a great book!`

As you get more comfortable with **R**, you will develop intuition about which functions are vectorized. If a function is vectorized, you should make use of that fact and not iterate over it. The code below shows that computing the exponential of the first 10,000 integers by appealing to `exp()`

as a vectorized function is much, much faster than using `map_dbl()`

to iterate over the same vector. The results are identical.

```
<- 1:1e5
x ::mark(
benchexp(x),
map_dbl(x, exp)
)
```

```
# A tibble: 2 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 exp(x) 900µs 1.59ms 652. 1.53MB 21.2
2 map_dbl(x, exp) 135ms 135.08ms 7.40 788.62KB 29.6
```

Try to always make use of vectorized functions before iterating an operation.

## 7.2 Using `across()`

with **dplyr** functions

The `mutate()`

and `summarize()`

verbs described in Chapter 4 can take advantage of an *adverb* called `across()`

that applies operations programmatically. In the example above, we had to observe that columns 15 through 40 of the `Teams`

data frame were numeric, and hard-code those observations as magic numbers. Rather than relying on these observations to identify which variables are numeric—and therefore which variables it makes sense to compute the average of—we can use the `across()`

function to do this work for us. The chunk below will compute the average values of only those variables in `Teams`

that are numeric.

```
%>%
Teams summarize(across(where(is.numeric), mean, na.rm = TRUE))
```

```
yearID Rank G Ghome W L R AB H X2B X3B HR BB SO SB
1 1958 4.05 150 78 74.5 74.5 680 5126 1340 228 45.9 105 473 756 110
CS HBP SF RA ER ERA CG SHO SV IPouts HA HRA BBA SOA E DP
1 46.9 45.4 44.2 680 572 3.84 48 9.58 24.3 4011 1339 105 473 755 182 133
FP attendance BPF PPF
1 0.966 1375102 100 100
```

Note that this result included several variables (e.g., `yearID`

, `attendance`

) that were outside of the range we defined previously.

The `across()`

adverb allows us to specify the set of variables that `summarize()`

includes in different ways. In the example above, we used the *predicate function* `is.numeric()`

to identify the variables for which we wanted to compute the mean.
In the following example, we compute the mean of `yearID`

, a series of variables from `R`

(runs scored) to `SF`

(*sacrifice flies*) that apply only to offensive players, and the batting *park factor* (`BPF`

). Since we are specifying these columns without the use of a predicate function, we don’t need to use `where()`

.

```
%>%
Teams summarize(across(c(yearID, R:SF, BPF), mean, na.rm = TRUE))
```

```
yearID R AB H X2B X3B HR BB SO SB CS HBP SF BPF
1 1958 680 5126 1340 228 45.9 105 473 756 110 46.9 45.4 44.2 100
```

The `across()`

function behaves analogously with `mutate()`

.
It provides an easy way to perform an operation on a set of variables without having to type or copy-and-paste the name of each variable.

## 7.3 The `map()`

family of functions

More generally, to apply a function to each item in a list or vector, or the columns of a data frame^{12}, use `map()`

(or one of its type-specific variants). This is the main function from the **purrr** package. In this example, we calculate the mean of each of the statistics defined above, all at once. Compare this to the `for()`

loop written above. Which is syntactically simpler? Which expresses the ideas behind the code more succinctly?

```
%>%
Teams select(15:40) %>%
map_dbl(mean, na.rm = TRUE)
```

```
R AB H X2B X3B HR BB SO
680.496 5126.260 1339.657 228.330 45.910 104.929 473.020 755.573
SB CS HBP SF RA ER ERA CG
109.786 46.873 45.411 44.233 680.495 572.404 3.836 48.011
SHO SV IPouts HA HRA BBA SOA E
9.584 24.267 4010.716 1339.434 104.929 473.212 755.050 181.732
DP FP
132.669 0.966
```

The first argument to `map_dbl()`

is the thing that you want to do something to (in this case, a data frame). The second argument specifies the name of a function (the argument is named `.f`

). Any further arguments are passed as options to `.f`

. Thus, this command applies the `mean()`

function to the 15th through the 40th columns of the `Teams`

data frame, while removing any `NA`

s that might be present in any of those columns. The use of the variant `map_dbl()`

simply forces the output to be a vector of type `double`

.^{13}

Of course, we began by taking the subset of the columns that were all `numeric`

values. If you tried to take the `mean()`

of a non-numeric vector, you would get a *warning* (and a value of `NA`

).

```
%>%
Teams select(teamID) %>%
map_dbl(mean, na.rm = TRUE)
```

```
Warning in mean.default(.x[[i]], ...): argument is not numeric or logical:
returning NA
```

```
teamID
NA
```

If you can solve your problem using `across()`

and/or `where()`

as described in Section 7.2, that is probably the cleanest solution. However, we will show that the `map()`

family of functions provides a much more general set of capabilities.

## 7.4 Iterating over a one-dimensional vector

### 7.4.1 Iterating a known function

Often you will want to apply a function to each element of a vector or list. For example, the baseball franchise now known as the *Los Angeles Angels of Anaheim* has gone by several names during its history.

```
<- Teams %>%
angels filter(franchID == "ANA") %>%
group_by(teamID, name) %>%
summarize(began = first(yearID), ended = last(yearID)) %>%
arrange(began)
angels
```

```
# A tibble: 4 × 4
# Groups: teamID [3]
teamID name began ended
<fct> <chr> <int> <int>
1 LAA Los Angeles Angels 1961 1964
2 CAL California Angels 1965 1996
3 ANA Anaheim Angels 1997 2004
4 LAA Los Angeles Angels of Anaheim 2005 2020
```

The franchise began as the *Los Angeles Angels* (`LAA`

) in 1961, then became the *California Angels* (`CAL`

) in 1965, the *Anaheim Angels* (`ANA`

) in 1997, before taking their current name (`LAA`

again) in 2005. This situation is complicated by the fact that the `teamID`

`LAA`

was re-used. This sort of schizophrenic behavior is unfortunately common in many data sets.

Now, suppose we want to find the length, in number of characters, of each of those team names. We could check each one manually using the function `nchar()`

:

```
<- angels %>%
angels_names pull(name)
nchar(angels_names[1])
```

`[1] 18`

`nchar(angels_names[2])`

`[1] 17`

`nchar(angels_names[3])`

`[1] 14`

`nchar(angels_names[4])`

`[1] 29`

But this would grow tiresome if we had many names. It would be simpler, more efficient, more elegant, and scalable to apply the function `nchar()`

to each element of the vector `angel_names`

. We can accomplish this using `map_int()`

. `map_int()`

is like `map()`

or `map_dbl()`

, but it always returns an `integer`

vector.

`map_int(angels_names, nchar)`

`[1] 18 17 14 29`

The key difference between `map_int()`

and `map()`

is that the former will always return an `integer`

vector, whereas the latter will always return a `list`

. Recall that the main difference between `list`

s and `data.frame`

s is that the elements (columns) of a `data.frame`

have to have the same length, whereas the elements of a list are arbitrary. So while `map()`

is more versatile, we usually find `map_int()`

or one of the other variants to be more convenient when appropriate.

It’s often helpful to use `map()`

to figure out what the return type will be, and then switch to the appropriate type-specific `map()`

variant.

This section was designed to illustrate how `map()`

can be used to iterate a function over a vector of values. However, the choice of the `nchar()`

function was a bit silly, because `nchar()`

is already vectorized. Thus, we can use it directly!

`nchar(angels_names)`

`[1] 18 17 14 29`

### 7.4.2 Iterating an arbitrary function

One of the most powerful uses of iteration is that you can apply *any* function, including a function that you have defined (see Appendix C for a discussion of how to write user-defined functions). For example, suppose we want to display the top-5 seasons in terms of wins for each of the Angels teams.

```
<- function(data, team_name) {
top5 %>%
data filter(name == team_name) %>%
select(teamID, yearID, W, L, name) %>%
arrange(desc(W)) %>%
head(n = 5)
}
```

We can now do this for each element of our vector with a single call to `map()`

.
Note how we named the `data`

argument to ensure that the `team_name`

argument was the one that accepted the value over which we iterated.

```
%>%
angels_names map(top5, data = Teams)
```

```
[[1]]
teamID yearID W L name
1 LAA 1962 86 76 Los Angeles Angels
2 LAA 1964 82 80 Los Angeles Angels
3 LAA 1961 70 91 Los Angeles Angels
4 LAA 1963 70 91 Los Angeles Angels
[[2]]
teamID yearID W L name
1 CAL 1982 93 69 California Angels
2 CAL 1986 92 70 California Angels
3 CAL 1989 91 71 California Angels
4 CAL 1985 90 72 California Angels
5 CAL 1979 88 74 California Angels
[[3]]
teamID yearID W L name
1 ANA 2002 99 63 Anaheim Angels
2 ANA 2004 92 70 Anaheim Angels
3 ANA 1998 85 77 Anaheim Angels
4 ANA 1997 84 78 Anaheim Angels
5 ANA 2000 82 80 Anaheim Angels
[[4]]
teamID yearID W L name
1 LAA 2008 100 62 Los Angeles Angels of Anaheim
2 LAA 2014 98 64 Los Angeles Angels of Anaheim
3 LAA 2009 97 65 Los Angeles Angels of Anaheim
4 LAA 2005 95 67 Los Angeles Angels of Anaheim
5 LAA 2007 94 68 Los Angeles Angels of Anaheim
```

Alternatively, we can collect the results into a single data frame by using the `map_dfr()`

function, which combines the data frames by row. Below, we do this and then compute the average number of wins in a top-5 season for each Angels team name. Based on these data, the Los Angeles Angels of Anaheim has been the most successful incarnation of the franchise, when judged by average performance in the best five seasons.

```
%>%
angels_names map_dfr(top5, data = Teams) %>%
group_by(teamID, name) %>%
summarize(N = n(), mean_wins = mean(W)) %>%
arrange(desc(mean_wins))
```

```
# A tibble: 4 × 4
# Groups: teamID [3]
teamID name N mean_wins
<fct> <chr> <int> <dbl>
1 LAA Los Angeles Angels of Anaheim 5 96.8
2 CAL California Angels 5 90.8
3 ANA Anaheim Angels 5 88.4
4 LAA Los Angeles Angels 4 77
```

Once you’ve read Chapter 15, think about how you might do this operation in SQL. It is not that easy!

## 7.5 Iteration over subgroups

In Chapter 4, we introduced data *verbs* that could be chained to perform very powerful data wrangling operations. These functions—which come from the **dplyr** package—operate on data frames and return data frames. The `group_modify()`

function in **purrr** allows you to apply an arbitrary function that returns a data frame to the *groups* of a data frame. That is, you will first define a grouping using the `group_by()`

function, and then apply a function to each of those groups. Note that this is similar to `map_dfr()`

, in that you are mapping a function that returns a data frame over a collection of values, and returning a data frame. But whereas the values used in `map_dfr()`

are individual elements of a vector, in `group_modify()`

they are groups defined on a data frame.

### 7.5.1 Example: Expected winning percentage

As noted in Section 4.2, one of the more enduring models in *sabermetrics* is Bill James’s formula for estimating a team’s expected *winning percentage*, given knowledge only of the team’s runs scored and runs allowed to date (recall that the team that scores the most
runs wins a given game). This statistic is known—unfortunately—as Pythagorean Winning Percentage, even though it has nothing to do with Pythagoras. The formula is simple, but non-linear:

\[
\widehat{WPct} = \frac{RS^2}{RS^2 + RA^2} = \frac{1}{1 + (RA/RS)^2} \,,
\]
where \(RS\) and \(RA\) are the number of runs the team has scored and allowed, respectively. If we define \(x = RS/RA\) to be the team’s *run ratio*, then this is a function of one variable having the form \(f(x) = \frac{1}{1 + (1/x)^2}\).

This model seems to fit quite well upon visual inspection—in Figure 7.1 we show the data since 1954, along with a line representing the model. Indeed, this model has also been successful in other sports, albeit with wholly different exponents.

```
<- function(x) {
exp_wpct return(1/(1 + (1/x)^2))
}
<- Teams %>%
TeamRuns filter(yearID >= 1954) %>%
rename(RS = R) %>%
mutate(WPct = W / (W + L), run_ratio = RS/RA) %>%
select(yearID, teamID, lgID, WPct, run_ratio)
ggplot(data = TeamRuns, aes(x = run_ratio, y = WPct)) +
geom_vline(xintercept = 1, color = "darkgray", linetype = 2) +
geom_hline(yintercept = 0.5, color = "darkgray", linetype = 2) +
geom_point(alpha = 0.2) +
stat_function(fun = exp_wpct, size = 2, color = "blue") +
xlab("Ratio of Runs Scored to Runs Allowed") +
ylab("Winning Percentage")
```

However, the exponent of 2 was posited by James. One can imagine having the exponent become a parameter \(k\), and trying to find the optimal fit. Indeed, researchers have found that in baseball, the optimal value of \(k\) is not 2, but something closer to 1.85 (V. Wang 2006). It is easy enough for us to find the optimal value using the `nls()`

function. We specify the formula of the nonlinear model, the data used to fit the model, and a starting value for the search.

```
%>%
TeamRuns nls(
formula = WPct ~ 1/(1 + (1/run_ratio)^k),
start = list(k = 2)
%>%
) coef()
```

```
k
1.84
```

Furthermore, researchers investigating this model have found that the optimal value of the exponent varies based on the era during which the model is fit.
We can use the `group_modify()`

function to do this for all decades in baseball history.
First, we must write a short function (see Appendix C) that will return a data frame containing the optimal exponent, and for good measure, the number of observations during that decade.

```
<- function(x) {
fit_k <- nls(
mod formula = WPct ~ 1/(1 + (1/run_ratio)^k),
data = x,
start = list(k = 2)
)return(tibble(k = coef(mod), n = nrow(x)))
}
```

Note that this function will return the optimal value of the exponent over any time period.

`fit_k(TeamRuns)`

```
# A tibble: 1 × 2
k n
<dbl> <int>
1 1.84 1708
```

Finally, we compute the decade for each year using `mutate()`

, define the group using `group_by()`

, and apply `fit_k()`

to those decades. The use of the `~`

tells **R** to interpret the expression in parentheses as a `formula`

, rather than the name of a function. The `.x`

is a placeholder for the data frame for a particular decade.

```
%>%
TeamRuns mutate(decade = yearID %/% 10 * 10) %>%
group_by(decade) %>%
group_modify(~fit_k(.x))
```

```
# A tibble: 8 × 3
# Groups: decade [8]
decade k n
<dbl> <dbl> <int>
1 1950 1.69 96
2 1960 1.90 198
3 1970 1.74 246
4 1980 1.93 260
5 1990 1.88 278
6 2000 1.94 300
7 2010 1.77 300
8 2020 1.86 30
```

Note the variation in the optimal value of \(k\). Even though the exponent is not the same in each decade, it varies within a fairly narrow range between 1.69 and 1.94.

### 7.5.2 Example: Annual leaders

As a second example, consider the problem of identifying the team in each season that led their league in home runs. We can easily write a function that will, for a specific year and league, return a data frame with one row that contains the team with the most home runs.

```
<- function(x) {
hr_leader # x is a subset of Teams for a single year and league
%>%
x select(teamID, HR) %>%
arrange(desc(HR)) %>%
head(1)
}
```

We can verify that in 1961, the *New York Yankees* led the *American League* in home runs.

```
%>%
Teams filter(yearID == 1961 & lgID == "AL") %>%
hr_leader()
```

```
teamID HR
1 NYA 240
```

We can use `group_modify()`

to quickly find all the teams that led their league in home runs. Here, we employ the `.keep`

argument so that the grouping variables appear in the computation.

```
<- Teams %>%
hr_leaders group_by(yearID, lgID) %>%
group_modify(~hr_leader(.x), .keep = TRUE)
tail(hr_leaders, 4)
```

```
# A tibble: 4 × 4
# Groups: yearID, lgID [4]
yearID lgID teamID HR
<int> <fct> <fct> <int>
1 2019 AL MIN 307
2 2019 NL LAN 279
3 2020 AL CHA 96
4 2020 NL LAN 118
```

In this manner, we can compute the average number of home runs hit in a season by the team that hit the most.

```
%>%
hr_leaders group_by(lgID) %>%
summarize(mean_hr = mean(HR))
```

```
# A tibble: 7 × 2
lgID mean_hr
<fct> <dbl>
1 AA 40.5
2 AL 157.
3 FL 51
4 NA 13.8
5 NL 129.
6 PL 66
7 UA 32
```

We restrict our attention to the years since 1916, during which only the AL and NL leagues have existed.

```
%>%
hr_leaders filter(yearID >= 1916) %>%
group_by(lgID) %>%
summarize(mean_hr = mean(HR))
```

```
# A tibble: 2 × 2
lgID mean_hr
<fct> <dbl>
1 AL 174.
2 NL 161.
```

In Figure 7.2, we show how this number has changed over time. We note that while the top HR hitting teams were comparable across the two leagues until the mid-1970s, the AL teams have dominated since their league adopted the *designated hitter* rule in 1973.

```
%>%
hr_leaders filter(yearID >= 1916) %>%
ggplot(aes(x = yearID, y = HR, color = lgID)) +
geom_line() +
geom_point() +
geom_smooth(se = FALSE) +
geom_vline(xintercept = 1973) +
annotate(
"text", x = 1974, y = 25,
label = "AL adopts DH", hjust = "left"
+
) labs(x = "Year", y = "Home runs", color = "League")
```

## 7.6 Simulation

In the previous section, we learned how to repeat operations while iterating over the elements of a vector. It can also be useful to simply repeat an operation many times and collect the results. Obviously, if the result of the operation is *deterministic* (i.e., you get the same answer every time) then this is pointless. On the other hand, if this operation involves randomness, then you won’t get the same answer every time, and understanding the distribution of values that your random operation produces can be useful.
We will flesh out these ideas further in Chapter 13.

For example, in our investigation into the expected winning percentage in baseball (Section 7.5.1), we determined that the optimal exponent fit to the 66 seasons worth of data from 1954 to 2019 was 1.84. However, we also found that if we fit this same model separately for each decade, that optimal exponent varies from 1.69 to 1.94. This gives us a rough sense of the variability in this exponent—we observed values between 1.6 and 2, which may give some insights as to plausible values for the exponent.

Nevertheless, our choice to stratify by decade was somewhat arbitrary. A more natural question might be: What is the distribution of optimal exponents fit to a *single-season*’s worth of data? How confident should we be in that estimate of 1.84?

We can use `group_modify()`

and the function we wrote previously to compute the 66 actual values. The resulting distribution is summarized in Figure 7.3.

```
<- TeamRuns %>%
k_actual group_by(yearID) %>%
group_modify(~fit_k(.x))
%>%
k_actual ungroup() %>%
skim(k)
```

```
── Variable type: numeric ──────────────────────────────────────────────────
var n na mean sd p0 p25 p50 p75 p100
1 k 67 0 1.85 0.186 1.31 1.69 1.89 1.96 2.33
```

```
ggplot(data = k_actual, aes(x = k)) +
geom_density() +
xlab("Best fit exponent for a single season")
```

Since we only have 67 samples, we might obtain a better understanding of the sampling distribution of the mean \(k\) by *resampling*—sampling with replacement—from these 67 values. (This is a statistical technique known as the *bootstrap*, which we describe further in Chapter 9.) A simple way to do this is by mapping a sampling expression over an index of values. That is, we define `n`

to be the number of iterations we want to perform, write an expression to compute the mean of a single resample, and then use `map_dbl()`

to perform the iterations.

```
<- 10000
n
<- 1:n %>%
bstrap map_dbl(
~k_actual %>%
pull(k) %>%
sample(replace = TRUE) %>%
mean()
)
<- bstrap %>%
civals quantile(probs = c(0.025, .975))
civals
```

```
2.5% 97.5%
1.80 1.89
```

After repeating the resampling 10,000 times, we found that 95% of the resampled exponents were between 1.8 and 1.889, with our original estimate of 1.84 lying somewhere near the center of that distribution. This distribution, along the boundaries of the middle 95%, is depicted in Figure 7.4.

```
ggplot(data = enframe(bstrap, value = "k"), aes(x = k)) +
geom_density() +
xlab("Distribution of resampled means") +
geom_vline(
data = enframe(civals), aes(xintercept = value),
color = "red", linetype = 3
)
```

## 7.7 Extended example: Factors associated with BMI

*Body Mass Index* (BMI) is a common measure of a person’s size, expressed as a ratio of their body’s mass to the square of their height. What factors are associated with high BMI?

For answers, we turn to survey data collected by the National Center for Health Statistics (NCHS) and packaged as the *National Health and Nutrition Examination Survey* (NHANES). These data available in **R** through the **NHANES** package.

`library(NHANES)`

An exhaustive approach to understanding the relationship between BMI and some of the other variables is complicated by the fact that there are 75 potential explanatory varibles for any model for BMI. In Chapter 11, we develop several modeling techniques that might be useful for this purpose, but here, we focus on examining the *bivariate* relationships between BMI and the other explanatory variables.
For example, we might start by simply producing a bivariate scatterplot between BMI and age, and adding a *local regression* line to show the general trend. Figure 7.5 shows the result.

```
ggplot(NHANES, aes(x = Age, y = BMI)) +
geom_point() +
geom_smooth()
```

How can we programmatically produce an analogous image for *all* of the variables in **NHANES**?
First, we’ll write a function that takes the name of a variable as an input, and returns the plot.
Second, we’ll define a set of variables, and use `map()`

to iterate our function over that list.

The following function will take a data set, and an argument called `x_var`

that will be the name of a variable.
It produces a slightly jazzed-up version of Figure 7.5 that contains variable-specific titles, as well as information about the source.

```
<- function(.data, x_var) {
bmi_plot ggplot(.data, aes(y = BMI)) +
aes_string(x = x_var) +
geom_jitter(alpha = 0.3) +
geom_smooth() +
labs(
title = paste("BMI by", x_var),
subtitle = "NHANES",
caption = "US National Center for Health Statistics (NCHS)"
) }
```

The use of the `aes_string()`

function is necessary for **ggplot2** to understand that we want to bind the `x`

aesthetic to the variable whose name is stored in the `x_var`

object, and not a variable that is named `x_var`

!^{14}

We can then call our function on a specific variable.

`bmi_plot(NHANES, "Age")`

Or, we can specify a set of variables and then `map()`

over that set. Since `map()`

always returns a list, and a list of plots is not that useful, we use the `wrap_plots()`

function from the **patchwork** package to combine the resulting list of plots into one image.

```
c("Age", "HHIncomeMid", "PhysActiveDays",
"TVHrsDay", "AlcoholDay", "Pulse") %>%
map(bmi_plot, .data = NHANES) %>%
::wrap_plots(ncol = 2) patchwork
```

Figure 7.6 displays the results for six variables.
We won’t show the results of our ultimate goal to produce all 75 plots here, but you can try it for yourself by using the `names()`

function to retrieve the full list of variable names.
Or, you could use `across()`

to retrieve only those variables that meet a certain condition.

## 7.8 Further resources

The chapter on *functionals* in H. Wickham (2019) is the definitive source for understanding **purrr**. The name “functionals” reflects the use of a programming paradigm called *functional programming*.

For those who are already familiar with the `*apply()`

family of functions popular in base R, Jenny Bryan wrote a helpful tutorial that maps these functions to their **purrr** equivalents.

The **rlang** package lays the groundwork for *tidy evaluation*, which allows you to work programmatically with unquoted variable names. The programming with `dplyr`

vignette is the best place to start learning about tidy evaluation.
Section C.4 provides a brief introduction to the principles.

## 7.9 Exercises

**Problem 1 (Easy)**: Use the `HELPrct`

data from the `mosaicData`

to calculate the mean of all numeric variables (be sure to exclude missing values).

**Problem 2 (Easy)**: Suppose you want to visit airports in Boston (`BOS`

), New York (`JFK`

, `LGA`

), San Francisco (`SFO`

), Chicago (`ORD`

, `MDW`

), and Los Angeles (`LAX`

). You have data about flight delays in a `tibble`

called `flights`

. You have written a pipeline that, for any given airport code (e.g., `LGA`

), will return a `tibble`

with two columns, the airport code, and the average arrival delay time.

Suggest a workflow that would be most efficient for computing the average arrival delay time for all seven airports.

**Problem 3 (Medium)**: Use the `purrr::map()`

function and the `HELPrct`

data frame from the `mosaicData`

package to fit a regression model predicting `cesd`

as a function of `age`

separately for each of the levels of the `substance`

variable. Generate a table of results (estimates and confidence intervals) for the slope parameter for each level of the grouping variable.

**Problem 4 (Medium)**: The team IDs corresponding to Brooklyn baseball teams from the `Teams`

data frame from the `Lahman`

package are listed below. Use `map_int()`

to find the number of seasons in which each of those teams played by calling a function called `count_seasons`

.

```
library(Lahman)
<- c("BR1", "BR2", "BR3", "BR4", "BRO", "BRP", "BRF") bk_teams
```

**Problem 5 (Medium)**: Use data from the `NHANES`

package to create a set of scatterplots of `Pulse`

as a function of `Age`

, `BMI`

, `TVHrsDay`

, and `BPSysAve`

to create a figure like the last one in the chapter.
Be sure to create appropriate annotations (source, survey name, variables being displayed).
What do you conclude?

**Problem 6 (Hard)**: Use the `group_modify()`

function and the `Lahman`

data to replicate one of the baseball records plots (http://tinyurl.com/nytimes-records) from the *The New York Times*.

## 7.10 Supplementary exercises

Available at https://mdsr-book.github.io/mdsr2e/ch-iteration.html#iteration-online-exercises

No exercises found

This works because a data frame is stored as a

`list`

of vectors of the same length. When you supply a data frame as the first argument to`map()`

, you are giving`map()`

a list of vectors, and it iterates a function over that list.↩︎The plain function

`map()`

will always return a list.↩︎Note that we use the quoted variable names here. To get this to work with unquoted variable names, see the Programming with

`dplyr`

vignette.↩︎