These exercises are taken from the tidy data and iteration chapter from **Modern Data Science with R**: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.

Consider the number of home runs hit (`HR`

) and home runs allowed (`HRA`

) for the Chicago Cubs (CHN) baseball team. Reshape the `Teams`

data from the `Lahman`

package into *long* format and plot a time series conditioned on whether the HRs that involved the Cubs were hit by them or allowed by them.

SOLUTION:

```
library(mdsr)
library(Lahman)
# solution goes here
```

Write a function called `count_seasons`

that, when given a `teamID`

, will count the number of seasons the team played in the `Teams`

data frame from the `Lahman`

package.

SOLUTION:

```
library(mdsr)
library(Lahman)
# solution goes here
```

The team IDs corresponding to Brooklyn baseball teams from the `Teams`

data frame from the `Lahman`

package are listed below. Use `sapply()`

to find the number of seasons in which each of those teams played.

SOLUTION:

```
library(mdsr)
library(Lahman)
bk_teams <- c("BR1", "BR2", "BR3", "BR4", "BRO", "BRP", "BRF")
# solution goes here
```

In the `Marriage`

data set included in `mosaicData`

, the `appdate`

, `ceremonydate`

, and `dob`

variables are encoded as factors, even though they are dates. Use the `lubridate`

package to convert those three columns into a date format.

SOLUTION:

```
library(mdsr)
Marriage %>%
select(appdate, ceremonydate, dob) %>%
glimpse()
```

```
## Observations: 98
## Variables: 3
## $ appdate <fct> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, 1...
## $ ceremonydate <fct> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, 1...
## $ dob <fct> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/21...
```

`# solution goes here`

Consider the values returned by the `as.numeric()`

and `readr::parse_number()`

functions when applied to the following vectors. Describe the results and their implication.

<<>>= x1 <- c(“1900.45”, “$1900.45”, “1,900.45”, “nearly $2000”) x2 <- as.factor(x1) @

An analyst wants to calculate the pairwise differences between the Treatment and Control values for a small data set from a crossover trial (all subjects received both treatments) that consists of the following observations.

```
tab <- xtable(ds1)
print(tab, type="html", floating=FALSE)
```

id | group | vals | |
---|---|---|---|

1 | 1 | T | 4.00 |

2 | 2 | T | 6.00 |

3 | 3 | T | 8.00 |

4 | 1 | C | 5.00 |

5 | 2 | C | 6.00 |

6 | 3 | C | 10.00 |

They use the following code to create the new `diff`

variable.

```
Treat <- filter(ds1, group=="T")
Control <- filter(ds1, group=="C")
all <- mutate(Treat, diff = Treat$vals - Control$vals)
all
```

Verify that this code works for this example and generates the correct values of -1, 0, and -2. Describe two problems that might arise if the data set is not sorted in a particular order or if one of the observations is missing for one of the subjects. Provide an alternative approach to generate this variable that is more robust (hint: use `tidyr::spread()`

).

SOLUTION:

`# solution goes here`

Generate the code to convert the following data frame to wide format.

grp | sex | meanL | sdL | meanR | sdR | |
---|---|---|---|---|---|---|

1 | A | F | 0.22 | 0.11 | 0.34 | 0.08 |

2 | A | M | 0.47 | 0.33 | 0.57 | 0.33 |

3 | B | F | 0.33 | 0.11 | 0.40 | 0.07 |

4 | B | M | 0.55 | 0.31 | 0.65 | 0.27 |

grp | F.meanL | F.meanR | F.sdL | F.sdR | M.meanL | M.meanR | M.sdL | M.sdR | |
---|---|---|---|---|---|---|---|---|---|

1 | A | 0.22 | 0.34 | 0.11 | 0.08 | 0.47 | 0.57 | 0.33 | 0.33 |

2 | B | 0.33 | 0.40 | 0.11 | 0.07 | 0.55 | 0.65 | 0.31 | 0.27 |

Hint: use `gather()`

in conjunction with `spread()`

and `unite()`

.

Use the `dplyr::do()`

function and the `HELPrct`

data frame from the `mosaicData`

package to fit a regression model predicting `cesd`

as a function of `age`

separately for each of the levels of the `substance`

variable. Generate a table of results (estimates and confidence intervals) for each level of the grouping variable.

SOLUTION:

```
library(mdsr)
# solution goes here
```

Use the `dplyr::do()`

function and the `Lahman`

data to replicate one of these baseball records plots (http://tinyurl.com/nytimes-records) from the *The New York Times*.

SOLUTION:

```
library(mdsr)
library(Lahman)
# solution goes here
```

Use the `fec`

package to download the Federal Election Commission data for 2012. Re-create Figures 2.1 and 2.2 using `ggplot2`

.

SOLUTION:

`# solution goes here, after downloading the fec package.`

Using the same FEC data as the previous exercise, re-create Figure 2.8.

SOLUTION:

`# solution goes here`

Using the approach described in Section 5.5.4, find another table in Wikipedia that can be scraped and visualized. Be sure to interpret your graphical display.

SOLUTION:

```
# earlier example (please delete this)
library(rvest)
library(methods)
url <- "http://en.wikipedia.org/wiki/Mile_run_world_record_progression"
tables <- url %>%
read_html() %>%
html_nodes("table")
length(tables)
```

`## [1] 7`

```
Table3 <- html_table(tables[[3]])
head(Table3)
```

```
## Time Athlete Nationality Date Venue
## 1 4:52 Cadet Marshall United Kingdom 2 September 1852 Addiscome
## 2 4:45 Thomas Finch United Kingdom 3 November 1858 Oxford
## 3 4:45 St. Vincent Hammick United Kingdom 15 November 1858 Oxford
## 4 4:40 Gerald Surman United Kingdom 24 November 1859 Oxford
## 5 4:33 George Farran United Kingdom 23 May 1862 Dublin
```

Replicate the wrangling to create the `house_elections`

table in the `fec`

package from the original Excel source file.

SOLUTION:

```
library(mdsr)
# solution goes here
```

Replicate the functionality of the `make_babynames_dist()`

function from the `mdsr`

package to wrangle the original tables from the `babynames`

package.

SOLUTION:

```
library(mdsr)
# solution goes here
```