Tidy data and iteration exercises

Introduction

These exercises are taken from the tidy data and iteration chapter from Modern Data Science with R: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.

Home runs

Consider the number of home runs hit (HR) and home runs allowed (HRA) for the Chicago Cubs (CHN) baseball team. Reshape the Teams data from the Lahman package into long format and plot a time series conditioned on whether the HRs that involved the Cubs were hit by them or allowed by them.

SOLUTION:

library(mdsr)
library(Lahman)
# solution goes here

Seasons

Write a function called count_seasons that, when given a teamID, will count the number of seasons the team played in the Teams data frame from the Lahman package.

SOLUTION:

library(mdsr)   
library(Lahman)
# solution goes here

We’ll always have Brooklyn

The team IDs corresponding to Brooklyn baseball teams from the Teams data frame from the Lahman package are listed below. Use sapply() to find the number of seasons in which each of those teams played.

SOLUTION:

library(mdsr)   
library(Lahman)
bk_teams <- c("BR1", "BR2", "BR3", "BR4", "BRO", "BRP", "BRF")
# solution goes here

Marriage

In the Marriage data set included in mosaicData, the appdate, ceremonydate, and dob variables are encoded as factors, even though they are dates. Use the lubridate package to convert those three columns into a date format.

SOLUTION:

library(mdsr)   
Marriage %>%
  select(appdate, ceremonydate, dob) %>%
  glimpse()

## Observations: 98
## Variables: 3
## $ appdate      <fct> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, 1...
## $ ceremonydate <fct> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, 1...
## $ dob          <fct> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/21...

# solution goes here

Coercion

Consider the values returned by the as.numeric() and readr::parse_number() functions when applied to the following vectors. Describe the results and their implication.

<<>>= x1 <- c(“1900.45”, “$1900.45”, “1,900.45”, “nearly $2000”) x2 <- as.factor(x1) @

Brittle code

An analyst wants to calculate the pairwise differences between the Treatment and Control values for a small data set from a crossover trial (all subjects received both treatments) that consists of the following observations.

tab <- xtable(ds1)
print(tab, type="html", floating=FALSE)

	id	group	vals
1	1	T	4.00
2	2	T	6.00
3	3	T	8.00
4	1	C	5.00
5	2	C	6.00
6	3	C	10.00

They use the following code to create the new diff variable.

Treat <- filter(ds1, group=="T")
Control <- filter(ds1, group=="C")
all <- mutate(Treat, diff = Treat$vals - Control$vals)
all

Verify that this code works for this example and generates the correct values of -1, 0, and -2. Describe two problems that might arise if the data set is not sorted in a particular order or if one of the observations is missing for one of the subjects. Provide an alternative approach to generate this variable that is more robust (hint: use tidyr::spread()).

SOLUTION:

# solution goes here

Tall to wide

Generate the code to convert the following data frame to wide format.

	grp	sex	meanL	sdL	meanR	sdR
1	A	F	0.22	0.11	0.34	0.08
2	A	M	0.47	0.33	0.57	0.33
3	B	F	0.33	0.11	0.40	0.07
4	B	M	0.55	0.31	0.65	0.27

The result should look like the following display.

	grp	F.meanL	F.meanR	F.sdL	F.sdR	M.meanL	M.meanR	M.sdL	M.sdR
1	A	0.22	0.34	0.11	0.08	0.47	0.57	0.33	0.33
2	B	0.33	0.40	0.11	0.07	0.55	0.65	0.31	0.27

Hint: use gather() in conjunction with spread() and unite().

Multiple models

Use the dplyr::do() function and the HELPrct data frame from the mosaicData package to fit a regression model predicting cesd as a function of age separately for each of the levels of the substance variable. Generate a table of results (estimates and confidence intervals) for each level of the grouping variable.

SOLUTION:

library(mdsr)   
# solution goes here

Baseball records

Use the dplyr::do() function and the Lahman data to replicate one of these baseball records plots (http://tinyurl.com/nytimes-records) from the The New York Times.

SOLUTION:

library(mdsr)   
library(Lahman)
# solution goes here

FEC

Use the fec package to download the Federal Election Commission data for 2012. Re-create Figures 2.1 and 2.2 using ggplot2.

SOLUTION:

# solution goes here, after downloading the fec package.

More FEC

Using the same FEC data as the previous exercise, re-create Figure 2.8.

SOLUTION:

# solution goes here

Wikipedia

Using the approach described in Section 5.5.4, find another table in Wikipedia that can be scraped and visualized. Be sure to interpret your graphical display.

SOLUTION:

# earlier example (please delete this)
library(rvest)
library(methods)
url <- "http://en.wikipedia.org/wiki/Mile_run_world_record_progression"
tables <- url %>%
  read_html() %>%
  html_nodes("table")
length(tables)

## [1] 7

Table3 <- html_table(tables[[3]])
head(Table3)

##   Time             Athlete    Nationality             Date     Venue
## 1 4:52      Cadet Marshall United Kingdom 2 September 1852 Addiscome
## 2 4:45        Thomas Finch United Kingdom  3 November 1858    Oxford
## 3 4:45 St. Vincent Hammick United Kingdom 15 November 1858    Oxford
## 4 4:40       Gerald Surman United Kingdom 24 November 1859    Oxford
## 5 4:33       George Farran United Kingdom      23 May 1862    Dublin

Wrangling with the FEC

Replicate the wrangling to create the house_elections table in the fec package from the original Excel source file.

SOLUTION:

library(mdsr)
# solution goes here

Babynames

Replicate the functionality of the make_babynames_dist() function from the mdsr package to wrangle the original tables from the babynames package.

SOLUTION:

library(mdsr)
# solution goes here

Tidy data and iteration exercises

Nicholas Horton (nhorton@amherst.edu)

February 5, 2018

Introduction

Home runs

Seasons

We’ll always have Brooklyn

Marriage

Coercion

Brittle code

Tall to wide

Multiple models

Baseball records

FEC

More FEC

Wikipedia

Wrangling with the FEC

Babynames