These exercises are taken from the tidy data and iteration chapter from Modern Data Science with R: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.
Consider the number of home runs hit (HR
) and home runs allowed (HRA
) for the Chicago Cubs (CHN) baseball team. Reshape the Teams
data from the Lahman
package into long format and plot a time series conditioned on whether the HRs that involved the Cubs were hit by them or allowed by them.
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
Write a function called count_seasons
that, when given a teamID
, will count the number of seasons the team played in the Teams
data frame from the Lahman
package.
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
The team IDs corresponding to Brooklyn baseball teams from the Teams
data frame from the Lahman
package are listed below. Use sapply()
to find the number of seasons in which each of those teams played.
SOLUTION:
library(mdsr)
library(Lahman)
bk_teams <- c("BR1", "BR2", "BR3", "BR4", "BRO", "BRP", "BRF")
# solution goes here
In the Marriage
data set included in mosaicData
, the appdate
, ceremonydate
, and dob
variables are encoded as factors, even though they are dates. Use the lubridate
package to convert those three columns into a date format.
SOLUTION:
library(mdsr)
Marriage %>%
select(appdate, ceremonydate, dob) %>%
glimpse()
## Observations: 98
## Variables: 3
## $ appdate <fct> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, 1...
## $ ceremonydate <fct> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, 1...
## $ dob <fct> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/21...
# solution goes here
Consider the values returned by the as.numeric()
and readr::parse_number()
functions when applied to the following vectors. Describe the results and their implication.
<<>>= x1 <- c(“1900.45”, “$1900.45”, “1,900.45”, “nearly $2000”) x2 <- as.factor(x1) @
An analyst wants to calculate the pairwise differences between the Treatment and Control values for a small data set from a crossover trial (all subjects received both treatments) that consists of the following observations.
tab <- xtable(ds1)
print(tab, type="html", floating=FALSE)
id | group | vals | |
---|---|---|---|
1 | 1 | T | 4.00 |
2 | 2 | T | 6.00 |
3 | 3 | T | 8.00 |
4 | 1 | C | 5.00 |
5 | 2 | C | 6.00 |
6 | 3 | C | 10.00 |
They use the following code to create the new diff
variable.
Treat <- filter(ds1, group=="T")
Control <- filter(ds1, group=="C")
all <- mutate(Treat, diff = Treat$vals - Control$vals)
all
Verify that this code works for this example and generates the correct values of -1, 0, and -2. Describe two problems that might arise if the data set is not sorted in a particular order or if one of the observations is missing for one of the subjects. Provide an alternative approach to generate this variable that is more robust (hint: use tidyr::spread()
).
SOLUTION:
# solution goes here
Generate the code to convert the following data frame to wide format.
grp | sex | meanL | sdL | meanR | sdR | |
---|---|---|---|---|---|---|
1 | A | F | 0.22 | 0.11 | 0.34 | 0.08 |
2 | A | M | 0.47 | 0.33 | 0.57 | 0.33 |
3 | B | F | 0.33 | 0.11 | 0.40 | 0.07 |
4 | B | M | 0.55 | 0.31 | 0.65 | 0.27 |
grp | F.meanL | F.meanR | F.sdL | F.sdR | M.meanL | M.meanR | M.sdL | M.sdR | |
---|---|---|---|---|---|---|---|---|---|
1 | A | 0.22 | 0.34 | 0.11 | 0.08 | 0.47 | 0.57 | 0.33 | 0.33 |
2 | B | 0.33 | 0.40 | 0.11 | 0.07 | 0.55 | 0.65 | 0.31 | 0.27 |
Hint: use gather()
in conjunction with spread()
and unite()
.
Use the dplyr::do()
function and the HELPrct
data frame from the mosaicData
package to fit a regression model predicting cesd
as a function of age
separately for each of the levels of the substance
variable. Generate a table of results (estimates and confidence intervals) for each level of the grouping variable.
SOLUTION:
library(mdsr)
# solution goes here
Use the dplyr::do()
function and the Lahman
data to replicate one of these baseball records plots (http://tinyurl.com/nytimes-records) from the The New York Times.
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
Use the fec
package to download the Federal Election Commission data for 2012. Re-create Figures 2.1 and 2.2 using ggplot2
.
SOLUTION:
# solution goes here, after downloading the fec package.
Using the same FEC data as the previous exercise, re-create Figure 2.8.
SOLUTION:
# solution goes here
Using the approach described in Section 5.5.4, find another table in Wikipedia that can be scraped and visualized. Be sure to interpret your graphical display.
SOLUTION:
# earlier example (please delete this)
library(rvest)
library(methods)
url <- "http://en.wikipedia.org/wiki/Mile_run_world_record_progression"
tables <- url %>%
read_html() %>%
html_nodes("table")
length(tables)
## [1] 7
Table3 <- html_table(tables[[3]])
head(Table3)
## Time Athlete Nationality Date Venue
## 1 4:52 Cadet Marshall United Kingdom 2 September 1852 Addiscome
## 2 4:45 Thomas Finch United Kingdom 3 November 1858 Oxford
## 3 4:45 St. Vincent Hammick United Kingdom 15 November 1858 Oxford
## 4 4:40 Gerald Surman United Kingdom 24 November 1859 Oxford
## 5 4:33 George Farran United Kingdom 23 May 1862 Dublin
Replicate the wrangling to create the house_elections
table in the fec
package from the original Excel source file.
SOLUTION:
library(mdsr)
# solution goes here
Replicate the functionality of the make_babynames_dist()
function from the mdsr
package to wrangle the original tables from the babynames
package.
SOLUTION:
library(mdsr)
# solution goes here