These exercises are taken from the data wrangling chapter from Modern Data Science with R: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.
Each of these tasks can be performed using a single data verb. For each task, say which verb it is:
SOLUTION:
Use the nycflights13
package and the flights
data frame to answer the following questions: What month had the highest proportion of cancelled flights? What month had the lowest? Interpret any seasonal patterns.
SOLUTION:
library(mdsr)
library(nycflights13)
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
# solution goes here
Use the nycflights13
package and the flights
data frame to answer the following question: What plane (specified by the tailnum
variable) traveled the most times from New York City airports in 2013? Plot the number of trips per week over the year.
SOLUTION:
library(mdsr)
library(nycflights13)
glimpse(planes)
## Observations: 3,322
## Variables: 9
## $ tailnum <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575",...
## $ year <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1...
## $ type <chr> "Fixed wing multi engine", "Fixed wing multi engi...
## $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE"...
## $ model <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", ...
## $ engines <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ seats <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 5...
## $ speed <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ engine <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan...
# solution goes here
Use the nycflights13
package and the flights
and planes
tables to answer the following questions: What is the oldest plane (specified by the tailnum
variable) that flew from New York City airports in 2013? How many airplanes that flew from New York City are included in the planes
table?
SOLUTION:
library(mdsr)
library(nycflights13)
# solution goes here
Use the nycflights13
package and the flights
and planes
tables to answer the following questions:
How many planes have a missing date of manufacture?
What are the five most common manufacturers?
Has the distribution of manufacturer changed over time as reflected by the airplanes flying from NYC in 2013?
(Hint: you may need to recode the manufacturer name and collapse rare vendors into a category called Other
)
SOLUTION:
library(mdsr)
library(nycflights13)
# solution goes here
Use the nycflights13
package and the weather
table to answer the following questions: What is the distribution of temperature in July, 2013? Identify any important outliers in terms of the wind_speed
variable. What is the relationship between dewp
and humid
? What is the relationship between precip
and visib
?
SOLUTION:
library(mdsr)
library(nycflights13)
# solution goes here
Use the nycflights13
package and the weather
table to answer the following questions: On how many days was there precipitation in the New York area in 2013? Were there differences in the mean visibility (visib
) based on the day of the week and/or month of the year? XX
SOLUTION:
library(mdsr)
library(nycflights13)
# solution goes here
Define two new variables in the Teams
data frame from the Lahman
package: batting average (\(BA\)) and slugging percentage (\(SLG\)). Batting average is the ratio of hits (H
) to at-bats (AB
), and slugging percentage is total bases divided by at-bats. To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
Plot a time series of SLG
since 1954 conditioned by lgID
. Is slugging percentage typically higher in the American League (AL) or the National League (NL)? Can you think of why this might be the case?
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
Display the top 15 teams ranked in terms of slugging percentage in MLB history. Repeat this using teams since 1969.
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
The Angels have at times been called the California Angels (CAL
), the Anaheim Angels (ANA
), and the Los Angeles Angels of Anaheim (LAA
). Find the 10 most successful seasons in Angels history. Have they ever won the World Series?
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
Create a factor called election
that divides the yearID
into four-year blocks that correspond to U.S. presidential terms. During which term have the most home runs been hit?
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
Name every player in baseball history who has accumulated at least 300 home runs (HR
) and at least 300 stolen bases (SB
).
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
Name every pitcher in baseball history who has accumulated at least 300 wins (W
) and at least 3,000 strikeouts (`SO).
SOLUTION:
library(mdsr)
library(Lahman)
# solution goes here
Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?
SOLUTION:
library(mdsr)
# solution goes here
The relative age effect (https://en.wikipedia.org/wiki/Relative_age_effect) is an attempt to explain anomalies in the distribution of birth month among athletes. Briefly, the idea is that children born just after the age cut-off for participation will be as much as 11 months older than their fellow athletes, which is enough of a disparity to give them an advantage. That advantage will then be compounded over the years, resulting in notably more professional athletes born in these months. Display the distribution of birth months of baseball players who batted during the decade of the 2000s. How are they distributed over the calendar year? Does this support the notion of a relative age effect? (Online hint: use the Births78
data set from the mosaicData
package as a reference.)
SOLUTION:
library(mdsr)
glimpse(Births78)
## Observations: 365
## Variables: 4
## $ date <dttm> 1978-01-01, 1978-01-02, 1978-01-03, 1978-01-04, 197...
## $ births <int> 7701, 7527, 8825, 8859, 9043, 9208, 8084, 7611, 9172...
## $ dayofyear <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ wday <ord> Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tues...
# solution goes here
The Violations
data set in the mdsr
package contains information regarding the outcome of health inspections of restaurants in New York City. Use these data to calculate the median violation score by zip code for zip codes in Manhattan with 50 or more inspections. What pattern do you see between the number of inspections and the median score?
SOLUTION:
library(mdsr)
glimpse(Violations)
## Observations: 480,621
## Variables: 16
## $ camis <int> 30075445, 30075445, 30075445, 30075445, 300754...
## $ dba <chr> "MORRIS PARK BAKE SHOP", "MORRIS PARK BAKE SHO...
## $ boro <chr> "BRONX", "BRONX", "BRONX", "BRONX", "BRONX", "...
## $ building <int> 1007, 1007, 1007, 1007, 1007, 1007, 1007, 1007...
## $ street <chr> "MORRIS PARK AVE", "MORRIS PARK AVE", "MORRIS ...
## $ zipcode <int> 10462, 10462, 10462, 10462, 10462, 10462, 1046...
## $ phone <dbl> 7188924968, 7188924968, 7188924968, 7188924968...
## $ inspection_date <dttm> 2015-02-09, 2014-03-03, 2013-10-10, 2013-09-1...
## $ action <chr> "Violations were cited in the following area(s...
## $ violation_code <chr> "06C", "10F", NA, "04L", "04N", "04C", "04L", ...
## $ score <int> 6, 2, NA, 6, 6, 32, 32, 32, 32, 32, 32, NA, 10...
## $ grade <chr> "A", "A", NA, "A", "A", NA, NA, NA, NA, NA, NA...
## $ grade_date <dttm> 2015-02-09, 2014-03-03, NA, 2013-09-11, 2013-...
## $ record_date <dttm> 2016-01-06, 2016-01-06, 2016-01-06, 2016-01-0...
## $ inspection_type <chr> "Cycle Inspection / Initial Inspection", "Cycl...
## $ cuisine_code <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
# solution goes here
Download data on the number of deaths by firearm (http://www.fdle.state.fl.us/cms/FSAC/Documents/Excel/1971_fwd_murder_firearms.aspx) from the Florida Department of Law Enforcement. Wrangle these data and use ggplot2
to re-create Figure 6.1.
SOLUTION:
library(mdsr)
# solution goes here