Introduction

These exercises are taken from the data wrangling chapter from Modern Data Science with R: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.

Verbs

Each of these tasks can be performed using a single data verb. For each task, say which verb it is:

  1. Find the average of one of the variables.
  2. Add a new column that is the ratio between two variables.
  3. Sort the cases in descending order of a variable.
  4. Create a new data table that includes only those cases that meet a criterion.
  5. From a data table with three categorical variables A, B, and C, and a quantitative variable X, produce a data frame that has the same cases but only the variables A and X.

SOLUTION:

Cancelled flights

Use the nycflights13 package and the flights data frame to answer the following questions: What month had the highest proportion of cancelled flights? What month had the lowest? Interpret any seasonal patterns.

SOLUTION:

library(mdsr)
library(nycflights13)
glimpse(flights)
## Observations: 336,776
## Variables: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013...
# solution goes here

Frequent NYC visitors

Use the nycflights13 package and the flights data frame to answer the following question: What plane (specified by the tailnum variable) traveled the most times from New York City airports in 2013? Plot the number of trips per week over the year.

SOLUTION:

library(mdsr)
library(nycflights13)
glimpse(planes)
## Observations: 3,322
## Variables: 9
## $ tailnum      <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575",...
## $ year         <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1...
## $ type         <chr> "Fixed wing multi engine", "Fixed wing multi engi...
## $ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE"...
## $ model        <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", ...
## $ engines      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ seats        <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 5...
## $ speed        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ engine       <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan...
# solution goes here

Oldest plane

Use the nycflights13 package and the flights and planes tables to answer the following questions: What is the oldest plane (specified by the tailnum variable) that flew from New York City airports in 2013? How many airplanes that flew from New York City are included in the planes table?

SOLUTION:

library(mdsr)
library(nycflights13)
# solution goes here

Who made that?

Use the nycflights13 package and the flights and planes tables to answer the following questions:

  1. How many planes have a missing date of manufacture?

  2. What are the five most common manufacturers?

  3. Has the distribution of manufacturer changed over time as reflected by the airplanes flying from NYC in 2013?

(Hint: you may need to recode the manufacturer name and collapse rare vendors into a category called Other)

SOLUTION:

library(mdsr)
library(nycflights13)
# solution goes here

Weather

Use the nycflights13 package and the weather table to answer the following questions: What is the distribution of temperature in July, 2013? Identify any important outliers in terms of the wind_speed variable. What is the relationship between dewp and humid? What is the relationship between precip and visib?

SOLUTION:

library(mdsr)
library(nycflights13)
# solution goes here

Rain, rain, go away

Use the nycflights13 package and the weather table to answer the following questions: On how many days was there precipitation in the New York area in 2013? Were there differences in the mean visibility (visib) based on the day of the week and/or month of the year? XX

SOLUTION:

library(mdsr)
library(nycflights13)
# solution goes here

Batting

Define two new variables in the Teams data frame from the Lahman package: batting average (\(BA\)) and slugging percentage (\(SLG\)). Batting average is the ratio of hits (H) to at-bats (AB), and slugging percentage is total bases divided by at-bats. To compute total bases, you get 1 for a single, 2 for a double, 3 for a triple, and 4 for a home run.

SOLUTION:

library(mdsr)
library(Lahman)
# solution goes here

AL vs NL

Plot a time series of SLG since 1954 conditioned by lgID. Is slugging percentage typically higher in the American League (AL) or the National League (NL)? Can you think of why this might be the case?

SOLUTION:

library(mdsr)
library(Lahman)
# solution goes here

Slugging percentage

Display the top 15 teams ranked in terms of slugging percentage in MLB history. Repeat this using teams since 1969.

SOLUTION:

library(mdsr)
library(Lahman)
# solution goes here

Angels

The Angels have at times been called the California Angels (CAL), the Anaheim Angels (ANA), and the Los Angeles Angels of Anaheim (LAA). Find the 10 most successful seasons in Angels history. Have they ever won the World Series?

SOLUTION:

library(mdsr)
library(Lahman)
# solution goes here

Presidential baseball

Create a factor called election that divides the yearID into four-year blocks that correspond to U.S. presidential terms. During which term have the most home runs been hit?

SOLUTION:

library(mdsr)
library(Lahman)
# solution goes here

Stolen bases

Name every player in baseball history who has accumulated at least 300 home runs (HR) and at least 300 stolen bases (SB).

SOLUTION:

library(mdsr)
library(Lahman)
# solution goes here

Wins and strikeouts

Name every pitcher in baseball history who has accumulated at least 300 wins (W) and at least 3,000 strikeouts (`SO).

SOLUTION:

library(mdsr)
library(Lahman)
# solution goes here

Home runs and batting average

Identify the name and year of every player who has hit at least 50 home runs in a single season. Which player had the lowest batting average in that season?

SOLUTION:

library(mdsr)
# solution goes here

Relative age effect

The relative age effect (https://en.wikipedia.org/wiki/Relative_age_effect) is an attempt to explain anomalies in the distribution of birth month among athletes. Briefly, the idea is that children born just after the age cut-off for participation will be as much as 11 months older than their fellow athletes, which is enough of a disparity to give them an advantage. That advantage will then be compounded over the years, resulting in notably more professional athletes born in these months. Display the distribution of birth months of baseball players who batted during the decade of the 2000s. How are they distributed over the calendar year? Does this support the notion of a relative age effect? (Online hint: use the Births78 data set from the mosaicData package as a reference.)

SOLUTION:

library(mdsr)
glimpse(Births78)
## Observations: 365
## Variables: 4
## $ date      <dttm> 1978-01-01, 1978-01-02, 1978-01-03, 1978-01-04, 197...
## $ births    <int> 7701, 7527, 8825, 8859, 9043, 9208, 8084, 7611, 9172...
## $ dayofyear <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ wday      <ord> Sun, Mon, Tues, Wed, Thurs, Fri, Sat, Sun, Mon, Tues...
# solution goes here

Restaurants

The Violations data set in the mdsr package contains information regarding the outcome of health inspections of restaurants in New York City. Use these data to calculate the median violation score by zip code for zip codes in Manhattan with 50 or more inspections. What pattern do you see between the number of inspections and the median score?

SOLUTION:

library(mdsr)
glimpse(Violations)
## Observations: 480,621
## Variables: 16
## $ camis           <int> 30075445, 30075445, 30075445, 30075445, 300754...
## $ dba             <chr> "MORRIS PARK BAKE SHOP", "MORRIS PARK BAKE SHO...
## $ boro            <chr> "BRONX", "BRONX", "BRONX", "BRONX", "BRONX", "...
## $ building        <int> 1007, 1007, 1007, 1007, 1007, 1007, 1007, 1007...
## $ street          <chr> "MORRIS PARK AVE", "MORRIS PARK AVE", "MORRIS ...
## $ zipcode         <int> 10462, 10462, 10462, 10462, 10462, 10462, 1046...
## $ phone           <dbl> 7188924968, 7188924968, 7188924968, 7188924968...
## $ inspection_date <dttm> 2015-02-09, 2014-03-03, 2013-10-10, 2013-09-1...
## $ action          <chr> "Violations were cited in the following area(s...
## $ violation_code  <chr> "06C", "10F", NA, "04L", "04N", "04C", "04L", ...
## $ score           <int> 6, 2, NA, 6, 6, 32, 32, 32, 32, 32, 32, NA, 10...
## $ grade           <chr> "A", "A", NA, "A", "A", NA, NA, NA, NA, NA, NA...
## $ grade_date      <dttm> 2015-02-09, 2014-03-03, NA, 2013-09-11, 2013-...
## $ record_date     <dttm> 2016-01-06, 2016-01-06, 2016-01-06, 2016-01-0...
## $ inspection_type <chr> "Cycle Inspection / Initial Inspection", "Cycl...
## $ cuisine_code    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8...
# solution goes here

Firearms

Download data on the number of deaths by firearm (http://www.fdle.state.fl.us/cms/FSAC/Documents/Excel/1971_fwd_murder_firearms.aspx) from the Florida Department of Law Enforcement. Wrangle these data and use ggplot2 to re-create Figure 6.1.

SOLUTION:

library(mdsr)
# solution goes here