Introduction

These exercises are taken from the grammar of graphics chapter from Modern Data Science with R: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.

Galton

Using the famous Galton data set from the mosaicData package:

  1. Create a scatterplot of each person’s height against their father’s height
  2. Separate your plot into facets by sex
  3. Add regression lines to all of your facets

Hint: recall that you can find out more about the data set by running the command ?Galton.

SOLUTION:

library(mdsr)
glimpse(Galton)
## Observations: 898
## Variables: 6
## $ family <fct> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 5, 5, 5, 5...
## $ father <dbl> 78.5, 78.5, 78.5, 78.5, 75.5, 75.5, 75.5, 75.5, 75.0, 7...
## $ mother <dbl> 67.0, 67.0, 67.0, 67.0, 66.5, 66.5, 66.5, 66.5, 64.0, 6...
## $ sex    <fct> M, F, F, F, M, M, F, F, M, F, M, M, F, F, F, M, M, M, F...
## $ height <dbl> 73.2, 69.2, 69.0, 69.0, 73.5, 72.5, 65.5, 65.5, 71.0, 6...
## $ nkids  <int> 4, 4, 4, 4, 4, 4, 4, 4, 2, 2, 5, 5, 5, 5, 5, 6, 6, 6, 6...
# solution goes here

Railtrails

Using the RailTrail data set from the mosaicData package:

  1. Create a scatterplot of the number of crossings per day volume against the high temperature that day
  2. Separate your plot into facets by weekday
  3. Add regression lines to the two facets

SOLUTION:

library(mdsr)
glimpse(RailTrail)
## Observations: 90
## Variables: 11
## $ hightemp   <int> 83, 73, 74, 95, 44, 69, 66, 66, 80, 79, 78, 65, 41,...
## $ lowtemp    <int> 50, 49, 52, 61, 52, 54, 39, 38, 55, 45, 55, 48, 49,...
## $ avgtemp    <dbl> 66.5, 61.0, 63.0, 78.0, 48.0, 61.5, 52.5, 52.0, 67....
## $ spring     <int> 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, ...
## $ summer     <int> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, ...
## $ fall       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, ...
## $ cloudcover <dbl> 7.6, 6.3, 7.5, 2.6, 10.0, 6.6, 2.4, 0.0, 3.8, 4.1, ...
## $ precip     <dbl> 0.00, 0.29, 0.32, 0.00, 0.14, 0.02, 0.00, 0.00, 0.0...
## $ volume     <int> 501, 419, 397, 385, 200, 375, 417, 629, 533, 547, 4...
## $ weekday    <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE, FALSE, F...
## $ dayType    <chr> "weekday", "weekday", "weekday", "weekend", "weekda...
# solution goes here

Hamilton and Angelica

Angelica Schuyler Church (https://en.wikipedia.org/wiki/Angelica_Schuyler_Church, 1756–1814) was the daughter of New York Governer Philip Schuyler and sister of Elizabeth Schuyler Hamilton. Angelica, New York was named after her. Generate a plot of the reported proportion of babies born with the name Angelica over time and interpret the figure.

SOLUTION:

library(mdsr)
library(babynames)
glimpse(babynames)
## Observations: 1,858,689
## Variables: 5
## $ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 188...
## $ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F...
## $ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret"...
## $ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 128...
## $ prop <dbl> 0.072384329, 0.026679234, 0.020521700, 0.019865989, 0.017...
# solution goes here

Marriage

The following questions use the Marriage data set from the mosaicData package.

  1. Create an informative and meaningful data graphic.
  2. Identify each of the visual cues that you are using, and describe how they are related to each variable.
  3. Create a data graphic with at least five variables (either quantitative or categorical). For the purposes of this exercise, do not worry about making your visualization meaningful—just try to encode five variables into one plot.
library(mdsr)
glimpse(Marriage)
## Observations: 98
## Variables: 15
## $ bookpageID    <fct> B230p539, B230p677, B230p766, B230p892, B230p994...
## $ appdate       <fct> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, ...
## $ ceremonydate  <fct> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, ...
## $ delay         <int> 11, 0, 8, 5, 5, 0, 16, 0, 28, 10, 8, 0, 4, 4, 0,...
## $ officialTitle <fct> CIRCUIT JUDGE , MARRIAGE OFFICIAL, MARRIAGE OFFI...
## $ person        <fct> Groom, Groom, Groom, Groom, Groom, Groom, Groom,...
## $ dob           <fct> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/2...
## $ age           <dbl> 32.60274, 32.29041, 34.79178, 40.57808, 30.02192...
## $ race          <fct> White, White, Hispanic, Black, White, White, Whi...
## $ prevcount     <int> 0, 1, 1, 1, 0, 1, 1, 1, 0, 3, 1, 1, 0, 0, 1, 0, ...
## $ prevconc      <fct> NA, Divorce, Divorce, Divorce, NA, NA, Divorce, ...
## $ hs            <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, ...
## $ college       <int> 7, 0, 3, 4, 0, 0, 0, 0, 0, 6, 2, 1, 1, 0, 0, 4, ...
## $ dayOfBirth    <dbl> 102.00, 219.00, 51.50, 141.00, 348.50, 52.50, 28...
## $ sign          <fct> Aries, Leo, Pisces, Gemini, Saggitarius, Pisces,...
# solution goes here

MLB teams

The MLB_teams data set in the mdsr package contains information about Major League Baseball teams in the past four seasons. There are several quantitative and a few categorical variables present. See how many variables you can illustrate on a single plot in R. The current record is 7. (Note: This is not good graphical practice—it is merely an exercise to help you understand how to use visual cues and aesthetics!)

library(mdsr)
glimpse(MLB_teams)
## Observations: 210
## Variables: 11
## $ yearID     <int> 2008, 2008, 2008, 2008, 2008, 2008, 2008, 2008, 200...
## $ teamID     <chr> "ARI", "ATL", "BAL", "BOS", "CHA", "CHN", "CIN", "C...
## $ lgID       <fct> NL, NL, AL, AL, AL, NL, NL, AL, NL, AL, NL, NL, AL,...
## $ W          <int> 82, 72, 68, 95, 89, 97, 74, 81, 74, 74, 84, 86, 75,...
## $ L          <int> 80, 90, 93, 67, 74, 64, 88, 81, 88, 88, 77, 75, 87,...
## $ WPct       <dbl> 0.5061728, 0.4444444, 0.4223602, 0.5864198, 0.54601...
## $ attendance <int> 2509924, 2532834, 1950075, 3048250, 2500648, 330020...
## $ normAttend <dbl> 0.5838859, 0.5892155, 0.4536477, 0.7091172, 0.58172...
## $ payroll    <int> 66202712, 102365683, 67196246, 133390035, 121189332...
## $ metroPop   <dbl> 4489109, 5614323, 2785874, 4732161, 9554598, 955459...
## $ name       <chr> "Arizona Diamondbacks", "Atlanta Braves", "Baltimor...
# solution goes here

Payroll

Use the MLB_teams data in the mdsr package to create an informative data graphic that illustrates the relationship between winning percentage and payroll in context.

SOLUTION:

library(mdsr)
# solution goes here

Dead names

Use the make_babynames_dist function in the mdsr package to recreate the “Deadest Names” graphic from FiveThirtyEight (http://tinyurl.com/zcbcl9o).

SOLUTION:

library(mdsr)
babynames_dist <- make_babynames_dist()
glimpse(babynames_dist)
## Observations: 1,639,490
## Variables: 9
## $ year            <dbl> 1900, 1900, 1900, 1900, 1900, 1900, 1900, 1900...
## $ sex             <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "...
## $ name            <chr> "Mary", "Helen", "Anna", "Margaret", "Ruth", "...
## $ n               <int> 16707, 6343, 6114, 5304, 4765, 4096, 3920, 389...
## $ prop            <dbl> 0.052574935, 0.019960664, 0.019240028, 0.01669...
## $ alive_prob      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ count_thousands <dbl> 16.707, 6.343, 6.114, 5.304, 4.765, 4.096, 3.9...
## $ age_today       <dbl> 114, 114, 114, 114, 114, 114, 114, 114, 114, 1...
## $ est_alive_today <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

MacLeish

The macleish package contains weather data collected every ten minutes in 2015 from two weather stations in Whately, MA. Using ggpplot2 create a data graphic that displays the average temperature over each 10-minute interal (temperature as a function of time (when).

SOLUTION:

library(mdsr)
library(macleish)
## Loading required package: etl
glimpse(whately_2015)
## Observations: 52,560
## Variables: 8
## $ when            <dttm> 2015-01-01 00:00:00, 2015-01-01 00:10:00, 201...
## $ temperature     <dbl> -9.32, -9.46, -9.44, -9.30, -9.32, -9.34, -9.3...
## $ wind_speed      <dbl> 1.399, 1.506, 1.620, 1.141, 1.223, 1.090, 1.16...
## $ wind_dir        <dbl> 225.4, 248.2, 258.3, 243.8, 238.4, 241.7, 242....
## $ rel_humidity    <dbl> 54.55, 55.38, 56.18, 56.41, 56.87, 57.25, 57.7...
## $ pressure        <int> 985, 985, 985, 985, 984, 984, 984, 984, 984, 9...
## $ solar_radiation <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ rainfall        <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
# solution goes here

NASA weather

Using data from the nasaweather package, create a scatterplot between wind and pressure with color being used to distinguish the type of storm.

SOLUTION:

library(mdsr)
library(nasaweather)
glimpse(storms)
## Observations: 2,747
## Variables: 11
## $ name     <chr> "Allison", "Allison", "Allison", "Allison", "Allison"...
## $ year     <int> 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995, 1995,...
## $ month    <int> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,...
## $ day      <int> 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7,...
## $ hour     <int> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 1...
## $ lat      <dbl> 17.4, 18.3, 19.3, 20.6, 22.0, 23.3, 24.7, 26.2, 27.6,...
## $ long     <dbl> -84.3, -84.9, -85.7, -85.8, -86.0, -86.3, -86.2, -86....
## $ pressure <int> 1005, 1004, 1003, 1001, 997, 995, 987, 988, 988, 990,...
## $ wind     <int> 30, 30, 35, 40, 50, 60, 65, 65, 65, 60, 60, 45, 30, 3...
## $ type     <chr> "Tropical Depression", "Tropical Depression", "Tropic...
## $ seasday  <int> 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 6, 6, 7, 7,...
# solution goes here

More weather

Using data from the nasaweather package, use the geom_path function to plot the path of each tropical storm in the storms data table. Use color to distinguish the storms from one another, and use facetting to plot each year in its own panel.

SOLUTION:

library(mdsr)
library(nasaweather)
# solution goes here