These exercises are taken from the supervised learning chapter from Modern Data Science with R: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.
The ability to get a good night’s sleep is correlated with many positive health outcomes. The NHANES
data set in the NHANES
package contains a binary variable SleepTrouble
that indicates whether each person has trouble sleeping. For each of the following models:
SleepTrouble
NHANES
training dataYou may use whatever variable you like, except for SleepHrsNight
. Models:
SOLUTION:
library(mdsr)
library(NHANES)
# XX create test and train
# solution goes here
Repeat the previous exercise, but now use the quantitative response variable SleepHrsNight
. Build and interpret the following models:
SOLUTION:
library(mdsr)
library(NHANES)
# XX create test and train
# solution goes here
Repeat either of the previous exercises, but this time first separate the NHANES
data set uniformly at random into 75% training and 25% testing sets. Compare the effectiveness of each model on training vs. testing data.
SOLUTION:
library(mdsr)
library(NHANES)
# XX create test and train
# solution goes here
Repeat the first exercise, but for the variable PregnantNow
. What did you learn about who is pregnant?
SOLUTION:
library(mdsr)
library(NHANES)
# XX create test and train
# solution goes here
The nasaweather
package contains data about tropical storms
from 1995-2005. Consider the scatterplot between the wind
speed and pressure
of these storms
shown below.
<
The type
of storm is present in the data, and four types are given: extratropical, hurricane, tropical depression, and tropical storm. There are complicated and not terribly precise definitions for storm type. Build a classifier for the type
of each storm as a function of its wind
speed and pressure
.
Why would a decision tree make a particularly good classifier for these data? Visualize your classifier in the data space in a manner similar to Figure 8.10 or 8.11. % XX hardcoded ref!
SOLUTION:
library(mdsr)
library(nasaweather)
# solution goes here
Fit a series of supervised learning models to predict arrival delays for flights from New York to SFO
using the nycflights13
package. How do the conclusions change from the multiple regression model presented in the Statistical Foundations Chapter?
SOLUTION:
library(mdsr)
library(nasaweather)
# solution goes here
Use the College Scorecard Data (https://collegescorecard.ed.gov/data) to model student debt as a function of institutional characteristics using the techniques described in this chapter. Compare and contrast results from at least three methods. (Note that a considerable amount of data wrangling will be needed.)
SOLUTION:
library(mdsr)
# solution goes here