Statistical foundations exercises

Introduction

These exercises are taken from the statistical foundations chapter from Modern Data Science with R: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.

Gestation

Calculate and interpret a 95% confidence interval for the mean age of mothers from the classic Gestation data set from the mosaicData package.

SOLUTION:

library(mdsr)   
glimpse(Gestation)

## Observations: 1,236
## Variables: 23
## $ id        <int> 15, 20, 58, 61, 72, 100, 102, 129, 142, 148, 164, 17...
## $ pluralty  <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ outcome   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ date      <int> 1411, 1499, 1576, 1504, 1425, 1673, 1449, 1562, 1408...
## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351...
## $ sex       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ wt        <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 14...
## $ parity    <int> 1, 2, 1, 2, 1, 4, 4, 2, 3, 3, 2, 4, 3, 5, 3, 4, 3, 3...
## $ race      <int> 8, 0, 0, 0, 0, 0, 7, 7, 0, 0, 0, 0, 0, 8, 7, 7, 4, 3...
## $ age       <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, ...
## $ ed        <int> 5, 5, 2, 5, 5, 2, 2, 1, 4, 5, 5, 2, 1, 5, 2, 2, 7, 2...
## $ ht        <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, ...
## $ wt.1      <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120...
## $ drace     <fctr> 8, 0, 5, 3, 0, 3, 7, 7, 3, 0, 5, 0, 5, 0, 7, 7, 7, ...
## $ dage      <int> 31, 38, 32, 43, 24, 28, 37, 23, 26, 34, 28, 36, 28, ...
## $ ded       <int> 5, 5, 1, 4, 5, 2, 4, 4, 1, 5, 4, 1, 2, 5, 0, 0, 1, 2...
## $ dht       <int> 65, 70, NA, 68, NA, 64, NA, 71, 70, NA, NA, 74, NA, ...
## $ dwt       <int> 110, 148, NA, 197, NA, 130, NA, 192, 180, NA, NA, 18...
## $ marital   <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ inc       <int> 1, 4, 2, 8, 1, 4, NA, 2, 2, 2, NA, 2, 2, 2, 1, 1, 1,...
## $ smoke     <int> 0, 0, 1, 3, 1, 2, 0, 0, 0, 1, 3, 1, 1, 1, 0, 0, 1, 1...
## $ time      <int> 0, 0, 1, 5, 1, 2, 0, 0, 0, 1, 4, 1, 1, 1, 0, 0, 1, 1...
## $ number    <int> 0, 0, 1, 5, 5, 2, 0, 0, 0, 4, 2, 1, 1, 2, 0, 0, 5, 5...

# solution goes here

More gestation

Use the bootstrap to generate and interpret a 95% confidence interval for the median age of mothers for the classic Gestation data set from the mosaicData package.

SOLUTION:

library(mdsr)   
# solution goes here

Even more gestation

Use the bootstrap to generate a 95% confidence interval for the regression parameters in a model for weight as a function of age for the Gestation data frame from the mosaicData package.

SOLUTION:

library(mdsr)   
# solution goes here

Confidence intervals

We saw that a 95% confidence interval for a mean was constructed by taking the estimate and adding and subtracting two standard deviations. How many standard deviations should be used if a 99% confidence interval is desired? (Hint: see xqnorm().)

SOLUTION:

library(mdsr)   
# solution goes here

Twins

In 2010, the Minnesota Twins played their first season at Target Field. However, up through 2009, the Twins played at the Metrodome (an indoor stadium). In the Metrodome, air ventilator fans are used both to keep the roof up and to ventilate the stadium. Typically, the air is blown from all directions into the center of the stadium.

According to a retired supervisor in the Metrodome, in the late innings of some games the fans would be modified so that the ventilation air would blow out from home plate toward the outfield. The idea is that the air flow might increase the length of a fly ball. To see if manipulating the fans could possibly make any difference, a group of students at the University of Minnesota and their professor built a ‘cannon’ that used compressed air to shoot baseballs. They then did the following experiment.

Shoot balls at angles around 50 degrees with velocity of around 150 feet per second.
Shoot balls under two different settings: headwind (air blowing from outfield toward home plate) or tailwind (air blowing from home plate toward outfield).
Record other variables: weight of the ball (in grams), diameter of the ball (in cm), and distance of the ball’s flight (in feet).

Background: People who know little or nothing about baseball might find these basic facts useful. The batter stands near “home plate” and tries to hit the ball toward the outfield. A “fly ball” refers to a ball that is hit into the air. It is desirable to hit the ball as far as possible. For reasons of basic physics, the distance is maximized when the ball is hit at an intermediate angle steeper than 45 degrees from the horizontal.

The variables are described in the following table.

Cond: the wind conditions, a categorical variable with levels Headwind, Tailwind
Angle: the angle of ball's trajectory
Velocity: velocity of ball in feet per second
BallWt: weight of ball in grams
BallDia: diameter of ball in inches
Dist: distance in feet of the flight of the ball

Here is the output of several models.

> lm1 <- lm(Dist ~ Cond, data=ds)  # FIRST MODEL

> summary(lm1)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  350.768      2.179 160.967   <2e-16 ***
CondTail       5.865      3.281   1.788   0.0833 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 9.499 on 32 degrees of freedom
Multiple R-squared: 0.0908,     Adjusted R-squared: 0.06239
F-statistic: 3.196 on 1 and 32 DF,  p-value: 0.0833

> confint(lm1)
                2.5 %    97.5 %
(Intercept) 346.32966 355.20718
CondTail     -0.81784  12.54766

> # SECOND MODEL
> lm2 <- lm(Dist ~ Cond + Velocity + Angle + BallWt + BallDia, data=ds)
> summary(lm2)
Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) 181.7443   335.6959   0.541  0.59252
CondTail      7.6705     2.4593   3.119  0.00418 **
Velocity      1.7284     0.5433   3.181  0.00357 **
Angle        -1.6014     1.7995  -0.890  0.38110
BallWt       -3.9862     2.6697  -1.493  0.14659
BallDia     190.3715    62.5115   3.045  0.00502 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.805 on 28 degrees of freedom
Multiple R-squared: 0.5917,     Adjusted R-squared: 0.5188
F-statistic: 8.115 on 5 and 28 DF,  p-value: 7.81e-05

> confint(lm2)
                   2.5 %     97.5 %
(Intercept) -505.8974691 869.386165
CondTail       2.6328174  12.708166
Velocity       0.6155279   2.841188
Angle         -5.2874318   2.084713
BallWt        -9.4549432   1.482457
BallDia       62.3224999 318.420536

Consider the results from the model of Dist as a function of Cond (first model). Briefly summarize what this model says about the relationship between the wind conditions and the distance travelled by the ball. Make sure to say something sensible about the strength of evidence that there is any relationship at all.

SOLUTION:

More Twins

Briefly summarize the model that has Dist as the response variable and includes the other variables as explanatory variables (second model) by reporting and interpretating the CondTail parameter. This second model suggests a somewhat different result for the relationship between Dist and Cond Summarize the differences and explain in statistical terms why the inclusion of the other explanatory variables has affected the results.

SOLUTION:

Smoking and mortality

The Whickham data set in the mosaicData package includes data on age, smoking, and mortality from a one-in-six survey of the electoral roll in Whickham, a mixed urban and rural district near Newcastle upon Tyne, in the United Kingdom. The survey was conducted in 1972-1974 to study heart disease and thyroid disease. A follow-up on those in the survey was conducted twenty years later. Describe the association between smoking status and mortality in this study. Be sure to consider the role of age as a possible confounding factor.

SOLUTION:

library(mdsr)   
Whickham <- mutate(Whickham,
  agegrp = cut(age, breaks=c(1, 44, 64, 100),
  labels=c("18-44", "45-64", "65+")))
glimpse(Whickham)

## Observations: 1,314
## Variables: 4
## $ outcome <fctr> Alive, Alive, Dead, Alive, Alive, Alive, Alive, Dead,...
## $ smoker  <fctr> Yes, Yes, Yes, No, No, Yes, Yes, No, No, No, No, Yes,...
## $ age     <int> 23, 18, 71, 67, 64, 38, 45, 76, 28, 27, 28, 34, 20, 72...
## $ agegrp  <fctr> 18-44, 18-44, 65+, 65+, 45-64, 18-44, 45-64, 65+, 18-...

# solution goes here

Missing income

A data scientist working for a company that sells mortgages for new home purchases might be interested in determining what factors might be predictive of defaulting on the loan. Some of the mortgagees have missing income in their data set. Would it be reasonable for the analyst to drop these loans from their analytic data set? Explain.

SOLUTION:

Missing data and NHANES

The NHANES data set in the NHANES package includes survey data collected by the U.S. National Center for Health Statistics (NCHS), which has conducted a series of health and nutrition surveys since the early 1960s. An investigator is interested in fitting a model to predict the probability that a female subject will have a diagnosis of diabetes. Predictors for this model include age and BMI. Imagine that only 1/10 of the data are available but that these data are sampled randomly from the full set of observations (this mechanism is called “Missing Completely at Random”, or MCAR). What implications will this sampling have on the results?

SOLUTION:

library(mdsr)   
library(NHANES)
# solution goes here

Missing data and NHANES 2

Imagine that only 1/10 of the data are available but that these data are sampled from the full set of observations such that missingness depends on age, with older subjects less likely to be observed than younger subjects.
(this mechanism is called “Covariate Dependent Missingness”, or CDM). What implications will this sampling have on the results?

SOLUTION:

Missing data and NHANES 3

Imagine that only 1/10 of the data are available but that these data are sampled from the full set of observations such that missingness depends on diabetes status (this mechanism is called ``Non-Ignorable Non-Response“, or NINR). What implications will this sampling have on the results?

SOLUTION:

Statistical foundations exercises

Nicholas Horton (nhorton@amherst.edu)

September 2, 2017

Introduction

Gestation

More gestation

Even more gestation

Confidence intervals

Twins

More Twins

Smoking and mortality

Missing income

Missing data and NHANES

Missing data and NHANES 2

Missing data and NHANES 3