These exercises are taken from the statistical foundations chapter from Modern Data Science with R: http://mdsr-book.github.io. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.
Calculate and interpret a 95% confidence interval for the mean age of mothers from the classic Gestation
data set from the mosaicData
package.
SOLUTION:
library(mdsr)
glimpse(Gestation)
## Observations: 1,236
## Variables: 23
## $ id <int> 15, 20, 58, 61, 72, 100, 102, 129, 142, 148, 164, 17...
## $ pluralty <int> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5...
## $ outcome <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ date <int> 1411, 1499, 1576, 1504, 1425, 1673, 1449, 1562, 1408...
## $ gestation <int> 284, 282, 279, NA, 282, 286, 244, 245, 289, 299, 351...
## $ sex <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ wt <int> 120, 113, 128, 123, 108, 136, 138, 132, 120, 143, 14...
## $ parity <int> 1, 2, 1, 2, 1, 4, 4, 2, 3, 3, 2, 4, 3, 5, 3, 4, 3, 3...
## $ race <int> 8, 0, 0, 0, 0, 0, 7, 7, 0, 0, 0, 0, 0, 8, 7, 7, 4, 3...
## $ age <int> 27, 33, 28, 36, 23, 25, 33, 23, 25, 30, 27, 32, 23, ...
## $ ed <int> 5, 5, 2, 5, 5, 2, 2, 1, 4, 5, 5, 2, 1, 5, 2, 2, 7, 2...
## $ ht <int> 62, 64, 64, 69, 67, 62, 62, 65, 62, 66, 68, 64, 63, ...
## $ wt.1 <int> 100, 135, 115, 190, 125, 93, 178, 140, 125, 136, 120...
## $ drace <fctr> 8, 0, 5, 3, 0, 3, 7, 7, 3, 0, 5, 0, 5, 0, 7, 7, 7, ...
## $ dage <int> 31, 38, 32, 43, 24, 28, 37, 23, 26, 34, 28, 36, 28, ...
## $ ded <int> 5, 5, 1, 4, 5, 2, 4, 4, 1, 5, 4, 1, 2, 5, 0, 0, 1, 2...
## $ dht <int> 65, 70, NA, 68, NA, 64, NA, 71, 70, NA, NA, 74, NA, ...
## $ dwt <int> 110, 148, NA, 197, NA, 130, NA, 192, 180, NA, NA, 18...
## $ marital <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ inc <int> 1, 4, 2, 8, 1, 4, NA, 2, 2, 2, NA, 2, 2, 2, 1, 1, 1,...
## $ smoke <int> 0, 0, 1, 3, 1, 2, 0, 0, 0, 1, 3, 1, 1, 1, 0, 0, 1, 1...
## $ time <int> 0, 0, 1, 5, 1, 2, 0, 0, 0, 1, 4, 1, 1, 1, 0, 0, 1, 1...
## $ number <int> 0, 0, 1, 5, 5, 2, 0, 0, 0, 4, 2, 1, 1, 2, 0, 0, 5, 5...
# solution goes here
Use the bootstrap to generate and interpret a 95% confidence interval for the median age of mothers for the classic Gestation
data set from the mosaicData
package.
SOLUTION:
library(mdsr)
# solution goes here
Use the bootstrap to generate a 95% confidence interval for the regression parameters in a model for weight as a function of age for the Gestation
data frame from the mosaicData
package.
SOLUTION:
library(mdsr)
# solution goes here
We saw that a 95% confidence interval for a mean was constructed by taking the estimate and adding and subtracting two standard deviations. How many standard deviations should be used if a 99% confidence interval is desired? (Hint: see xqnorm()
.)
SOLUTION:
library(mdsr)
# solution goes here
In 2010, the Minnesota Twins played their first season at Target Field. However, up through 2009, the Twins played at the Metrodome (an indoor stadium). In the Metrodome, air ventilator fans are used both to keep the roof up and to ventilate the stadium. Typically, the air is blown from all directions into the center of the stadium.
According to a retired supervisor in the Metrodome, in the late innings of some games the fans would be modified so that the ventilation air would blow out from home plate toward the outfield. The idea is that the air flow might increase the length of a fly ball. To see if manipulating the fans could possibly make any difference, a group of students at the University of Minnesota and their professor built a ‘cannon’ that used compressed air to shoot baseballs. They then did the following experiment.
Background: People who know little or nothing about baseball might find these basic facts useful. The batter stands near “home plate” and tries to hit the ball toward the outfield. A “fly ball” refers to a ball that is hit into the air. It is desirable to hit the ball as far as possible. For reasons of basic physics, the distance is maximized when the ball is hit at an intermediate angle steeper than 45 degrees from the horizontal.
The variables are described in the following table.
Cond: the wind conditions, a categorical variable with levels Headwind, Tailwind
Angle: the angle of ball's trajectory
Velocity: velocity of ball in feet per second
BallWt: weight of ball in grams
BallDia: diameter of ball in inches
Dist: distance in feet of the flight of the ball
Here is the output of several models.
> lm1 <- lm(Dist ~ Cond, data=ds) # FIRST MODEL
> summary(lm1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 350.768 2.179 160.967 <2e-16 ***
CondTail 5.865 3.281 1.788 0.0833 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.499 on 32 degrees of freedom
Multiple R-squared: 0.0908, Adjusted R-squared: 0.06239
F-statistic: 3.196 on 1 and 32 DF, p-value: 0.0833
> confint(lm1)
2.5 % 97.5 %
(Intercept) 346.32966 355.20718
CondTail -0.81784 12.54766
> # SECOND MODEL
> lm2 <- lm(Dist ~ Cond + Velocity + Angle + BallWt + BallDia, data=ds)
> summary(lm2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 181.7443 335.6959 0.541 0.59252
CondTail 7.6705 2.4593 3.119 0.00418 **
Velocity 1.7284 0.5433 3.181 0.00357 **
Angle -1.6014 1.7995 -0.890 0.38110
BallWt -3.9862 2.6697 -1.493 0.14659
BallDia 190.3715 62.5115 3.045 0.00502 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.805 on 28 degrees of freedom
Multiple R-squared: 0.5917, Adjusted R-squared: 0.5188
F-statistic: 8.115 on 5 and 28 DF, p-value: 7.81e-05
> confint(lm2)
2.5 % 97.5 %
(Intercept) -505.8974691 869.386165
CondTail 2.6328174 12.708166
Velocity 0.6155279 2.841188
Angle -5.2874318 2.084713
BallWt -9.4549432 1.482457
BallDia 62.3224999 318.420536
Consider the results from the model of Dist
as a function of Cond
(first model). Briefly summarize what this model says about the relationship between the wind conditions and the distance travelled by the ball. Make sure to say something sensible about the strength of evidence that there is any relationship at all.
SOLUTION:
Briefly summarize the model that has Dist
as the response variable and includes the other variables as explanatory variables (second model) by reporting and interpretating the CondTail
parameter. This second model suggests a somewhat different result for the relationship between Dist
and Cond
Summarize the differences and explain in statistical terms why the inclusion of the other explanatory variables has affected the results.
SOLUTION:
The Whickham
data set in the mosaicData
package includes data on age, smoking, and mortality from a one-in-six survey of the electoral roll in Whickham, a mixed urban and rural district near Newcastle upon Tyne, in the United Kingdom. The survey was conducted in 1972-1974 to study heart disease and thyroid disease. A follow-up on those in the survey was conducted twenty years later. Describe the association between smoking status and mortality in this study. Be sure to consider the role of age as a possible confounding factor.
SOLUTION:
library(mdsr)
Whickham <- mutate(Whickham,
agegrp = cut(age, breaks=c(1, 44, 64, 100),
labels=c("18-44", "45-64", "65+")))
glimpse(Whickham)
## Observations: 1,314
## Variables: 4
## $ outcome <fctr> Alive, Alive, Dead, Alive, Alive, Alive, Alive, Dead,...
## $ smoker <fctr> Yes, Yes, Yes, No, No, Yes, Yes, No, No, No, No, Yes,...
## $ age <int> 23, 18, 71, 67, 64, 38, 45, 76, 28, 27, 28, 34, 20, 72...
## $ agegrp <fctr> 18-44, 18-44, 65+, 65+, 45-64, 18-44, 45-64, 65+, 18-...
# solution goes here
A data scientist working for a company that sells mortgages for new home purchases might be interested in determining what factors might be predictive of defaulting on the loan. Some of the mortgagees have missing income in their data set. Would it be reasonable for the analyst to drop these loans from their analytic data set? Explain.
SOLUTION:
The NHANES
data set in the NHANES
package includes survey data collected by the U.S. National Center for Health Statistics (NCHS), which has conducted a series of health and nutrition surveys since the early 1960s. An investigator is interested in fitting a model to predict the probability that a female subject will have a diagnosis of diabetes. Predictors for this model include age and BMI. Imagine that only 1/10 of the data are available but that these data are sampled randomly from the full set of observations (this mechanism is called “Missing Completely at Random”, or MCAR). What implications will this sampling have on the results?
SOLUTION:
library(mdsr)
library(NHANES)
# solution goes here
Imagine that only 1/10 of the data are available but that these data are sampled from the full set of observations such that missingness depends on age, with older subjects less likely to be observed than younger subjects.
(this mechanism is called “Covariate Dependent Missingness”, or CDM). What implications will this sampling have on the results?
SOLUTION:
Imagine that only 1/10 of the data are available but that these data are sampled from the full set of observations such that missingness depends on diabetes status (this mechanism is called ``Non-Ignorable Non-Response“, or NINR). What implications will this sampling have on the results?
SOLUTION: