A Packages used in this book
A.1 The mdsr package
The mdsr package contains most of the small data sets used in this book that are not available in other packages.
To install it from CRAN, use install.packages()
.
To get the latest release, use the install_github()
function from the remotes package.
(See Section B.4.1 for more comprehensive information about R package maintenance.)
# this command only needs to be run once
install.packages("mdsr")
# if you want the development version
::install_github("mdsr-book/mdsr") remotes
The list of data sets provided can be retrieved using the data()
function.
library(mdsr)
data(package = "mdsr")
The mdsr package includes some functions that simplify a number of tasks.
In particular, the dbConnect_scidb()
function provides a shorthand for connecting to the public SQL server hosted by Amazon Web Services.
We use this function extensively in Chapter 15 and in our classes and projects.
In keeping with best practices, mdsr no longer loads any other packages.
In every chapter in this book, a call to library(tidyverse)
precedes a call to library(mdsr)
.
These two steps will set up an R session to replicate the code in the book.
A.2 Other packages
As we discuss in Chapters 1 and 21, this book is not explicitly about “big data”—it is about mastering data science techniques for small and medium data with an eye towards big data. To that end, we need medium-sized data sets to work with. We have introduced several such data sets in this book, namely airlines, fec12, and fec16.
The airlines package, which was inspired by the nycflights13 package, gives R users the ability to download the full 33 years (and counting) of flight data from the United States Bureau of Transportation Statistics and bring it seamlessly into SQL without actually having to write any SQL code. The macleish package also uses the etl framework for hourly-updated weather data from the MacLeish field station.
The full list of packages used in this book appears below in Tables A.1 and A.2.
Package | Citation | Title |
---|---|---|
alr4 | (Weisberg 2018) | Data to Accompany Applied Linear Regression 4th Edition |
ape | (Paradis et al. 2021) | Analyses of Phylogenetics and Evolution |
assertthat | (Hadley Wickham 2019a) | Easy Pre and Post Assertions |
available | (Ganz et al. 2019) | Check if the Title of a Package is Available, Appropriate and Interesting |
babynames | (Hadley Wickham 2021a) | US Baby Names 1880-2017 |
bench | (Hester 2020) | High Precision Timing of R Expressions |
biglm | (Lumley 2020) | Bounded Memory Linear and Generalized Linear Models |
bookdown | (Yihui Xie 2021a) | Authoring Books and Technical Documents with R Markdown |
broom | (Robinson, Hayes, and Couch 2021) | Convert Statistical Objects into Tidy Tibbles |
DBI | (R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller 2021) | R Database Interface |
dbplyr | (Hadley Wickham, Girlich, and Ruiz 2021) | A ‘dplyr’ Back End for Databases |
discrim | (Kuhn 2021) | Model Wrappers for Discriminant Analysis |
dplyr | (Hadley Wickham, François, et al. 2021) | A Grammar of Data Manipulation |
DT | (Yihui Xie, Cheng, and Tan 2021) | A Wrapper of the JavaScript Library ‘DataTables’ |
dygraphs | (Vanderkam et al. 2018) | Interface to ‘Dygraphs’ Interactive Time Series Charting Library |
etl | (Benjamin S. Baumer 2021) | Extract-Transform-Load Framework for Medium Data |
extrafont | (Winston Chang 2014) | Tools for using fonts |
forcats | (Hadley Wickham 2021b) | Tools for Working with Categorical Variables (Factors) |
fs | (Hester and Wickham 2020) | Cross-Platform File System Operations Based on ‘libuv’ |
furrr | (Vaughan and Dancho 2021) | Apply Mapping Functions in Parallel using Futures |
future | (Bengtsson 2020) | Unified Parallel and Distributed Processing in R for Everyone |
ggmosaic | (Jeppson, Hofmann, and Cook 2021) | Mosaic Plots in the ‘ggplot2’ Framework |
ggplot2 | (Hadley Wickham, Chang, et al. 2021) | Create Elegant Data Visualisations Using the Grammar of Graphics |
ggraph | (Pedersen 2021) | An Implementation of Grammar of Graphics for Graphs and Networks |
ggrepel | (Slowikowski 2021) | Automatically Position Non-Overlapping Text Labels with ‘ggplot2’ |
ggspatial | (Dunnington 2021) | Spatial Data Framework for ggplot2 |
ggthemes | (Arnold 2021) | Extra Themes, Scales and Geoms for ‘ggplot2’ |
glmnet | (Friedman et al. 2021) | Lasso and Elastic-Net Regularized Generalized Linear Models |
googlesheets4 | (Bryan 2021) | Access Google Sheets using the Sheets API V4 |
haven | (Hadley Wickham and Miller 2021) | Import and Export ‘SPSS,’ ‘Stata’ and ‘SAS’ Files |
here | (Müller 2020) | A Simpler Way to Find Your Files |
Hmisc | (Harrell 2021) | Harrell Miscellaneous |
htmlwidgets | (Vaidyanathan et al. 2020) | HTML Widgets for R |
igraph | (Csárdi et al. 2020) | Network Analysis and Visualization |
janitor | (Firke 2021) | Simple Tools for Examining and Cleaning Dirty Data |
jsonlite | (Ooms 2020) | A Simple and Robust JSON Parser and Generator for R |
kableExtra | (Zhu 2021) | Construct Complex Table with ‘kable’ and Pipe Syntax |
kknn | (Schliep and Hechenbichler 2016) | Weighted k-Nearest Neighbors |
knitr | (Yihui Xie 2021b) | A General-Purpose Package for Dynamic Report Generation in R |
Lahman | (Friendly et al. 2021) | Sean ‘Lahman’ Baseball Database |
lattice | (Sarkar 2021) | Trellis Graphics for R |
lazyeval | (Hadley Wickham 2019c) | Lazy (Non-Standard) Evaluation |
leaflet | (Cheng, Karambelkar, and Xie 2021) | Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library |
lubridate | (Spinu, Grolemund, and Wickham 2021) | Make Dealing with Dates a Little Easier |
macleish | (Benjamin S. Baumer et al. 2020) | Retrieve Data from MacLeish Field Station |
magick | (Ooms 2021) | Advanced Graphics and Image-Processing in R |
mapproj | (McIlroy et al. 2020) | Map Projections |
maps | (Brownrigg 2018) | Draw Geographical Maps |
mclust | (Fraley, Raftery, and Scrucca 2020) | Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation |
mdsr | (Benjamin S. Baumer, Horton, and Kaplan 2021) | Complement to ‘Modern Data Science with R’ |
modelr | (Hadley Wickham 2020b) | Modelling Functions that Work with the Pipe |
mosaic | (Pruim, Kaplan, and Horton 2021a) | Project MOSAIC Statistics and Mathematics Teaching Utilities |
mosaicData | (Pruim, Kaplan, and Horton 2021b) | Project MOSAIC Data Sets |
NeuralNetTools | (Beck 2018) | Visualization and Analysis Tools for Neural Networks |
NHANES | (Pruim 2015) | Data from the US National Health and Nutrition Examination Study |
nycflights13 | (Hadley Wickham 2021c) | Flights that Departed NYC in 2013 |
parsnip | (Kuhn and Vaughan 2021a) | A Common API to Modeling and Analysis Functions |
partykit | (Hothorn and Zeileis 2021) | A Toolkit for Recursive Partytioning |
patchwork | (Pedersen 2020a) | The Composer of Plots |
plotly | (Sievert et al. 2021) | Create Interactive Web Graphics via ‘plotly.js’ |
purrr | (Henry and Wickham 2020) | Functional Programming Tools |
randomForest | (Breiman et al. 2018) | Breiman and Cutler’s Random Forests for Classification and Regression |
RColorBrewer | (Neuwirth 2014) | ColorBrewer Palettes |
Rcpp | (Eddelbuettel et al. 2021) | Seamless R and C++ Integration |
readr | (Hadley Wickham and Hester 2021) | Read Rectangular Text Data |
readxl | (Hadley Wickham and Bryan 2019) | Read Excel Files |
remotes | (Hester et al. 2021) | R Package Installation from Remote Repositories, Including ‘GitHub’ |
renv | (Ushey 2021) | Project Environments |
reticulate | (Ushey, Allaire, and Tang 2021) | Interface to ‘Python’ |
rgdal | (R. Bivand, Keitt, and Rowlingson 2021) | Bindings for the ‘Geospatial’ Data Abstraction Library |
rlang | (Henry and Wickham 2021) | Functions for Base Types and Core R and ‘Tidyverse’ Features |
rmarkdown | (J. Allaire et al. 2021) | Dynamic Documents for R |
RMySQL | (Ooms et al. 2021) | Database Interface and ‘MySQL’ Driver for R |
rpart | (Therneau and Atkinson 2019) | Recursive Partitioning and Regression Trees |
RSQLite | (Müller et al. 2021) | ‘SQLite’ Interface for R |
rvest | (Hadley Wickham 2021d) | Easily Harvest (Scrape) Web Pages |
scales | (Hadley Wickham and Seidel 2020) | Scale Functions for Visualization |
sessioninfo | (Csárdi et al. 2018) | R Session Information |
sf | (Pebesma 2021) | Simple Features for R |
shiny | (Chang et al. 2021) | Web Application Framework for R |
sp | (Pebesma and Bivand 2021) | Classes and Methods for Spatial Data |
sparklyr | (Luraschi et al. 2021) | R Interface to Apache Spark |
stopwords | (Benoit, Muhr, and Watanabe 2021) | Multilingual Stopword Lists |
stringr | (Hadley Wickham 2019d) | Simple, Consistent Wrappers for Common String Operations |
styler | (Müller and Walthert 2021) | Non-Invasive Pretty Printing of R Code |
testthat | (Hadley Wickham 2021e) | Unit Testing for R |
textdata | (Hvitfeldt 2020) | Download and Load Various Text Datasets |
tidycensus | (Walker and Herman 2021) | Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames |
tidygeocoder | (Cambon et al. 2021) | Geocoding Made Easy |
tidygraph | (Pedersen 2020b) | A Tidy API for Graph Manipulation |
tidymodels | (Kuhn and Wickham 2021) | Easily Install and Load the ‘Tidymodels’ Packages |
tidyr | (Hadley Wickham 2021f) | Tidy Messy Data |
tidytext | (Robinson and Silge 2021) | Text Mining using ‘dplyr,’ ‘ggplot2,’ and Other Tidy Tools |
tidyverse | (Hadley Wickham 2021g) | Easily Install and Load the ‘Tidyverse’ |
tigris | (Walker 2021) | Load Census TIGER/Line Shapefiles |
tm | (Feinerer and Hornik 2020) | Text Mining Package |
units | (Pebesma et al. 2021) | Measurement Units for R Vectors |
usethis | (Hadley Wickham and Bryan 2021) | Automate Package and Project Setup |
viridis | (Garnier 2021a) | Colorblind-Friendly Color Maps for R |
viridisLite | (Garnier 2021b) | Colorblind-Friendly Color Maps (Lite Version) |
webshot | (Chang 2019) | Take Screenshots of Web Pages |
wordcloud | (Fellows 2018) | Word Clouds |
wru | (Khanna and Imai 2021) | Who are You? Bayesian Prediction of Racial Category Using Surname and Geolocation |
xaringanthemer | (Aden-Buie 2021) | Custom ‘xaringan’ CSS Themes |
xfun | (Yihui Xie 2021c) | Supporting Functions for Packages Maintained by ‘Yihui Xie’ |
xkcd | (Torres-Manzanera 2018) | Plotting ggplot2 Graphics in an XKCD Style |
yardstick | (Kuhn and Vaughan 2021b) | Tidy Characterizations of Model Performance |
Package | GitHub User | Citation | Title |
---|---|---|---|
etude | dtkaplan | (Kaplan 2021) | Utilities for Handling Textbook Exercises with Knitr |
fec12 | baumer-lab | (Tapal, Gahwagy, and Ryan 2021) | Data Package for 2012 Federal Elections |
openrouteservice | GIScience | (Oleś 2021) | Openrouteservice API Client |
streamgraph | hrbrmstr | (Rudis 2019) | Build Streamgraph Visualizations |
A.3 Further resources
More information on the mdsr package can be found at http://www.github.com/mdsr-book/mdsr.