A Packages used in this book

A.1 The mdsr package

The mdsr package contains most of the small data sets used in this book that are not available in other packages. To install it from CRAN, use install.packages(). To get the latest release, use the install_github() function from the remotes package. (See Section B.4.1 for more comprehensive information about R package maintenance.)

# this command only needs to be run once
# if you want the development version

The list of data sets provided can be retrieved using the data() function.

data(package = "mdsr")

The mdsr package includes some functions that simplify a number of tasks. In particular, the dbConnect_scidb() function provides a shorthand for connecting to the public SQL server hosted by Amazon Web Services. We use this function extensively in Chapter 15 and in our classes and projects.

In keeping with best practices, mdsr no longer loads any other packages. In every chapter in this book, a call to library(tidyverse) precedes a call to library(mdsr). These two steps will set up an R session to replicate the code in the book.

A.2 Other packages

As we discuss in Chapters 1 and 21, this book is not explicitly about “big data”—it is about mastering data science techniques for small and medium data with an eye towards big data. To that end, we need medium-sized data sets to work with. We have introduced several such data sets in this book, namely airlines, fec12, and fec16.

The airlines package, which was inspired by the nycflights13 package, gives R users the ability to download the full 33 years (and counting) of flight data from the United States Bureau of Transportation Statistics and bring it seamlessly into SQL without actually having to write any SQL code. The macleish package also uses the etl framework for hourly-updated weather data from the MacLeish field station.

The full list of packages used in this book appears below in Tables A.1 and A.2.

Table A.1: List of CRAN packages used in this book.
Package Citation Title
alr4 (Weisberg 2018) Data to Accompany Applied Linear Regression 4th Edition
ape (Paradis et al. 2021) Analyses of Phylogenetics and Evolution
assertthat (Hadley Wickham 2019a) Easy Pre and Post Assertions
available (Ganz et al. 2019) Check if the Title of a Package is Available, Appropriate and Interesting
babynames (Hadley Wickham 2021a) US Baby Names 1880-2017
bench (Hester 2020) High Precision Timing of R Expressions
biglm (Lumley 2020) Bounded Memory Linear and Generalized Linear Models
bookdown (Yihui Xie 2021a) Authoring Books and Technical Documents with R Markdown
broom (Robinson, Hayes, and Couch 2021) Convert Statistical Objects into Tidy Tibbles
DBI (R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller 2021) R Database Interface
dbplyr (Hadley Wickham, Girlich, and Ruiz 2021) A ‘dplyr’ Back End for Databases
discrim (Kuhn 2021) Model Wrappers for Discriminant Analysis
dplyr (Hadley Wickham, François, et al. 2021) A Grammar of Data Manipulation
DT (Yihui Xie, Cheng, and Tan 2021) A Wrapper of the JavaScript Library ‘DataTables’
dygraphs (Vanderkam et al. 2018) Interface to ‘Dygraphs’ Interactive Time Series Charting Library
etl (Benjamin S. Baumer 2021) Extract-Transform-Load Framework for Medium Data
extrafont (Winston Chang 2014) Tools for using fonts
forcats (Hadley Wickham 2021b) Tools for Working with Categorical Variables (Factors)
fs (Hester and Wickham 2020) Cross-Platform File System Operations Based on ‘libuv’
furrr (Vaughan and Dancho 2021) Apply Mapping Functions in Parallel using Futures
future (Bengtsson 2020) Unified Parallel and Distributed Processing in R for Everyone
ggmosaic (Jeppson, Hofmann, and Cook 2021) Mosaic Plots in the ‘ggplot2’ Framework
ggplot2 (Hadley Wickham, Chang, et al. 2021) Create Elegant Data Visualisations Using the Grammar of Graphics
ggraph (Pedersen 2021) An Implementation of Grammar of Graphics for Graphs and Networks
ggrepel (Slowikowski 2021) Automatically Position Non-Overlapping Text Labels with ‘ggplot2’
ggspatial (Dunnington 2021) Spatial Data Framework for ggplot2
ggthemes (Arnold 2021) Extra Themes, Scales and Geoms for ‘ggplot2’
glmnet (Friedman et al. 2021) Lasso and Elastic-Net Regularized Generalized Linear Models
googlesheets4 (Bryan 2021) Access Google Sheets using the Sheets API V4
haven (Hadley Wickham and Miller 2021) Import and Export ‘SPSS,’ ‘Stata’ and ‘SAS’ Files
here (Müller 2020) A Simpler Way to Find Your Files
Hmisc (Harrell 2021) Harrell Miscellaneous
htmlwidgets (Vaidyanathan et al. 2020) HTML Widgets for R
igraph (Csárdi et al. 2020) Network Analysis and Visualization
janitor (Firke 2021) Simple Tools for Examining and Cleaning Dirty Data
jsonlite (Ooms 2020) A Simple and Robust JSON Parser and Generator for R
kableExtra (Zhu 2021) Construct Complex Table with ‘kable’ and Pipe Syntax
kknn (Schliep and Hechenbichler 2016) Weighted k-Nearest Neighbors
knitr (Yihui Xie 2021b) A General-Purpose Package for Dynamic Report Generation in R
Lahman (Friendly et al. 2021) Sean ‘Lahman’ Baseball Database
lattice (Sarkar 2021) Trellis Graphics for R
lazyeval (Hadley Wickham 2019c) Lazy (Non-Standard) Evaluation
leaflet (Cheng, Karambelkar, and Xie 2021) Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library
lubridate (Spinu, Grolemund, and Wickham 2021) Make Dealing with Dates a Little Easier
macleish (Benjamin S. Baumer et al. 2020) Retrieve Data from MacLeish Field Station
magick (Ooms 2021) Advanced Graphics and Image-Processing in R
mapproj (McIlroy et al. 2020) Map Projections
maps (Brownrigg 2018) Draw Geographical Maps
mclust (Fraley, Raftery, and Scrucca 2020) Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation
mdsr (Benjamin S. Baumer, Horton, and Kaplan 2021) Complement to ‘Modern Data Science with R’
modelr (Hadley Wickham 2020b) Modelling Functions that Work with the Pipe
mosaic (Pruim, Kaplan, and Horton 2021a) Project MOSAIC Statistics and Mathematics Teaching Utilities
mosaicData (Pruim, Kaplan, and Horton 2021b) Project MOSAIC Data Sets
NeuralNetTools (Beck 2018) Visualization and Analysis Tools for Neural Networks
NHANES (Pruim 2015) Data from the US National Health and Nutrition Examination Study
nycflights13 (Hadley Wickham 2021c) Flights that Departed NYC in 2013
parsnip (Kuhn and Vaughan 2021a) A Common API to Modeling and Analysis Functions
partykit (Hothorn and Zeileis 2021) A Toolkit for Recursive Partytioning
patchwork (Pedersen 2020a) The Composer of Plots
plotly (Sievert et al. 2021) Create Interactive Web Graphics via ‘plotly.js’
purrr (Henry and Wickham 2020) Functional Programming Tools
randomForest (Breiman et al. 2018) Breiman and Cutler’s Random Forests for Classification and Regression
RColorBrewer (Neuwirth 2014) ColorBrewer Palettes
Rcpp (Eddelbuettel et al. 2021) Seamless R and C++ Integration
readr (Hadley Wickham and Hester 2021) Read Rectangular Text Data
readxl (Hadley Wickham and Bryan 2019) Read Excel Files
remotes (Hester et al. 2021) R Package Installation from Remote Repositories, Including ‘GitHub’
renv (Ushey 2021) Project Environments
reticulate (Ushey, Allaire, and Tang 2021) Interface to ‘Python’
rgdal (R. Bivand, Keitt, and Rowlingson 2021) Bindings for the ‘Geospatial’ Data Abstraction Library
rlang (Henry and Wickham 2021) Functions for Base Types and Core R and ‘Tidyverse’ Features
rmarkdown (J. Allaire et al. 2021) Dynamic Documents for R
RMySQL (Ooms et al. 2021) Database Interface and ‘MySQL’ Driver for R
rpart (Therneau and Atkinson 2019) Recursive Partitioning and Regression Trees
RSQLite (Müller et al. 2021) ‘SQLite’ Interface for R
rvest (Hadley Wickham 2021d) Easily Harvest (Scrape) Web Pages
scales (Hadley Wickham and Seidel 2020) Scale Functions for Visualization
sessioninfo (Csárdi et al. 2018) R Session Information
sf (Pebesma 2021) Simple Features for R
shiny (Chang et al. 2021) Web Application Framework for R
sp (Pebesma and Bivand 2021) Classes and Methods for Spatial Data
sparklyr (Luraschi et al. 2021) R Interface to Apache Spark
stopwords (Benoit, Muhr, and Watanabe 2021) Multilingual Stopword Lists
stringr (Hadley Wickham 2019d) Simple, Consistent Wrappers for Common String Operations
styler (Müller and Walthert 2021) Non-Invasive Pretty Printing of R Code
testthat (Hadley Wickham 2021e) Unit Testing for R
textdata (Hvitfeldt 2020) Download and Load Various Text Datasets
tidycensus (Walker and Herman 2021) Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames
tidygeocoder (Cambon et al. 2021) Geocoding Made Easy
tidygraph (Pedersen 2020b) A Tidy API for Graph Manipulation
tidymodels (Kuhn and Wickham 2021) Easily Install and Load the ‘Tidymodels’ Packages
tidyr (Hadley Wickham 2021f) Tidy Messy Data
tidytext (Robinson and Silge 2021) Text Mining using ‘dplyr,’ ‘ggplot2,’ and Other Tidy Tools
tidyverse (Hadley Wickham 2021g) Easily Install and Load the ‘Tidyverse’
tigris (Walker 2021) Load Census TIGER/Line Shapefiles
tm (Feinerer and Hornik 2020) Text Mining Package
units (Pebesma et al. 2021) Measurement Units for R Vectors
usethis (Hadley Wickham and Bryan 2021) Automate Package and Project Setup
viridis (Garnier 2021a) Colorblind-Friendly Color Maps for R
viridisLite (Garnier 2021b) Colorblind-Friendly Color Maps (Lite Version)
webshot (Chang 2019) Take Screenshots of Web Pages
wordcloud (Fellows 2018) Word Clouds
wru (Khanna and Imai 2021) Who are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
xaringanthemer (Aden-Buie 2021) Custom ‘xaringan’ CSS Themes
xfun (Yihui Xie 2021c) Supporting Functions for Packages Maintained by ‘Yihui Xie’
xkcd (Torres-Manzanera 2018) Plotting ggplot2 Graphics in an XKCD Style
yardstick (Kuhn and Vaughan 2021b) Tidy Characterizations of Model Performance
Table A.2: List of GitHub packages used in this book.
Package GitHub User Citation Title
etude dtkaplan (Kaplan 2021) Utilities for Handling Textbook Exercises with Knitr
fec12 baumer-lab (Tapal, Gahwagy, and Ryan 2021) Data Package for 2012 Federal Elections
openrouteservice GIScience (Oleś 2021) Openrouteservice API Client
streamgraph hrbrmstr (Rudis 2019) Build Streamgraph Visualizations

A.3 Further resources

More information on the mdsr package can be found at http://www.github.com/mdsr-book/mdsr.