A Packages used in this book

A.1 The mdsr package

The mdsr package contains most of the small data sets used in this book that are not available in other packages. To install it from CRAN, use install.packages(). To get the latest release, use the install_github() function from the remotes package. (See Section B.4.1 for more comprehensive information about R package maintenance.)

# this command only needs to be run once
install.packages("mdsr")
# if you want the development version
remotes::install_github("mdsr-book/mdsr")

The list of data sets provided can be retrieved using the data() function.

library(mdsr)
data(package = "mdsr")

The mdsr package includes some functions that simplify a number of tasks. In particular, the dbConnect_scidb() function provides a shorthand for connecting to the public SQL server hosted by Amazon Web Services. We use this function extensively in Chapter 15 and in our classes and projects.

In keeping with best practices, mdsr no longer loads any other packages. In every chapter in this book, a call to library(tidyverse) precedes a call to library(mdsr). These two steps will set up an R session to replicate the code in the book.

A.2 Other packages

As we discuss in Chapters 1 and 21, this book is not explicitly about “big data”—it is about mastering data science techniques for small and medium data with an eye towards big data. To that end, we need medium-sized data sets to work with. We have introduced several such data sets in this book, namely airlines, fec12, and fec16.

The airlines package, which was inspired by the nycflights13 package, gives R users the ability to download the full 33 years (and counting) of flight data from the United States Bureau of Transportation Statistics and bring it seamlessly into SQL without actually having to write any SQL code. The macleish package also uses the etl framework for hourly-updated weather data from the MacLeish field station.

The full list of packages used in this book appears below in Tables A.1 and A.2.

Table A.1: List of CRAN packages used in this book.
Package Citation Title
alr3 (Weisberg 2018) Data to Accompany Applied Linear Regression 3rd Edition
ape (Paradis et al. 2020) Analyses of Phylogenetics and Evolution
aRxiv (Karthik Ram and Broman 2019) Interface to the arXiv API
assertthat (Hadley Wickham 2019a) Easy Pre and Post Assertions
available (Ganz et al. 2019) Check if the Title of a Package is Available, Appropriate and Interesting
babynames (Hadley Wickham 2019c) US Baby Names 1880-2017
bench (Hester 2020) High Precision Timing of R Expressions
biglm (Lumley 2020) Bounded Memory Linear and Generalized Linear Models
bigrquery (Hadley Wickham and Bryan 2020a) An Interface to Google’s ‘BigQuery’ ‘API’
bookdown (Yihui Xie 2020a) Authoring Books and Technical Documents with R Markdown
broom (Robinson, Hayes, and Couch 2020) Convert Statistical Objects into Tidy Tibbles
caret (Kuhn 2020a) Classification and Regression Training
DBI (R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller 2019) R Database Interface
dbplyr (Hadley Wickham and Ruiz 2020) A ‘dplyr’ Back End for Databases
discrim (Kuhn 2020b) Model Wrappers for Discriminant Analysis
dplyr (Hadley Wickham, François, et al. 2020) A Grammar of Data Manipulation
DT (Yihui Xie, Cheng, and Tan 2021) A Wrapper of the JavaScript Library ‘DataTables’
dygraphs (Vanderkam et al. 2018) Interface to ‘Dygraphs’ Interactive Time Series Charting Library
etl (Benjamin S. Baumer 2020) Extract-Transform-Load Framework for Medium Data
extrafont (Chang 2020) Tools for Using Fonts
fec16 (Tapal et al. 2020) Data Package for the 2016 United States Federal Elections
flexdashboard (Iannone, Allaire, and Borges 2020) R Markdown Format for Flexible Dashboards
FlickrAPI (Ando 2019) Access to Flickr API
forcats (Hadley Wickham 2020a) Tools for Working with Categorical Variables (Factors)
fs (Hester and Wickham 2020) Cross-Platform File System Operations Based on ‘libuv’
furrr (Vaughan and Dancho 2020) Apply Mapping Functions in Parallel using Futures
future (Bengtsson 2020) Unified Parallel and Distributed Processing in R for Everyone
GGally (Schloerke et al. 2021) Extension to ‘ggplot2’
gganimate (Pedersen and Robinson 2020) A Grammar of Animated Graphics
ggmosaic (Jeppson, Hofmann, and Cook 2020) Mosaic Plots in the ‘ggplot2’ Framework
ggplot2 (Hadley Wickham, Chang, et al. 2020) Create Elegant Data Visualisations Using the Grammar of Graphics
ggraph (Pedersen 2020a) An Implementation of Grammar of Graphics for Graphs and Networks
ggrepel (Slowikowski 2020) Automatically Position Non-Overlapping Text Labels with ‘ggplot2’
ggspatial (Dunnington 2021) Spatial Data Framework for ggplot2
ggthemes (Arnold 2019a) Extra Themes, Scales and Geoms for ‘ggplot2’
glmnet (Friedman et al. 2020) Lasso and Elastic-Net Regularized Generalized Linear Models
googlesheets4 (Bryan 2020) Access Google Sheets using the Sheets API V4
gutenbergr (Robinson 2020) Download and Process Public Domain Works from Project Gutenberg
haven (Hadley Wickham and Miller 2020) Import and Export ‘SPSS,’ ‘Stata’ and ‘SAS’ Files
here (Müller 2020) A Simpler Way to Find Your Files
Hmisc (Harrell 2020) Harrell Miscellaneous
htmlwidgets (Vaidyanathan et al. 2020) HTML Widgets for R
igraph (Csárdi et al. 2020) Network Analysis and Visualization
janitor (Firke 2021) Simple Tools for Examining and Cleaning Dirty Data
jsonlite (Ooms 2020a) A Simple and Robust JSON Parser and Generator for R
kableExtra (Zhu 2020) Construct Complex Table with ‘kable’ and Pipe Syntax
kknn (Schliep and Hechenbichler 2016) Weighted k-Nearest Neighbors
knitr (Yihui Xie 2020b) A General-Purpose Package for Dynamic Report Generation in R
Lahman (Friendly et al. 2020) Sean ‘Lahman’ Baseball Database
lars (Hastie and Efron 2013) Least Angle Regression, Lasso and Forward Stagewise
lattice (Sarkar 2020) Trellis Graphics for R
lazyeval (Hadley Wickham 2019d) Lazy (Non-Standard) Evaluation
leaflet (Cheng, Karambelkar, and Xie 2021) Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library
lubridate (Spinu, Grolemund, and Wickham 2020) Make Dealing with Dates a Little Easier
macleish (Benjamin S. Baumer et al. 2020) Retrieve Data from MacLeish Field Station
magick (Ooms 2020b) Advanced Graphics and Image-Processing in R
mapproj (McIlroy et al. 2020) Map Projections
maps (Brownrigg 2018) Draw Geographical Maps
mclust (Fraley, Raftery, and Scrucca 2020) Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation
mdsr (Benjamin S. Baumer, Horton, and Kaplan 2020) Complement to ‘Modern Data Science with R’
modelr (Hadley Wickham 2020c) Modelling Functions that Work with the Pipe
mosaic (Pruim, Kaplan, and Horton 2020a) Project MOSAIC Statistics and Mathematics Teaching Utilities
mosaicData (Pruim, Kaplan, and Horton 2020b) Project MOSAIC Data Sets
network (Butts 2020a) Classes for Relational Data
NeuralNetTools (Beck 2018) Visualization and Analysis Tools for Neural Networks
NHANES (Pruim 2015) Data from the US National Health and Nutrition Examination Study
nycflights13 (Hadley Wickham 2019e) Flights that Departed NYC in 2013
packrat (Ushey et al. 2018) A Dependency Management System for Projects and their R Package Dependencies
palmerpenguins (Horst, Hill, and Gorman 2020) Palmer Archipelago (Antarctica) Penguin Data
parsnip (Kuhn and Vaughan 2020a) A Common API to Modeling and Analysis Functions
partykit (Hothorn and Zeileis 2020) A Toolkit for Recursive Partytioning
patchwork (Pedersen 2020b) The Composer of Plots
plotly (Sievert et al. 2020) Create Interactive Web Graphics via ‘plotly.js’
purrr (Henry and Wickham 2020a) Functional Programming Tools
randomForest (Breiman et al. 2018) Breiman and Cutler’s Random Forests for Classification and Regression
RColorBrewer (Neuwirth 2014) ColorBrewer Palettes
Rcpp (Eddelbuettel et al. 2020) Seamless R and C++ Integration
RCurl (Temple Lang 2020) General Network (HTTP/FTP/…) Client Interface for R
readr (Hadley Wickham and Hester 2020) Read Rectangular Text Data
readxl (Hadley Wickham and Bryan 2019) Read Excel Files
remotes (Hester et al. 2020) R Package Installation from Remote Repositories, Including ‘GitHub’
renv (Ushey 2021) Project Environments
reticulate (Ushey, Allaire, and Tang 2020) Interface to ‘Python’
rgdal (R. Bivand, Keitt, and Rowlingson 2021) Bindings for the ‘Geospatial’ Data Abstraction Library
rlang (Henry and Wickham 2020b) Functions for Base Types and Core R and ‘Tidyverse’ Features
rmarkdown (J. Allaire, Xie, et al. 2020) Dynamic Documents for R
RMySQL (Ooms et al. 2020) Database Interface and ‘MySQL’ Driver for R
rpart (Therneau and Atkinson 2019) Recursive Partitioning and Regression Trees
rsconnect (J. Allaire 2019) Deployment Interface for R Markdown Documents and Shiny Applications
RSQLite (Müller et al. 2021) ‘SQLite’ Interface for R
rvest (Hadley Wickham 2020d) Easily Harvest (Scrape) Web Pages
scales (Hadley Wickham and Seidel 2020) Scale Functions for Visualization
sessioninfo (Csárdi et al. 2018) R Session Information
sf (Pebesma 2021) Simple Features for R
shiny (Chang et al. 2020) Web Application Framework for R
shinybusy (Meyer and Perrier 2020) Busy Indicator for ‘Shiny’ Applications
sna (Butts 2020b) Tools for Social Network Analysis
sp (Pebesma and Bivand 2020) Classes and Methods for Spatial Data
sparklyr (Luraschi et al. 2020) R Interface to Apache Spark
stopwords (Benoit, Muhr, and Watanabe 2020) Multilingual Stopword Lists
stringr (Hadley Wickham 2019f) Simple, Consistent Wrappers for Common String Operations
styler (Müller and Walthert 2020) Non-Invasive Pretty Printing of R Code
testthat (Hadley Wickham 2020e) Unit Testing for R
textdata (Hvitfeldt 2020) Download and Load Various Text Datasets
tidycensus (Walker and Herman 2020) Load US Census Boundary and Attribute Data as ‘tidyverse’ and ‘sf’-Ready Data Frames
tidygeocoder (Cambon 2020) Geocoding Made Easy
tidygraph (Pedersen 2020c) A Tidy API for Graph Manipulation
tidymodels (Kuhn and Wickham 2020) Easily Install and Load the ‘Tidymodels’ Packages
tidyr (Hadley Wickham 2020g) Tidy Messy Data
tidytext (Robinson and Silge 2021) Text Mining using ‘dplyr,’ ‘ggplot2,’ and Other Tidy Tools
tidyverse (Hadley Wickham 2019g) Easily Install and Load the ‘Tidyverse’
tigris (Walker 2020a) Load Census TIGER/Line Shapefiles
tm (Feinerer and Hornik 2020) Text Mining Package
transformr (Pedersen 2020d) Polygon and Path Transformations
twitteR (Gentry 2015) R Based Twitter Client
units (Pebesma, Mailund, and Kalinowski 2020) Measurement Units for R Vectors
usethis (Hadley Wickham and Bryan 2020b) Automate Package and Project Setup
viridis (Garnier 2018a) Default Color Maps from ‘matplotlib’
viridisLite (Garnier 2018b) Default Color Maps from ‘matplotlib’ (Lite Version)
webshot (Chang 2019) Take Screenshots of Web Pages
wordcloud (Fellows 2018) Word Clouds
wru (Khanna and Imai 2020) Who are You? Bayesian Prediction of Racial Category Using Surname and Geolocation
xaringanthemer (Aden-Buie 2020) Custom ‘xaringan’ CSS Themes
xfun (Yihui Xie 2021) Miscellaneous Functions by ‘Yihui Xie’
xkcd (Torres-Manzanera 2018) Plotting ggplot2 Graphics in an XKCD Style
yardstick (Kuhn and Vaughan 2020b) Tidy Characterizations of Model Performance
Table A.2: List of GitHub packages used in this book.
Package GitHub User Citation Title
etude dtkaplan (Kaplan 2020) Utilities for Handling Textbook Exercises with Knitr
fec12 baumer-lab (Tapal, Gahwagy, and Ryan 2020) Data Package for 2012 Federal Elections
openrouteservice GIScience (Oleś 2020) Openrouteservice API Client
streamgraph hrbrmstr (Rudis 2019) Build Streamgraph Visualizations

A.3 Further resources

More information on the mdsr package can be found at http://www.github.com/mdsr-book/mdsr.