Appendix A — Packages used in the book

A.1 The mdsr package

The mdsr package contains many of the small data sets used in this book that are not available in other packages. To install it from CRAN, use install.packages(). To get the latest release, use the install_github() function from the remotes package. (See Section B.4.1 for more comprehensive information about R package maintenance.)

# this command only needs to be run once
# if you want the development version

The list of data sets provided can be retrieved using the data() function.

data(package = "mdsr")

The mdsr package includes some functions that simplify a number of tasks. In particular, the dbConnect_scidb() function provides a shorthand for connecting to the public SQL server hosted by Amazon Web Services. We use this function extensively in Chapter 15 and in our classes and projects.

In keeping with best practices, mdsr no longer loads any other packages. In every chapter in this book, a call to library(tidyverse) precedes a call to library(mdsr). These two steps will set up an R session to replicate the code in the book.

A.2 Other packages

As we discuss in Chapters 1 and 21, this book is not explicitly about “big data”—it is about mastering data science techniques for small and medium data with an eye towards big data. To that end, we need medium-sized data sets to work with. We have introduced several such data sets in this book, namely airlines, fec12, and fec16.

The airlines package, which was inspired by the nycflights13 package, gives R users the ability to download the full 33 years (and counting) of flight data from the United States Bureau of Transportation Statistics and bring it seamlessly into SQL without actually having to write any SQL code. The macleish package also uses the etl framework for hourly-updated weather data from the MacLeish field station.

The full list of packages used in this book appears below in Tables A.1 and A.2.

Table A.1: List of CRAN packages used in this book.
List of CRAN packages used in this book.
Package Citation Title
DBI R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller (2024) R Database Interface
DT Xie, Cheng, and Tan (2024) A Wrapper of the JavaScript Library 'DataTables'
Hmisc Harrell (2024) Harrell Miscellaneous
Lahman Friendly et al. (2023) Sean 'Lahman' Baseball Database
NHANES Pruim (2015) Data from the US National Health and Nutrition Examination Study
NeuralNetTools Beck (2022) Visualization and Analysis Tools for Neural Networks
RColorBrewer Neuwirth (2022) ColorBrewer Palettes
RCurl Temple Lang (2024) General Network (HTTP/FTP/...) Client Interface for R
RMariaDB Müller et al. (2023) Database Interface and MariaDB Driver
Rcpp Eddelbuettel et al. (2024) Seamless R and C++ Integration
aRxiv Ram and Broman (2024) Interface to the arXiv API
alr4 Weisberg (2018) Data to Accompany Applied Linear Regression 4th Edition
ape Paradis et al. (2024) Analyses of Phylogenetics and Evolution
assertthat Wickham (2019) Easy Pre and Post Assertions
available Ganz et al. (2022) Check if the Title of a Package is Available, Appropriate and Interesting
babynames Wickham (2021a) US Baby Names 1880-2017
bench Hester and Vaughan (2023) High Precision Timing of R Expressions
biglm Lumley (2020) Bounded Memory Linear and Generalized Linear Models
bigrquery Wickham and Bryan (2024) An Interface to Google's 'BigQuery' 'API'
broom Robinson, Hayes, and Couch (2023) Convert Statistical Objects into Tidy Tibbles
dbplyr Wickham, Girlich, and Ruiz (2024) A 'dplyr' Back End for Databases
discrim Hvitfeldt and Kuhn (2023) Model Wrappers for Discriminant Analysis
dplyr Wickham et al. (2023) A Grammar of Data Manipulation
dygraphs Vanderkam et al. (2018) Interface to 'Dygraphs' Interactive Time Series Charting Library
etl Baumer (2023) Extract-Transform-Load Framework for Medium Data
extrafont Chang (2023) Tools for Using Fonts
forcats Wickham (2023a) Tools for Working with Categorical Variables (Factors)
fs Hester, Wickham, and Csárdi (2023) Cross-Platform File System Operations Based on 'libuv'
furrr Vaughan and Dancho (2022) Apply Mapping Functions in Parallel using Futures
future Bengtsson (2024) Unified Parallel and Distributed Processing in R for Everyone
gganimate Pedersen and Robinson (2024) A Grammar of Animated Graphics
ggmosaic Jeppson, Hofmann, and Cook (2021) Mosaic Plots in the 'ggplot2' Framework
ggplot2 Wickham, Chang, et al. (2024) Create Elegant Data Visualisations Using the Grammar of Graphics
ggraph Pedersen (2024a) An Implementation of Grammar of Graphics for Graphs and Networks
ggrepel Slowikowski (2024) Automatically Position Non-Overlapping Text Labels with 'ggplot2'
ggspatial Dunnington (2023) Spatial Data Framework for ggplot2
ggthemes Arnold (2024) Extra Themes, Scales and Geoms for 'ggplot2'
glmnet Friedman et al. (2023) Lasso and Elastic-Net Regularized Generalized Linear Models
googlesheets4 Bryan (2023) Access Google Sheets using the Sheets API V4
haven Wickham, Miller, and Smith (2023) Import and Export 'SPSS', 'Stata' and 'SAS' Files
here Müller (2020) A Simpler Way to Find Your Files
htmlwidgets Vaidyanathan et al. (2023) HTML Widgets for R
igraph Csárdi, Nepusz, et al. (2024) Network Analysis and Visualization
janitor Firke (2023) Simple Tools for Examining and Cleaning Dirty Data
jsonlite Ooms (2023) A Simple and Robust JSON Parser and Generator for R
kableExtra Zhu (2024) Construct Complex Table with 'kable' and Pipe Syntax
kknn Schliep and Hechenbichler (2016) Weighted k-Nearest Neighbors
knitr Xie (2024a) A General-Purpose Package for Dynamic Report Generation in R
lattice Sarkar (2023) Trellis Graphics for R
leaflet Cheng et al. (2024) Create Interactive Web Maps with the JavaScript 'Leaflet' Library
lubridate Spinu, Grolemund, and Wickham (2023) Make Dealing with Dates a Little Easier
macleish Baumer et al. (2022) Retrieve Data from MacLeish Field Station
magrittr Bache and Wickham (2022) A Forward-Pipe Operator for R
mapproj McIlroy et al. (2023) Map Projections
maps Brownrigg (2023) Draw Geographical Maps
mclust Fraley, Raftery, and Scrucca (2024) Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation
mdsr Baumer, Horton, and Kaplan (2023) Complement to 'Modern Data Science with R'
modelr Wickham (2023b) Modelling Functions that Work with the Pipe
mosaic Pruim, Kaplan, and Horton (2024) Project MOSAIC Statistics and Mathematics Teaching Utilities
mosaicData Pruim, Kaplan, and Horton (2023) Project MOSAIC Data Sets
nycflights13 Wickham (2021b) Flights that Departed NYC in 2013
parsnip Kuhn and Vaughan (2024) A Common API to Modeling and Analysis Functions
partykit Hothorn and Zeileis (2023) A Toolkit for Recursive Partytioning
patchwork Pedersen (2024b) The Composer of Plots
plotly Sievert et al. (2024) Create Interactive Web Graphics via 'plotly.js'
purrr Wickham and Henry (2023) Functional Programming Tools
randomForest Breiman et al. (2022) Breiman and Cutler's Random Forests for Classification and Regression
readr Wickham, Hester, and Bryan (2024) Read Rectangular Text Data
readxl Wickham and Bryan (2023) Read Excel Files
remotes Csárdi, Hester, et al. (2024) R Package Installation from Remote Repositories, Including 'GitHub'
renv Ushey and Wickham (2024) Project Environments
reticulate Ushey, Allaire, and Tang (2024) Interface to 'Python'
rlang Henry and Wickham (2024) Functions for Base Types and Core R and 'Tidyverse' Features
rmarkdown Allaire et al. (2024) Dynamic Documents for R
rpart Therneau and Atkinson (2023) Recursive Partitioning and Regression Trees
rvest Wickham (2024) Easily Harvest (Scrape) Web Pages
scales Wickham, Pedersen, and Seidel (2023) Scale Functions for Visualization
sessioninfo Wickham et al. (2021) R Session Information
sf Pebesma (2024) Simple Features for R
shiny Chang et al. (2024) Web Application Framework for R
sp Pebesma and Bivand (2024) Classes and Methods for Spatial Data
sparklyr Luraschi et al. (2024) R Interface to Apache Spark
stopwords Benoit, Muhr, and Watanabe (2021) Multilingual Stopword Lists
stringr Wickham (2023c) Simple, Consistent Wrappers for Common String Operations
styler Müller and Walthert (2024) Non-Invasive Pretty Printing of R Code
textdata Hvitfeldt (2022) Download and Load Various Text Datasets
tidycensus Walker and Herman (2024) Load US Census Boundary and Attribute Data as 'tidyverse' and 'sf'-Ready Data Frames
tidygeocoder Cambon et al. (2021) Geocoding Made Easy
tidygraph Pedersen (2024c) A Tidy API for Graph Manipulation
tidymodels Kuhn and Wickham (2024) Easily Install and Load the 'Tidymodels' Packages
tidyr Wickham, Vaughan, and Girlich (2024) Tidy Messy Data
tidytext Robinson and Silge (2024) Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
tidyverse Wickham (2023d) Easily Install and Load the 'Tidyverse'
tigris Walker (2024) Load Census TIGER/Line Shapefiles
tm Feinerer and Hornik (2024) Text Mining Package
transformr Pedersen (2024d) Polygon and Path Transformations
units Pebesma et al. (2023) Measurement Units for R Vectors
usethis Wickham, Bryan, et al. (2024) Automate Package and Project Setup
viridis Garnier (2024) Colorblind-Friendly Color Maps for R
viridisLite Garnier (2023) Colorblind-Friendly Color Maps (Lite Version)
wordcloud Fellows (2018) Word Clouds
wru Khanna et al. (2024) Who are You? Bayesian Prediction of Racial Category Using Surname, First Name, Middle Name, and Geolocation
xaringanthemer Aden-Buie (2022) Custom 'xaringan' CSS Themes
xfun Xie (2024b) Supporting Functions for Packages Maintained by 'Yihui Xie'
xkcd Torres-Manzanera (2018) Plotting ggplot2 Graphics in an XKCD Style
yardstick Kuhn, Vaughan, and Hvitfeldt (2024) Tidy Characterizations of Model Performance
Table A.2: List of GitHub packages used in this book.
Package GitHub User Citation Title
etude dtkaplan @R-etude Utilities for Handling Textbook Exercises with Knitr
fec12 baumer-lab @R-fec12 Data Package for 2012 Federal Elections
openrouteservice GIScience @R-openrouteservice Openrouteservice API Client
streamgraph hrbrmstr @R-streamgraph Build Streamgraph Visualizations

A.3 Further resources

More information on the mdsr package can be found at