Modern Data Science with R

3rd edition (light edits and updates)

A comprehensive data science textbook for undergraduates that incorporates statistical and computational thinking to solve real-world problems with data.

Author

Benjamin S. Baumer, Daniel T. Kaplan, and Nicholas J. Horton

Published

July 25, 2024

Welcome!

3rd edition

This is the work-in-progress of the 3rd edition. At present, there are relatively modest changes from the second edition beyond those necessitated by changes in the R ecosystem.

Key changes include:

Transition to Quarto from RMarkdown
Transition from magrittr pipe (%>%) to base R pipe (|>)
Minor updates to specific examples (e.g., updating tables scraped from Wikipedia) and code (e.g., new group options within the dplyr package).

At the main website for the book, you will find other reviews, instructor resources, errata, and other information.

Do you see issues or have suggestions? To submit corrections, please visit our website’s public GitHub repository and file an issue.

Known issues with the 3rd edition

This is a work in progress. At present there are a number of known issues:

nuclear reactors example (6.4.4 Example: Japanese nuclear reactors) needs to be updated to account for Wikipedia changes
Python code not yet implemented (Chapter 21 Epilogue: Towards “big data”)
Spark code not yet implemented (Chapter 21 Epilogue: Towards “big data”)
SQL output captions not working (Chapter 15 Database querying using SQL)
Open street map geocoding not yet implemented (Chapter 18 Geospatial computations)
ggmosaic() warnings (Figure 3.19)
RMarkdown introduction (Appendix Appendix D — Reproducible analysis and workflow) not yet converted to Quarto examples
issues with references in Appendix Appendix A — Packages used in the book
Exercises not yet available (throughout)
Links have not all been verified (help welcomed here!)

2nd edition

The online version of the 2nd edition of Modern Data Science with R is available. You can purchase the book from CRC Press or from Amazon.

The main website for the book includes more information, including reviews, instructor resources, and errata.

To submit corrections, please visit our website’s public GitHub repository and file an issue.

Cover

1st edition

The 1st edition may still be available for purchase. Although much of the material has been updated and improved, the general framework is the same (reviews).

Copyright

© 2021 by Taylor & Francis Group, LLC. Except as permitted under U.S. copyright law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by an electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.

Preface

library(mdsr)

Background and motivation

The increasing volume and sophistication of data poses new challenges for analysts, who need to be able to transform complex data sets to answer important statistical questions. A consensus report on data science for undergraduates (National Academies of Science, Engineering, and Medicine 2018) noted that data science is revolutionizing science and the workplace. They defined a data scientist as “a knowledge worker who is principally occupied with analyzing complex and massive data resources.”

Michael I. Jordan has described data science as the marriage of computational thinking and inferential (statistical) thinking. Without the skills to be able to “wrangle” or “marshal” the increasingly rich and complex data that surround us, analysts will not be able to use these data to make better decisions.

Demand is strong for graduates with these skills. According to the company ratings site Glassdoor, “data scientist” was the best job in America every year from 2016–2019 (Columbus 2019).

New data technologies make it possible to extract data from more sources than ever before. Streamlined data processing libraries enable data scientists to express how to restructure those data into a form suitable for analysis. Database systems facilitate the storage and retrieval of ever-larger collections of data. State-of-the-art workflow tools foster well-documented and reproducible analysis. Modern statistical and machine learning methods allow the analyst to fit and assess models as well as to undertake supervised or unsupervised learning to glean information about the underlying real-world phenomena. Contemporary data science requires tight integration of these statistical, computing, data-related, and communication skills.

Intended audience

This book is intended for readers who want to develop the appropriate skills to tackle complex data science projects and “think with data” (as coined by Diane Lambert of Google). The desire to solve problems using data is at the heart of our approach.

We acknowledge that it is impossible to cover all these topics in any level of detail within a single book: Many of the chapters could productively form the basis for a course or series of courses. Instead, our goal is to lay a foundation for analysis of real-world data and to ensure that analysts see the power of statistics and data analysis. After reading this book, readers will have greatly expanded their skill set for working with these data, and should have a newfound confidence about their ability to learn new technologies on-the-fly.

This book was originally conceived to support a one-semester, 13-week undergraduate course in data science. We have found that the book will be useful for more advanced students in related disciplines, or analysts who want to bolster their data science skills. At the same time, Part I of the book is accessible to a general audience with no programming or statistics experience.

Key features of this book

Focus on case studies and extended examples

We feature a series of complex, real-world extended case studies and examples from a broad range of application areas, including politics, transportation, sports, environmental science, public health, social media, and entertainment. These rich data sets require the use of sophisticated data extraction techniques, modern data visualization approaches, and refined computational approaches.

Context is king for such questions, and we have structured the book to foster the parallel developments of statistical thinking, data-related skills, and communication. Each chapter focuses on a different extended example with diverse applications, while exercises allow for the development and refinement of the skills learned in that chapter.

Structure

The book has three main sections plus supplementary appendices. Part I provides an introduction to data science, which includes an introduction to data visualization, a foundation for data management (or “wrangling”), and ethics. Part II extends key modeling notions from introductory statistics, including regression modeling, classification and prediction, statistical foundations, and simulation. Part III introduces more advanced topics, including interactive data visualization, SQL and relational databases, geospatial data, text mining, and network science.

We conclude with appendices that introduce the book’s R package, R and RStudio, key aspects of algorithmic thinking, reproducible analysis, a review of regression, and how to set up a local SQL database.

The book features extensive cross-referencing (given the inherent connections between topics and approaches).

Supporting materials

In addition to many examples and extended case studies, the book incorporates exercises at the end of each chapter along with supplementary exercises available online. Many of the exercises are quite open-ended, and are designed to allow students to explore their creativity in tackling data science questions. (A solutions manual for instructors is available from the publisher.)

The book website at https://mdsr-book.github.io/mdsr3e includes the table of contents, the full text of each chapter, and bibliography. The instructor’s website at https://mdsr-book.github.io/ contains code samples, supplementary exercises, additional activities, and a list of errata.

Changes in the second edition

Data science moves quickly. A lot has changed since we wrote the first edition. We have updated all chapters to account for many of these changes and to take advantage of state-of-the-art R packages.

First, the chapter on working with geospatial data has been expanded and split into two chapters. The first focuses on working with geospatial data, and the second focuses on geospatial computations. Both chapters now use the sf package and the new geom_sf() function in ggplot2. These changes allow students to penetrate deeper into the world of geospatial data analysis.

Second, the chapter on tidy data has undergone significant revisions. A new section on list-columns has been added, and the section on iteration has been expanded into a full chapter. This new chapter makes consistent use of the functional programming style provided by the purrr package. These changes help students develop a habit of mind around scalability: if you are copying-and-pasting code more than twice, there is probably a more efficient way to do it.

Third, the chapter on supervised learning has been split into two chapters and updated to use the tidymodels suite of packages. The first chapter now covers model evaluation in generality, while the second introduces several models. The tidymodels ecosystem provides a consistent syntax for fitting, interpreting, and evaluating a wide variety of machine learning models, all in a manner that is consistent with the tidyverse. These changes significantly reduce the cognitive overhead of the code in this chapter.

The content of several other chapters has undergone more minor—but nonetheless substantive—revisions. All of the code in the book has been revised to adhere more closely to the tidyverse syntax and style. Exercises and solutions from the first edition have been revised, and new exercises have been added. The code from each chapter is now available on the book website. The book has been ported to bookdown, so that a full version can be found online at https://mdsr-book.github.io/mdsr2e.

Key role of technology

While many tools can be used effectively to undertake data science, and the technologies to undertake analyses are quickly changing, R and Python have emerged as two powerful and extensible environments. While it is important for data scientists to be able to use multiple technologies for their analyses, we have chosen to focus on the use of R and RStudio (an open source integrated development environment created by Posit) to avoid cognitive overload. We describe a powerful and coherent set of tools that can be introduced within the confines of a single semester and that provide a foundation for data wrangling and exploration.

We take full advantage of the (RStudio) environment. This powerful and easy-to-use front end adds innumerable features to R including package support, code-completion, integrated help, a debugger, and other coding tools. In our experience, the use of (RStudio) dramatically increases the productivity of R users, and by tightly integrating reproducible analysis tools, helps avoid error-prone “cut-and-paste” workflows. Our students and colleagues find (RStudio) to be an accessible interface. No prior knowledge or experience with R or (RStudio) is required: we include an introduction within the Appendix.

As noted earlier, we have comprehensively integrated many substantial improvements in the tidyverse, an opinionated set of packages that provide a more consistent interface to R (Wickham 2023). Many of the design decisions embedded in the tidyverse packages address issues that have traditionally complicated the use of R for data analysis. These decisions allow novice users to make headway more quickly and develop good habits.

We used a reproducible analysis system (knitr) to generate the example code and output in this book. Code extracted from these files is provided on the book’s website. We provide a detailed discussion of the philosophy and use of these systems. In particular, we feel that the knitr and rmarkdown packages for R, which are tightly integrated with Posit’s (RStudio) IDE, should become a part of every R user’s toolbox. We can’t imagine working on a project without them (and we’ve incorporated reproducibility into all of our courses).

Modern data science is a team sport. To be able to fully engage, analysts must be able to pose a question, seek out data to address it, ingest this into a computing environment, model and explore, then communicate results. This is an iterative process that requires a blend of statistics and computing skills.

How to use this book

The material from this book has supported several courses to date at Amherst, Smith, and Macalester Colleges, as well as many others around the world. From our personal experience, this includes an intermediate course in data science (in 2013 and 2014 at Smith College and since 2017 at Amherst College), an introductory course in data science (since 2016 at Smith), and a capstone course in advanced data analysis (multiple years at Amherst).

The introductory data science course at Smith has no prerequisites and includes the following subset of material:

Data Visualization: three weeks, covering Chapters 1 Prologue: Why data science?–3 A grammar for graphics
Data Wrangling: five weeks, covering Chapters 4 Data wrangling on one table–7 Iteration
Ethics: one week, covering Chapter 8 Data science ethics
Database Querying: two weeks, covering Chapter 15 Database querying using SQL
Geospatial Data: two weeks, covering Chapter 17 Working with geospatial data and part of Chapter 18 Geospatial computations

A intermediate course at Amherst followed the approach of Baumer (2015) with a pre-requisite of some statistics and some computer science and an integrated final project. The course generally covers the following chapters:

Data Visualization: two weeks, covering Chapters 1 Prologue: Why data science?–3 A grammar for graphics and 14 Dynamic and customized data graphics
Data Wrangling: four weeks, covering Chapters 4 Data wrangling on one table–7 Iteration
Ethics: one week, covering Chapter 8 Data science ethics
Unsupervised Learning: one week, covering Chapter 12 Unsupervised learning
Database Querying: one week, covering Chapter 15 Database querying using SQL
Geospatial Data: one week, covering Chapter 17 Working with geospatial data and some of Chapter 18 Geospatial computations
Text Mining: one week, covering Chapter 19 Text as data
Network Science: one week, covering Chapter 20 Network science

The capstone course at Amherst reviewed much of that material in more depth:

Data Visualization: three weeks, covering Chapters 1 Prologue: Why data science?–3 A grammar for graphics and Chapter 14 Dynamic and customized data graphics
Data Wrangling: two weeks, covering Chapters 4 Data wrangling on one table–7 Iteration
Ethics: one week, covering Chapter 8 Data science ethics
Simulation: one week, covering Chapter 13 Simulation
Statistical Learning: two weeks, covering Chapters 10 Predictive modeling–12 Unsupervised learning
Databases: one week, covering Chapter 15 Database querying using SQL and Appendix Appendix F — Setting up a database server
Text Mining: one week, covering Chapter 19 Text as data
Spatial Data: one week, covering Chapter 17 Working with geospatial data
Big Data: one week, covering Chapter 21 Epilogue: Towards “big data”

We anticipate that this book could serve as the primary text for a variety of other courses, such as a Data Science 2 course, with or without additional supplementary material.

The content in Part I—particularly the ggplot2 visualization concepts presented in Chapter 3 A grammar for graphics and the dplyr data wrangling operations presented in Chapter 4 Data wrangling on one table—is fundamental and is assumed in Parts II and III. Each of the topics in Part III are independent of each other and the material in Part II. Thus, while most instructors will want to cover most (if not all) of Part I in any course, the material in Parts II and III can be added with almost total freedom.

The material in Part II is designed to expose students with a beginner’s understanding of statistics (i.e., basic inference and linear regression) to a richer world of statistical modeling and statistical inference.

Acknowledgments

We would like to thank John Kimmel at Informa CRC/Chapman and Hall for his support and guidance. We also thank Jim Albert, Nancy Boynton, Jon Caris, Mine Çetinkaya-Rundel, Jonathan Che, Patrick Frenett, Scott Gilman, Maria-Cristiana Gîrjău, Johanna Hardin, Alana Horton, John Horton, Kinari Horton, Azka Javaid, Andrew Kim, Eunice Kim, Caroline Kusiak, Ken Kleinman, Priscilla (Wencong) Li, Amelia McNamara, Melody Owen, Randall Pruim, Tanya Riseman, Gabriel Sosa, Katie St. Clair, Amy Wagaman, Susan (Xiaofei) Wang, Hadley Wickham, J. J. Allaire and the Posit (formerly RStudio) developers, the anonymous reviewers, multiple classes at Smith and Amherst Colleges, and many others for contributions to the R and (RStudio) environment, comments, guidance, and/or helpful suggestions on drafts of the manuscript. Rose Porta was instrumental in proofreading and easing the transition from Sweave to R Markdown. Jessica Yu converted and tagged most of the exercises from the first edition to the new format based on etude.

Above all we greatly appreciate Cory, Maya, and Julia for their patience and support.

Northampton, MA and St. Paul, MN
August, 2023 (third edition [light edits and updates])

Northampton, MA and St. Paul, MN
December, 2020 (second edition)