Preface

Background and motivation

The increasing volume and sophistication of data poses new challenges for analysts, who need to be able to transform complex data sets to answer important statistical questions. A consensus report on data science for undergraduates (National Academies of Science, Engineering, and Medicine 2018) noted that data science is revolutionizing science and the workplace. They defined a data scientist as “a knowledge worker who is principally occupied with analyzing complex and massive data resources.”

Michael I. Jordan has described data science as the marriage of computational thinking and inferential (statistical) thinking. Without the skills to be able to “wrangle” or “marshal” the increasingly rich and complex data that surround us, analysts will not be able to use these data to make better decisions.

Demand is strong for graduates with these skills. According to the company ratings site Glassdoor, “data scientist” was the best job in America every year from 2016–2019 (Columbus 2019).

New data technologies make it possible to extract data from more sources than ever before. Streamlined data processing libraries enable data scientists to express how to restructure those data into a form suitable for analysis. Database systems facilitate the storage and retrieval of ever-larger collections of data. State-of-the-art workflow tools foster well-documented and reproducible analysis. Modern statistical and machine learning methods allow the analyst to fit and assess models as well as to undertake supervised or unsupervised learning to glean information about the underlying real-world phenomena. Contemporary data science requires tight integration of these statistical, computing, data-related, and communication skills.

Intended audience

This book is intended for readers who want to develop the appropriate skills to tackle complex data science projects and “think with data” (as coined by Diane Lambert of Google). The desire to solve problems using data is at the heart of our approach.

We acknowledge that it is impossible to cover all these topics in any level of detail within a single book: Many of the chapters could productively form the basis for a course or series of courses. Instead, our goal is to lay a foundation for analysis of real-world data and to ensure that analysts see the power of statistics and data analysis. After reading this book, readers will have greatly expanded their skill set for working with these data, and should have a newfound confidence about their ability to learn new technologies on-the-fly.

This book was originally conceived to support a one-semester, 13-week undergraduate course in data science. We have found that the book will be useful for more advanced students in related disciplines, or analysts who want to bolster their data science skills. At the same time, Part I of the book is accessible to a general audience with no programming or statistics experience.

Key features of this book

Focus on case studies and extended examples

We feature a series of complex, real-world extended case studies and examples from a broad range of application areas, including politics, transportation, sports, environmental science, public health, social media, and entertainment. These rich data sets require the use of sophisticated data extraction techniques, modern data visualization approaches, and refined computational approaches.

Context is king for such questions, and we have structured the book to foster the parallel developments of statistical thinking, data-related skills, and communication. Each chapter focuses on a different extended example with diverse applications, while exercises allow for the development and refinement of the skills learned in that chapter.

Structure

The book has three main sections plus supplementary appendices. Part I provides an introduction to data science, which includes an introduction to data visualization, a foundation for data management (or “wrangling”), and ethics. Part II extends key modeling notions from introductory statistics, including regression modeling, classification and prediction, statistical foundations, and simulation. Part III introduces more advanced topics, including interactive data visualization, SQL and relational databases, geospatial data, text mining, and network science.

We conclude with appendices that introduce the book’s R package, R and RStudio, key aspects of algorithmic thinking, reproducible analysis, a review of regression, and how to set up a local SQL database.

We provide two indices: one organized by subject and the other organized by R function and package. In addition, the book features extensive cross-referencing (given the inherent connections between topics and approaches).

Supporting materials

In addition to many examples and extended case studies, the book incorporates exercises at the end of each chapter along with supplementary exercises available online. Many of the exercises are quite open-ended, and are designed to allow students to explore their creativity in tackling data science questions. (A solutions manual for instructors is available from the publisher.)

The book website at https://mdsr-book.github.io/mdsr2e includes the table of contents, the full text of each chapter, bibliography, and subject and R indices. The instructor’s website at https://mdsr-book.github.io/ contains code samples, supplementary exercises, additional activities, and a list of errata.

Changes in the second edition

Data science moves quickly. A lot has changed since we wrote the first edition. We have updated all chapters to account for many of these changes and to take advantage of state-of-the-art R packages.

First, the chapter on working with geospatial data has been expanded and split into two chapters. The first focuses on working with geospatial data, and the second focuses on geospatial computations. Both chapters now use the sf package and the new geom_sf() function in ggplot2. These changes allow students to penetrate deeper into the world of geospatial data analysis.

Second, the chapter on tidy data has undergone significant revisions. A new section on list-columns has been added, and the section on iteration has been expanded into a full chapter. This new chapter makes consistent use of the functional programming style provided by the purrr package. These changes help students develop a habit of mind around scalability: if you are copying-and-pasting code more than twice, there is probably a more efficient way to do it.

Third, the chapter on supervised learning has been split into two chapters and updated to use the tidymodels suite of packages. The first chapter now covers model evaluation in generality, while the second introduces several models. The tidymodels ecosystem provides a consistent syntax for fitting, interpreting, and evaluating a wide variety of machine learning models, all in a manner that is consistent with the tidyverse. These changes significantly reduce the cognitive overhead of the code in this chapter.

The content of several other chapters has undergone more minor—but nonetheless substantive—revisions. All of the code in the book has been revised to adhere more closely to the tidyverse syntax and style. Exercises and solutions from the first edition have been revised, and new exercises have been added. The code from each chapter is now available on the book website. The book has been ported to bookdown, so that a full version can be found online at https://mdsr-book.github.io/mdsr2e.

Key role of technology

While many tools can be used effectively to undertake data science, and the technologies to undertake analyses are quickly changing, R and Python have emerged as two powerful and extensible environments. While it is important for data scientists to be able to use multiple technologies for their analyses, we have chosen to focus on the use of R and RStudio to avoid cognitive overload. We describe a powerful and coherent set of tools that can be introduced within the confines of a single semester and that provide a foundation for data wrangling and exploration.

We take full advantage of the RStudio environment. This powerful and easy-to-use front end adds innumerable features to R including package support, code-completion, integrated help, a debugger, and other coding tools. In our experience, the use of RStudio dramatically increases the productivity of R users, and by tightly integrating reproducible analysis tools, helps avoid error-prone “cut-and-paste” workflows. Our students and colleagues find RStudio to be an accessible interface. No prior knowledge or experience with R or RStudio is required: we include an introduction within the Appendix.

As noted earlier, we have comprehensively integrated many substantial improvements in the tidyverse, an opinionated set of packages that provide a more consistent interface to R (Hadley Wickham 2021g). Many of the design decisions embedded in the tidyverse packages address issues that have traditionally complicated the use of R for data analysis. These decisions allow novice users to make headway more quickly and develop good habits.

We used a reproducible analysis system (knitr) to generate the example code and output in this book. Code extracted from these files is provided on the book’s website. We provide a detailed discussion of the philosophy and use of these systems. In particular, we feel that the knitr and rmarkdown packages for R, which are tightly integrated with RStudio, should become a part of every R user’s toolbox. We can’t imagine working on a project without them (and we’ve incorporated reproducibility into all of our courses).

Modern data science is a team sport. To be able to fully engage, analysts must be able to pose a question, seek out data to address it, ingest this into a computing environment, model and explore, then communicate results. This is an iterative process that requires a blend of statistics and computing skills.

How to use this book

The material from this book has supported several courses to date at Amherst, Smith, and Macalester Colleges, as well as many others around the world. From our personal experience, this includes an intermediate course in data science (in 2013 and 2014 at Smith College and since 2017 at Amherst College), an introductory course in data science (since 2016 at Smith), and a capstone course in advanced data analysis (multiple years at Amherst).

The introductory data science course at Smith has no prerequisites and includes the following subset of material:

Data Visualization: three weeks, covering Chapters 1–3
Data Wrangling: five weeks, covering Chapters 4–7
Ethics: one week, covering Chapter 8
Database Querying: two weeks, covering Chapter 15
Geospatial Data: two weeks, covering Chapter 17 and some of Chapter 18

A intermediate course at Amherst followed the approach of B. S. Baumer (2015) with a pre-requisite of some statistics and some computer science and an integrated final project. The course generally covers the following chapters:

Data Visualization: two weeks, covering Chapters 1–3 and 14
Data Wrangling: four weeks, covering Chapters 4–7
Ethics: one week, covering Chapter 8
Unsupervised Learning: one week, covering Chapter 12
Database Querying: one week, covering Chapter 15
Geospatial Data: one week, covering Chapter 17 and some of Chapter 18
Text Mining: one week, covering Chapter 19
Network Science: one week, covering Chapter 20

The capstone course at Amherst reviewed much of that material in more depth:

Data Visualization: three weeks, covering Chapters 1–3 and 14
Data Wrangling: two weeks, covering Chapters 4–7
Ethics: one week, covering Chapter 8
Simulation: one week, covering Chapter 13
Statistical Learning: two weeks, covering Chapters 10–12
Databases: one week, covering Chapter 15 and Appendix F
Text Mining: one week, covering Chapter 19
Spatial Data: one week, covering Chapter 17
Big Data: one week, covering Chapter 21

We anticipate that this book could serve as the primary text for a variety of other courses, with or without additional supplementary material.

The content in Part I—particularly the ggplot2 visualization concepts presented in Chapter 3 and the dplyr data wrangling operations presented in Chapter 4—is fundamental and is assumed in Parts II and III. Each of the topics in Part III are independent of each other and the material in Part II. Thus, while most instructors will want to cover most (if not all) of Part I in any course, the material in Parts II and III can be added with almost total freedom.

The material in Part II is designed to expose students with a beginner’s understanding of statistics (i.e., basic inference and linear regression) to a richer world of statistical modeling and statistical inference.

Acknowledgments

We would like to thank John Kimmel at Informa CRC/Chapman and Hall for his support and guidance. We also thank Jim Albert, Nancy Boynton, Jon Caris, Mine Çetinkaya-Rundel, Jonathan Che, Patrick Frenett, Scott Gilman, Maria-Cristiana Gîrjău, Johanna Hardin, Alana Horton, John Horton, Kinari Horton, Azka Javaid, Andrew Kim, Eunice Kim, Caroline Kusiak, Ken Kleinman, Priscilla (Wencong) Li, Amelia McNamara, Melody Owen, Randall Pruim, Tanya Riseman, Gabriel Sosa, Katie St. Clair, Amy Wagaman, Susan (Xiaofei) Wang, Hadley Wickham, J. J. Allaire and the RStudio developers, the anonymous reviewers, multiple classes at Smith and Amherst Colleges, and many others for contributions to the R and RStudio environment, comments, guidance, and/or helpful suggestions on drafts of the manuscript. Rose Porta was instrumental in proofreading and easing the transition from Sweave to R Markdown. Jessica Yu converted and tagged most of the exercises from the first edition to the new format based on etude.

Above all we greatly appreciate Cory, Maya, and Julia for their patience and support.

Northampton, MA and St. Paul, MN
December 2020