21  Epilogue: Towards “big data”

The terms data science and big data are often used interchangeably, but this is not correct. Technically, “big data” is a part of data science: the part that deals with data that are so large that they cannot be handled by an ordinary computer. This book provides what we hope is a broad—yet principled—introduction to data science, but it does not specifically prepare the reader to work with big data. Rather, we see the concepts developed in this book as “precursors” to big data (Horton, Baumer, and Wickham 2015; Horton and Hardin 2015). In this epilogue, we explore notions of big data and point the reader towards technologies that scale for truly big data.

21.1 Notions of big data

Big data is an exceptionally hot topic, but it is not so well-defined. Wikipedia states:

Big data is a field that treats ways to analyze, systematically extract information from, or otherwise deal with data sets that are too large or complex to be dealt with by traditional data-processing application software.

Relational database management systems, desktop statistics and software packages used to visualize data often have difficulty handling big data. The work may require “massively parallel software running on tens, hundreds, or even thousands of servers.” What qualifies as being “big data” varies depending on the capabilities of the users and their tools, and expanding capabilities make big data a moving target. “For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration” (retrieved December 2020).

Big data is often characterized by the three V’s: volume, velocity, and variety (Laney 2001). Under this definition, the qualities that make big data different are its size, how quickly it grows as it is collected, and how many different formats it may come in. In big data, the size of tables may be too large to fit on an ordinary computer, the data and queries on it may be coming in too quickly to process, or the data may be distributed across many different systems. Randall Pruim puts it more concisely: “Big data is when your workflow breaks.”

Both relative and absolute definitions of big data are meaningful. The absolute definition may be easier to understand: We simply specify a data size and agree that any data that are at least that large are “big”—otherwise they are not. The problem with this definition is that it is a moving target. It might mean petabytes (1,000 terabytes) today, but exabytes (1,000 petabytes) a few years from now. Regardless of the precise definition, it is increasingly clear that while many organizations like Google, Facebook, and Amazon are working with truly big data, most individuals—even data scientists like you and us—are not.

For us, the relative definition becomes more meaningful. A big data problem occurs when the workflow that you have been using to solve problems becomes infeasible due to the expansion in the size of your data. It is useful in this context to think about orders of magnitude of data. The evolution of baseball data illustrates how “big data problems” have arisen as the volume and variety of the data has increased over time.

  • Individual game data: Henry Chadwick started collecting boxscores (a tabular summary of each game) in the early 1900s. These data (dozens or even hundreds of rows) can be stored on handwritten pieces of paper, or in a single spreadsheet. Each row might represent one game. Thus, a perfectly good workflow for working with data of this size is to store them on paper. A more sophisticated workflow would be to store them in a spreadsheet application.

  • Seasonal data: By the 1970s, decades of baseball history were recorded in a seasonal format. Here, the data are aggregated at the player-team-season level. An example of this kind of data is the Lahman database we explored in Chapter 4, which has nearly 100,000 rows in the Batting table. Note that in this seasonal format, we know how many home runs each player hit for each team, but we don’t know anything about when they were hit (e.g., in what month or what inning). Excel is limited in the number of rows that a single spreadsheet can contain. The original limit of \(2^{14} = 16,384\) rows was bumped up to \(2^{16} = 65,536\) rows in 2003, and the current limit is \(2^{20} \approx 1\) million rows. Up until 2003, simply opening the Batting table in Excel would have been impossible. This is a big data problem, because your Excel workflow has broken due to the size of your data. On the other hand, opening the Batting table in R requires far less memory, since R does not try to display all of the data.

  • Play-by-play data: By the 1990s, Retrosheet began collecting even more granular play-by-play data. Each row contains information about one play. This means that we know exactly when each player hit each home run—what date, what inning, off of which pitcher, which other runners were on base, and even which other players were in the field. As of this writing nearly 100 seasons occupying more than 10 million rows are available. This creates a big data problem for R–you would have a hard time loading these data into R on a typical personal computer. However, SQL provides a scalable solution for data of this magnitude, even on a laptop. Still, you will experience significantly better performance if these data are stored in an SQL cluster with lots of memory.

  • Camera-tracking data: The Statcast data set contains \((x,y,z)\)-coordinates for all fielders, baserunners, and the ball every \(1/15^{th}\) of a second. Thus, each row is a moment in time. These data indicate not just the outcome of each play, but exactly where each of the players on the field and the ball were as the play evolved. These data are several gigabytes per game, which translates into many terabytes per season. Thus, some sort of distributed server system would be required just to store these data. These data are “big” in the relative sense for any individual, but they are still orders of magnitude away from being “big” in the absolute sense.

What does absolutely big data look like? For an individual user, you might consider the 13.5-terabyte data set of 110 billion events released in 2015 by Yahoo! for use in machine learning research. The grand-daddy of data may be the Large Hadron Collider in Europe, which is generating 25 petabytes of data per year (CERN 2008). However, only 0.001% of all of the data that is begin generated by the supercollider is being saved, because to collect it all would mean capturing nearly 500 exabytes per day. This is clearly big data.

21.2 Tools for bigger data

By now, you have a working knowledge of both R and SQL. These are battle-tested, valuable tools for working with small and medium data. Both have large user bases, ample deployment, and continue to be very actively developed. Some of that development seeks to make R and SQL more useful for truly large data. While we don’t have the space to cover these extensions in detail, in this section we outline some of the most important concepts for working with big data, and highlight some of the tools you are likely to see on this frontier of your working knowledge.

21.2.1 Data and memory structures for big data

An alternative to dplyr, data.table is a popular R package for fast SQL-style operations on very large data tables (many gigabytes of memory). It is not clear that data.table is faster or more efficient than dplyr, and it uses a different—but not necessarily better—syntax. Moreover, dplyr can use data.table itself as a backend. We have chosen to highlight dplyr in this book primarily because it fits so well syntactically with a number of other R packages we use herein (i.e., the tidyverse).

For some problems—more common in machine learning—the number of explanatory variables \(p\) can be large (not necessarily relative to the number of observations \(n\)). In such cases, the algorithm to compute a least-squares regression model may eat up quite a bit of memory. The biglm package seeks to improve on this by providing a memory-efficient biglm() function that can be used in place of lm(). In particular, biglm can fit generalized linear models with data frames that are larger than memory. The package accomplishes this by splitting the computations into more manageable chunks—updating the results iteratively as each chunk is processed. In this manner, you can write a drop-in replacement for your existing code that will scale to data sets larger than the memory on your computer.

library(tidyverse)
library(mdsr)
library(biglm)
library(bench)
n <- 20000
p <- 500
d <- rnorm(n * (p + 1)) |>
  matrix(ncol = (p + 1)) |>
  as_tibble(.name_repair = "unique")
expl_vars <- names(d) |>
  tail(-1) |>
  paste(collapse = " + ")
my_formula <- as.formula(paste("...1 ~ ", expl_vars))

system_time(lm(my_formula, data = d))
process    real 
  4.85s   4.85s 
system_time(biglm(my_formula, data = d))
process    real 
   3.1s    3.1s 

Here we see that the computation completed more quickly (and can be updated to incorporate more observations, unlike lm()). The biglm package is also useful in settings where there are many observations but not so many predictors. A related package is bigmemory. This package extends R’s capabilities to map memory to disk, allowing you to work with larger matrices.

21.2.2 Compilation

Python, SQL, and R are interpreted programming languages. This means that the code that you write in these languages gets translated into machine language on-the-fly as you execute it. The process is not altogether different than when you hear someone speaking in Russian on the news, and then you hear a halting English translation with a one- or two-second delay. Most of the time, the translation happens so fast that you don’t even notice.

Imagine that instead of translating the Russian speaker’s words on-the-fly, the translator took dictation, wrote down a thoughtful translation, and then re-recorded the segment in English. You would be able to process the English-speaking segment faster—because you are fluent in English. At the same time, the translation would probably be better, since more time and care went into it, and you would likely pick up subtle nuances that were lost in the on-the-fly translation. Moreover, once the English segment is recorded, it can be watched at any time without incurring the cost of translation again.

This alternative paradigm involves a one-time translation of the code called compilation. R code is not compiled (it is interpreted), but C++ code is. The result of compilation is a binary program that can be executed by the CPU directly. This is why, for example, you can’t write a desktop application in R, and executables written in C++ will be much faster than scripts written in R or Python. (To continue this line of reasoning, binaries written in assembly language can be faster than those written in C++, and binaries written in machine language can be faster than those written in assembly.)

If C++ is so much faster than R, then why write code in R? Here again, it is a trade-off. The code written in C++ may be faster, but when your programming time is taken into account you can often accomplish your task much faster by writing in R. This is because R provides extensive libraries that are designed to reduce the amount of code that you have to write. R is also interactive, so that you can keep a session alive and continue to write new code as you run the old code. This is fundamentally different from C++ development, where you have to re-compile every time you change a single line of code. The convenience of R programming comes at the expense of speed.

However, there is a compromise. Rcpp allows you to move certain pieces of your R code to C++. The basic idea is that Rcpp provides C++ data structures that correspond to R data structures (e.g., a data.frame data structure written in C++). It is thus possible to write functions in R that get compiled into faster C++ code, with minimal additional effort on the part of the R programmer. The dplyr package makes extensive use of this functionality to improve performance.

21.2.3 Parallel and distributed computing

21.2.3.1 Embarrassingly parallel computing

How do you increase a program’s capacity to work with larger data? The most obvious way is to add more memory (i.e., RAM) to your computer. This enables the program to read more data at once, enabling greater functionality with any additional programming. But what if the bottleneck is not the memory, but the processor (CPU)? A processor can only do one thing at a time. So if you have a computation that takes \(t\) units of time, and you have to do that computation for many different data sets, then you can expect that it will take many more units of time to complete.

For example, suppose we generate 20 sets of 1 million \((x,y)\) random pairs and want to fit a regression model to each set.

n <- 1e6
k <- 20
d <- tibble(y = rnorm(n*k), x = rnorm(n*k), set = rep(1:k, each = n))

my_lm <- function(data, formula) { # native pipe fills in first unnamed argument
  lm(formula, data)
}

fit_lm <- function(data, set_id) {
  data |>
    filter(set == set_id) |>
    my_lm(y ~ x)
}

However long it takes to do it for the first set, it should take about 20 times as long to do it for all 20 sets. This is as expected, since the computation procedure was to fit the regression model for the first set, then fit it for the second set, and so on.

system_time(map(1:1, fit_lm, data = d))
process    real 
  511ms   511ms 
system_time(map(1:k, fit_lm, data = d))
process    real 
  6.56s   6.56s 

However, in this particular case, the data in each of the 20 sets has nothing to do with the data in any of the other sets. This is an example of an embarrassingly parallel problem. These data are ripe candidates for a parallelized computation. If we had 20 processors, we could fit one regression model on each CPU—all at the same time—and get our final result in about the same time as it takes to fit the model to one set of data. This would be a tremendous improvement in speed.

Unfortunately, we don’t have 20 CPUs. Nevertheless, most modern computers have multiple cores.

library(parallel)
my_cores <- detectCores()
my_cores
[1] 12

The parallel package provides functionality for parallel computation in R. The furrr package extends the future package to allow us to express embarrassingly parallel computations in our familiar purrr syntax (Vaughan and Dancho 2022). Specifically, it provides a function future_map() that works just like map() (see Chapter 7), except that it spreads the computations over multiple cores. The theoretical speed-up is a function of my_cores, but in practice this may be less for a variety of reasons (most notably, the overhead associated with combining the parallel results).

The plan function sets up a parallel computing environment. In this case, we are using the multiprocess mode, which will split computations across asynchronous separate R sessions. The workers argument to plan() controls the number of cores being used for parallel computation. Next, we fit the 20 regression models using the future_map() function instead of map. Once completed, set the computation mode back to sequential for the remainder of this chapter.

library(furrr)
plan(multiprocess, workers = my_cores)

system_time(
  future_map(1:k, fit_lm, data = d)
)
process    real 
  9.98s  11.56s
plan(sequential)

In this case, the overhead associated with combining the results was larger than the savings from parallelizing the computation. But this will not always be the case.

21.2.3.2 GPU computing and CUDA

Another fruitful avenue to speed up computations is through use of a graphical processing unit (GPU). These devices feature a highly parallel structure that can lead to significant performance gains. CUDA is a parallel computing platform and application programming interface created by NVIDIA (one of the largest manufacturers of GPUs). The OpenCL package provides bindings for R to the open-source, general-purpose OpenCL programming language for GPU computing.

21.2.3.3 MapReduce

MapReduce is a programming paradigm for parallel computing. To solve a task using a MapReduce framework, two functions must be written:

  1. Map(key_0, value_0): The Map() function reads in the original data (which is stored in key-value pairs), and splits it up into smaller subtasks. It returns a list of key-value pairs \((key_1, value_1)\), where the keys and values are not necessarily of the same type as the original ones.
  2. Reduce(key_1, list(value_1)): The MapReduce implementation has a method for aggregating the key-value pairs returned by the Map() function by their keys (i.e., key_1). Thus, you only have to write the Reduce() function, which takes as input a particular key_1, and a list of all the value_1’s that correspond to key_1. The Reduce() function then performs some operation on that list, and returns a list of values.

MapReduce is efficient and effective because the Map() step can be highly parallelized. Moreover, MapReduce is also fault tolerant, because if any individual Map() job fails, the controller can simply start another one. The Reduce() step often provides functionality similar to a GROUP BY operation in SQL.

Example

The canonical MapReduce example is to tabulate the frequency of each word in a large number of text documents (i.e., a corpus (see Chapter 19)). In what follows, we show an implementation written in Python by Bill Howe of the University of Washington (Howe 2014). Note that at the beginning, this bit of code calls an external MapReduce library that actually implements MapReduce. The user only needs to write the two functions shown in this block of code—not the MapReduce library itself.

import MapReduce
import sys

mr = MapReduce.MapReduce()

def mapper(record):
    key = record[0]
    value = record[1]
    words = value.split()
    for w in words:
      mr.emit_intermediate(w, 1)

def reducer(key, list_of_values):
    total = 0
    for v in list_of_values:
      total += v
    mr.emit((key, total))

if __name__ == '__main__':
  inputdata = open(sys.argv[1])
  mr.execute(inputdata, mapper, reducer)

We will use this MapReduce program to compile a word count for the issues raised on GitHub for the ggplot2 package. These are stored in a JSON file (see Chapter 6) as a single JSON array. Since we want to illustrate how MapReduce can parallelize over many files, we will convert this single array into a JSON object for each issue. This will mimic the typical use case. The jsonlite package provides functionality for coverting between JSON objects and native R data structures.

library(jsonlite)
url <- "https://api.github.com/repos/tidyverse/ggplot2/issues"
gg_issues <- url |>
  fromJSON() |>
  select(url, body) |>
  group_split(url) |>
  map_chr(~toJSON(as.character(.x))) |>
  write(file = here::here("book/code/map-reduce/issues.json"))

For example, the first issue is displayed below. Note that it consists of two comma-separated character strings within brackets. We can think of this as having the format: [key, value].

here::here("book/code/map-reduce/issues.json") |>
  readLines() |>
  head(1) |>
  str_wrap(width = 70) |>
  cat()
["https://api.github.com/repos/tidyverse/ggplot2/issues/5164","We're
taking the following figure out of R4DS to save
space and thought it might be a useful addition to
https://ggplot2.tidyverse.org/articles/ggplot2-specs.html.\r\n\r\n```r\r\n#|
label: fig-just\r\n#| echo: false\r\n#| fig-width: 4.5\r\n#|
fig-asp: 0.5\r\n#| out-width: \"60%\"\r\n#| fig-cap: >\r\n#| All
nine combinations of `hjust` and `vjust`.\r\n#| fig-alt: >\r\n#| A
1x1 grid. At (0,0) hjust is set to left and vjust is set to bottom.
\r\n#| At (0.5, 0) hjust is center and vjust is bottom and at (1, 0)
hjust is \r\n#| right and vjust is bottom. At (0, 0.5) hjust is left
and vjust is \r\n#| center, at (0.5, 0.5) hjust is center and vjust
is center, and at (1, 0.5) \r\n#| hjust is right and vjust is center.
Finally, at (1, 0) hjust is left and \r\n#| vjust is top, at (0.5,
1) hjust is center and vjust is top, and at (1, 1) \r\n#| hjust is
right and vjust is bottom.\r\n\r\nvjust <- c(bottom = 0, center = 0.5,
top = 1)\r\nhjust <- c(left = 0, center = 0.5, right = 1)\r\n\r\ndf
<- crossing(hj = names(hjust), vj = names(vjust)) |>\r\n mutate(\r\n
y = vjust[vj],\r\n x = hjust[hj],\r\n label = paste0(\"hjust = '\",
hj, \"'\\n\", \"vjust = '\", vj, \"'\")\r\n )\r\n\r\nggplot(df,
aes(x, y)) +\r\n geom_point(color = \"grey70\", size = 5) +\r\n
geom_point(size = 0.5, color = \"red\") +\r\n geom_text(aes(label =
label, hjust = hj, vjust = vj), size = 4) +\r\n labs(x = NULL, y =
NULL)\r\n```\r\n\r\n![image](https://user-images.githubusercontent.com/5965649/215240855-7165637a-dc92-4981-a0a7-e74853875506.png)\r\n"]

In the Python code written above (which is stored in the file wordcount.py), the mapper() function takes a record argument (i.e., one line of the issues.json file), and examines its first two elements—the key becomes the first argument (in this case, the URL of the GitHub issue) and the value becomes the second argument (the text of the issue). After splitting the value on each space, the mapper() function emits a \((key, value)\) pair for each word. Thus, the first issue shown above would generate the pairs: (When, 1), (setting, 1), (the, 1), etc.

The MapReduce library provides a mechanism for efficiently collecting all of the resulting pairs based on the key, which in this case corresponds to a single word. The reducer() function simply adds up all of the values associated with each key. In this case, these values are all 1s, so the resulting pair is a word and the number of times it appears (e.g., (the, 158), etc.).

Thanks to the reticulate package, we can run this Python script from within R and bring the results into R for further analysis. We see that the most common words in this corpus are short articles and prepositions.

library(reticulate)

py_script <- here::here("book/code/map-reduce/wordcount.py")
py_issues <- here::here("book/code/map-reduce/issues.json")

res <- system2("python3", c(py_script, py_issues), stdout = TRUE)
library(mdsr)
freq_df <- res |>
  purrr::map(jsonlite::fromJSON) |>
  purrr::map(rlang::set_names, c("word", "count")) |>
  purrr::map(
    .f = function(x) { x["word"] <- as.character(x["word"]); x }
  ) |>
  bind_rows() |>
  mutate(count = parse_number(count))
glimpse(freq_df)
Rows: 2,652
Columns: 2
$ word  <chr> "We're", "taking", "the", "following", "figure", "out", "of", "R…
$ count <dbl> 1, 2, 214, 6, 1, 4, 99, 2, 162, 1, 6, 89, 3, 42, 9, 36, 80, 5, 1…
freq_df |>
  filter(str_detect(pattern = "[a-z]", word)) |>
  arrange(desc(count)) |>
  head(10)
# A tibble: 10 × 2
   word    count
   <chr>   <dbl>
 1 the       214
 2 to        162
 3 of         99
 4 and        89
 5 is         84
 6 a          80
 7 in         62
 8 ggplot2    62
 9 for        56
10 (local)    54

MapReduce is popular and offers some advantages over SQL for some problems. When MapReduce first became popular, and Google used it to redo their webpage ranking system (see Chapter 20), there was great excitement about a coming “paradigm shift” in parallel and distributed computing. Nevertheless, advocates of SQL have challenged the notion that it has been completely superseded by MapReduce (Stonebraker et al. 2010).

21.2.3.4 Hadoop

As noted previously, MapReduce requires a software implementation. One popular such implementation is Hadoop MapReduce, which is one of the core components of Apache Hadoop. Hadoop is a larger software ecosystem for storing and processing large data that includes a distributed file system, Pig, Hive, Spark, and other popular open-source software tools. While we won’t be able to go into great detail about these items, we will illustrate how to interface with Spark, which has a particularly tight integration with RStudio.

21.2.3.5 Spark

One nice feature of Apache Spark—especially for our purposes—is that while it requires a distributed file system, it can implement a pseudo-distributed file system on a single machine. This makes it possible for you to experiment with Spark on your local machine even if you don’t have access to a cluster. For obvious reasons, you won’t actually see the performance boost that parallelism can bring, but you can try it out and debug your code. Furthermore, the sparklyr package makes it painless to install a local Spark cluster from within R, as well as connect to a local or remote cluster.

Once the sparklyr package is installed, we can use it to install a local Spark cluster.

library(sparklyr)
spark_install(version = "3.0") # only once!
# note to future selves: this required me to specify SPARK_HOME in my .cshrc file...
Sys.getenv("SPARK_HOME")
library(sparklyr)

Next, we make a connection to our local Spark instance from within R. Of course, if we were connecting to a remote Spark cluster, we could modify the master argument to reflect that. Spark requires Java, so you may have to install the Java Development Kit before using Spark.1

# sudo apt-get install openjdk-8-jdk
sc <- spark_connect(master = "local", version = "3.0")
class(sc)
[1] "spark_connection"       "spark_shell_connection" "DBIConnection"         

Note that sc has class DBIConnection—this means that it can do many of the things that other dplyr connections can do. For example, the src_tbls() function works just like it did on the MySQL connection objects we saw in Chapter 15).

src_tbls(sc)
character(0)

In this case, there are no tables present in this Spark cluster, but we can add them using the copy_to() command. Here, we will load the babynames table from the babynames package.

babynames_tbl <- sc |>
  copy_to(babynames::babynames, "babynames")
src_tbls(sc)
[1] "babynames"
class(babynames_tbl)
[1] "tbl_spark" "tbl_sql"   "tbl_lazy"  "tbl"      

The babynames_tbl object is a tbl_spark, but also a tbl_sql. Again, this is analogous to what we saw in Chapter 15, where a tbl_MySQLConnection was also a tbl_sql.

babynames_tbl |>
  filter(name == "Benjamin") |>
  group_by(year) |>
  summarize(N = n(), total_births = sum(n)) |>
  arrange(desc(total_births)) |>
  head()
# Source:     spark<?> [?? x 3]
# Ordered by: desc(total_births)
   year     N total_births
  <dbl> <dbl>        <dbl>
1  1989     2        15785
2  1988     2        15279
3  1987     2        14953
4  2000     2        14864
5  1990     2        14660
6  2016     2        14641

As we will see below with Google BigQuery, even though Spark is a parallelized technology designed to supersede SQL, it is still useful to know SQL in order to use Spark. Like BigQuery, sparklyr allows you to work with a Spark cluster using the familiar dplyr interface.

As you might suspect, because babynames_tbl is a tbl_sql, it implements SQL methods common in DBI. Thus, we can also write SQL queries against our Spark cluster.

library(DBI)
dbGetQuery(sc, "SELECT year, sum(1) as N, sum(n) as total_births
                FROM babynames WHERE name == 'Benjamin' 
                GROUP BY year
                ORDER BY total_births desc
                LIMIT 6")
  year N total_births
1 1989 2        15785
2 1988 2        15279
3 1987 2        14953
4 2000 2        14864
5 1990 2        14660
6 2016 2        14641

Finally, because Spark includes not only a database infrastructure, but also a machine learning library, sparklyr allows you to fit many of the models we outlined in Chapters 11 and 12 within Spark. This means that you can rely on Spark’s big data capabilities without having to bring all of your data into R’s memory.

As a motivating example, we fit a multiple regression model for the amount of rainfall at the MacLeish field station as a function of the temperature, pressure, and relative humidity.

library(macleish)
weather_tbl <- copy_to(sc, whately_2015)
weather_tbl |>
  ml_linear_regression(rainfall ~ temperature + pressure + rel_humidity) |>
  summary()
Deviance Residuals:
       Min         1Q     Median         3Q        Max 
-0.0828638 -0.0457327 -0.0237299 -0.0009378 16.2000751 

Coefficients:
  (Intercept)   temperature      pressure  rel_humidity 
 1.281287e+00  8.603249e-06 -1.353376e-03  1.034068e-03 

R-Squared: 0.01401
Root Mean Squared Error: 0.2268

The most recent versions of (RStudio) include integrated support for management of Spark clusters.

21.2.4 Alternatives to SQL

Relational database management systems can be spread across multiple computers into what is called a cluster. In fact, it is widely acknowledged that one of the things that allowed Google to grow so fast was its use of the open-source (zero cost) MySQL RDBMS running as a cluster across many identical low-cost servers. That is, rather than investing large amounts of money in big machines, they built a massive MySQL cluster over many small, cheap machines. Both MySQL and PostgreSQL provide functionality for extending a single installation to a cluster.

Helpful Tip

Use a cloud-based computing service, such as Amazon Web Services, Google Cloud Platform, or Digital Ocean, for a low-cost alternative to building your own server farm (many of these companies offer free credits for student and instructor use).

21.2.4.1 BigQuery

BigQuery is a Web service offered by Google. Internally, the BigQuery service is supported by Dremel, the open-source version of which is Apache Drill. The bigrquery package for R provides access to BigQuery from within R.

To use the BigQuery service, you need to sign up for an account with Google, but you won’t be charged unless you exceed the free limit of 10,000 requests per day (the BigQuery sandbox provides free access subject to certain limits).

If you want to use your own data, you have to upload it to Google Cloud Storage, but Google provides many data sets that you can use for free (e.g., COVID, Census, real-estate transactions). Here we illustrate how to query the shakespeare data set—which is a list of all of the words that appear in Shakespeare’s plays—to find the most common words. Note that BigQuery understands a recognizable dialect of SQL—what makes BigQuery special is that it is built on top of Google’s massive computing architecture.

library(bigrquery)
project_id <- "my-google-id"  

sql <- "
SELECT word
, count(distinct corpus) AS numPlays
, sum(word_count) AS N
FROM [publicdata:samples.shakespeare] 
GROUP BY word
ORDER BY N desc
LIMIT 10
"
bq_project_query(sql, project = project_id)
4.9 megabytes processed
   word numPlays     N
1   the       42 25568
2     I       42 21028
3   and       42 19649
4    to       42 17361
5    of       42 16438
6     a       42 13409
7   you       42 12527
8    my       42 11291
9    in       42 10589
10   is       42  8735

21.2.4.2 NoSQL

NoSQL refers not to a specific technology, but rather to a class of database architectures that are not based on the notion—so central to SQL (and data.frames in R—that a table consists of a rectangular array of rows and columns. Rather than being built around tables, NoSQL databases may be built around columns, key-value pairs, documents, or graphs. Nevertheless NoSQL databases may (or may not) include an SQL-like query language for retrieving data.

One particularly successful NoSQL database is MongoDB, which is based on a document structure. In particular, MongoDB is often used to store JSON objects (see Chapter 6), which are not necessarily tabular.

21.3 Alternatives to R

Python is a widely-used general-purpose, high-level programming language. You will find adherents for both R and Python, and while there are ongoing debates about which is “better,” there is no consensus. It is probably true that—for obvious reasons—computer scientists tend to favor Python, while statisticians tend to favor R. We prefer the latter but will not make any claims about its being “better” than Python. A well-rounded data scientist should be competent in both environments.

Python is a modular environment (like R) and includes many libraries for working with data. The most R-like is Pandas, but other popular auxiliary libraries include SciPy for scientific computation, NumPy for large arrays, matplotlib for graphics, and scikit-learn for machine learning.

Other popular programming languages among data scientists include Scala and Julia. Scala supports a functional programming paradigm that has been promoted by Wickham (2019) and other R users. Julia has a smaller user base but has nonetheless many strong adherents.

21.4 Closing thoughts

Advances in computing power and the internet have changed the field of statistics in ways that only the greatest visionaries could have imagined. In the 20th century, the science of extracting meaning from data focused on developing inferential techniques that required sophisticated mathematics to squeeze the most information out of small data. In the 21st century, the science of extracting meaning from data has focused on developing powerful computational tools that enable the processing of ever larger and more complex data. While the essential analytical language of the last century—mathematics—is still of great importance, the analytical language of this century is undoubtedly programming. The ability to write code is a necessary but not sufficient condition for becoming a data scientist.

We have focused on programming in R, a well-worn interpreted language designed by statisticians for computing with data. We believe that as an open-source language with a broad following, R has significant staying power. Yet we recognize that all technological tools eventually become obsolete. Nevertheless, by absorbing the lessons in this book, you will have transformed yourself into a competent, ethical, and versatile data scientist—one who possesses the essential capacities for working with a variety of data programmatically. You can build and interpret models, query databases both local and remote, make informative and interactive maps, and wrangle and visualize data in various forms. Internalizing these abilities will allow them to permeate your work in whatever field interests you, for as long as you continue to use data to inform.

21.5 Further resources

Tools for working with big data analytics are developing more quickly than any of the other topics in this book. A special issue of the The American Statistician addressed the training of students in statistics and data science (Horton and Hardin 2015). The issue included articles on teaching statistics at “Google-Scale” (Chamandy, Muraldharan, and Wager 2015) and on the teaching of data science more generally (Baumer 2015; Hardin et al. 2015). The board of directors of the American Statistical Association endorsed the Curriculum Guidelines for Undergraduate Programs in Data Science written by the Park City Math Institute (PCMI) Undergraduate Faculty Group (De Veaux et al. 2017). These guidelines recommended fusing statistical thinking into the teaching of techniques to solve big data problems.

A comprehensive survey of R packages for parallel computation and high-performance computing is available through the CRAN task view on that subject. The Parallel R book is another resource (McCallum and Weston 2011).

More information about Google BigQuery can be found at their website. A tutorial for SparkR is available on Apache’s website.


  1. Please check sparklyr for current information about which versions of the JDK are supported by which versions of Spark.↩︎