Chapter 16 Database administration

In Chapter 15, we learned how to write SELECT queries to retrieve data from an existing SQL server. Of course, these queries depend on that server being configured, and the proper data loaded into it. In this chapter, we provide the tools necessary to set up a new database and populate it. Furthermore, we present concepts that will help you construct efficient databases that enable faster query performance. While the treatment herein is not sufficient to make you a seasoned database administrator, it should be enough to allow you to start experimenting with SQL databases on your own.

As in Chapter 15, the code that you see in this chapter illustrates exchanges between a MySQL server and a client. In places where R is involved, we will make that explicit. We assume that you are able to log in to a MySQL server. (See Appendix F for instructions on how to install, configure, and log in to an SQL server.)

16.1 Constructing efficient SQL databases

While it is often helpful to think about SQL tables as being analogous to data.frames in R, there are some important differences. In R, a data.frame is a list of vectors that have the same length. Each of those vectors has a specific data type (e.g., integers, character strings, etc.), but those data types can vary across the columns. The same is true of tables in SQL, but there are additional constraints that we can impose on SQL tables that can improve both the logical integrity of our data, as well as the performance we can achieve when searching it.

16.1.1 Creating new databases

Once you have logged into MySQL, you can see what databases are available to you by running the SHOW DATABASES command at the mysql> prompt.

SHOW DATABASES;

Database
information_schema
airlines
fec
imdb
lahman
nyctaxi

In this case, the output indicates that the airlines database already exists. If it didn’t, we could create it using the CREATE DATABASE command.

CREATE DATABASE airlines;

Since we will continue to work with the airlines database, we can save ourselves some typing by utilizing the USE command to make that connection explicit.

USE airlines;

Now that we are confined to the airlines database, there is no ambiguity in asking what tables are present.

SHOW TABLES;

Tables_in_airlines
airports
carriers
flights
planes

16.1.2 `CREATE TABLE`

Recall that in Chapter 15 we used the DESCRIBE statement to display the definition of each table. This lists each field, its data type, whether there are keys or indices defined on it, and whether NULL values are allowed. For example, the airports table has the following definition.

DESCRIBE airports;

Field	Type	Null	Key	Default
faa	varchar(3)	NO	PRI
name	varchar(255)	YES		NA
lat	decimal(10,7)	YES		NA
lon	decimal(10,7)	YES		NA
alt	int(11)	YES		NA
tz	smallint(4)	YES		NA
dst	char(1)	YES		NA
city	varchar(255)	YES		NA
country	varchar(255)	YES		NA

We can see from the output that the faa, name, city, and country fields are defined as varchar (or variable character) fields. These fields contain character strings, but the length of the strings allowed varies. We know that the faa code is restricted to three characters, and so we have codified that in the table definition. The dst field contains only a single character, indicating whether daylight saving time is observed at each airport. The lat and lon fields contain geographic coordinates, which can be three-digit numbers (i.e., the maximum value is 180) with up to seven decimal places. The tz field can be up to a four-digit integer, while the alt field is allowed eleven digits. In this case, NULL values are allowed—and are the default—in all of the fields except for faa, which is the primary key. R is translating the null character in SQL (NULL) to the null character in R (NA).

These definitions did not come out of thin air, nor were they automatically generated. In this case, we wrote them by hand, in the following CREATE TABLE statement:

SHOW CREATE TABLE airports;

CREATE TABLE `airports` (
  `faa` varchar(3) NOT NULL DEFAULT '',
  `name` varchar(255) DEFAULT NULL,
  `lat` decimal(10,7) DEFAULT NULL,
  `lon` decimal(10,7) DEFAULT NULL,
  `alt` int(11) DEFAULT NULL,
  `tz` smallint(4) DEFAULT NULL,
  `dst` char(1) DEFAULT NULL,
  `city` varchar(255) DEFAULT NULL,
  `country` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`faa`)

The CREATE TABLE command starts by defining the name of the table, and then proceeds to list the field definitions in a comma-separated list. If you want to build a database from scratch—as we do in Section 16.3—you will have to write these definitions for each table.²⁹ Tables that are already created can be modified using the ALTER TABLE command. For example, the following will change the tz field to two digits and change the default value to zero.

ALTER TABLE airports CHANGE tz tz smallint(2) DEFAULT 0;

16.1.3 Keys

Two related but different concepts are keys and indices. The former offers some performance advantages but is primarily useful for imposing constraints on possible entries in the database, while the latter is purely about improving the speed of retrieval.

Different relational database management systems (RDBMS) may implement a variety of different kinds of keys, but three types are most common. In each case, suppose that we have a table with \(n\) rows and \(p\) columns.

PRIMARY KEY: a column or set of columns in a table that uniquely identifies each row. By convention, this column is often called id. A table can have at most one primary key, and in general it is considered good practice to define a primary key on every table (although there are exceptions to this rule). If the index spans \(k < p\) columns, then even though the primary key must by definition have \(n\) rows itself, it only requires \(nk\) pieces of data, rather than the \(np\) that the full table occupies. Thus, the primary key is always smaller than the table itself, and is thus faster to search. A second critically important role of the primary key is enforcement of non-duplication. If you try to insert a row into a table that would result in a duplicate entry for the primary key, you will get an error.
UNIQUE KEY: a column or set of columns in a table that uniquely identifies each row, except for rows that contain NULL in some of those attributes. Unlike primary keys, a single table may have many unique keys. A typical use for these are in a lookup table. For example, Ted Turocy maintains a register of player ids for professional baseball players across multiple data providers. Each row in this table is a different player, and the primary key is a randomly-generated hash—each player gets exactly one value. However, each row also contains that same player’s id in systems designed by MLBAM, Baseball-Reference, Baseball Prospectus, Fangraphs, etc. This is tremendously useful for researchers working with multiple data providers, since they can easily link a player’s statistics in one system to his information in another. However, this ability is predicated on the uniqueness of each player’s id in each system. Moreover, many players may not have an id in every system, since some data providers track minor league baseball, or even the Japanese and Korean professional leagues. Thus, the imposition of a unique key—which allows NULLs—is necessary to maintain the integrity of these data.
FOREIGN KEY: a column or set of columns that reference a primary key in another table. For example, the primary key in the carriers table is carrier. The carrier column in the flights table, which consists of carrier IDs, is a foreign key that references carriers.carrier. Foreign keys don’t offer any performance enhancements, but they are important for maintaining referential integrity, especially in transactional databases that have many insertions and deletions.

You can use the SHOW KEYS command to identify the keys in a table. Note that the carriers table has only one key defined: a primary key on carrier.

SHOW KEYS FROM carriers;

Table	Non_unique	Key_name	Seq_in_index	Column_name	Cardinality
carriers	0	PRIMARY	1	carrier	1610

16.1.4 Indices

While keys help maintain the integrity of the data, indices impose no constraints—they simply enable faster retrieval. An index is a lookup table that helps SQL keep track of which records contain certain values. Judicious use of indices can dramatically speed up retrieval times. The technical implementation of efficient indices is an active area of research among computer scientists, and fast indices are one of the primary advantages that differentiate SQL tables from large R data frames.

Indices have to be built by the database in advance, and they are then written to the disk. Thus, indices take up space on the disk (this is one of the reasons that they aren’t implemented in R). For some tables with many indices, the size of the indices can even exceed the size of the raw data. Thus, when building indices, there is a trade-off to consider: You want just enough indices but not too many.

Consider the task of locating all of the rows in the flights table that contain the origin value BDL. These rows are strewn about the table in no particular order. How would you find them? A simple approach would be to start with the first row, examine the origin field, grab it if it contains BDL, and otherwise move to the second row. In order to ensure that all of the matching rows are returned, this algorithm must check every single one of the \(n=\)48 million rows³⁰ in this table! So its speed is \(O(n)\). However, we have built an index on the origin column, and this index contains only 2,266 rows (see Table 16.1). Each row in the index corresponds to exactly one value of origin, and contains a lookup for the exact rows in the table that are specific to that value. Thus, when we ask for the rows for which origin is equal to BDL, the database will use the index to deliver those rows very quickly. In practice, the retrieval speed for indexed columns can be \(O(\ln{n})\) (or better)—which is a tremendous advantage when \(n\) is large.

The speed-up that indices can provide is often especially apparent when joining two large tables. To see why, consider the following toy example. Suppose we want to merge two tables on the columns whose values are listed below. To merge these records correctly, we have to do a lot of work to find the appropriate value in the second list that matches each value in the first list.

[1]  5 18  2  3  4  2  1

[1]  5  6  3 18  4  7  1  2

On the other hand, consider performing the same task on the same set of values, but having the values sorted ahead of time. Now, the merging task is very fast, because we can quickly locate the matching records. In effect, by keeping the records sorted, we have off-loaded the sorting task when we do a merge, resulting in much faster merging performance. However, this requires that we sort the records in the first place and then keep them sorted. This may slow down other operations—such as inserting new records—which now have to be done more carefully.

[1]  1  2  2  3  4  5 18

[1]  1  2  3  4  5  6  7 18

SHOW INDEXES FROM flights;

Table 16.1: Indices in the flights table.
Table	Non_unique	Key_name	Seq_in_index	Column_name	Cardinality
flights	1	Year	1	year	7
flights	1	Date	1	year	7
flights	1	Date	2	month	89
flights	1	Date	3	day	2712
flights	1	Origin	1	origin	2267
flights	1	Dest	1	dest	2267
flights	1	Carrier	1	carrier	134
flights	1	tailNum	1	tailnum	37862

In MySQL the SHOW INDEXES command is equivalent to SHOW KEYS. Note that the flights table has several keys defined, but no primary key (see Table 16.1). The key Date spans the three columns year, month, and day.

16.1.5 Query plans

It is important to have the right indices built for your specific data and the queries that are likely to be run on it. Unfortunately, there is not always a straightforward answer to the question of which indices to build. For the flights table, it seems likely that many queries will involve searching for flights from a particular origin, or to a particular destination, or during a particular year (or range of years), or on a specific carrier, and so we have built indices on each of these columns. We have also built the Date index, since it seems likely that people would want to search for flights on a certain date. However, it does not seem so likely that people would search for flights in a specific month across all years, and thus we have not built an index on month alone. The Date index contains the month column, but this index can only be used if year is also part of the query.

You can ask MySQL for information about how it is going to perform a query using the EXPLAIN syntax. This will help you understand how onerous your query is, without actually running it—saving you the time of having to wait for it to execute. This output reflects the query plan returned by the MySQL server.

If we were to run a query for long flights using the distance column the server will have to inspect each of the 48 million rows, since this column is not indexed. This is the slowest possible search, and is often called a table scan. The 48 million number that you see in the rows column is an estimate of the number of rows that MySQL will have to consult in order to process your query. In general, more rows mean a slower query.

EXPLAIN SELECT * FROM flights WHERE distance > 3000;

table	partitions	type	possible_keys	key	key_len	ref	rows	filtered
flights	p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32	ALL	NA	NA	NA	NA	47932811	33.3

On the other hand, if we search for recent flights using the year column, which has an index built on it, then we only need to consider a fraction of those rows (about 6.3 million).

EXPLAIN SELECT * FROM flights WHERE year = 2013;

table	partitions	type	possible_keys	key	key_len	ref	rows	filtered
flights	p27	ALL	Year,Date	NA	NA	NA	6369482	100

Note that for the second example the server could have used either the index Year or the index Date (which contains the column year). Because of the index, only the 6.3 million flights from 2013 were consulted. Similarly, if we search by year and month, we can use the Date index.

EXPLAIN SELECT * FROM flights WHERE year = 2013 AND month = 6;

table	partitions	type	possible_keys	key	key_len	ref	rows	filtered
flights	p27	ref	Year,Date	Date	6	const,const	714535	100

But if we search for months across all years, we can’t! The query plan results in a table scan again.

EXPLAIN SELECT * FROM flights WHERE month = 6;

table	partitions	type	possible_keys	key	key_len	ref	rows	filtered
flights	p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32	ALL	NA	NA	NA	NA	47932811	10

This is because although month is part of the Date index, it is the second column in the index, and thus it doesn’t help us when we aren’t filtering on year. Thus, if it were common for our users to search on month without year, it would probably be worth building an index on month. Were we to actually run these queries, there would be a significant difference in computational time.

Using indices is especially important when performing JOIN operations on large tables. In this example, both queries use indices. However, because the cardinality of the index on tailnum is smaller that the cardinality of the index on year (see Table 16.1), the number of rows in flights associated with each unique value of tailnum is smaller than for each unique value of year. Thus, the first query runs faster.

EXPLAIN 
  SELECT * FROM planes p 
  LEFT JOIN flights o ON p.tailnum = o.TailNum
  WHERE manufacturer = 'BOEING';

table	partitions	type	possible_keys	key	key_len	ref	rows	filtered
p	NA	ALL	NA	NA	NA	NA	3322	10
o	p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32	ref	tailNum	tailNum	9	airlines.p.tailnum	1266	100

EXPLAIN 
  SELECT * FROM planes p 
  LEFT JOIN flights o ON p.Year = o.Year
  WHERE manufacturer = 'BOEING';

table	partitions	type	possible_keys	key	key_len	ref	rows	filtered
p	NA	ALL	NA	NA	NA	NA	3322	10
o	p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32	ref	Year,Date	Year	3	airlines.p.year	6450117	100

16.1.6 Partitioning

Another approach to speeding up queries on large tables (like flights) is partitioning. Here, we could create partitions based on the year. For flights this would instruct the server to physically write the flights table as a series of smaller tables, each one specific to a single value of year. At the same time, the server would create a logical supertable, so that to the user, the appearance of flights would be unchanged. This acts like a preemptive index on the year column.

If most of the queries to the flights table were for a specific year or range of years, then partitioning could significantly improve performance, since most of the rows would never be consulted. For example, if most of the queries to the flights database were for the past three years, then partitioning would reduce the search space of most queries on the full data set to the roughly 20 million flights in the last three years instead of the 169 million rows in the last 20 years. But here again, if most of the queries to the flights table were about carriers across years, then this type of partitioning would not help at all. It is the job of the database designer to tailor the database structure to the pattern of queries coming from the users. As a data scientist, this may mean that you have to tailor the database structure to the queries that you are running.

16.2 Changing SQL data

In Chapter 15, we described how to query an SQL database using the SELECT command. Thus far in this chapter, we have discussed how to set up an SQL database, and how to optimize it for speed. None of these operations actually change data in an existing database. In this section, we will briefly touch upon the UPDATE and INSERT commands, which allow you to do exactly that.

16.2.1 Changing data

The UPDATE command allows you to reset values in a table across all rows that match a certain criteria. For example, in Chapter 15 we discussed the possibility that airports could change names over time. The airport in Washington, D.C. with code DCA is now called Ronald Reagan Washington National.

SELECT faa, name FROM airports WHERE faa = 'DCA';

faa	name
DCA	Ronald Reagan Washington Natl

However, the “Ronald Reagan” prefix was added in 1998. If—for whatever reason—we wanted to go back to the old name, we could use an UPDATE command to change that information in the airports table.

UPDATE airports 
  SET name = 'Washington National' 
  WHERE faa = 'DCA';

An UPDATE operation can be very useful when you have to apply wholesale changes over a large number of rows. However, extreme caution is necessary, since an imprecise UPDATE query can wipe out large quantities of data, and there is no “undo” operation!

Exercise extraordinary caution when performing UPDATEs.

16.2.2 Adding data

New data can be appended to an existing table with the INSERT commands. There are actually three things that can happen, depending on what you want to do when you have a primary key conflict. This occurs when one of the new rows that you are trying to insert has the same primary key value as one of the existing rows in the table.

INSERT: Try to insert the new rows. If there is a primary key conflict, quit and throw an error.
INSERT IGNORE: Try to insert the new rows. If there is a primary key conflict, skip inserting the conflicting rows and leave the existing rows untouched. Continue inserting data that does not conflict.
REPLACE: Try to insert the new rows. If there is a primary key conflict, overwrite the existing rows with the new ones. Continue inserting data that does not conflict.

Recall that in Chapter 15 we found that the airports in Puerto Rico were not present in the airports table. If we wanted to add these manually, we could use INSERT.

INSERT INTO airports (faa, name) 
  VALUES ('SJU', 'Luis Munoz Marin International Airport');

Since faa is the primary key on this table, we can insert this row without contributing values for all of the other fields. In this case, the new row corresponding to SJU would have the faa and name fields as noted above, and the default values for all of the other fields. If we were to run this operation a second time, we would get an error, because of the primary key collision on SJU. We could avoid the error by choosing to INSERT INGORE or REPLACE instead of INSERT.

16.2.3 Importing data from a file

In practice, we rarely add new data manually in this manner. Instead, new data are most often added using the LOAD DATA command. This allows a file containing new data—usually a CSV—to be inserted in bulk. This is very common, when, for example, your data comes to you daily in a CSV file and you want to keep your database up to date. The primary key collision concepts described above also apply to the LOAD DATA syntax, and are important to understand for proper database maintenance. We illustrate the use of LOAD DATA in Section 16.3.

16.3 Extended example: Building a database

The extract-transform-load (ETL) paradigm is common among data professionals. The idea is that many data sources need to be extracted from some external source, transformed into a different format, and finally loaded into a database system. Often, this is an iterative process that needs to be done every day, or even every hour. In such cases, developing the infrastructure to automate these steps can result in dramatically increased productivity.

In this example, we will illustrate how to set up a MySQL database for the babynames data using the command line and SQL, but not R. As noted previously, while the dplyr package has made R a viable interface for querying and populating SQL databases, it is occasionally necessary to get “under the hood” with SQL. The files that correspond to this example can be found on the book website at http://mdsr-book.github.io/.

16.3.1 Extract

In this case, our data already lives in an R package, but in most cases, your data will live on a website, or be available in a different format. Our goal is to take that data from wherever it is and download it. For the babynames data, there isn’t much to do, since we already have the data in an R package. We will simply load it.

library(babynames)

16.3.2 Transform

Since SQL tables conform to a row-and-column paradigm, our goal during the transform phase is to create CSV files (see Chapter 6) for each of the tables. In this example, we will create tables for the babynames and births tables. You can try to add the applicants and lifetables tables on your own. We will simply write these data to CSV files using the write_csv() command. Since the babynames table is very long (nearly 1.8 million rows), we will just use the more recent data.

babynames %>%
  filter(year > 1975) %>%
  write_csv("babynames.csv")
births %>%
  write_csv("births.csv")
list.files(".", pattern = ".csv")

[1] "babynames.csv" "births.csv"

This raises an important question: what should we call these objects? The babynames package includes a data frame called babynames with one row per sex per year per name. Having both the database and a table with the same name may be confusing. To clarify which is which we will call the database babynamedb and the table babynames.

Spending time thinking about the naming of databases, tables, and fields before you create them can help avoid confusion later on.

16.3.3 Load into MySQL database

Next, we need to write a script that will define the table structure for these two tables in a MySQL database (instructions for creation of a database in SQLite can be found in Section F.4.4). This script will have four parts:

a USE statement that ensures we are in the right schema/database
a series of DROP TABLE statements that drop any old tables with the same names as the ones we are going to create
a series of CREATE TABLE statements that specify the table structures
a series of LOAD DATA statements that read the data from the CSVs into the appropriate tables

The first part is easy:

USE babynamedb;

This assumes that we have a local database called babynamedb—we will create this later. The second part is easy in this case, since we only have two tables. These ensure that we can run this script as many times as we want.

DROP TABLE IF EXISTS babynames;
DROP TABLE IF EXISTS births;

Be careful with the DROP TABLE statement. It destroys data.

The third step is the trickiest part—we need to define the columns precisely. The use of str(), summary(), and glimpse() are particularly useful for matching up R data types with MySQL data types. Please see the MySQL documentation for more information about what data types are supported.

glimpse(babynames)

Rows: 1,924,665
Columns: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1…
$ sex  <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "…
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "I…
$ n    <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1…
$ prop <dbl> 0.07238, 0.02668, 0.02052, 0.01987, 0.01789, 0.01617, 0.01508…

In this case, we know that the year variable will only contain four-digit integers, so we can specify that this column take up only that much room in SQL. Similarly, the sex variable is just a single character, so we can restrict the width of that column as well. These savings probably won’t matter much in this example, but for large tables they can make a noticeable difference.

CREATE TABLE `babynames` (
  `year` smallint(4) NOT NULL DEFAULT 0,
  `sex` char(1) NOT NULL DEFAULT 'F',
  `name` varchar(255) NOT NULL DEFAULT '',
  `n` mediumint(7) NOT NULL DEFAULT 0,
  `prop` decimal(21,20) NOT NULL DEFAULT 0,
  PRIMARY KEY (`year`, `sex`, `name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

In this table, each row contains the information about one name for one sex in one year. Thus, each row contains a unique combination of those three variables, and we can therefore define a primary key across those three fields. Note the use of backquotes (to denote tables and variables) and the use of regular quotes (for default values).

glimpse(births)

Rows: 109
Columns: 2
$ year   <int> 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918,…
$ births <int> 2718000, 2777000, 2809000, 2840000, 2869000, 2966000, 29650…

CREATE TABLE `births` (
  `year` smallint(4) NOT NULL DEFAULT 0,
  `births` mediumint(8) NOT NULL DEFAULT 0,
  PRIMARY KEY (`year`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

Finally, we have to tell MySQL where to find the CSV files and where to put the data it finds in them. This is accomplished using the LOAD DATA command. You may also need to add a LINES TERMINATED BY \r\n clause, but we have omitted that for clarity. Please be aware that lines terminate using different characters in different operating systems, so Windows, Mac, and Linux users may have to tweak these commands to suit their needs. The SHOW WARNINGS commands are not necessary, but they will help with debugging.

LOAD DATA LOCAL INFILE './babynames.csv' INTO TABLE `babynames` 
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES;
SHOW WARNINGS;
LOAD DATA LOCAL INFILE './births.csv' INTO TABLE `births`
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES;
SHOW WARNINGS;

Putting this all together, we have the following script:

USE babynamedb;

DROP TABLE IF EXISTS babynames;
DROP TABLE IF EXISTS births;

CREATE TABLE `babynames` (
  `year` smallint(4) NOT NULL DEFAULT 0,
  `sex` char(1) NOT NULL DEFAULT 'F',
  `name` varchar(255) NOT NULL DEFAULT '',
  `n` mediumint(7) NOT NULL DEFAULT 0,
  `prop` decimal(21,20) NOT NULL DEFAULT 0,
  PRIMARY KEY (`year`, `sex`, `name`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

CREATE TABLE `births` (
  `year` smallint(4) NOT NULL DEFAULT 0,
  `births` mediumint(8) NOT NULL DEFAULT 0,
  PRIMARY KEY (`year`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

LOAD DATA LOCAL INFILE './babynames.csv' INTO TABLE `babynames` 
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' IGNORE 1 LINES;
LOAD DATA LOCAL INFILE './births.csv' INTO TABLE `births`
  FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' IGNORE 1 LINES;

SELECT year
  , COUNT(DISTINCT name) AS numNames
  , SUM(n) AS numBirths
  FROM babynames 
  GROUP BY year
  ORDER BY numBirths DESC
  LIMIT 0,10;

Note that we have added a SELECT query just to verify that our table is populated. To load this into MySQL, we must first make sure that the babynamedb database exists, and if not, we must create it.

First, we check to see if babynamedb exists. We can do this from the command line using shell commands:

mysql -e "SHOW DATABASES;"

If it doesn’t exist, then we must create it:

mysql -e "CREATE DATABASE babynamedb;"

Finally, we run our script. The --show-warnings and -v flags are optional, but will help with debugging.

mysql --local-infile --show-warnings -v babynamedb 
  < babynamedb.mysql

In practice, if your SQL script is not perfect, you will see errors or warnings the first time you try this. But by iterating this process, you will eventually refine your script such that it works as desired. If you get an 1148 error, make sure that you are using the --local-infile flag.

ERROR 1148 (42000): The used command is not allowed with this MySQL version

If you get a 29 error, make sure that the file exists in this location and that the mysql user has permission to read and execute it.

ERROR 29 (HY000): File './babynames.csv' not found (Errcode: 13)

Once the MySQL database has been created, the following commands can be used to access it from R using dplyr:

db <- dbConnect(RMySQL::MySQL(), dbname = "babynamedb")
babynames <- tbl(db, "babynames")
babynames %>% 
  filter(name == "Benjamin")

16.4 Scalability

With the exception of SQLite, RBDMSs scale very well on a single computer to databases that take up dozens of gigabytes. For a dedicated server, even terabytes are workable on a single machine. Beyond this, many companies employ distributed solutions called clusters. A cluster is simply more than one machine (i.e., a node) linked together running the same RDBMS. One machine is designated as the head node, and this machine controls all of the other nodes. The actual data are distributed across the various nodes, and the head node manages queries—parceling them to the appropriate cluster nodes.

A full discussion of clusters and other distributed architectures (including replication) are beyond the scope of this book. In Chapter 21, we discuss alternatives to SQL that may provide higher-end solutions for bigger data.

16.5 Further resources

The SQL in a Nutshell book (Kline et al. 2008) is a useful reference for all things SQL.

16.6 Exercises

Problem 1 (Easy): Alice is searching for cancelled flights in the flights table, and her query is running very slowly. She decides to build an index on cancelled in the hopes of speeding things up. Discuss the relative merits of her plan. What are the trade-offs? Will her query be any faster?

Problem 2 (Medium): The Master table of the Lahman database contains biographical information about baseball players. The primary key is the playerID variable. There are also variables for retroID and bbrefID, which correspond to the player’s identifier in other baseball databases. Discuss the ramifications of placing a primary, unique, or foreign key on retroID.

Problem 3 (Medium): Bob wants to analyze the on-time performance of United Airlines flights across the decade of the 1990s. Discuss how the partitioning scheme of the flights table based on year will affect the performance of Bob’s queries, relative to an unpartitioned table.

Problem 4 (Hard): Use the macleish package to download the weather data at the MacLeish Field Station. Write your own table schema from scratch and import these data into the database server of your choice.

Problem 5 (Hard): Write a full table schema for the two tables in the fueleconomy package and import them into the database server of your choice.

Problem 6 (Hard): Write a full table schema for the mtcars data set and import it into the database server of your choice.

Problem 7 (Hard): Write a full table schema for the five tables in the nasaweather package and import them into the database server of your choice.

Problem 8 (Hard): Write a full table schema for two of the ten tables in the usdanutrients package and import them into the database server of your choice.

# remotes::install_github("hadley/usdanutrients")
library(usdanutrients)
# data(package="usdanutrients")

16.7 Supplementary exercises

Available at https://mdsr-book.github.io/mdsr2e/sql-II.html#sqlII-online-exercises

Problem 1 (Easy): The flights table in the airlines database contains the following indexes:

SHOW INDEXES FROM flights;

Table	Non_unique	Key_name	Seq_in_index	Column_name	Collation	Cardinality	Sub_part	Packed	Null	Index_type
flights	1	Year	1	year	A	7	NA	NA	YES	BTREE
flights	1	Date	1	year	A	7	NA	NA	YES	BTREE
flights	1	Date	2	month	A	89	NA	NA	YES	BTREE
flights	1	Date	3	day	A	2712	NA	NA	YES	BTREE
flights	1	Origin	1	origin	A	2267	NA	NA		BTREE
flights	1	Dest	1	dest	A	2267	NA	NA		BTREE
flights	1	Carrier	1	carrier	A	134	NA	NA		BTREE
flights	1	tailNum	1	tailnum	A	37862	NA	NA	YES	BTREE

Consider the following queries:

SELECT * FROM flights WHERE cancelled = 1;
SELECT * FROM flights WHERE carrier = "DL";

Which query will execute faster? Justify your answer.

Problem 2 (Hard): Use the fec12 package to download and unzip the federal election data for 2012 that were used in Chapter 2. Write your own table schema from scratch and import these data into the database server of your choice.

There are ways of automatically generating table schemas, but in many cases some manual tweaking is recommended.↩︎
The db instance referenced here contains flight data from 2010–2017 only.↩︎