Chapter 16 Database administration
In Chapter 15, we learned how to write SELECT
queries to retrieve data from an existing SQL server. Of course, these queries depend on that server being configured, and the proper data loaded into it. In this chapter, we provide the tools necessary to set up a new database and populate it. Furthermore, we present concepts that will help you construct efficient databases that enable faster query performance. While the treatment herein is not sufficient to make you a seasoned database administrator, it should be enough to allow you to start experimenting with SQL databases on your own.
As in Chapter 15, the code that you see in this chapter illustrates exchanges between a MySQL server and a client. In places where R is involved, we will make that explicit. We assume that you are able to log in to a MySQL server. (See Appendix F for instructions on how to install, configure, and log in to an SQL server.)
16.1 Constructing efficient SQL databases
While it is often helpful to think about SQL tables as being analogous to data.frame
s in R, there are some important differences. In R, a data.frame
is a list
of vectors that have the same length. Each of those vectors has a specific data type (e.g., integers, character strings, etc.), but those data types can vary across the columns. The same is true of tables in SQL, but there are additional constraints that we can impose on SQL tables that can improve both the logical integrity of our data, as well as the performance we can achieve when searching it.
16.1.1 Creating new databases
Once you have logged into MySQL, you can see what databases are available to you by running the SHOW DATABASES
command at the mysql>
prompt.
SHOW DATABASES;
Database |
---|
information_schema |
airlines |
fec |
imdb |
lahman |
nyctaxi |
In this case, the output indicates that the airlines
database already exists.
If it didn’t, we could create it using the CREATE DATABASE
command.
CREATE DATABASE airlines;
Since we will continue to work with the airlines
database, we can save ourselves some typing by utilizing the USE
command to make that connection explicit.
USE airlines;
Now that we are confined to the airlines
database, there is no ambiguity in asking what tables are present.
TABLES; SHOW
Tables_in_airlines |
---|
airports |
carriers |
flights |
planes |
16.1.2 CREATE TABLE
Recall that in Chapter 15 we used the DESCRIBE
statement to display the definition of each table. This lists each field, its data type, whether there are keys or indices defined on it, and whether NULL
values are allowed. For example, the airports
table has the following definition.
DESCRIBE airports;
Field | Type | Null | Key | Default | Extra |
---|---|---|---|---|---|
faa | varchar(3) | NO | PRI | ||
name | varchar(255) | YES | NA | ||
lat | decimal(10,7) | YES | NA | ||
lon | decimal(10,7) | YES | NA | ||
alt | int(11) | YES | NA | ||
tz | smallint(4) | YES | NA | ||
dst | char(1) | YES | NA | ||
city | varchar(255) | YES | NA | ||
country | varchar(255) | YES | NA |
We can see from the output that the faa
, name
, city
, and country
fields are defined as varchar
(or variable character) fields. These fields contain character strings, but the length of the strings allowed varies. We know that the faa
code is restricted to three characters, and so we have codified that in the table definition. The dst
field contains only a single character, indicating whether daylight saving time is observed at each airport. The lat
and lon
fields contain geographic coordinates, which can be three-digit numbers (i.e., the maximum value is 180) with up to seven decimal places. The tz
field can be up to a four-digit integer, while the alt
field is allowed eleven digits.
In this case, NULL
values are allowed—and are the default—in all of the fields except for faa
, which is the primary key.
R is translating the null character in SQL (NULL
) to the null character in R (NA
).
These definitions did not come out of thin air, nor were they automatically generated. In this case, we wrote them by hand, in the following CREATE TABLE
statement:
CREATE TABLE airports; SHOW
CREATE TABLE `airports` (
varchar(3) NOT NULL DEFAULT '',
`faa` varchar(255) DEFAULT NULL,
`name` decimal(10,7) DEFAULT NULL,
`lat` decimal(10,7) DEFAULT NULL,
`lon` int(11) DEFAULT NULL,
`alt` smallint(4) DEFAULT NULL,
`tz` char(1) DEFAULT NULL,
`dst` varchar(255) DEFAULT NULL,
`city` varchar(255) DEFAULT NULL,
`country` PRIMARY KEY (`faa`)
The CREATE TABLE
command starts by defining the name of the table, and then proceeds to list the field definitions in a comma-separated list. If you want to build a database from scratch—as we do in Section 16.3—you will have to write these definitions for each table.29 Tables that are already created can be modified using the ALTER TABLE
command. For example, the following will change the tz
field to two digits and change the default value to zero.
ALTER TABLE airports CHANGE tz tz smallint(2) DEFAULT 0;
16.1.3 Keys
Two related but different concepts are keys and indices. The former offers some performance advantages but is primarily useful for imposing constraints on possible entries in the database, while the latter is purely about improving the speed of retrieval.
Different relational database management systems (RDBMS) may implement a variety of different kinds of keys, but three types are most common. In each case, suppose that we have a table with \(n\) rows and \(p\) columns.
PRIMARY KEY
: a column or set of columns in a table that uniquely identifies each row. By convention, this column is often calledid
. A table can have at most one primary key, and in general it is considered good practice to define a primary key on every table (although there are exceptions to this rule). If the index spans \(k < p\) columns, then even though the primary key must by definition have \(n\) rows itself, it only requires \(nk\) pieces of data, rather than the \(np\) that the full table occupies. Thus, the primary key is always smaller than the table itself, and is thus faster to search. A second critically important role of the primary key is enforcement of non-duplication. If you try to insert a row into a table that would result in a duplicate entry for the primary key, you will get an error.UNIQUE KEY
: a column or set of columns in a table that uniquely identifies each row, except for rows that containNULL
in some of those attributes. Unlike primary keys, a single table may have many unique keys. A typical use for these are in a lookup table. For example, Ted Turocy maintains a register of playerid
s for professional baseball players across multiple data providers. Each row in this table is a different player, and the primary key is a randomly-generated hash—each player gets exactly one value. However, each row also contains that same player’sid
in systems designed by MLBAM, Baseball-Reference, Baseball Prospectus, Fangraphs, etc. This is tremendously useful for researchers working with multiple data providers, since they can easily link a player’s statistics in one system to his information in another. However, this ability is predicated on the uniqueness of each player’s id in each system. Moreover, many players may not have anid
in every system, since some data providers track minor league baseball, or even the Japanese and Korean professional leagues. Thus, the imposition of a unique key—which allowsNULL
s—is necessary to maintain the integrity of these data.FOREIGN KEY
: a column or set of columns that reference a primary key in another table. For example, the primary key in thecarriers
table iscarrier
. Thecarrier
column in theflights
table, which consists of carrier IDs, is a foreign key that referencescarriers.carrier
. Foreign keys don’t offer any performance enhancements, but they are important for maintaining referential integrity, especially in transactional databases that have many insertions and deletions.
You can use the SHOW KEYS
command to identify the keys in a table. Note that the carriers
table has only one key defined: a primary key on carrier
.
FROM carriers; SHOW KEYS
Table | Non_unique | Key_name | Seq_in_index | Column_name | Cardinality |
---|---|---|---|---|---|
carriers | 0 | PRIMARY | 1 | carrier | 1610 |
16.1.4 Indices
While keys help maintain the integrity of the data, indices impose no constraints—they simply enable faster retrieval. An index is a lookup table that helps SQL keep track of which records contain certain values. Judicious use of indices can dramatically speed up retrieval times. The technical implementation of efficient indices is an active area of research among computer scientists, and fast indices are one of the primary advantages that differentiate SQL tables from large R data frames.
Indices have to be built by the database in advance, and they are then written to the disk. Thus, indices take up space on the disk (this is one of the reasons that they aren’t implemented in R). For some tables with many indices, the size of the indices can even exceed the size of the raw data. Thus, when building indices, there is a trade-off to consider: You want just enough indices but not too many.
Consider the task of locating all of the rows in the flights
table that contain the origin
value BDL
. These rows are strewn about the table in no particular order. How would you find them? A simple approach would be to start with the first row, examine the origin
field, grab it if it contains BDL
, and otherwise move to the second row. In order to ensure that all of the matching rows are returned, this algorithm must check every single one of the \(n=\)48 million rows30 in this table! So its speed is \(O(n)\). However, we have built an index on the origin
column, and this index contains only 2,266 rows (see Table 16.1). Each row in the index corresponds to exactly one value of origin
, and contains a lookup for the exact rows in the table that are specific to that value. Thus, when we ask for the rows for which origin
is equal to BDL
, the database will use the index to deliver those rows very quickly. In practice, the retrieval speed for indexed columns can be \(O(\ln{n})\) (or better)—which is a tremendous advantage when \(n\) is large.
The speed-up that indices can provide is often especially apparent when joining two large tables. To see why, consider the following toy example. Suppose we want to merge two tables on the columns whose values are listed below. To merge these records correctly, we have to do a lot of work to find the appropriate value in the second list that matches each value in the first list.
[1] 5 18 2 3 4 2 1
[1] 5 6 3 18 4 7 1 2
On the other hand, consider performing the same task on the same set of values, but having the values sorted ahead of time. Now, the merging task is very fast, because we can quickly locate the matching records. In effect, by keeping the records sorted, we have off-loaded the sorting task when we do a merge, resulting in much faster merging performance. However, this requires that we sort the records in the first place and then keep them sorted. This may slow down other operations—such as inserting new records—which now have to be done more carefully.
[1] 1 2 2 3 4 5 18
[1] 1 2 3 4 5 6 7 18
INDEXES FROM flights; SHOW
Table | Non_unique | Key_name | Seq_in_index | Column_name | Cardinality |
---|---|---|---|---|---|
flights | 1 | Year | 1 | year | 7 |
flights | 1 | Date | 1 | year | 7 |
flights | 1 | Date | 2 | month | 89 |
flights | 1 | Date | 3 | day | 2712 |
flights | 1 | Origin | 1 | origin | 2267 |
flights | 1 | Dest | 1 | dest | 2267 |
flights | 1 | Carrier | 1 | carrier | 134 |
flights | 1 | tailNum | 1 | tailnum | 37862 |
In MySQL the SHOW INDEXES
command is equivalent to SHOW KEYS
. Note that the flights
table has several keys defined, but no primary key (see Table 16.1).
The key Date
spans the three columns year
, month
, and day
.
16.1.5 Query plans
It is important to have the right indices built for your specific data and the queries that are likely to be run on it. Unfortunately, there is not always a straightforward answer to the question of which indices to build.
For the flights
table, it seems likely that many queries will involve searching for flights from a particular origin, or to a particular destination, or during a particular year (or range of years), or on a specific carrier, and so we have built indices on each of these columns.
We have also built the Date
index, since it seems likely that people would want to search for flights on a certain date.
However, it does not seem so likely that people would search for flights in a specific month across all years, and thus we have not built an index on month
alone.
The Date
index contains the month
column, but this index can only be used if year
is also part of the query.
You can ask MySQL for information about how it is going to perform a query using the EXPLAIN
syntax. This will help you understand how onerous your query is, without actually running it—saving you the time of having to wait for it to execute.
This output reflects the query plan returned by the MySQL server.
If we were to run a query for long flights using the distance
column the server will have to inspect each of the 48 million rows, since this column is not indexed.
This is the slowest possible search, and is often called a table scan.
The 48 million number that you see in the rows
column is an estimate of the number of rows that MySQL will have to consult in order to process your query.
In general, more rows mean a slower query.
EXPLAIN SELECT * FROM flights WHERE distance > 3000;
table | partitions | type | possible_keys | key | key_len | ref | rows | filtered |
---|---|---|---|---|---|---|---|---|
flights | p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32 | ALL | NA | NA | NA | NA | 47932811 | 33.3 |
On the other hand, if we search for recent flights using the year
column, which has an index built on it, then we only need to consider a fraction of those rows (about 6.3 million).
EXPLAIN SELECT * FROM flights WHERE year = 2013;
table | partitions | type | possible_keys | key | key_len | ref | rows | filtered |
---|---|---|---|---|---|---|---|---|
flights | p27 | ALL | Year,Date | NA | NA | NA | 6369482 | 100 |
Note that for the second example the server could have used either the index Year
or the index Date
(which contains the column year
). Because of the index, only the 6.3 million flights from 2013 were consulted.
Similarly, if we search by year and month, we can use the Date
index.
EXPLAIN SELECT * FROM flights WHERE year = 2013 AND month = 6;
table | partitions | type | possible_keys | key | key_len | ref | rows | filtered |
---|---|---|---|---|---|---|---|---|
flights | p27 | ref | Year,Date | Date | 6 | const,const | 714535 | 100 |
But if we search for months across all years, we can’t! The query plan results in a table scan again.
EXPLAIN SELECT * FROM flights WHERE month = 6;
table | partitions | type | possible_keys | key | key_len | ref | rows | filtered |
---|---|---|---|---|---|---|---|---|
flights | p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32 | ALL | NA | NA | NA | NA | 47932811 | 10 |
This is because although month
is part of the Date
index, it is the second column in the index, and thus it doesn’t help us when we aren’t filtering on year
. Thus, if it were common for our users to search on month
without year
, it would probably be worth building an index on month
. Were we to actually run these queries, there would be a significant difference in computational time.
Using indices is especially important when performing JOIN
operations on large tables.
In this example, both queries use indices.
However, because the cardinality of the index on tailnum
is smaller that the cardinality of the index on year
(see Table 16.1), the number of rows in flights
associated with each unique value of tailnum
is smaller than for each unique value of year
.
Thus, the first query runs faster.
EXPLAIN
SELECT * FROM planes p
LEFT JOIN flights o ON p.tailnum = o.TailNum
WHERE manufacturer = 'BOEING';
table | partitions | type | possible_keys | key | key_len | ref | rows | filtered |
---|---|---|---|---|---|---|---|---|
p | NA | ALL | NA | NA | NA | NA | 3322 | 10 |
o | p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32 | ref | tailNum | tailNum | 9 | airlines.p.tailnum | 1266 | 100 |
EXPLAIN
SELECT * FROM planes p
LEFT JOIN flights o ON p.Year = o.Year
WHERE manufacturer = 'BOEING';
table | partitions | type | possible_keys | key | key_len | ref | rows | filtered |
---|---|---|---|---|---|---|---|---|
p | NA | ALL | NA | NA | NA | NA | 3322 | 10 |
o | p1,p2,p3,p4,p5,p6,p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18,p19,p20,p21,p22,p23,p24,p25,p26,p27,p28,p29,p30,p31,p32 | ref | Year,Date | Year | 3 | airlines.p.year | 6450117 | 100 |
16.1.6 Partitioning
Another approach to speeding up queries on large tables (like flights
) is partitioning. Here, we could create partitions based on the year
. For flights
this would instruct the server to physically write the flights
table as a series of smaller tables, each one specific to a single value of year
. At the same time, the server would create a logical supertable, so that to the user, the appearance of flights
would be unchanged. This acts like a preemptive index on the year
column.
If most of the queries to the flights
table were for a specific year or range of years, then partitioning could significantly improve performance, since most of the rows would never be consulted. For example, if most of the queries to the flights
database were for the past three years, then partitioning would reduce the search space of most queries on the full data set to the roughly 20 million flights in the last three years instead of the 169 million rows in the last 20 years. But here again, if most of the queries to the flights
table were about carriers across years, then this type of partitioning would not help at all. It is the job of the database designer to tailor the database structure to the pattern of queries coming from the users. As a data scientist, this may mean that you have to tailor the database structure to the queries that you are running.
16.2 Changing SQL data
In Chapter 15, we described how to query an SQL database using the SELECT
command. Thus far in this chapter, we have discussed how to set up an SQL database, and how to optimize it for speed. None of these operations actually change data in an existing database. In this section, we will briefly touch upon the UPDATE
and INSERT
commands, which allow you to do exactly that.
16.2.1 Changing data
The UPDATE
command allows you to reset values in a table across all rows that match a certain criteria. For example, in Chapter 15 we discussed the possibility that airports could change names over time. The airport in Washington, D.C. with code DCA
is now called Ronald Reagan Washington National.
SELECT faa, name FROM airports WHERE faa = 'DCA';
faa | name |
---|---|
DCA | Ronald Reagan Washington Natl |
However, the “Ronald Reagan” prefix was added in 1998. If—for whatever reason—we wanted to go back to the old name, we could use an UPDATE
command to change that information in the airports
table.
UPDATE airports
SET name = 'Washington National'
WHERE faa = 'DCA';
An UPDATE
operation can be very useful when you have to apply wholesale changes over a large number of rows. However, extreme caution is necessary, since an imprecise UPDATE
query can wipe out large quantities of data, and there is no “undo” operation!
Exercise extraordinary caution when performing UPDATE
s.
16.2.2 Adding data
New data can be appended to an existing table with the INSERT
commands. There are actually three things that can happen, depending on what you want to do when you have a primary key conflict. This occurs when one of the new rows that you are trying to insert has the same primary key value as one of the existing rows in the table.
INSERT
: Try to insert the new rows. If there is a primary key conflict, quit and throw an error.INSERT IGNORE
: Try to insert the new rows. If there is a primary key conflict, skip inserting the conflicting rows and leave the existing rows untouched. Continue inserting data that does not conflict.REPLACE
: Try to insert the new rows. If there is a primary key conflict, overwrite the existing rows with the new ones. Continue inserting data that does not conflict.
Recall that in Chapter 15 we found that the airports in Puerto Rico were not present in the airports
table. If we wanted to add these manually, we could use INSERT
.
INSERT INTO airports (faa, name)
VALUES ('SJU', 'Luis Munoz Marin International Airport');
Since faa
is the primary key on this table, we can insert this row without contributing values for all of the other fields. In this case, the new row corresponding to SJU
would have the faa
and name
fields as noted above, and the default values for all of the other fields. If we were to run this operation a second time, we would get an error, because of the primary key collision on SJU
. We could avoid the error by choosing to INSERT INGORE
or REPLACE
instead of INSERT
.
16.2.3 Importing data from a file
In practice, we rarely add new data manually in this manner. Instead, new data are most often added using the LOAD DATA
command. This allows a file containing new data—usually a CSV—to be inserted in bulk. This is very common, when, for example, your data comes to you daily in a CSV file and you want to keep your database up to date. The primary key collision concepts described above also apply to the LOAD DATA
syntax, and are important to understand for proper database maintenance. We illustrate the use of LOAD DATA
in Section 16.3.
16.3 Extended example: Building a database
The extract-transform-load (ETL) paradigm is common among data professionals. The idea is that many data sources need to be extracted from some external source, transformed into a different format, and finally loaded into a database system. Often, this is an iterative process that needs to be done every day, or even every hour. In such cases, developing the infrastructure to automate these steps can result in dramatically increased productivity.
In this example, we will illustrate how to set up a MySQL database for the babynames data using the command line and SQL, but not R. As noted previously, while the dplyr package has made R a viable interface for querying and populating SQL databases, it is occasionally necessary to get “under the hood” with SQL. The files that correspond to this example can be found on the book website at http://mdsr-book.github.io/.
16.3.1 Extract
In this case, our data already lives in an R package, but in most cases, your data will live on a website, or be available in a different format. Our goal is to take that data from wherever it is and download it. For the babynames data, there isn’t much to do, since we already have the data in an R package. We will simply load it.
library(babynames)
16.3.2 Transform
Since SQL tables conform to a row-and-column paradigm, our goal during the transform phase is to create CSV files (see Chapter 6) for each of the tables. In this example, we will create tables for the babynames
and births
tables. You can try to add the applicants
and lifetables
tables on your own. We will simply write these data to CSV files using the write_csv()
command. Since the babynames
table is very long (nearly 1.8 million rows), we will just use the more recent data.
%>%
babynames filter(year > 1975) %>%
write_csv("babynames.csv")
%>%
births write_csv("births.csv")
list.files(".", pattern = ".csv")
[1] "babynames.csv" "births.csv"
This raises an important question: what should we call these objects? The babynames package includes a data frame called babynames
with one row per sex per year per name. Having both the database and a table with the same name may be confusing. To clarify which is which we will call the database babynamedb
and the table babynames
.
Spending time thinking about the naming of databases, tables, and fields before you create them can help avoid confusion later on.
16.3.3 Load into MySQL database
Next, we need to write a script that will define the table structure for these two tables in a MySQL database (instructions for creation of a database in SQLite can be found in Section F.4.4). This script will have four parts:
- a
USE
statement that ensures we are in the right schema/database - a series of
DROP TABLE
statements that drop any old tables with the same names as the ones we are going to create - a series of
CREATE TABLE
statements that specify the table structures - a series of
LOAD DATA
statements that read the data from the CSVs into the appropriate tables
The first part is easy:
USE babynamedb;
This assumes that we have a local database called babynamedb
—we will create this later.
The second part is easy in this case, since we only have two tables. These ensure that we can run this script as many times as we want.
DROP TABLE IF EXISTS babynames;
DROP TABLE IF EXISTS births;
Be careful with the DROP TABLE
statement. It destroys data.
The third step is the trickiest part—we need to define the columns precisely. The use of str()
, summary()
, and glimpse()
are particularly useful for matching up R data types with MySQL data types. Please see the MySQL documentation for more information about what data types are supported.
glimpse(babynames)
Rows: 1,924,665
Columns: 5
$ year <dbl> 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1880, 1…
$ sex <chr> "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "F", "…
$ name <chr> "Mary", "Anna", "Emma", "Elizabeth", "Minnie", "Margaret", "I…
$ n <int> 7065, 2604, 2003, 1939, 1746, 1578, 1472, 1414, 1320, 1288, 1…
$ prop <dbl> 0.07238, 0.02668, 0.02052, 0.01987, 0.01789, 0.01617, 0.01508…
In this case, we know that the year
variable will only contain four-digit integers, so we can specify that this column take up only that much room in SQL. Similarly, the sex
variable is just a single character, so we can restrict the width of that column as well. These savings probably won’t matter much in this example, but for large tables they can make a noticeable difference.
CREATE TABLE `babynames` (
smallint(4) NOT NULL DEFAULT 0,
`year` char(1) NOT NULL DEFAULT 'F',
`sex` varchar(255) NOT NULL DEFAULT '',
`name` 7) NOT NULL DEFAULT 0,
`n` mediumint(decimal(21,20) NOT NULL DEFAULT 0,
`prop` PRIMARY KEY (`year`, `sex`, `name`)
=MyISAM DEFAULT CHARSET=latin1; ) ENGINE
In this table, each row contains the information about one name for one sex in one year. Thus, each row contains a unique combination of those three variables, and we can therefore define a primary key across those three fields. Note the use of backquotes (to denote tables and variables) and the use of regular quotes (for default values).
glimpse(births)
Rows: 109
Columns: 2
$ year <int> 1909, 1910, 1911, 1912, 1913, 1914, 1915, 1916, 1917, 1918,…
$ births <int> 2718000, 2777000, 2809000, 2840000, 2869000, 2966000, 29650…
CREATE TABLE `births` (
smallint(4) NOT NULL DEFAULT 0,
`year` 8) NOT NULL DEFAULT 0,
`births` mediumint(PRIMARY KEY (`year`)
=MyISAM DEFAULT CHARSET=latin1; ) ENGINE
Finally, we have to tell MySQL where to find the CSV files and where to put the data it finds in them. This is accomplished using the LOAD DATA
command. You may also need to add a LINES TERMINATED BY \r\n
clause, but we have omitted that for clarity. Please be aware that lines terminate using different characters in different operating systems, so Windows, Mac, and Linux users may have to tweak these commands to suit their needs. The SHOW WARNINGS
commands are not necessary, but they will help with debugging.
DATA LOCAL INFILE './babynames.csv' INTO TABLE `babynames`
LOAD BY ',' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES;
FIELDS TERMINATED
SHOW WARNINGS;DATA LOCAL INFILE './births.csv' INTO TABLE `births`
LOAD BY ',' OPTIONALLY ENCLOSED BY '"' IGNORE 1 LINES;
FIELDS TERMINATED SHOW WARNINGS;
Putting this all together, we have the following script:
USE babynamedb;
DROP TABLE IF EXISTS babynames;
DROP TABLE IF EXISTS births;
CREATE TABLE `babynames` (
smallint(4) NOT NULL DEFAULT 0,
`year` char(1) NOT NULL DEFAULT 'F',
`sex` varchar(255) NOT NULL DEFAULT '',
`name` 7) NOT NULL DEFAULT 0,
`n` mediumint(decimal(21,20) NOT NULL DEFAULT 0,
`prop` PRIMARY KEY (`year`, `sex`, `name`)
=MyISAM DEFAULT CHARSET=latin1;
) ENGINE
CREATE TABLE `births` (
smallint(4) NOT NULL DEFAULT 0,
`year` 8) NOT NULL DEFAULT 0,
`births` mediumint(PRIMARY KEY (`year`)
=MyISAM DEFAULT CHARSET=latin1;
) ENGINE
DATA LOCAL INFILE './babynames.csv' INTO TABLE `babynames`
LOAD BY ',' OPTIONALLY ENCLOSED BY '\"' IGNORE 1 LINES;
FIELDS TERMINATED DATA LOCAL INFILE './births.csv' INTO TABLE `births`
LOAD BY ',' OPTIONALLY ENCLOSED BY '\"' IGNORE 1 LINES;
FIELDS TERMINATED
SELECT year
COUNT(DISTINCT name) AS numNames
, SUM(n) AS numBirths
, FROM babynames
GROUP BY year
ORDER BY numBirths DESC
LIMIT 0,10;
Note that we have added a SELECT
query just to verify that our table is populated. To load this into MySQL, we must first make sure that the babynamedb
database exists, and if not, we must create it.
First, we check to see if babynamedb
exists. We can do this from the command line using shell commands:
mysql -e "SHOW DATABASES;"
If it doesn’t exist, then we must create it:
mysql -e "CREATE DATABASE babynamedb;"
Finally, we run our script. The --show-warnings
and -v
flags are optional, but will help with debugging.
mysql --local-infile --show-warnings -v babynamedb
< babynamedb.mysql
In practice, if your SQL script is not perfect, you will see errors or warnings the first time you try this.
But by iterating this process, you will eventually refine your script such that it works as desired.
If you get an 1148 error, make sure that you are using the --local-infile
flag.
ERROR 1148 (42000): The used command is not allowed with this MySQL version
If you get a 29 error, make sure that the file exists in this location and that the mysql
user has permission to read and execute it.
ERROR 29 (HY000): File './babynames.csv' not found (Errcode: 13)
Once the MySQL database has been created, the following commands can be used to access it from R using dplyr:
<- dbConnect(RMySQL::MySQL(), dbname = "babynamedb")
db <- tbl(db, "babynames")
babynames %>%
babynames filter(name == "Benjamin")
16.4 Scalability
With the exception of SQLite, RBDMSs scale very well on a single computer to databases that take up dozens of gigabytes. For a dedicated server, even terabytes are workable on a single machine. Beyond this, many companies employ distributed solutions called clusters. A cluster is simply more than one machine (i.e., a node) linked together running the same RDBMS. One machine is designated as the head node, and this machine controls all of the other nodes. The actual data are distributed across the various nodes, and the head node manages queries—parceling them to the appropriate cluster nodes.
A full discussion of clusters and other distributed architectures (including replication) are beyond the scope of this book. In Chapter 21, we discuss alternatives to SQL that may provide higher-end solutions for bigger data.
16.5 Further resources
The SQL in a Nutshell book (Kline et al. 2008) is a useful reference for all things SQL.
16.6 Exercises
Problem 1 (Easy): Alice is searching for cancelled flights in the flights
table, and her query is running very slowly. She decides to build an index on cancelled
in the hopes of speeding things up. Discuss the relative merits of her plan. What are the trade-offs? Will her query be any faster?
Problem 2 (Medium): The Master
table of the Lahman
database contains biographical information about baseball players. The primary key is the playerID
variable. There are also variables for retroID
and bbrefID
, which correspond to the player’s identifier in other baseball databases. Discuss the ramifications of placing a primary, unique, or foreign key on retroID
.
Problem 3 (Medium): Bob wants to analyze the on-time performance of United Airlines flights across the decade of the 1990s. Discuss how the partitioning scheme of the flights
table based on year
will affect the performance of Bob’s queries, relative to an unpartitioned table.
Problem 4 (Hard): Use the macleish
package to download the weather data at the MacLeish Field Station. Write your own table schema from scratch and import these data into the database server of your choice.
Problem 5 (Hard): Write a full table schema for the two tables in the fueleconomy
package and import them into the database server of your choice.
Problem 6 (Hard): Write a full table schema for the mtcars
data set and import it into the database server of your choice.
Problem 7 (Hard): Write a full table schema for the five tables in the nasaweather
package and import them into the database server of your choice.
Problem 8 (Hard): Write a full table schema for two of the ten tables in the usdanutrients
package and import them into the database server of your choice.
# remotes::install_github("hadley/usdanutrients")
library(usdanutrients)
# data(package="usdanutrients")
16.7 Supplementary exercises
Available at https://mdsr-book.github.io/mdsr2e/sql-II.html#sqlII-online-exercises
Problem 1 (Easy): The flights
table in the airlines
database contains the following indexes:
INDEXES FROM flights; SHOW
Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment |
---|---|---|---|---|---|---|---|---|---|---|---|---|
flights | 1 | Year | 1 | year | A | 7 | NA | NA | YES | BTREE | ||
flights | 1 | Date | 1 | year | A | 7 | NA | NA | YES | BTREE | ||
flights | 1 | Date | 2 | month | A | 89 | NA | NA | YES | BTREE | ||
flights | 1 | Date | 3 | day | A | 2712 | NA | NA | YES | BTREE | ||
flights | 1 | Origin | 1 | origin | A | 2267 | NA | NA | BTREE | |||
flights | 1 | Dest | 1 | dest | A | 2267 | NA | NA | BTREE | |||
flights | 1 | Carrier | 1 | carrier | A | 134 | NA | NA | BTREE | |||
flights | 1 | tailNum | 1 | tailnum | A | 37862 | NA | NA | YES | BTREE |
Consider the following queries:
SELECT * FROM flights WHERE cancelled = 1;
SELECT * FROM flights WHERE carrier = "DL";
Which query will execute faster? Justify your answer.
Problem 2 (Hard): Use the fec12
package to download and unzip the federal election data for 2012 that were used in Chapter 2. Write your own table schema from scratch and import these data into the database server of your choice.