Text as Data exercises

Introduction

These exercises are taken from the text as data chapter from Modern Data Science with R): http://mdsr-book.github.io/. Other materials relevant for instructors (sample activities, overview video) for this chapter can be found there.

Speaking lines

Speaking lines in Shakespeare’s plays are identified by a line that starts with two spaces, then a string of capital letters and spaces (the character’s name) followed by a period. Use grep() to find all of the speaking lines in Macbeth. How many are there?

SOLUTION:

library(mdsr)   
library(tidyr)
library(tm)
library(wordcloud)
data(Macbeth_raw)
macbeth <- strsplit(Macbeth_raw, "\r\n")[[1]]
head(macbeth)

## [1] "This Etext file is presented by Project Gutenberg, in"           
## [2] "cooperation with World Library, Inc., from their Library of the" 
## [3] "Future and Shakespeare CDROMS.  Project Gutenberg often releases"
## [4] "Etexts that are NOT placed in the Public Domain!!"               
## [5] ""                                                                
## [6] "*This Etext has certain copyright implications you should read!*"

# solution goes here

Hyphenated words

Find all the hyphenated words in one of Shakespeare’s plays.

SOLUTION:

# solution goes here

Most popular names

Use the babynames data table from the babynames package to find the ten most popular:

Boys’ names ending in a vowel.

SOLUTION:

# solution goes here

Names ending with joe, jo Joe or Jo (e.g., Billyjoe).

SOLUTION:

# solution goes here

Adjectives

Find all of the adjectives in one of Shakespeare’s plays that end in more or less (note change from original question 15.4).

SOLUTION:

# solution goes here

Stage directions

Find all of the lines containing the stage direction or in one of Shakespeare’s plays (note change from original exercise 15.5).

SOLUTION:

# solution goes here

Regular expressions

Use regular expressions to determine the number of speaking lines from the Complete Works of William Shakespeare (http://www.gutenberg.org/cache/epub/100/pg100.txt). Here, we care only about how many times a character speaks—not what they say or for how long they speak.

SOLUTION:

# solution goes here

Top characters

Make a bar chart displaying the top 100 characters with the greatest number of lines. Hint: you may want to use either the stringr::str_extract() or strsplit() function here.

SOLUTION:

# solution goes here

Shakespare Machine

In this problem, you will do much of the work to recreate Mark Hansen’s Shakespeare Machine. Start by watching a video clip (http://vimeo.com/54858820) of the exhibit. Use The Complete Works of William Shakespeare (see earlier exercise) and regular expressions to find all of the hyphenated words in Shakespeare Machine. How many are there? Use %in\% to verify that your list contains the following hyphenated words pictured at 00:46 of the clip.

SOLUTION:

sm_words <- c("true-fix'd", "pale-hearted", "lean-fac'd", "hard-hearted", 
  "best-regarded", "thick-ribbed", "both-sides", "sea-like.", 
  "shrill-shrieking", "lust-stain'd", "tragical-historical,")
# solution goes here

Wikipedia table

Find an interesting Wikipedia page with a table, scrape the data from it, and generate a figure that tells an interesting story. Include an interpretation of the figure.

SOLUTION:

# solution goes here

Stackexchange 1

The site displays questions and answers on technical topics.
The following code downloads the most recent questions related to the package.

library(httr)
# Find the most recent R questions on stackoverflow
getresult <- GET("http://api.stackexchange.com",
  path = "questions",
  query = list(site = "stackoverflow.com", tagged = "dplyr"))
stop_for_status(getresult) # Ensure returned without error
questions <- content(getresult)  # Grab content
names(questions$items[[1]])    # What does the returned data look like?

##  [1] "tags"               "owner"              "is_answered"       
##  [4] "view_count"         "answer_count"       "score"             
##  [7] "last_activity_date" "creation_date"      "last_edit_date"    
## [10] "question_id"        "link"               "title"

length(questions$items)

## [1] 30

substr(questions$items[[1]]$title, 1, 68)

## [1] "Left join on date range by group ID"

substr(questions$items[[2]]$title, 1, 68)

## [1] "R: regrouping multiple rows into a single row (by value in first col"

substr(questions$items[[3]]$title, 1, 68)

## [1] "Relative frequencies / proportions with dplyr"

How many questions were returned? Without using jargon, describe in words what is being displayed and how it might be used.

SOLUTION:

# solution goes here

Stackexchange 2

Repeat the process of downloading the content from related to the package and summarize the results.

SOLUTION:

# solution goes here

Text as Data exercises

Nicholas Horton (nhorton@amherst.edu)

July 19, 2017

Introduction

Speaking lines

Hyphenated words

Most popular names

Adjectives

Stage directions

Regular expressions

Top characters

Shakespare Machine

Wikipedia table

Stackexchange 1

Stackexchange 2