B Introduction to R and RStudio

This chapter provides a (brief) introduction to R and RStudio. The R language is a free, open-source software environment for statistical computing and graphics (Ihaka and Gentleman 1996; R Core Team 2020). RStudio is an open-source integrated development environment (IDE) for R that adds many features and productivity tools for R (RStudio 2020). This chapter includes a short history, installation information, a sample session, background on fundamental structures and actions, information about help and documentation, and other important topics.

The R Foundation for Statistical Computing holds and administers the copyright of the R software and documentation. R is available under the terms of the Free Software Foundation’s GNU General Public License in source code form.

RStudio facilitates use of R by integrating R help and documentation, providing a workspace browser and data viewer, and supporting syntax highlighting, code completion, and smart indentation. Support for reproducible analysis is made available with the knitr package and R Markdown (see Appendix D). It facilitates the creation of dynamic web applications using Shiny (see Chapter 14.4). It also provides support for multiple projects as well as an interface to source code control systems such as GitHub. It has become the default interface for many R users, and is our recommended environment for analysis.

RStudio is available as a client (standalone) for Windows, Mac OS X, and Linux. There is also a server version. Commercial products and support are available in addition to the open-source offerings (see http://www.rstudio.com/ide for details).

The first versions of R were written by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, while current development is coordinated by the R Development Core Team, a group of international volunteers.

The R language is quite similar to the S language, a flexible and extensible statistical environment originally developed in the 1980s at AT&T Bell Labs (now Alcatel–Lucent).

B.1 Installation

New users are encouraged to download and install R from the Comprehensive R Archive Network (CRAN, http://www.r-project.org) and install RStudio from http://www.rstudio.com/download. The sample session in the appendix of the Introduction to R documentation, also available from CRAN, is recommended reading.

The home page for the R project, located at http://r-project.org, is the best starting place for information about the software. It includes links to CRAN, which features pre-compiled binaries as well as source code for R, add-on packages, documentation (including manuals, frequently asked questions, and the R newsletter) as well as general background information. Mirrored CRAN sites with identical copies of these files exist all around the world. Updates to R and packages are regularly posted on CRAN.

B.1.1 RStudio

RStudio for Mac OS X, Windows, or Linux can be downloaded from https://rstudio.com/products/rstudio. RStudio requires R to be installed on the local machine. A server version (accessible from Web browsers) is also available for download. Documentation of the advanced features is available on the RStudio website.

B.2 Learning R

The R environment features extensive online documentation, though it can sometimes be challenging to comprehend. Each command has an associated help file that describes usage, lists arguments, provides details of actions, gives references, lists other related functions, and includes examples of its use. The help system is invoked using either the ? or help() commands.

?function
help(function)

where function is the name of the function of interest. (Alternatively, the Help tab in RStudio can be used to access the help system.)

Some commands (e.g., if) are reserved, so ?if will not generate the desired documentation. Running ?"if" will work (see also ?Reserved and ?Control). Other reserved words include else, repeat, while, function, for, in, next, break, TRUE, FALSE, NULL, Inf, NaN, and NA.

The RSiteSearch() function will search for key words or phrases in many places (including the search engine at http://search.r-project.org). The RSeek.org site can also be helpful in finding more information and examples. Examples of many functions are available using the example() function.

example(mean)

Other useful resources are help.start(), which provides a set of online manuals, and help.search(), which can be used to look up entries by description. The apropos() command returns any functions in the current search list that match a given pattern (which facilitates searching for a function based on what it does, as opposed to its name).

Other resources for help available from CRAN include the R help mailing list. The StackOverflow site for R provides a series of questions and answers for common questions that are tagged as being related to R. New users are also encouraged to read the R FAQ (frequently asked questions) list. RStudio provides a curated guide to resources for learning R and its extensions.

B.3 Fundamental structures and objects

Here we provide a brief introduction to R data structures.

B.3.1 Objects and vectors

Almost everything in R is an object, which may be initially confusing to a new user. An object is simply something stored in R’s memory. Common objects include vectors, matrices, arrays, factors, data frames (akin to data sets in other systems), lists, and functions. The basic variable structure is a vector. Vectors (and other objects) are created using the <- or = assignment operators (which assign the evaluated expression on the right-hand side of the operator to the object name on the left-hand side).

x <- c(5, 7, 9, 13, -4, 8) # preferred
x =  c(5, 7, 9, 13, -4, 8)  # equivalent

The above code creates a vector of length 6 using the c() function to concatenate scalars. The = operator is used in other contexts for the specification of arguments to functions. Other assignment operators exist, as well as the assign() function (see help("<-") for more information). The exists() function conveys whether an object exists in the workspace, and the rm() command removes it. In RStudio, the “Environment” tab shows the names (and values) of all objects that exist in the current workspace.

Since vector operations are so fundamental in R, it is important to be able to access (or index) elements within these vectors. Many different ways of indexing vectors are available. Here, we introduce several of these using the x as created above. The command x[2] returns the second element of x (the scalar 7), and x[c(2, 4)] returns the vector \((7, 13)\). The expressions x[c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)], x[1:5] and x[-6] all return a vector consisting of the first 5 elements in x (the last specifies all elements except the 6th).

x[2]
[1] 7
x[c(2, 4)]
[1]  7 13
x[c(TRUE, TRUE, TRUE, TRUE, TRUE, FALSE)]
[1]  5  7  9 13 -4
x[1:5]
[1]  5  7  9 13 -4
x[-6]
[1]  5  7  9 13 -4

Vectors are recycled if needed; for example, when comparing each of the elements of a vector to a scalar.

x > 8
[1] FALSE FALSE  TRUE  TRUE FALSE FALSE

The above expression demonstrates the use of comparison operators (see ?Comparison). Only the third and fourth elements of x are greater than 8. The function returns a logical value of either TRUE or FALSE (see ?Logic).

A count of elements meeting the condition can be generated using the sum() function. Other comparison operators include == (equal), >= (greater than or equal), <= (less than or equal and != (not equal). Care needs to be taken in the comparison using == if noninteger values are present (see all.equal()).

sum(x > 8)
[1] 2

B.3.2 Operators

There are many operators defined in R to carry out a variety of tasks. Many of these were demonstrated in the sample session (assignment, arithmetic) and previous examples (comparison). Arithmetic operations include +, -, *, /, ^ (exponentiation), %% (modulus), and %/% (integer division). More information about operators can be found using the help system (e.g., ?"+"). Background information on other operators and precedence rules can be found using help(Syntax).

Boolean operations (OR, AND, NOT, and XOR) are supported using the |, ||, &, ! operators and the xor() function. The | is an “or” operator that operates on each element of a vector, while the || is another “or” operator that stops evaluation the first time that the result is true (see ?Logic).

B.3.3 Lists

Lists in R are very general objects that can contain other objects of arbitrary types. List members can be named, or referenced using numeric indices (using the [[ operator).

newlist <- list(first = "hello", second = 42, Bob = TRUE)
is.list(newlist)
[1] TRUE
newlist
$first
[1] "hello"

$second
[1] 42

$Bob
[1] TRUE
newlist[[2]]
[1] 42
newlist$Bob
[1] TRUE

The unlist() function flattens (makes a vector out of) the elements in a list (see also relist()). Note that unlisted objects are coerced to a common type (in this case character).

unlisted <- unlist(newlist)
unlisted
  first  second     Bob 
"hello"    "42"  "TRUE" 

B.3.4 Matrices

Matrices are like two-dimensional vectors: rectangular objects where all entries have the same type. We can create a \(2 \times 3\) matrix, display it, and test for its type.

A <- matrix(x, 2, 3)
A
     [,1] [,2] [,3]
[1,]    5    9   -4
[2,]    7   13    8
is.matrix(A)    # is A a matrix?
[1] TRUE
is.vector(A)
[1] FALSE
is.matrix(x)
[1] FALSE

Note that comments are supported within R (any input given after a # character is ignored).

Indexing for matrices is done in a similar fashion as for vectors, albeit with a second dimension (denoted by a comma).

A[2, 3]
[1] 8
A[, 1]
[1] 5 7
A[1, ]
[1]  5  9 -4

B.3.5 Dataframes and tibbles

Data sets are often stored in a data.frame, which is a special type of list that is more general than a matrix. This rectangular object, similar to a data table in other systems, can be thought of as a two-dimensional array with columns of vectors of the same length, but of possibly different types (as opposed to a matrix, which consists of vectors of the same type; or a list, whose elements needn’t be of the same length). The function read_csv() in the readr package returns a data.frame object.

A simple data.frame can be created using the data.frame() command. Variables can be accessed using the $ operator, as shown below (see also help(Extract)). In addition, operations can be performed by column (e.g., calculation of sample statistics). We can check to see if an object is a data.frame with is.data.frame().

y <- rep(11, length(x))
y
[1] 11 11 11 11 11 11
ds <- data.frame(x, y)
ds
   x  y
1  5 11
2  7 11
3  9 11
4 13 11
5 -4 11
6  8 11
ds$x[3]
[1] 9
is.data.frame(ds)
[1] TRUE

Tibbles are a form of simple data frames (a modern interpretation) that are described as “lazy and surly” (https://tibble.tidyverse.org). They support multiple data technologies (e.g., SQL databases), make more explicit their assumptions, and have an enhanced print method (so that output doesn’t scroll so much). Many packages in the tidyverse create tibbles by default.

tbl <- as_tibble(ds)
is.data.frame(tbl)
[1] TRUE
is_tibble(ds)
[1] FALSE
is_tibble(tbl)
[1] TRUE

The use of data.frame() differs from the use of cbind(), which yields a matrix object (unless it is given data frames as inputs).

newmat <- cbind(x, y)
newmat
      x  y
[1,]  5 11
[2,]  7 11
[3,]  9 11
[4,] 13 11
[5,] -4 11
[6,]  8 11
is.data.frame(newmat)
[1] FALSE
is.matrix(newmat)
[1] TRUE

Data frames are created from matrices using as.data.frame(), while matrices are constructed from data frames using as.matrix().

Although we strongly discourage its use, data frames can be attached to the workspace using the attach() command. The Tidyverse R Style guide (https://style.tidyverse.org) provides similar advice. Name conflicts are a common problem with attach() (see conflicts(), which reports on objects that exist with the same name in two or more places on the search path).

The search() function lists attached packages and objects. To avoid cluttering and confusing the name-space, the command detach() should be used once a data frame or package is no longer needed.

A number of R functions include a data argument to specify a data frame as a local environment. For functions without a data option, the with() and within() commands can be used to simplify reference to an object within a data frame without attaching.

B.3.6 Attributes and classes

Many objects have a set of associated attributes (such as names of variables, dimensions, or classes) that can be displayed or sometimes changed. For example, we can find the dimension of the matrix defined earlier.

attributes(A)
$dim
[1] 2 3

Other types of objects within R include lists (ordered objects that are not necessarily rectangular), regression models (objects of class lm), and formulae (e.g., y ~ x1 + x2). R supports object-oriented programming (see help(UseMethod)). As a result, objects in R have an associated class attribute, which changes the default behavior for some operations on that object. Many functions (called generics) have special capabilities when applied to objects of a particular class. For example, when summary() is applied to an lm object, the summary.lm() function is called. Conversely, summary.aov() is called when an aov object is given as argument. These class-specific implementations of generic functions are called methods.

The class() function returns the classes to which an object belongs, while the methods() function displays all of the classes supported by a generic function.

head(methods(summary))
[1] "summary,ANY-method"             "summary,DBIObject-method"      
[3] "summary,MySQLConnection-method" "summary,MySQLDriver-method"    
[5] "summary,MySQLResult-method"     "summary.aov"                   

Objects in R can belong to multiple classes, although those classes need not be nested. As noted above, generic functions are dispatched according the class attribute of each object. Thus, in the example below we create the tbl object, which belongs to multiple classes. When the print() function is called on tbl, R looks for a method called print.tbl_df(). If no such method is found, R looks for a method called print.tbl(). If no such method is found, R looks for a method called print.data.frame(). This process continues until a suitable method is found. If there is none, then print.default() is called.

tbl <- as_tibble(ds)
class(tbl)
[1] "tbl_df"     "tbl"        "data.frame"
print(tbl)
# A tibble: 6 × 2
      x     y
  <dbl> <dbl>
1     5    11
2     7    11
3     9    11
4    13    11
5    -4    11
6     8    11
print.data.frame(tbl)
   x  y
1  5 11
2  7 11
3  9 11
4 13 11
5 -4 11
6  8 11
print.default(tbl)
$x
[1]  5  7  9 13 -4  8

$y
[1] 11 11 11 11 11 11

attr(,"class")
[1] "tbl_df"     "tbl"        "data.frame"

There are a number of functions that assist with learning about an object in R. The attributes() command displays the attributes associated with an object. The typeof() function provides information about the underlying data structure of objects (e.g., logical, integer, double, complex, character, and list). The str() function displays the structure of an object, and the mode() function displays its storage mode. For data frames, the glimpse() function provides a useful summary of each variable.

A few quick notes on specific types of objects are worth relating here:

  • A vector is a one-dimensional array of items of the same data type. There are six basic data types that a vector can contain: logical, character, integer, double, complex, and raw. Vectors have a length() but not a dim(). Vectors can have—but needn’t have—names().
  • A factor is a special type of vector for categorical data. A factor has level()s. We change the reference level of a factor with relevel(). Factors are stored internally as integers that correspond to the id’s of the factor levels.

Factors can be problematic and their use is discouraged since they can complicate some aspects of data wrangling. A number of R developers have encouraged the use of the stringsAsFactors = FALSE option.

  • A matrix is a two-dimensional array of items of the same data type. A matrix has a length() that is equal to nrow() times ncol(), or the product of dim().
  • A data.frame is a list of vectors of the same length. This is like a matrix, except that columns can be of different data types. Data frames always have names() and often have row.names().

Do not confuse a factor with a character vector.

Note that data sets typically have class data.frame but are of type list. This is because, as noted above, R stores data frames as special types of lists—a list of several vectors having the same length, but possibly having different types.

class(mtcars)
[1] "data.frame"
typeof(mtcars)
[1] "list"

If you ever get confused when working with data frames and matrices, remember that a data.frame is a list (that can accommodate multiple types of objects), whereas a matrix is more like a vector (in that it can only support one type of object).

B.3.7 Options

The options() function in R can be used to change various default behaviors. For example, the digits argument controls the number of digits to display in output.
The current options are returned when options() is called, to allow them to be restored. The command help(options) lists all of the settable options.

B.3.8 Functions

Fundamental actions within R are carried out by calling functions (either built-in or user defined—see Appendix C for guidance on the latter). Multiple arguments may be given, separated by commas. The function carries out operations using the provided arguments and returns values (an object such as a vector or list) that are displayed (by default) or which can be saved by assignment to an object.

It’s a good idea to name arguments to functions. This practice minimizes errors assigning unnamed arguments to options and makes code more readable.

As an example, the quantile() function takes a numeric vector and returns the minimum, 25th percentile, median, 75th percentile, and maximum of the values in that vector. However, if an optional vector of quantiles is given, those quantiles are calculated instead.

vals <- rnorm(1000) # generate 1000 standard normal random variables
quantile(vals)
    0%    25%    50%    75%   100% 
-3.520 -0.675  0.012  0.737  3.352 
quantile(vals, c(.025, .975))
 2.5% 97.5% 
-2.00  1.98 
# Return values can be saved for later use.
res <- quantile(vals, c(.025, .975))
res[1]
2.5% 
  -2 

Arguments (options) are available for most functions. The documentation specifies the default action if named arguments are not specified. If not named, the arguments are provided to the function in order specified in the function call.

For the quantile() function, there is a type argument that allows specification of one of nine algorithms for calculating quantiles.

res <- quantile(vals, probs = c(.025, .975), type = 3)
res
 2.5% 97.5% 
-2.02  1.98 

Some functions allow a variable number of arguments. An example is the paste() function. The calling sequence is described in the documentation as follows.

paste(..., sep = " ", collapse = NULL)

To override the default behavior of a space being added between elements output by paste(), the user can specify a different value for sep.

B.4 Add-ons: Packages

B.4.1 Introduction to packages

Additional functionality in R is added through packages, which consist of functions, data sets, examples, vignettes, and help files that can be downloaded from CRAN. The function install.packages() can be used to download and install packages. Alternatively, RStudio provides an easy-to-use Packages tab to install and load packages.

Throughout the book, we assume that the tidyverse and mdsr packages are loaded. In many cases, additional add-on packages (see Appendix A) need to be installed prior to running the examples in this book.

Packages that are not on CRAN can be installed using the install_github() function in the remotes package.

install.packages("mdsr")    # CRAN version
remotes::install_github("mdsr-book/mdsr")    # development version

The library() function will load an installed package.
For example, to install and load Frank Harrell’s Hmisc() package, two commands are needed:

install.packages("Hmisc")
library(Hmisc)

If a package is not installed, running the library() command will yield an error. Here we try to load the xaringanthemer package (which has not been installed):

> library(xaringanthemer)
Error in library(xaringanthemer) : there is no package called 'xaringanthemer'

To rectify the problem, we install the package from CRAN.

> install.packages("xaringanthemer")
trying URL 'https://cloud.r-project.org/src/contrib/xaringanthemer_0.3.0.tar.gz'
Content type 'application/x-gzip' length 1362643 bytes (1.3 MB)
==================================================
downloaded 1.3 Mb
library(xaringanthemer)

The require() function will test whether a package is available—this will load the library if it is installed, and generate a warning message if it is not (as opposed to library(), which will return an error).

The names of all variables within a given data set (or more generally for sub-objects within an object) are provided by the names() command. The names of all objects defined within an R session can be generated using the objects() and ls() commands, which return a vector of character strings. RStudio includes an Environment tab that lists all the objects in the current environment.

The print() and summary() functions return the object or summaries of that object, respectively. Running print(object) at the command line is equivalent to just entering the name of the object, i.e., object.

B.4.2 Packages and name conflicts

Different package authors may choose the same name for functions that exist within base R (or within other packages). This will cause the other function or object to be masked. This can sometimes lead to confusion, when the expected version of a function is not the one that is called. The find() function can be used to determine where in the environment (workspace) a given object can be found.

find("mean")
[1] "package:base"

Sometimes it is desirable to remove a package from the workspace. For example, a package might define a function with the same name as an existing function. Packages can be detached using the syntax detach(package:PKGNAME), where PKGNAME is the name of the package. Objects with the same name that appear in multiple places in the environment can be accessed using the location::objectname syntax. As an example, to access the mean() function from the base package, the user would specify base::mean() instead of mean(). It is sometimes preferable to reference a function or object in this way rather than loading the package.

As an example where this might be useful, there are functions in the base and Hmisc packages called units(). The find command would display both (in the order in which they would be accessed).

library(Hmisc)
find("units")
[1] "package:Hmisc" "package:base" 

When the Hmisc package is loaded, the units() function from the base package is masked and would not be used by default. To specify that the version of the function from the base package should be used, prefix the function with the package name followed by two colons: base::units(). The conflicts() function reports on objects that exist with the same name in two or more places on the search path.

Running the command library(help = "PKGNAME") will display information about an installed package. Alternatively, the Packages tab in RStudio can be used to list, install, and update packages.

The session_info() function from the sessioninfo package provides improved reporting version information about R as well as details of loaded packages.

sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────
 setting  value                       
 version  R version 4.1.0 (2021-05-18)
 os       Ubuntu 20.04.2 LTS          
 system   x86_64, linux-gnu           
 ui       X11                         
 language (EN)                        
 collate  en_US.UTF-8                 
 ctype    en_US.UTF-8                 
 tz       America/New_York            
 date     2021-07-28                  

─ Packages ───────────────────────────────────────────────────────────────
 package        * version date       lib source        
 assertthat       0.2.1   2019-03-21 [1] CRAN (R 4.1.0)
 backports        1.2.1   2020-12-09 [1] CRAN (R 4.1.0)
 base64enc        0.1-3   2015-07-28 [1] CRAN (R 4.1.0)
 bookdown         0.22    2021-04-22 [1] CRAN (R 4.1.0)
 broom            0.7.8   2021-06-24 [1] CRAN (R 4.1.0)
 bslib            0.2.5.1 2021-05-18 [1] CRAN (R 4.1.0)
 cellranger       1.1.0   2016-07-27 [1] CRAN (R 4.1.0)
 checkmate        2.0.0   2020-02-06 [1] CRAN (R 4.1.0)
 cli              3.0.1   2021-07-17 [1] CRAN (R 4.1.0)
 cluster          2.1.2   2021-04-17 [4] CRAN (R 4.0.5)
 colorspace       2.0-2   2021-06-24 [1] CRAN (R 4.1.0)
 crayon           1.4.1   2021-02-08 [1] CRAN (R 4.1.0)
 data.table       1.14.0  2021-02-21 [1] CRAN (R 4.1.0)
 DBI            * 1.1.1   2021-01-15 [1] CRAN (R 4.1.0)
 dbplyr           2.1.1   2021-04-06 [1] CRAN (R 4.1.0)
 digest           0.6.27  2020-10-24 [1] CRAN (R 4.1.0)
 dplyr          * 1.0.7   2021-06-18 [1] CRAN (R 4.1.0)
 ellipsis         0.3.2   2021-04-29 [1] CRAN (R 4.1.0)
 evaluate         0.14    2019-05-28 [1] CRAN (R 4.1.0)
 fansi            0.5.0   2021-05-25 [1] CRAN (R 4.1.0)
 forcats        * 0.5.1   2021-01-27 [1] CRAN (R 4.1.0)
 foreign          0.8-81  2020-12-22 [4] CRAN (R 4.0.3)
 Formula        * 1.2-4   2020-10-16 [1] CRAN (R 4.1.0)
 fs               1.5.0   2020-07-31 [1] CRAN (R 4.1.0)
 generics         0.1.0   2020-10-31 [1] CRAN (R 4.1.0)
 ggplot2        * 3.3.5   2021-06-25 [1] CRAN (R 4.1.0)
 glue             1.4.2   2020-08-27 [1] CRAN (R 4.1.0)
 gridExtra        2.3     2017-09-09 [1] CRAN (R 4.1.0)
 gtable           0.3.0   2019-03-25 [1] CRAN (R 4.1.0)
 haven            2.4.1   2021-04-23 [1] CRAN (R 4.1.0)
 Hmisc            4.5-0   2021-02-28 [1] CRAN (R 4.1.0)
 hms              1.1.0   2021-05-17 [1] CRAN (R 4.1.0)
 htmlTable        2.2.1   2021-05-18 [1] CRAN (R 4.1.0)
 htmltools        0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
 htmlwidgets      1.5.3   2020-12-10 [1] CRAN (R 4.1.0)
 httr             1.4.2   2020-07-20 [1] CRAN (R 4.1.0)
 jpeg             0.1-9   2021-07-24 [1] CRAN (R 4.1.0)
 jquerylib        0.1.4   2021-04-26 [1] CRAN (R 4.1.0)
 jsonlite         1.7.2   2020-12-09 [1] CRAN (R 4.1.0)
 knitr            1.33    2021-04-24 [1] CRAN (R 4.1.0)
 lattice        * 0.20-44 2021-05-02 [4] CRAN (R 4.1.0)
 latticeExtra     0.6-29  2019-12-19 [1] CRAN (R 4.1.0)
 lifecycle        1.0.0   2021-02-15 [1] CRAN (R 4.1.0)
 lubridate        1.7.10  2021-02-26 [1] CRAN (R 4.1.0)
 magrittr         2.0.1   2020-11-17 [1] CRAN (R 4.1.0)
 Matrix           1.3-4   2021-06-01 [4] CRAN (R 4.1.0)
 mdsr           * 0.2.5   2021-03-29 [1] CRAN (R 4.1.0)
 modelr           0.1.8   2020-05-19 [1] CRAN (R 4.1.0)
 mosaicData     * 0.20.2  2021-01-16 [1] CRAN (R 4.1.0)
 munsell          0.5.0   2018-06-12 [1] CRAN (R 4.1.0)
 nnet             7.3-16  2021-05-03 [4] CRAN (R 4.0.5)
 pillar           1.6.1   2021-05-16 [1] CRAN (R 4.1.0)
 pkgconfig        2.0.3   2019-09-22 [1] CRAN (R 4.1.0)
 png              0.1-7   2013-12-03 [1] CRAN (R 4.1.0)
 purrr          * 0.3.4   2020-04-17 [1] CRAN (R 4.1.0)
 R6               2.5.0   2020-10-28 [1] CRAN (R 4.1.0)
 RColorBrewer     1.1-2   2014-12-07 [1] CRAN (R 4.1.0)
 Rcpp             1.0.7   2021-07-07 [1] CRAN (R 4.1.0)
 readr          * 2.0.0   2021-07-20 [1] CRAN (R 4.1.0)
 readxl           1.3.1   2019-03-13 [1] CRAN (R 4.1.0)
 repr             1.1.3   2021-01-21 [1] CRAN (R 4.1.0)
 reprex           2.0.0   2021-04-02 [1] CRAN (R 4.1.0)
 rlang            0.4.11  2021-04-30 [1] CRAN (R 4.1.0)
 rmarkdown        2.9     2021-06-15 [1] CRAN (R 4.1.0)
 RMySQL           0.10.22 2021-06-22 [1] CRAN (R 4.1.0)
 rpart            4.1-15  2019-04-12 [4] CRAN (R 4.0.0)
 rstudioapi       0.13    2020-11-12 [1] CRAN (R 4.1.0)
 rvest            1.0.1   2021-07-26 [1] CRAN (R 4.1.0)
 sass             0.4.0   2021-05-12 [1] CRAN (R 4.1.0)
 scales           1.1.1   2020-05-11 [1] CRAN (R 4.1.0)
 sessioninfo      1.1.1   2018-11-05 [1] CRAN (R 4.1.0)
 skimr            2.1.3   2021-03-07 [1] CRAN (R 4.1.0)
 stringi          1.7.3   2021-07-16 [1] CRAN (R 4.1.0)
 stringr        * 1.4.0   2019-02-10 [1] CRAN (R 4.1.0)
 survival       * 3.2-11  2021-04-26 [4] CRAN (R 4.0.5)
 tibble         * 3.1.3   2021-07-23 [1] CRAN (R 4.1.0)
 tidyr          * 1.1.3   2021-03-03 [1] CRAN (R 4.1.0)
 tidyselect       1.1.1   2021-04-30 [1] CRAN (R 4.1.0)
 tidyverse      * 1.3.1   2021-04-15 [1] CRAN (R 4.1.0)
 tzdb             0.1.2   2021-07-20 [1] CRAN (R 4.1.0)
 utf8             1.2.2   2021-07-24 [1] CRAN (R 4.1.0)
 vctrs            0.3.8   2021-04-29 [1] CRAN (R 4.1.0)
 withr            2.4.2   2021-04-18 [1] CRAN (R 4.1.0)
 xaringanthemer * 0.4.0   2021-06-24 [1] CRAN (R 4.1.0)
 xfun             0.24    2021-06-15 [1] CRAN (R 4.1.0)
 xml2             1.3.2   2020-04-23 [1] CRAN (R 4.1.0)
 yaml             2.2.1   2020-02-01 [1] CRAN (R 4.1.0)

[1] /home/bbaumer/R/x86_64-pc-linux-gnu-library/4.1
[2] /usr/local/lib/R/site-library
[3] /usr/lib/R/site-library
[4] /usr/lib/R/library

The update.packages() function should be run periodically to ensure that packages are up-to-date

As of December 2020, there were more than 16,800 packages available from CRAN. This represents a tremendous investment of time and code by many developers (Fox 2009). While each of these has met a minimal standard for inclusion, it is important to keep in mind that packages in R are created by individuals or small groups, and not endorsed by the R core group. As a result, they do not necessarily undergo the same level of testing and quality assurance that the core R system does.

B.4.3 CRAN task views

The “Task Views” on CRAN are a very useful resource for finding packages. These are curated listings of relevant packages within a particular application area (such as multivariate statistics, psychometrics, or survival analysis). Table B.1 displays the task views available as of July 2021.

Table B.1: A complete list of CRAN task views.
Task View Subject
Bayesian Bayesian Inference
ChemPhys Chemometrics and Computational Physics
ClinicalTrials Clinical Trial Design, Monitoring, and Analysis
Cluster Cluster Analysis and Finite Mixture Models
Databases Databases with R
DifferentialEquations Differential Equations
Distributions Probability Distributions
Econometrics Econometrics
Environmetrics Analysis of Ecological and Environmental Data
ExperimentalDesign Design of Experiments (DoE) and Analysis of Experimental Data
ExtremeValue Extreme Value Analysis
Finance Empirical Finance
FunctionalData Functional Data Analysis
Genetics Statistical Genetics
gR gRaphical Models in R
Graphics Graphic Displays and Dynamic Graphics and Graphic Devices and Visualization
HighPerformanceComputing High-Performance and Parallel Computing with R
Hydrology Hydrological Data and Modeling
MachineLearning Machine Learning and Statistical Learning
MedicalImaging Medical Image Analysis
MetaAnalysis Meta-Analysis
MissingData Missing Data
ModelDeployment Model Deployment with R
Multivariate Multivariate Statistics
NaturalLanguageProcessing Natural Language Processing
NumericalMathematics Numerical Mathematics
OfficialStatistics Official Statistics and Survey Methodology
Optimization Optimization and Mathematical Programming
Pharmacokinetics Analysis of Pharmacokinetic Data
Phylogenetics Phylogenetics, Especially Comparative Methods
Psychometrics Psychometric Models and Methods
ReproducibleResearch Reproducible Research
Robust Robust Statistical Methods
SocialSciences Statistics for the Social Sciences
Spatial Analysis of Spatial Data
SpatioTemporal Handling and Analyzing Spatio-Temporal Data
Survival Survival Analysis
TeachingStatistics Teaching Statistics
TimeSeries Time Series Analysis
Tracking Processing and Analysis of Tracking Data
WebTechnologies Web Technologies and Services

B.5 Further resources

Advanced R is an excellent source for learning more about how R works (H. Wickham 2019). Extensive resources and documentation about R can be found at the Comprehensive R Archive Network (CRAN).

The forcats package, included in the tidyverse, is designed to facilitate data wrangling with factors.

More information regarding tibbles can be found at https://tibble.tidyverse.org.

JupyterLab and JupyterHub are alternative environments that support analysis via sophisticated notebooks for multiple languages including Julia, Python, and R.

B.6 Exercises

Problem 1 (Easy): The following code chunk throws an error.

mtcars %>%
  select(mpg, cyl)
                     mpg cyl
Mazda RX4           21.0   6
Mazda RX4 Wag       21.0   6
Datsun 710          22.8   4
Hornet 4 Drive      21.4   6
Hornet Sportabout   18.7   8
Valiant             18.1   6
Duster 360          14.3   8
Merc 240D           24.4   4
Merc 230            22.8   4
Merc 280            19.2   6
Merc 280C           17.8   6
Merc 450SE          16.4   8
Merc 450SL          17.3   8
Merc 450SLC         15.2   8
Cadillac Fleetwood  10.4   8
Lincoln Continental 10.4   8
Chrysler Imperial   14.7   8
Fiat 128            32.4   4
Honda Civic         30.4   4
Toyota Corolla      33.9   4
Toyota Corona       21.5   4
Dodge Challenger    15.5   8
AMC Javelin         15.2   8
Camaro Z28          13.3   8
Pontiac Firebird    19.2   8
Fiat X1-9           27.3   4
Porsche 914-2       26.0   4
Lotus Europa        30.4   4
Ford Pantera L      15.8   8
Ferrari Dino        19.7   6
Maserati Bora       15.0   8
Volvo 142E          21.4   4

What is the problem?

Problem 2 (Easy): Which of these kinds of names should be wrapped with quotation marks when used in R?

  • function name
  • file name
  • the name of an argument in a named argument
  • object name

Problem 3 (Easy): A user has typed the following commands into the RStudio console.

obj1 <- 2:10
obj2 <- c(2, 5)
obj3 <- c(TRUE, FALSE)
obj4 <- 42

What values are returned by the following commands?

obj1 * 10
obj1[2:4]
obj1[-3]
obj1 + obj2
obj1 * obj3
obj1 + obj4
obj2 + obj3
sum(obj2)
sum(obj3)

Problem 4 (Easy): A user has typed the following commands into the RStudio console:

mylist <- list(x1 = "sally", x2 = 42, x3 = FALSE, x4 = 1:5)

What values do each of the following commands return?

is.list(mylist)
names(mylist)
length(mylist)
mylist[[2]]
mylist[["x1"]]
mylist$x2
length(mylist[["x4"]])
class(mylist)
typeof(mylist)
class(mylist[[4]])
typeof(mylist[[3]])

Problem 5 (Easy): What’s wrong with this statement?

help(NHANES, package <- "NHANES")

Problem 6 (Easy): Consult the documentation for CPS85 in the mosaicData package to determine the meaning of CPS.

Problem 7 (Easy): The following code chunk throws an error. Why?

library(tidyverse)
mtcars %>%
  filter(cylinders == 4)
Error: Problem with `filter()` input `..1`.
ℹ Input `..1` is `cylinders == 4`.
x object 'cylinders' not found

What is the problem?

Problem 8 (Easy): The date function returns an indication of the current time and date. What arguments does date take? What kind of object is the result from date? What kind of object is the result from Sys.time?

Problem 9 (Easy): A user has typed the following commands into the RStudio console.

a <- c(10, 15)
b <- c(TRUE, FALSE)
c <- c("happy", "sad")

What do each of the following commands return? Describe the class of the object as well as its value.

data.frame(a, b, c)
cbind(a, b)
rbind(a, b)
cbind(a, b, c)
list(a, b, c)[[2]]

Problem 10 (Easy): For each of the following assignment statements, describe the error (or note why it does not generate an error).

result1 <- sqrt 10
result2 <-- "Hello to you!"
3result <- "Hello to you"
result4 <- "Hello to you
result5 <- date()

Problem 11 (Easy): The following code chunk throws an error.

library(tidyverse)
mtcars %>%
  filter(cyl = 4)
Error: Problem with `filter()` input `..1`.
x Input `..1` is named.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `cyl == 4`?

The error suggests that you need to use == inside of filter(). Why?

Problem 12 (Medium): The following code undertakes some data analysis using the HELP (Health Evaluation and Linkage to Primary Care) trial.

library(mosaic)
ds <-
  read.csv("http://nhorton.people.amherst.edu/r2/datasets/helpmiss.csv")
summarise(group_by(
  select(filter(mutate(ds,
    sex = ifelse(female == 1, "F", "M")
  ), !is.na(pcs)), age, pcs, sex),
  sex
), meanage = mean(age), meanpcs = mean(pcs), n = n())

Describe in words what computations are being done. Using the pipe notation, translate this code into a more readable version.

Problem 13 (Medium): The following concepts should have some meaning to you: package, function, command, argument, assignment, object, object name, data frame, named argument, quoted character string.

Construct an example of R commands that make use of at least four of these. Label which part of your example R command corresponds to each.

B.7 Supplementary exercises

Available at https://mdsr-book.github.io/mdsr2e/appR.html#datavizI-online-exercises

No exercises found