Introduction

This example builds on the database chapter from Modern Data Science with R: http://mdsr-book.github.io/.

We begin be demoinstrating how to access data from a MySQL database using R. This example utilizes a database of wideband acoustic immitance variables from humans with normal hearing (see http://www.science.smith.edu/wai-database/ and https://projectreporter.nih.gov/project_info_description.cfm?aid=8769352&icde=30039221&ddparam=&ddvalue=&ddsub=&cr=10&csb=default&cs=ASC for more details).

A relevant paper on the topic of data management and databases in R can be found at http://chance.amstat.org/2015/04/setting-the-stage.

Accessing data from a database using SQL commands

First we demonstrate how to access data using SQL (structured query language) commands and the dbGetQuery() function. We begin by setting up a connection to the database.

library(tidyverse)
library(RMySQL)
con <- dbConnect(MySQL(), host = "scidb.smith.edu",
                user = "waiuser", password = "smith_waiDB", 
                dbname = "wai")

Next a series of SQL queries can be sent to the database. These return R dataframes.

dbGetQuery(con, "SHOW TABLES")
##   Tables_in_wai
## 1  Measurements
## 2       PI_Info
## 3       Subject
dbGetQuery(con, "EXPLAIN PI_Info")
##          Field          Type Null Key Default Extra
## 1   Identifier   varchar(20)  YES        <NA>      
## 2      PI_Year       int(11)  YES        <NA>      
## 3           PI  varchar(500)  YES        <NA>      
## 4  Affiliation  varchar(500)  YES        <NA>      
## 5        Email   varchar(30)  YES        <NA>      
## 6        Title  varchar(140)  YES        <NA>      
## 7          Pub   varchar(30)  YES        <NA>      
## 8         Date      char(20)  YES        <NA>      
## 9          URL  varchar(140)  YES        <NA>      
## 10    PI_Notes varchar(1500)  YES        <NA>
ds <- dbGetQuery(con, "SELECT * from Measurements LIMIT 10")
ds
##    Identifier Sub_Number Session Left_Ear MEP Instrument    Freq
## 1   Abur_2014          1       1        0  -5          1 210.938
## 2   Abur_2014          1       1        0  -5          1 234.375
## 3   Abur_2014          1       1        0  -5          1 257.812
## 4   Abur_2014          1       1        0  -5          1 281.250
## 5   Abur_2014          1       1        0  -5          1 304.688
## 6   Abur_2014          1       1        0  -5          1 328.125
## 7   Abur_2014          1       1        0  -5          1 351.562
## 8   Abur_2014          1       1        0  -5          1 375.000
## 9   Abur_2014          1       1        0  -5          1 398.438
## 10  Abur_2014          1       1        0  -5          1 421.875
##    Absorbance      Zmag      Zang
## 1   0.0451375 110638000 -0.228113
## 2   0.0441247 100482000 -0.230561
## 3   0.0495935  90561100 -0.230213
## 4   0.0516088  83515500 -0.230959
## 5   0.0590836  77476800 -0.229652
## 6   0.0628038  71229100 -0.230026
## 7   0.0682962  66615500 -0.229576
## 8   0.0738373  61996200 -0.229327
## 9   0.0794857  58193600 -0.228984
## 10  0.0932373  54845900 -0.226507

Accessing a database using dplyr commands

Alternatively, a connection can be made to the server by creating a series of dplyr table objects.

db <- src_mysql(dbname = "wai", host = "scidb.smith.edu", user = "waiuser", 
                password = "smith_waiDB")
Measurements <- tbl(db, "Measurements")
PI_Info <- tbl(db, "PI_Info")
Subject <- tbl(db, "Subject")

Let’s explore the PI_Info table.

PI_Info %>% 
  summarise(total = n())
## # Source:   lazy query [?? x 1]
## # Database: mysql 5.5.57-0ubuntu0.14.04.1 [waiuser@scidb.smith.edu:/wai]
##   total
##   <dbl>
## 1     5
PI_Info %>% 
  collect() %>%  # collect() is a bad idea when dealing with large tables!
  data.frame()   
##      Identifier PI_Year                                            PI
## 1 Rosowski_2012    2012                              John J. Rosowski
## 2     Abur_2014    2014 Defne Abur; Nicholas J. Horton; Susan E. Voss
## 3  Shahnaz_2006    2006                     Navid Shahnaz; Karin Bork
## 4     Voss_1994    1994                                 Susan E. Voss
## 5   Werner_2010    2010                                 Douglas Keefe
##                                                                                                                                                                                                                                                              Affiliation
## 1 Eaton-Peabody Laboratory, Massachusetts Eye and Ear Infirmary, Boston; Department of Otology and Laryngology, Harvard Medical School, Boston; Speech and Hearing Bioscience and Technology Program, Harvard-MIT Division of Health Sciences and Technology, Cambridge.
## 2                                                                                                                                                                                                                                                          Smith College
## 3                                                                                                                                                                                                                                         University of British Columbia
## 4                                                                                                                                                                                                                                 Smith College, formerly AT&T Bell Labs
## 5                                                                                                                                                                                                                                            Boys Town National Hospital
##                            Email
## 1 John_Rosowski@meei.harvard.edu
## 2                svoss@smith.edu
## 3   nshahnaz@audiospeech.ubc.ca 
## 4                svoss@smith.edu
## 5     Douglas.Keefe@boystown.org
##                                                                                         Title
## 1             Ear-Canal Reflectance, Umbo Velocity, and Tympanometry in Normal-Hearing Adults
## 2                                               Intrasubject Variability in Power Reflectance
## 3                           Wideband Reflectance Norms for Caucasian and Chinese Young Adults
## 4                    Measurement of acoustic impedance and reflectance in the human ear canal
## 5 Ear-Canal Wideband Acoustic Transfer Functions of Adults and Two- to Nine-Month-Old Infants
##                              Pub       Date
## 1                  Ear & Hearing 11/06/2015
## 2              J. Am Acad Audiol 08/24/2016
## 3                 Ear & Hearing  08/24/2016
## 4 Journal of the Acoustical Soci 02/16/2017
## 5                Ear and Hearing   9/1/2017
##                                                                                                              URL
## 1                                                                    http://www.ncbi.nlm.nih.gov/pubmed/21857517
## 2                                                                                                  Not available
## 3  http://journals.lww.com/ear-hearing/Abstract/2006/12000/Wideband_Reflectance_Norms_for_Caucasian_and.15.aspx 
## 4                                                                    http://www.ncbi.nlm.nih.gov/pubmed/21857517
## 5                                                                   https://www.ncbi.nlm.nih.gov/pubmed/20517155
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             PI_Notes
## 1 HearID (Mimosa Acoustics); \nNormal Criteria as follows: \n(1) There was no history of significant middle ear disease (e.g., otitis media or effusion 2 or more years previously were not considered significant if there were no known residual consequences).\n(2) There was no history of otologic surgery, with the exception of myringotomy or tympanostomy tube placement over 2 yr prior. \n(3) The external ear and TM revealed no abnormalities on otoscopic examination. \n(4) Audiometric measurements had pure-tone thresholds of 20 dB HL or better at octave frequen- cies between 0.250 and 8 kHz. \n(5) Air-bone gaps were no greater than 15 dB at 0.25 kHz and 10 dB between frequencies of 0.5 to 4 kHz. Most subjects had air and bone thresholds between 0 and 10 dB HL with an average near 8 to 9 dB HL at the highest frequencies. \n(6) Tympanograms were Type-A peaked, with peak pressures of 100 to 50 daPa, static compliance of 0.3 to 2.0 cc, total tympanometric volumes (static compliance ear canal volume) between 0.7 and 2.7 cc, and normal-appearing shape that is neither rounded nor sharp. \n(7) All subjects included in the \x93normal hearing\x94 population were required to have two \x93normal\x94 ears (as defined by criteria described earlier).
## 2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   Database includes measurements at Position 1 and Channel B only.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
## 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  Measurements taken with a system using sysid and the Etymotic ER-2 pressure transducer and ER-7c probe microphone
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          Used an ER-1 earphone and ER-7C microphone.  Data provided by Doug Keefe and formatted by Susan Voss with help.  Lynne Werner is retired.

Let’s explore the Subjects table.

Subject %>% 
  summarise(total = n())
## # Source:   lazy query [?? x 1]
## # Database: mysql 5.5.57-0ubuntu0.14.04.1 [waiuser@scidb.smith.edu:/wai]
##   total
##   <dbl>
## 1   394
Subject %>% 
  collect()  # be careful with collect() with large tables!
## # A tibble: 394 x 12
##       Identifier Sub_Number Session_Total   Age Female  Race Ethnicity
##            <chr>      <chr>         <int> <int>  <int> <int>     <int>
##  1 Rosowski_2012          3             1    30      1     5         2
##  2 Rosowski_2012          6             1    29      0     5         2
##  3 Rosowski_2012         11             1    64      1     5         2
##  4 Rosowski_2012         12             1    42      1     5         2
##  5 Rosowski_2012         14             1    24      0     5         2
##  6 Rosowski_2012         15             1    32      1     5         2
##  7 Rosowski_2012         17             1    22      0     5         2
##  8 Rosowski_2012         18             1    33      1     5         2
##  9 Rosowski_2012         21             1    36      1     3         2
## 10 Rosowski_2012         22             1    33      0     5         2
## # ... with 384 more rows, and 5 more variables: Left_Ear_Status <int>,
## #   Right_Ear_Status <int>, Left_Ear_Area <dbl>, Right_Ear_Area <dbl>,
## #   Sub_Notes <chr>

Let’s explore the Measurements table.

Measurements %>% 
  summarise(total = n())
## # Source:   lazy query [?? x 1]
## # Database: mysql 5.5.57-0ubuntu0.14.04.1 [waiuser@scidb.smith.edu:/wai]
##    total
##    <dbl>
## 1 128998

Let’s download the data from a given subject

onesubj <- Measurements %>% 
  filter(Identifier == "Rosowski_2012", Sub_Number == 3) %>%
  collect %>%
  mutate(SessionNum = as.factor(Session))
head(onesubj)
## # A tibble: 6 x 11
##      Identifier Sub_Number Session Left_Ear   MEP Instrument    Freq
##           <chr>      <int>   <int>    <int> <dbl>      <int>   <dbl>
## 1 Rosowski_2012          3       1        1    NA          1 210.938
## 2 Rosowski_2012          3       1        1    NA          1 234.375
## 3 Rosowski_2012          3       1        1    NA          1 257.812
## 4 Rosowski_2012          3       1        1    NA          1 281.250
## 5 Rosowski_2012          3       1        1    NA          1 304.688
## 6 Rosowski_2012          3       1        1    NA          1 328.125
## # ... with 4 more variables: Absorbance <dbl>, Zmag <dbl>, Zang <dbl>,
## #   SessionNum <fctr>

Finally we can plot the results

onesubj <- onesubj %>%
  mutate(Ear = ifelse(Left_Ear == 1, "Left", "Right"))

ggplot(data = onesubj, aes(x = Freq, y = Absorbance, colour = Ear)) + 
  geom_point() +
  scale_x_log10() + 
  ggtitle("Absorbance by ear Rosowski subject 3")

Ensembl genetics data

Our second example demo describes an extensive publically acessible genomics database called Ensembl.

Ensembl

The website at https://www.ensembl.org states:

Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.

The website provides a list of what one can do with Ensembl:

  • View genes along with other annotation along the chromosome
  • View alternative transcripts (including splice variants) for a gene
  • Explore homologues and phylogenetic trees across more than 40 species for any gene
  • Compare whole genome alignments and conserved regions across species
  • View microarray sequences that match to Ensembl genes
  • View ESTs, clones, mRNA and proteins for any chromosomal region
  • Examine single nucleotide polymorphisms (SNPs) for a gene or chromosomal region
  • View SNPs across strains (rat, mouse), populations (human), or even breeds (dog)
  • View positions and sequence of mRNA and protein that align with an Ensembl gene
  • Upload your own data
  • Use BLAST, or BLAT, a similar sequence alignment search tool, against any Ensembl genome
  • Export sequence, or create a table of gene information with BioMart

Accessing data from a database using SQL commands

As before, we first demonstrate how to access data using SQL (structured query language) commands and the dbGetQuery() function. We begin by setting up a connection to the database.

library(tidyverse)
library(RMySQL)
con <- dbConnect(MySQL(), host = "ensembldb.ensembl.org",
                user = "anonymous", password = "", 
                port = 3306)

Next a series of SQL queries can be sent to the database. These return R dataframes. Let’s start by seeing what databases are available.

ds <- dbGetQuery(con, "SHOW DATABASES")
dim(ds)
## [1] 6854    1
grep("scrofa", ds$Database, value = TRUE)
##   [1] "sus_scrofa_core_56_9"            "sus_scrofa_core_57_9a"          
##   [3] "sus_scrofa_core_58_9b"           "sus_scrofa_core_59_9c"          
##   [5] "sus_scrofa_core_60_9d"           "sus_scrofa_core_61_9e"          
##   [7] "sus_scrofa_core_62_9f"           "sus_scrofa_core_63_9"           
##   [9] "sus_scrofa_core_64_9"            "sus_scrofa_core_65_9"           
##  [11] "sus_scrofa_core_66_9"            "sus_scrofa_core_67_102"         
##  [13] "sus_scrofa_core_68_102"          "sus_scrofa_core_69_102"         
##  [15] "sus_scrofa_core_70_102"          "sus_scrofa_core_71_102"         
##  [17] "sus_scrofa_core_72_102"          "sus_scrofa_core_73_102"         
##  [19] "sus_scrofa_core_74_102"          "sus_scrofa_core_75_102"         
##  [21] "sus_scrofa_core_76_102"          "sus_scrofa_core_77_102"         
##  [23] "sus_scrofa_core_78_102"          "sus_scrofa_core_79_102"         
##  [25] "sus_scrofa_core_80_102"          "sus_scrofa_core_81_102"         
##  [27] "sus_scrofa_core_82_102"          "sus_scrofa_core_83_102"         
##  [29] "sus_scrofa_core_84_102"          "sus_scrofa_core_85_102"         
##  [31] "sus_scrofa_core_86_102"          "sus_scrofa_core_87_102"         
##  [33] "sus_scrofa_core_88_102"          "sus_scrofa_core_89_102"         
##  [35] "sus_scrofa_core_90_111"          "sus_scrofa_funcgen_57_9a"       
##  [37] "sus_scrofa_funcgen_58_9b"        "sus_scrofa_funcgen_59_9c"       
##  [39] "sus_scrofa_funcgen_60_9d"        "sus_scrofa_funcgen_61_9e"       
##  [41] "sus_scrofa_funcgen_62_9f"        "sus_scrofa_funcgen_63_9"        
##  [43] "sus_scrofa_funcgen_64_9"         "sus_scrofa_funcgen_65_9"        
##  [45] "sus_scrofa_funcgen_66_9"         "sus_scrofa_funcgen_67_102"      
##  [47] "sus_scrofa_funcgen_68_102"       "sus_scrofa_funcgen_69_102"      
##  [49] "sus_scrofa_funcgen_70_102"       "sus_scrofa_funcgen_71_102"      
##  [51] "sus_scrofa_funcgen_72_102"       "sus_scrofa_funcgen_73_102"      
##  [53] "sus_scrofa_funcgen_74_102"       "sus_scrofa_funcgen_75_102"      
##  [55] "sus_scrofa_funcgen_76_102"       "sus_scrofa_funcgen_77_102"      
##  [57] "sus_scrofa_funcgen_78_102"       "sus_scrofa_funcgen_79_102"      
##  [59] "sus_scrofa_funcgen_80_102"       "sus_scrofa_funcgen_81_102"      
##  [61] "sus_scrofa_funcgen_82_102"       "sus_scrofa_funcgen_83_102"      
##  [63] "sus_scrofa_funcgen_84_102"       "sus_scrofa_funcgen_85_102"      
##  [65] "sus_scrofa_funcgen_86_102"       "sus_scrofa_funcgen_87_102"      
##  [67] "sus_scrofa_funcgen_88_102"       "sus_scrofa_funcgen_89_102"      
##  [69] "sus_scrofa_funcgen_90_111"       "sus_scrofa_otherfeatures_56_9"  
##  [71] "sus_scrofa_otherfeatures_57_9a"  "sus_scrofa_otherfeatures_58_9b" 
##  [73] "sus_scrofa_otherfeatures_59_9c"  "sus_scrofa_otherfeatures_60_9d" 
##  [75] "sus_scrofa_otherfeatures_61_9e"  "sus_scrofa_otherfeatures_62_9f" 
##  [77] "sus_scrofa_otherfeatures_63_9"   "sus_scrofa_otherfeatures_64_9"  
##  [79] "sus_scrofa_otherfeatures_65_9"   "sus_scrofa_otherfeatures_66_9"  
##  [81] "sus_scrofa_otherfeatures_67_102" "sus_scrofa_otherfeatures_68_102"
##  [83] "sus_scrofa_otherfeatures_69_102" "sus_scrofa_otherfeatures_70_102"
##  [85] "sus_scrofa_otherfeatures_71_102" "sus_scrofa_otherfeatures_72_102"
##  [87] "sus_scrofa_otherfeatures_73_102" "sus_scrofa_otherfeatures_74_102"
##  [89] "sus_scrofa_otherfeatures_75_102" "sus_scrofa_otherfeatures_76_102"
##  [91] "sus_scrofa_otherfeatures_77_102" "sus_scrofa_otherfeatures_78_102"
##  [93] "sus_scrofa_otherfeatures_79_102" "sus_scrofa_otherfeatures_80_102"
##  [95] "sus_scrofa_otherfeatures_81_102" "sus_scrofa_otherfeatures_82_102"
##  [97] "sus_scrofa_otherfeatures_83_102" "sus_scrofa_otherfeatures_84_102"
##  [99] "sus_scrofa_otherfeatures_85_102" "sus_scrofa_otherfeatures_86_102"
## [101] "sus_scrofa_otherfeatures_87_102" "sus_scrofa_otherfeatures_88_102"
## [103] "sus_scrofa_otherfeatures_89_102" "sus_scrofa_otherfeatures_90_111"
## [105] "sus_scrofa_rnaseq_67_102"        "sus_scrofa_rnaseq_68_102"       
## [107] "sus_scrofa_rnaseq_69_102"        "sus_scrofa_rnaseq_70_102"       
## [109] "sus_scrofa_rnaseq_71_102"        "sus_scrofa_rnaseq_72_102"       
## [111] "sus_scrofa_rnaseq_73_102"        "sus_scrofa_rnaseq_74_102"       
## [113] "sus_scrofa_rnaseq_75_102"        "sus_scrofa_rnaseq_76_102"       
## [115] "sus_scrofa_rnaseq_77_102"        "sus_scrofa_rnaseq_78_102"       
## [117] "sus_scrofa_rnaseq_79_102"        "sus_scrofa_rnaseq_80_102"       
## [119] "sus_scrofa_rnaseq_81_102"        "sus_scrofa_rnaseq_82_102"       
## [121] "sus_scrofa_rnaseq_83_102"        "sus_scrofa_rnaseq_84_102"       
## [123] "sus_scrofa_rnaseq_85_102"        "sus_scrofa_rnaseq_86_102"       
## [125] "sus_scrofa_rnaseq_87_102"        "sus_scrofa_rnaseq_88_102"       
## [127] "sus_scrofa_rnaseq_89_102"        "sus_scrofa_rnaseq_90_111"       
## [129] "sus_scrofa_variation_57_9a"      "sus_scrofa_variation_58_9b"     
## [131] "sus_scrofa_variation_59_9c"      "sus_scrofa_variation_60_9d"     
## [133] "sus_scrofa_variation_61_9e"      "sus_scrofa_variation_62_9f"     
## [135] "sus_scrofa_variation_63_9"       "sus_scrofa_variation_64_9"      
## [137] "sus_scrofa_variation_65_9"       "sus_scrofa_variation_66_9"      
## [139] "sus_scrofa_variation_67_102"     "sus_scrofa_variation_68_102"    
## [141] "sus_scrofa_variation_69_102"     "sus_scrofa_variation_70_102"    
## [143] "sus_scrofa_variation_71_102"     "sus_scrofa_variation_72_102"    
## [145] "sus_scrofa_variation_73_102"     "sus_scrofa_variation_74_102"    
## [147] "sus_scrofa_variation_75_102"     "sus_scrofa_variation_76_102"    
## [149] "sus_scrofa_variation_77_102"     "sus_scrofa_variation_78_102"    
## [151] "sus_scrofa_variation_79_102"     "sus_scrofa_variation_80_102"    
## [153] "sus_scrofa_variation_81_102"     "sus_scrofa_variation_82_102"    
## [155] "sus_scrofa_variation_83_102"     "sus_scrofa_variation_84_102"    
## [157] "sus_scrofa_variation_85_102"     "sus_scrofa_variation_86_102"    
## [159] "sus_scrofa_variation_87_102"     "sus_scrofa_variation_88_102"    
## [161] "sus_scrofa_variation_89_102"     "sus_scrofa_variation_90_111"    
## [163] "sus_scrofa_vega_69_102"          "sus_scrofa_vega_70_102"         
## [165] "sus_scrofa_vega_71_102"          "sus_scrofa_vega_72_102"         
## [167] "sus_scrofa_vega_73_102"          "sus_scrofa_vega_74_102"         
## [169] "sus_scrofa_vega_75_102"          "sus_scrofa_vega_76_102"         
## [171] "sus_scrofa_vega_77_102"          "sus_scrofa_vega_78_102"         
## [173] "sus_scrofa_vega_79_102"          "sus_scrofa_vega_80_102"         
## [175] "sus_scrofa_vega_81_102"          "sus_scrofa_vega_82_102"         
## [177] "sus_scrofa_vega_83_102"          "sus_scrofa_vega_84_102"         
## [179] "sus_scrofa_vega_85_102"          "sus_scrofa_vega_86_102"         
## [181] "sus_scrofa_vega_87_102"          "sus_scrofa_vega_88_102"

Let’s focus on the sus_scrofa_variation_79_102 database.

dbGetQuery(con, "USE sus_scrofa_variation_79_102;")
## data frame with 0 columns and 0 rows
dbGetQuery(con, "SHOW TABLES")
##    Tables_in_sus_scrofa_variation_79_102
## 1                                 allele
## 2                            allele_code
## 3                        associate_study
## 4                                 attrib
## 5                             attrib_set
## 6                            attrib_type
## 7             compressed_genotype_region
## 8                compressed_genotype_var
## 9                           coord_system
## 10                         display_group
## 11                         failed_allele
## 12                    failed_description
## 13           failed_structural_variation
## 14                      failed_variation
## 15                         genotype_code
## 16                            individual
## 17       individual_genotype_multiple_bp
## 18                 individual_population
## 19                    individual_synonym
## 20                       individual_type
## 21                                  meta
## 22                            meta_coord
## 23               motif_feature_variation
## 24                             phenotype
## 25                     phenotype_feature
## 26              phenotype_feature_attrib
## 27                            population
## 28                   population_genotype
## 29                  population_structure
## 30                    population_synonym
## 31          protein_function_predictions
## 32   protein_function_predictions_attrib
## 33                           publication
## 34                         read_coverage
## 35          regulatory_feature_variation
## 36                            seq_region
## 37                                source
## 38                     strain_gtype_poly
## 39                  structural_variation
## 40      structural_variation_association
## 41          structural_variation_feature
## 42           structural_variation_sample
## 43                                 study
## 44                      submitter_handle
## 45                         subsnp_handle
## 46                            subsnp_map
## 47              tagged_variation_feature
## 48                  transcript_variation
## 49                       translation_md5
## 50                             variation
## 51                      variation_attrib
## 52                    variation_citation
## 53                     variation_feature
## 54                    variation_genename
## 55                        variation_hgvs
## 56                         variation_set
## 57    variation_set_structural_variation
## 58               variation_set_structure
## 59               variation_set_variation
## 60                     variation_synonym
dbGetQuery(con, "SELECT * FROM publication")
##   publication_id
## 1              1
## 2              2
## 3              3
## 4              4
##                                                                                                                                      title
## 1                                         A combination of two variants in PRKAG3 is needed for a positive effect on meat quality in pigs.
## 2 RNA deep sequencing reveals novel candidate genes and polymorphisms in boar testis and liver tissues with divergent androstenone levels.
## 3                                                 TLR4 single nucleotide polymorphisms (SNPs) associated with Salmonella shedding in pigs.
## 4                                                  Population history and genomic signatures for high-altitude adaptation in Tibetan pigs.
##                                                                                                                                    authors
## 1                                                                                                                     Uimari P, Sironen A.
## 2 Gunawan A, Sahadevan S, Neuhoff C, Große-Brinkhaus C, Gad A, Frieden L, Tesfaye D, Tholen E, Looft C, Uddin MJ, Schellander K, Cinar MU.
## 3                                                             Kich JD, Uthe JJ, Benavides MV, Cantão ME, Zanella R, Tuggle CK, Bearson SM.
## 4                                                                                                Ai H, Yang B, Li J, Xie X, Chen H, Ren J.
##       pmid      pmcid year                          doi ucsc_id
## 1 24580963 PMC3943410 2014      10.1186/1471-2156-15-29    <NA>
## 2 23696805 PMC3655983 2013 10.1371/journal.pone.0063259    <NA>
## 3 24566961 PMC3990860 2014    10.1007/s13353-014-0199-8    <NA>
## 4 25270331 PMC4197311 2014     10.1186/1471-2164-15-834    <NA>
dbGetQuery(con, "SELECT * FROM source")
##   source_id               name  version
## 1         1              dbSNP      140
## 2         2      Archive dbSNP      140
## 3         3 Pig SNP Consortium        0
## 4         4               DGVa   201405
## 5         5       Animal_QTLdb 20140825
##                                                                                                                                                         description
## 1                                                                                                          Variants (including SNPs and indels) imported from dbSNP
## 2                                                                                                                         Former variants names imported from dbSNP
## 3                                                                                                                                             PorcineSNP60 BeadChip
## 4                                                                                                                              Database of Genomic Variants Archive
## 5 The Animal Quantitative Trait Loci (QTL) database (Animal QTLdb) is designed to house all publicly available QTL and association data on livestock animal species
##                                               url type somatic_status
## 1       http://www.ncbi.nlm.nih.gov/projects/SNP/ <NA>          mixed
## 2       http://www.ncbi.nlm.nih.gov/projects/SNP/ <NA>          mixed
## 3                                            <NA> <NA>       germline
## 4                      http://www.ebi.ac.uk/dgva/ <NA>       germline
## 5 http://www.animalgenome.org/cgi-bin/QTLdb/index <NA>       germline
##                   data_types
## 1                  variation
## 2          variation_synonym
## 3          variation_synonym
## 4 structural_variation,study
## 5    phenotype_feature,study
ds <- dbGetQuery(con, "SELECT * from individual LIMIT 1000")
glimpse(ds)
## Observations: 96
## Variables: 10
## $ individual_id        <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13...
## $ name                 <chr> "B0177", "D0067", "Y0081", "D0059", "B013...
## $ description          <chr> "IND:NIAS-AGP|AGP-PanelM84|B0177|9823|M|O...
## $ gender               <chr> "Unknown", "Unknown", "Unknown", "Unknown...
## $ father_individual_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ mother_individual_id <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ individual_type_id   <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ display              <chr> "UNDISPLAYABLE", "UNDISPLAYABLE", "UNDISP...
## $ has_coverage         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ variation_set_id     <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
ds %>%
  group_by(description) %>%
  summarize(N = n()) %>%
  arrange(desc(N))
## # A tibble: 85 x 2
##                                                                    description
##                                                                          <chr>
##  1                                          Subject from the DGVa study nstd24
##  2 IND:NIAS-AGP|AGP-PanelM84|B0075|9823|M|O|Western, SOURCE:submitter|AGP|B007
##  3 IND:NIAS-AGP|AGP-PanelM84|B0076|9823|M|O|Western, SOURCE:submitter|AGP|B007
##  4 IND:NIAS-AGP|AGP-PanelM84|B0077|9823|M|O|Western, SOURCE:submitter|AGP|B007
##  5 IND:NIAS-AGP|AGP-PanelM84|B0079|9823|M|O|Western, SOURCE:submitter|AGP|B007
##  6 IND:NIAS-AGP|AGP-PanelM84|B0126|9823|M|O|Western, SOURCE:submitter|AGP|B012
##  7 IND:NIAS-AGP|AGP-PanelM84|B0134|9823|M|O|Western, SOURCE:submitter|AGP|B013
##  8 IND:NIAS-AGP|AGP-PanelM84|B0138|9823|M|O|Western, SOURCE:submitter|AGP|B013
##  9 IND:NIAS-AGP|AGP-PanelM84|B0140|9823|M|O|Western, SOURCE:submitter|AGP|B014
## 10 IND:NIAS-AGP|AGP-PanelM84|B0157|9823|M|O|Western, SOURCE:submitter|AGP|B015
## # ... with 75 more rows, and 1 more variables: N <int>

We can track down the publication which is associated with these data.

https://www.ncbi.nlm.nih.gov/pubmed/25662601

East Balkan Swine (EBS) Sus scrofa is the only aboriginal domesticated pig breed in Bulgaria and is distributed on the western coast of the Black Sea in Bulgaria. To reveal the breed’s genetic characteristics, we analysed mitochondrial DNA (mtDNA) and Y chromosomal DNA sequences of EBS in Bulgaria. Nucleotide diversity (πn ) of the mtDNA control region, including two newly found haplotypes, in 54 EBS was higher (0.014 ± 0.007) compared with that of European (0.005 ± 0.003) and Asian (0.006 ± 0.003) domestic pigs and wild boar. The median-joining network based on the mtDNA control region showed that the EBS and wild boar in Bulgaria comprised mainly two major mtDNA clades, European clade E1 (61.3%) and Asian clade A (38.7%). The coexistence of two mtDNA clades in EBS in Bulgaria may be the relict of historical pig translocation. Among the Bulgarian EBS colonies, the geographical differences in distribution of two mtDNA clades (E1 and A) could be attributed to the source pig populations and/or historical crossbreeding with imported pigs. In addition, analysis of the Y chromosomal DNA sequences for the EBS revealed that all of the EBS had haplotype HY1, which is dominant in European domestic pigs.

Accessing a database using dplyr commands

Alternatively, a connection can be made to the server by creating a dplyr table objects for any give SQL table.

db <- src_mysql(dbname = "sus_scrofa_variation_79_102", 
                host = "ensembldb.ensembl.org", 
                user = "anonymous",
                port = 3306,
                password = "")
Allele <- tbl(db, "allele")

Let’s explore this table

Allele %>% 
  head()
## # Source:   lazy query [?? x 8]
## # Database: mysql 5.6.33
## #   [anonymous@ensembldb.ensembl.org:/sus_scrofa_variation_79_102]
##   allele_id variation_id subsnp_id allele_code_id population_id frequency
##       <int>        <dbl>     <dbl>          <dbl>         <dbl>     <dbl>
## 1         1         1169  16338075              3            15        NA
## 2         2         1169  16338075              2            15        NA
## 3         3         1169 722145465              3            NA        NA
## 4         4         1169 722145465              2            NA        NA
## 5         5         1588  23132408              3            15        NA
## 6         6         1588  23132408              4            15        NA
## # ... with 2 more variables: count <dbl>, frequency_submitter_handle <int>
Allele %>%
  group_by(allele_code_id) %>%
  summarize(meanfreq = mean(frequency), count = n())
## # Source:   lazy query [?? x 3]
## # Database: mysql 5.6.33
## #   [anonymous@ensembldb.ensembl.org:/sus_scrofa_variation_79_102]
##    allele_code_id  meanfreq    count
##             <dbl>     <dbl>    <dbl>
##  1              1 0.3816843 14503376
##  2              2 0.3861839 14500218
##  3              3 0.6516738 14686160
##  4              4 0.5927523 14690458
##  5              5 0.9642860     3276
##  6              6        NA        5
##  7              7        NA       35
##  8              8        NA        1
##  9              9        NA        1
## 10             10        NA       27
## # ... with more rows

The tutorials at https://www.ensembl.org/info/website/tutorials/index.html are helpful in pursuing more substantive analyses.