Practical 4-1: Create a data package

Overview

In this session you will take some classic biodiversity data and turn it into an R package.

Background

When we think about biodiversity data, we generally think about the hypothesis we are testing, how we are going to collect the data, and how we need to analyse it to investigate that hypothesis. Little thought is usually given to how to store the data. This is a mistake! Large amounts of hard work in the field can be wasted by mistakes in storage. After regular backups so you don’t physically lose your data (if you’re not doing this, then you should stop whatever you’re doing and fix that now!), the next important thing to consider is whether you are definitely, always using the most up-to-date, error-free copy of your dataset when you carry out analyses.

If your data is stored in one or more csv files or excel spreadsheets, you need to read that file into R. For that you need to know where it is. If you only have one copy of it on your computer, and you update that when you clean the data or add new records, then that is a single point of failure. If you accidentally make incorrect edits or delete the file, then you are entirely dependent on backups to recover from your mistakes (assuming you followed the instructions above and have a backup!).

If, on the other hand, you create a new file every time you update the data, then all of your R scripts that do analyses on the data now have to refer to the new file, or you will need to copy the new data file to the projects where you are doing the analysis. If you don’t do either of these everywhere you are working on the data, one or more of your analyses may be using an out of date version of the data, and silently give incorrect answers.

Even if your behaviour is perfect, and you never make mistakes with your data, then it is usually necessary to do some processing to the data before carrying out the analyses. This may be as simple as changing some of the column names in your tables so they are easier to work with in R (no spaces, for instance), some cleaning of the data to remove missing data, or some significant processing to turn records that are easy to collect into data that is easy to analyse. Such a script, just like the data, needs to be updated, and the code needs to be present wherever you carry out an analysis. It therefore suffers from exactly the same problems as the data itself.

There is a simple(-ish!) solution though. You don’t need to carry around your own copy of ggplot() or lmer() in every script you write, so why should you have to do it with the data? There is one definitive version of ggplot() and lmer() on your computer, and you just refer to them by loading the ggplot2 and lme4 packages using the library() function. When these packages get updated to fix bugs or extend functionality, you just do it once by installing (normally updating) the R package, and then all of your scripts use the new version. So why not turn your data into an R package and do the same?

You’ll go through the process of creating a package, processing data, inserting the data into the package, and building the package so that you can use it during the project work at the end of the course.

Tasks: storing the data

First, create a new package called githubusernameBCI (replacing githubusername with your actual GitHub username) as a git repository connected to GitHub as a repo in the SBOHVM organisation. Remember that a guide to creating packages on GitHub is available on the RPiR help pages here if you’re not sure what to do.

Generally, you would want to store the raw data from your work in the git repository so everything is archived and version controlled on GitHub. However, in this work, the dataset is not yours, and it’s so big it easily exceeds file limits for storage in GitHub. So, under no circumstances commit the raw records to your git repository – it’s far too big. We’ll show you how to ignore it so that you don’t do this by mistake shortly, but just bear this in mind now so you ensure that you don’t – the moment you commit a file like this into a repo it’s hard to undo the error.

Next, you need to download the Barro Colorado Forest Census Plot Data from the Smithsonian’s repository at https://repository.si.edu/handle/10088/20925. Download files 3 and 4, the full census data (ViewFullTable.zip) and the taxonomic data on all of the species present (ViewTax.txt). Now create a folder to store these raw data in, and create the files you will use shortly to process the data:

usethis::use_data_raw("bci_2010")
usethis::use_data_raw("bci_quadrats")
usethis::use_data_raw("bci_taxa")

Copy the files you have downloaded to the data-raw folder that has just been created inside the R package, and unzip the ViewFullTable.zip file into the same folder. You can now delete the zip file if you like. Now we need to make sure that none of these files make it into your git repo by accident. Run:

usethis::edit_git_ignore("project")

And add the following lines to the end of the .gitignore file that has opened:

ViewTax.txt
ViewFullTable.zip
ViewFullTable
TSMAttributes.txt
ViewFullTable.txt
ViewFullTable.pdf

Now go to the git pane in RStudio, and make sure that the none of the files you just downloaded show up in the pane. Commit everything and create a suitable commit message for these initial files. Now create a new R file in the data-raw folder. This will be the file you run to put all of the data into the package. It should contain the following:

# Put the datasets into the package
library(dplyr)

# Move to the data-raw subfolder to access the raw data
devtools::wd(".", "data-raw")

# BCI data

# Load individual BCI records
data <- read.delim(file.path("ViewFullTable", "ViewFullTable.txt"))

# Load BCI taxonomic data, create new species column, and extract species
taxa <- read.delim("ViewTax.txt") %>%
  mutate(GenusSpecies = as.factor(paste(Genus, SpeciesName))) %>%
  filter(IDLevel == "species")

# Store species counts from 2010 census in package
source("bci_2010.R")

# Store quadrat metadata in package
source("bci_quadrats.R")

# Store taxonomic data in package
source("bci_taxa.R")

This will load all of the raw data into R (this will take some time) and run the files you have created to process each of them and store them in the package. Don’t run this script yet though! First, you need to edit the processing files (bci_2010.R, bci_quadrats.R and bci_taxa.R) to do the processing correctly. Actually, we provide them for you here!

This is bci_2010.R:

library(dplyr)
library(reshape2)

# Clean data to remove secondary and dead stems of trees and species not in
# taxonomy
records <- data %>%
  filter(PrimaryStem == "main", Status == "alive", !is.na(QuadratName)) %>%
  filter(GenusSpecies %in% taxa$GenusSpecies) %>%
  select(GenusSpecies, PlotCensusNumber, QuadratName) %>%
  mutate(col = as.integer(floor(QuadratName / 100)),
         row = as.integer(QuadratName - col * 100)) %>%
  filter(row < 25)

# Extract table at a single timepoint
bci_2010 <- records %>% filter(PlotCensusNumber == 7) %>%
  select(GenusSpecies, QuadratName) %>%
  acast(GenusSpecies ~ QuadratName, fill = 0,
        value.var = "QuadratName", fun.aggregate = length)

# Call columns Q.xxyy, and store package
colnames(bci_2010) <- sprintf("Q.%04d", as.integer(colnames(bci_2010)))

# Store in package
usethis::use_data(bci_2010, overwrite = TRUE)

It takes the massive Barro Colorado Island (BCI) dataset and extracts only the 2010 (7th) census and summarises it to extract a matrix of counts of known species of living trees in 20m x 20m quadrats across the site.

This is bci_quadrats.R:

library(dplyr)
library(tibble)

# Work out useful information about the quadrats themselves
# Note: they index from 0 and they are 20m x 20m quadrats
bci_quadrats <- records %>% select(QuadratName, row, col) %>%
  unique %>%
  mutate(x = row * 20, y = col * 20, row = row + 1, col = col + 1) %>%
  mutate(Quadrat = sprintf("Q.%04d", as.integer(QuadratName))) %>%
  arrange(Quadrat) %>%
  as_tibble

# Store in package
usethis::use_data(bci_quadrats, overwrite = TRUE)

It provides metadata about the quadrats – where they are within the site – as a tibble (a prettier version of a data frame that the RStudio team have created). Specifically, the bci_quadrats data frame translates the column names in bci_2010 into actually rows and columns in a grid or (x,y) coordinates in metres in case you want to be able to display where the quadrats are physically.

Finally, this is bci_taxa.R:

library(dplyr)
library(tibble)

# Discard species not identified to species level
bci_taxa <- taxa %>% filter(IDLevel == "species") %>%
  select(GenusSpecies, Genus, Family) %>%
  filter(GenusSpecies %in% rownames(bci_2010)) %>%
  unique %>% as_tibble

# Store in package
usethis::use_data(bci_taxa, overwrite = TRUE)

This will turn the huge data frame containing all of the taxonomic records of the site into a tibble called bci_taxa, which just contains the species name (GenusSpecies), genus (Genus) and family (Family) of each living tree species on Barro Colorado Island (BCI) and stores it in the package.

Finally you need to do three things:

  1. Commit all of the scripts that you have just created
  2. Run the script that you created to generate the data files and commit them
  3. Add in all of the packages you are using as dependencies of this package

If you have a problem in step 2, with an error about a file missing here:

data <- read.delim(file.path("ViewFullTable", "ViewFullTable.txt"))

Then your computer may have unzipped files differently from mine – mine has created a folder in data-raw called ViewFullTable, but yours may have just unzipped the files directly into data-raw. In that case, just change the line to:

data <- read.delim("ViewFullTable.txt")

Step 3 is (hopefully!) easy, and it is something you are going to have to keep up to date as you develop any package – you need to check which libraries you load using library(xxx) in any script or demo, and which libraries you use by qualifying function calls with xxx::function_name(). You then need to run:

usethis::use_package("xxx")

to add the xxx package to those imported by your package in the DESCRIPTION file. See here for further details. Commit these changes to the repo too, and push them to GitHub. Note that you should not add your own (githubusernameBCI) package to the package dependencies of githubusernameBCI, because it is not a dependency of itself! If you have accidentally done this you need to go into the DESCRIPTION file and delete it from the Imports: entry there.

Finally, if you are going to add a package on GitHub to your dependencies (such as our RPiR package), you need to use usethis::use_dev_package() instead of usethis::use_package(). You need to know where to find it on GitHub, and then use:

usethis::use_dev_package("RPiR", remote = "SBOHVM/RPiR")

where the remote argument is the organisation and repo name. In the project, when you are using your own data package in your new project package, you will need to do this by calling something like usethis::use_dev_package("githubusernameBCI", remote = "SBOHVM/githubusernameBCI").

Tasks: documenting the package

First you need to edit the DESCRIPTION file (more here so that it contains the right information about you and the package, and you need to create a documentation file for the package in the R folder so that ?githubusernameBCI returns a description. You had this in the package for the third practical series, but you’ll see details on how to do that here.

Create a new file in the R folder – I’m going to call it githubusernameBCI-package.R because the convention is packagename-package.R. Then describe your package (most easily lifted from the DESCRIPTION file):

#' Barro Colorado Island data package 
#'
#' Package to hold the BCI data (or whatever) -- maybe also mention something
#' about these functions now, and put that in the DESCRIPTION too. And then
#' put it in the README.md file. And don't forget to reference the source of
#' the data correctly.
#'
#' @import magrittr 
#'
#' @name githubusernameBCI-package
#' @aliases githubusernameBCI
#' @docType package 
#'
NULL

There are a few things going on here that you should notice for making packages in the future. The first is that you need to say what your package is called – here I have given it two names githubusernameBCI-package and an alias of just githubusernameBCI, which you do with the @aliases command. Secondly, the NULL at the end of the file is included when there is no object associated with this documentation, which is the case here, since this file contains the package documentation, as defined explicitly in the @docType tag (more info here). Finally, if you are using any packages in any of your functions, you may want to import them into your package here. This is done using @import – see more here. You can then put the same, or a similar, package description into the README.md file in the package so that people going to GitHub will see what the package that you have created does without having to install it. Note that GitHub README pages use a special type of markdown called Git Flavored Markdown; more information can be found here https://docs.github.com/en/free-pro-team@latest/github/writing-on-github with more general documentation here https://guides.github.com/features/mastering-markdown/. Note that with markdown you don’t need to start lines with #'.

Second, you need to document the data that you are providing in this package, by creating file(s) in the R folder again. You’ll find details on how to create the documentation here, or an abbreviated version in our guide here.

Finally, build the documentation (using devtools::document()), and commit all of the changes to git.

Make sure throughout the documentation that you credit the real source of the data as the Smithsonian.

Running the code

Now you should have a working R data package. Try installing it (using devtools::install()), restarting R and loading it. Check that the objects you have created exist – you should be able to see them in the Environment pane, but more simply by just typing:

library(githubusernameBCI)
bci_taxa
bci_quadrats
bci_2010

You should (hopefully!) find that all of the objects exist, though you may be surprised to find that they appear to be normal data frames instead of tibbles. This actually isn’t true, and if you run:

library(githubusernameBCI)
library(tibble)
bci_taxa
bci_quadrats

You’ll see that they are tibbles. Tibbles automatically fall back to being data frames if you don’t load the library.

GitHub

Now push all of your changes (commits) to GitHub, and check you can install it by using devtools::install_github("SBOHVM/githubusernameBCI").