In this session you will take some classic biodiversity data and turn it into an R package.
When we think about biodiversity data, we generally think about the hypothesis we are testing, how we are going to collect the data, and how we need to analyse it to investigate that hypothesis. Little thought is usually given to how to store the data. This is a mistake! Large amounts of hard work in the field can be wasted by mistakes in storage. After regular backups so you don’t physically lose your data (if you’re not doing this, then you should stop whatever you’re doing and fix that now!), the next important thing to consider is whether you are definitely, always using the most up-to-date, error-free copy of your dataset when you carry out analyses.
If your data is stored in one or more csv files or excel spreadsheets, you need to read that file into R. For that you need to know where it is. If you only have one copy of it on your computer, and you update that when you clean the data or add new records, then that is a single point of failure. If you accidentally make incorrect edits or delete the file, then you are entirely dependent on backups to recover from your mistakes (assuming you followed the instructions above and have a backup!).
If, on the other hand, you create a new file every time you update the data, then all of your R scripts that do analyses on the data now have to refer to the new file, or you will need to copy the new data file to the projects where you are doing the analysis. If you don’t do either of these everywhere you are working on the data, one or more of your analyses may be using an out of date version of the data, and silently give incorrect answers.
Even if your behaviour is perfect, and you never make mistakes with your data, then it is usually necessary to do some processing to the data before carrying out the analyses. This may be as simple as changing some of the column names in your tables so they are easier to work with in R (no spaces, for instance), some cleaning of the data to remove missing data, or some significant processing to turn records that are easy to collect into data that is easy to analyse. Such a script, just like the data, needs to be updated, and the code needs to be present wherever you carry out an analysis. It therefore suffers from exactly the same problems as the data itself.
There is a simple(-ish!) solution though. You don’t need to carry
around your own copy of ggplot()
or lmer()
in
every script you write, so why should you have to do it with the data?
There is one definitive version of ggplot()
and
lmer()
on your computer, and you just refer to them by
loading the ggplot2
and lme4
packages using
the library()
function. When these packages get updated to
fix bugs or extend functionality, you just do it once by installing
(normally updating) the R package, and then all of your scripts use the
new version. So why not turn your data into an R package and do the
same?
You’ll go through the process of creating a package, processing data, inserting the data into the package, and building the package so that you can use it during the project work at the end of the course.
First, create a new package called githubusernameBCI
(replacing githubusername
with your actual GitHub username)
as a git repository connected to GitHub as a repo in the
SBOHVM
organisation. Remember that a guide to creating
packages on GitHub is available on the RPiR
help pages here
if you’re not sure what to do.
Generally, you would want to store the raw data from your work in the git repository so everything is archived and version controlled on GitHub. However, in this work, the dataset is not yours, and it’s so big it easily exceeds file limits for storage in GitHub. So, under no circumstances commit the raw records to your git repository – it’s far too big. We’ll show you how to ignore it so that you don’t do this by mistake shortly, but just bear this in mind now so you ensure that you don’t – the moment you commit a file like this into a repo it’s hard to undo the error.
Next, you need to download the Barro Colorado Forest Census Plot Data from the Smithsonian’s repository at https://repository.si.edu/handle/10088/20925. Download files 3 and 4, the full census data (ViewFullTable.zip) and the taxonomic data on all of the species present (ViewTax.txt). Now create a folder to store these raw data in, and create the files you will use shortly to process the data:
usethis::use_data_raw("bci_2010")
usethis::use_data_raw("bci_quadrats")
usethis::use_data_raw("bci_taxa")
Copy the files you have downloaded to the data-raw folder that has just been created inside the R package, and unzip the ViewFullTable.zip file into the same folder. You can now delete the zip file if you like. Now we need to make sure that none of these files make it into your git repo by accident. Run:
usethis::edit_git_ignore("project")
And add the following lines to the end of the .gitignore file that has opened:
ViewTax.txt
ViewFullTable.zip
ViewFullTable
TSMAttributes.txt
ViewFullTable.txt
ViewFullTable.pdf
Now go to the git pane in RStudio, and make sure that the none of the files you just downloaded show up in the pane. Commit everything and create a suitable commit message for these initial files. Now create a new R file in the data-raw folder. This will be the file you run to put all of the data into the package. It should contain the following:
# Put the datasets into the package
library(dplyr)
# Move to the data-raw subfolder to access the raw data
devtools::wd(".", "data-raw")
# BCI data
# Load individual BCI records
data <- read.delim(file.path("ViewFullTable", "ViewFullTable.txt"))
# Load BCI taxonomic data, create new species column, and extract species
taxa <- read.delim("ViewTax.txt") %>%
mutate(GenusSpecies = as.factor(paste(Genus, SpeciesName))) %>%
filter(IDLevel == "species")
# Store species counts from 2010 census in package
source("bci_2010.R")
# Store quadrat metadata in package
source("bci_quadrats.R")
# Store taxonomic data in package
source("bci_taxa.R")
This will load all of the raw data into R (this will take some time)
and run the files you have created to process each of them and store
them in the package. Don’t run this script yet though! First,
you need to edit the processing files (bci_2010.R
,
bci_quadrats.R
and bci_taxa.R
) to do the
processing correctly. Actually, we provide them for you
here!
This is bci_2010.R:
library(dplyr)
library(reshape2)
# Clean data to remove secondary and dead stems of trees and species not in
# taxonomy
records <- data %>%
filter(PrimaryStem == "main", Status == "alive", !is.na(QuadratName)) %>%
filter(GenusSpecies %in% taxa$GenusSpecies) %>%
select(GenusSpecies, PlotCensusNumber, QuadratName) %>%
mutate(col = as.integer(floor(QuadratName / 100)),
row = as.integer(QuadratName - col * 100)) %>%
filter(row < 25)
# Extract table at a single timepoint
bci_2010 <- records %>% filter(PlotCensusNumber == 7) %>%
select(GenusSpecies, QuadratName) %>%
acast(GenusSpecies ~ QuadratName, fill = 0,
value.var = "QuadratName", fun.aggregate = length)
# Call columns Q.xxyy, and store package
colnames(bci_2010) <- sprintf("Q.%04d", as.integer(colnames(bci_2010)))
# Store in package
usethis::use_data(bci_2010, overwrite = TRUE)
It takes the massive Barro Colorado Island (BCI) dataset and extracts only the 2010 (7th) census and summarises it to extract a matrix of counts of known species of living trees in 20m x 20m quadrats across the site.
This is bci_quadrats.R:
library(dplyr)
library(tibble)
# Work out useful information about the quadrats themselves
# Note: they index from 0 and they are 20m x 20m quadrats
bci_quadrats <- records %>% select(QuadratName, row, col) %>%
unique %>%
mutate(x = row * 20, y = col * 20, row = row + 1, col = col + 1) %>%
mutate(Quadrat = sprintf("Q.%04d", as.integer(QuadratName))) %>%
arrange(Quadrat) %>%
as_tibble
# Store in package
usethis::use_data(bci_quadrats, overwrite = TRUE)
It provides metadata about the quadrats – where they are within the
site – as a tibble (a prettier version of a data frame that the RStudio
team have created). Specifically, the bci_quadrats
data
frame translates the column names in bci_2010
into actually
rows and columns in a grid or (x,y)
coordinates in metres
in case you want to be able to display where the quadrats are
physically.
Finally, this is bci_taxa.R:
library(dplyr)
library(tibble)
# Discard species not identified to species level
bci_taxa <- taxa %>% filter(IDLevel == "species") %>%
select(GenusSpecies, Genus, Family) %>%
filter(GenusSpecies %in% rownames(bci_2010)) %>%
unique %>% as_tibble
# Store in package
usethis::use_data(bci_taxa, overwrite = TRUE)
This will turn the huge data frame containing all of the taxonomic
records of the site into a tibble called bci_taxa
, which
just contains the species name (GenusSpecies
), genus
(Genus
) and family (Family
) of each living
tree species on Barro Colorado Island (BCI) and stores it in the
package.
Finally you need to do three things:
If you have a problem in step 2, with an error about a file missing here:
data <- read.delim(file.path("ViewFullTable", "ViewFullTable.txt"))
Then your computer may have unzipped files differently from mine – mine has created a folder in data-raw called ViewFullTable, but yours may have just unzipped the files directly into data-raw. In that case, just change the line to:
data <- read.delim("ViewFullTable.txt")
Step 3 is (hopefully!) easy, and it is something you are going to
have to keep up to date as you develop any package – you need to check
which libraries you load using library(xxx)
in any script
or demo, and which libraries you use by qualifying function calls with
xxx::function_name()
. You then need to run:
usethis::use_package("xxx")
to add the xxx
package to those imported by
your package in the DESCRIPTION
file. See here
for further details. Commit these changes to the repo too, and push them
to GitHub. Note that you should not add your own
(githubusernameBCI
) package to the package dependencies of
githubusernameBCI
, because it is not a dependency of
itself! If you have accidentally done this you need to go into the
DESCRIPTION
file and delete it from the
Imports:
entry there.
Finally, if you are going to add a package on GitHub to your
dependencies (such as our RPiR
package), you need to use
usethis::use_dev_package()
instead of
usethis::use_package()
. You need to know where to find it
on GitHub, and then use:
usethis::use_dev_package("RPiR", remote = "SBOHVM/RPiR")
where the remote
argument is the organisation and repo
name. In the project, when you are using your own data package in your
new project package, you will need to do this by calling something like
usethis::use_dev_package("githubusernameBCI", remote = "SBOHVM/githubusernameBCI")
.
First you need to edit the DESCRIPTION file (more here so that it contains
the right information about you and the package, and you need to create
a documentation file for the package in the R folder so that
?githubusernameBCI
returns a description. You had this in
the package for the third practical series, but you’ll see details on
how to do that here.
Create a new file in the R folder – I’m going to call it githubusernameBCI-package.R because the convention is packagename-package.R. Then describe your package (most easily lifted from the DESCRIPTION file):
#' Barro Colorado Island data package
#'
#' Package to hold the BCI data (or whatever) -- maybe also mention something
#' about these functions now, and put that in the DESCRIPTION too. And then
#' put it in the README.md file. And don't forget to reference the source of
#' the data correctly.
#'
#' @import magrittr
#'
#' @name githubusernameBCI-package
#' @aliases githubusernameBCI
#' @docType package
#'
NULL
There are a few things going on here that you should notice for
making packages in the future. The first is that you need to say what
your package is called – here I have given it two names
githubusernameBCI-package
and an alias of just
githubusernameBCI
, which you do with the
@aliases
command. Secondly, the NULL
at the
end of the file is included when there is no object associated with this
documentation, which is the case here, since this file contains the
package documentation, as defined explicitly in the
@docType
tag (more info here). Finally,
if you are using any packages in any of your functions, you may want to
import them into your package here. This is done using
@import
– see more here. You can then put
the same, or a similar, package description into the README.md file in the package so that
people going to GitHub will see what the package that you have created
does without having to install it. Note that GitHub README pages use a
special type of markdown called Git Flavored Markdown; more information
can be found here https://docs.github.com/en/free-pro-team@latest/github/writing-on-github
with more general documentation here https://guides.github.com/features/mastering-markdown/.
Note that with markdown you don’t need to start lines with
#'
.
Second, you need to document the data that you are providing in this package, by creating file(s) in the R folder again. You’ll find details on how to create the documentation here, or an abbreviated version in our guide here.
Finally, build the documentation (using
devtools::document()
), and commit all of the changes to
git.
Make sure throughout the documentation that you credit the real source of the data as the Smithsonian.
Now you should have a working R data package. Try installing it
(using devtools::install()
), restarting R and loading it.
Check that the objects you have created exist – you should be able to
see them in the Environment pane,
but more simply by just typing:
library(githubusernameBCI)
bci_taxa
bci_quadrats
bci_2010
You should (hopefully!) find that all of the objects exist, though you may be surprised to find that they appear to be normal data frames instead of tibbles. This actually isn’t true, and if you run:
library(githubusernameBCI)
library(tibble)
bci_taxa
bci_quadrats
You’ll see that they are tibbles. Tibbles automatically fall back to being data frames if you don’t load the library.
Now push all of your changes (commits) to GitHub, and check you can
install it by using
devtools::install_github("SBOHVM/githubusernameBCI")
.