Tutorials

The tutorial here represents the absolute basics of using the BioDeepTime database. More elaborate tutorials will be made available on Evolv-ED.

Add toc: true to the metadata of the .md file for a table of contents.

The Data

Download

The BioDeepTime database is made available through the ‘chronosphere’ research data API. The R client to access data can be installed from the CRAN servers with:

install.packages("chronosphere") # used to call the data
install.packages("tidyverse") # used for data manipulation and plotting
install.packages("ggrepel") # used for plotting

The most up-to-date version of the denormalized BioDeepTiem database can be accessed with:

# attach package
library(chronosphere)
Chronosphere - Evolving Earth System Variables
Important: never fetch data as a superuser / with admin. privileges!

Note that the package was split for efficient maintenance and development:
 - Plate tectonic calculations -> package 'rgplates'
 - Arrays of raster and vector spatials -> package 'via'
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::collapse() masks divDyn::collapse()
✖ tidyr::fill()     masks divDyn::fill()
✖ dplyr::filter()   masks stats::filter()
✖ dplyr::lag()      masks stats::lag()
✖ dplyr::slice()    masks divDyn::slice()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)

# download data, verbose=FALSE hides the default chatter 
bdt <- fetch("biodeeptime", verbose=FALSE)

Note that this table is rather large and might take a bit of time to load. The accessible data items can be downloaded with (datasets("biodeeptime")).

Structure

The default representation of the denormalized table is a data.frame, where every row represents one record or biogeographic observation (the presence of a taxon in a sample).

str(bdt)
'data.frame':   7437847 obs. of  39 variables:
 $ db                : chr  "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
 $ seriesID          : chr  "TS_1" "TS_1" "TS_1" "TS_1" ...
 $ seriesOriginalName: chr  "Konus Exposure, Adycha River" "Konus Exposure, Adycha River" "Konus Exposure, Adycha River" "Konus Exposure, Adycha River" ...
 $ seriesOriginalID  : chr  "11" "11" "11" "11" ...
 $ long              : num  136 136 136 136 136 ...
 $ lat               : num  67.8 67.8 67.8 67.8 67.8 ...
 $ depthUnit         : chr  "cmbct" "cmbct" "cmbct" "cmbct" ...
 $ ageModel          : chr  "4" "4" "4" "4" ...
 $ reason            : chr  "Community analysis" "Community analysis" "Community analysis" "Community analysis" ...
 $ sampleID          : chr  "S_1" "S_1" "S_1" "S_1" ...
 $ sampleOriginalID  : chr  "158" "158" "158" "158" ...
 $ sampleOriginalName: chr  NA NA NA NA ...
 $ depth             : num  0 0 0 0 0 0 0 0 0 0 ...
 $ age               : num  1321 1321 1321 1321 1321 ...
 $ ageProc           : chr  "bchron" "bchron" "bchron" "bchron" ...
 $ ageOld            : num  2463 2463 2463 2463 2463 ...
 $ ageYoung          : num  355 355 355 355 355 ...
 $ timeOriginalUnit  : chr  "Radiocarbon years BP" "Radiocarbon years BP" "Radiocarbon years BP" "Radiocarbon years BP" ...
 $ timeOriginal      : num  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
 $ timeOriginalOld   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ timeOriginalYoung : num  NA NA NA NA NA NA NA NA NA NA ...
 $ waterDepth        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ preservation      : chr  NA NA NA NA ...
 $ samplingEffort    : num  NA NA NA NA NA NA NA NA NA NA ...
 $ minimumMesh       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ maximumMesh       : num  NA NA NA NA NA NA NA NA NA NA ...
 $ environment       : chr  "Terrestrial or Freshwater" "Terrestrial or Freshwater" "Terrestrial or Freshwater" "Terrestrial or Freshwater" ...
 $ samplingEffortType: chr  NA NA NA NA ...
 $ totalCount        : num  809 809 809 809 809 809 809 809 809 809 ...
 $ taxonID           : int  4 10 3 11 8 1 13 12 5 9 ...
 $ analyzedTaxon     : chr  "Valeriana" "Onagraceae" "Ericaceae" "Cyperaceae" ...
 $ species           : chr  NA NA NA NA ...
 $ genus             : chr  NA NA NA NA ...
 $ openNomenclature  : chr  NA NA NA NA ...
 $ analyzedRank      : chr  NA NA NA NA ...
 $ group             : chr  "Plants" "Plants" "Plants" "Plants" ...
 $ abundance         : num  2 6 2 7 5 1 22 12 3 5 ...
 $ abundanceUnit     : chr  "count" "count" "count" "count" ...
 $ refID             : chr  "2" "2" "2" "2" ...
 - attr(*, "chronosphere")=List of 13
  ..$ dat         : chr "biodeeptime"
  ..$ var         : chr "denormalized"
  ..$ res         : logi NA
  ..$ ver         : num 1
  ..$ datafile    : chr "biodeeptime.rds"
  ..$ item        : int 702
  ..$ reference   : chr "Jansen A. Smith, Marina C. Rillo, Ádám T. Kocsis, Maria Dornelas, David Fastovich, Huai-Hsuan M. Huang, Lukas J"| __truncated__
  ..$ bibtex      : chr "@misc{jansen_a_smith_2023_8154672,\n author = {Jansen A. Smith and\nMarina C. Rillo and\nÁdám T. Kocsis and\nMa"| __truncated__
  ..$ downloadDate: POSIXct[1:1], format: "2023-07-20 09:51:33"
  ..$ publishDate : chr "2023-07-12"
  ..$ infoURL     : logi NA
  ..$ API         : logi NA
  ..$ additional  : list()

Basic analyses

The number of time series in the database:

length(unique(bdt$seriesID))
[1] 10062

The number of records in the database:

length(bdt$db)
[1] 7437847

The number of unique sampling locations:

nrow(unique(bdt[, c("long", "lat")]))
[1] 8752

The oldest record (relative to 1950) in each database:

bdt %>% group_by(db) %>% summarize(max = max(age))
# A tibble: 9 × 2
  db                               max
  <chr>                          <dbl>
1 BioTIME                         49.4
2 Direct uploads             1146200  
3 Geobiodiversity Database 451050000. 
4 MARBEN                    77200000  
5 Neotoma                   23900000  
6 Neptune SandBox          151075752  
7 Paleobiology Database    150889174. 
8 SedTraps                       -28.4
9 Triton                    65997000  

Finding the mean age (relative to 1950) of a sample from modern databases (BioTIME and SedTraps):

modern <- bdt %>%
  filter(db == "BioTIME" | db == "SedTraps") ## filtering to only modern data

mean(modern$age) 
[1] -44.50505

Finding the mean age (relative to 1950) of a sample from fossil databases:

fossil <- bdt %>%
  filter(db != "BioTIME" & db != "SedTraps") ## filtering to exclude modern data

mean(fossil$age) 
[1] 5087525

For additional analyses and visualization, data summarization and manipulation can be done as follows:

## this manipulation summarizes information for each unique time series
overview <- bdt %>% 
  group_by(seriesID) %>%
  dplyr::summarise(db = unique(db), ## source database
                   #environment = unique(environment), ## broadly defined environment
                   group = unique(group), ## taxonomic group
                   lat = unique(lat), ## latitude
                   long = unique(long), ## longitude
                   #abundType = unique(abundanceUnit), ## abundance type of records
                   samples = length(unique(sampleID)), ## number of samples
                   richness = length(unique(analyzedTaxon)), ## number of unique species
                   meanAge = mean(age, na.rm = TRUE), ## mean age of samples 
                   minAge = min(age, na.rm = TRUE), ## minimum age of a sample
                   maxAge = max(age, na.rm = TRUE), ## maximum age of a sample
                   extent =  maxAge - minAge) ## temporal extent (duration) of the time series
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.

`summarise()` has grouped output by 'seriesID'. You can override using the
`.groups` argument.

Visualizing the data

Create a donut plot showing the contribution of source databases to BioDeepTime:

## donut plot for proportion contribution of each database
databases <- overview %>%
  group_by(db) %>%
  dplyr::summarise(count = n())

# Compute percentages
databases$fraction <- databases$count / sum(databases$count)

# Compute the cumulative percentages (top of each rectangle)
databases$ymax <- cumsum(databases$fraction)

# Compute the bottom of each rectangle
databases$ymin <- c(0, head(databases$ymax, n=-1))

# Compute label position
databases$labelPosition <- (databases$ymax + databases$ymin)/2

# Compute a good label
databases$label <- paste0(databases$db, ", ", databases$count)

# # Make the plot
ggplot(databases, aes(ymax=ymax, ymin=ymin, xmax=4, xmin=3, fill=db)) +
   geom_rect() +
   geom_label_repel(x = 4, aes(y=labelPosition, label=label), size=4) + ## x controls label position (inside/outside donut)
   scale_fill_brewer(palette = "Set3") +
   coord_polar(theta="y") +
   xlim(c(2, 4)) +
   theme_void() +
   theme(legend.position = "none") ## controls the presence of a legend