Recently I have been exploring different kinds of R packages that deal with biodiversity and ecology data. This is partly to understand the fragmented and distributed data landscape that we are tackling in DiSSCo. Our goal is to build a new world-class Research Infrastructure (RI) for natural science collections. Also, an excuse to learn something new and share with the wider community.
I discovered that R is immensely popular in biodiversity and ecology research. There is an active community that maintains various packages that are well-curated with easy to follow examples. Of course, all of these can be done with Python or any other languages — often it boils down to the task at hand. However, the design of R is closely aligned with statistical methods and data exploration which makes it handy for large datasets like species occurrence data. For example, a package called BioFTF provides a functional analysis tool that helps understand the multivariate approach of biodiversity (not just species abundance and richness). This a very domain-specific package that takes advantage of R’s statistical features. Also, spocc is another useful package that interacts with many sources of species occurrence data, including GBIF, Vertnet, BISON, iNaturalist, the Berkeley ecoengine, and eBird.
In any case, my post is not a comprehensive review. My goal here is to share my learning and highlight the strength of R with a simple example that makes use of different R packages and different data sources. And of course, I am also thinking about data interoperability, reusability and service integration to make it easier for people to use these datasets. But that is a topic for another time.
In this example, I investigate two R packages: rglobi and rentrez. rglobi provides access to the Global Biotic Interactions database where we can find species interaction data (e.g., predator-prey, pollinator-plant, pathogen-host, parasite-host). And rentrez takes us to NCBI which provides integrated access to nucleotide and protein sequence data.
Today’s object of study is Spirometra erinaceieuropaei which is a tapeworm. R provides an interactive command-line interface. Here I load the rglobi library and search for the species using the scientific name. I also indicate which interaction concerns me:
> sp <- get_interactions(taxon = "Spirometra erinaceieuropaei", interaction.type = "parasiteOf")
R provides a handy structure (‘str’) function that lets us peek into the data.
interaction_type : chr [1:178] "parasiteOf" "parasiteOf" "parasiteOf" "parasiteOf" ...
latitude : logi [1:178] NA NA NA NA NA NA ...
longitude : logi [1:178] NA NA NA NA NA NA ...
source_specimen_life_stage : logi [1:178] NA NA NA NA NA NA ...
source_taxon_external_id : chr [1:178] "EOL:4968441" "EOL:4968441" "EOL:4968441" "EOL:4968441" ...
source_taxon_name : chr [1:178] "Spirometra erinaceieuropaei" "Spirometra erinaceieuropaei" ...
source_taxon_path : chr [1:178] "Cellular organisms | Eukaryota | Opisthokonta | Metazoa | Eumetazoa | Bilateria | Platyhelminthes | Cestoda | E"| __truncated__ ...
study_citation : logi [1:178] NA NA NA NA NA NA ...
study_source_citation : logi [1:178] NA NA NA NA NA NA ...
target_specimen_life_stage : logi [1:178] NA NA NA NA NA NA ...
target_taxon_external_id : chr [1:178] "no:match" "EOL:311234" "EOL:1178681" "EOL:330512" ...
target_taxon_name : chr [1:178] "DOG" "Litoria aurea" "Erinaceus amurensis" "Rana cancrivora" ...
target_taxon_path : chr [1:178] "" "" "" "" "" "" "" "" "" ...
From this data structure, we know that there is an element called ‘$target_taxon_name`. Now we can list the species that the tapeworm targets:
 "DOG" "Litoria aurea"
 "Erinaceus amurensis" "Rana cancrivora"
 "Rana tigrina" "Cyclops affinis"
 "Cyclops phaleratus" "Mesocyclops aspericornis"
 "Litoria caerulea" "Mustela putorius"
I can now continue my investigation with these target species. But I want to head over to NCBI to see what kind of the sequence data are in store. I load the rentrez library and do another taxonomy search (ideally the taxon id below should have been in the previous dataset but rglobi uses EOL id for species):
> taxon_search = entrez_search(db="taxonomy", term="Spirometra erinaceieuropaei")
Again I can use str to understand the data structure and grab the NCBI taxon id:
> entrez_search(db="taxonomy", term="Spirometra erinaceieuropaei")$id
Using this id (“99802”) I can track all sorts of links in NCBI. Here I am displaying the gene ids ([1:5] is a convenient way to limit the output):
entrez_link(dbfrom='taxonomy', id=99802, db='all')$links$taxonomy_gene[1:5]
 "6446594" "6446593" "6446592" "6446591" "6446590"
I can gather summary information or specific gene name and chromosome information for each of these gene ids:
> entrez_summary(db="gene", id="6446594")$name
> entrez_summary(db="gene", id="6446594")$chromosome
If I want sequence information I can follow the link to the nuccore database:
> entrez_link(db = "all", id = 6446594, dbfrom = "gene")$links$gene_nuccore
 "374349411" "194097494" "193884329"
Now either with the summary or the link function, I can get more information:
entrez_link(db = "all", id = 374349411, dbfrom = "nuccore")$links$nuccore_protein
This R interface provides similar information that is available in the web-based nuccore database search. But the command-line interface provides functions that are highly automatable and modularised. R packages and other similar APIs can let us use distributed datasets that can be helpful in the task of integrating a vast, scattered data landscape.
We will be talking about similar and other interesting data-intensive research related issues in the Biodiversity Next conference. If you haven’t registered yet please do so.