Specimen data, digital object, and hashed ledger

In this post, I want to continue the thought process regarding event-driven architecture (see my previous post on event-driven digital object architecture). This time add Blockchain to the mix.

https://dilbert.com/strip/2018-06-06

Below is an example using easychain (a set of python scripts) that does a brilliant job in explaining the fundamental concepts of a blockchain. I am not implementing any particular Blockchain technology, also not exploring all the aspects (such as distributed consensus and mining), just the idea of hashed ledger to show the possibility of a decentralized, tamper-proof, transparent system.

In this example, I make use of the natural history museum data landscape where the physical specimens are the source of primary data for various research questions. A similar infrastructure could be implemented without Blockchain, but one question that is worth exploring: what value would a Blockchain implementation provide?

The examples here are also in the context of Open Science Research, FAIR, and research data management. For background reading, please see this recent article (19 November 2019) that “identified the requirements for an open scientific ecosystem and compared them with the properties of BT [Blockchain Technology] to verify whether they fit together.” Also, read this thought-provoking piece by Balázs Bodó and Alexandra Giannopoulou (University of Amsterdam — Institute for Information Law). The authors argue that decentralization should be seen as a techno-social system with various social, economic, political, and legal forces behind it.

For research data infrastructure related concepts, see LifeBlock, proposed by LifeWatch, ERIC, targeting Biodiversity and ecosystem communities. Also, check out the Electron Tomography Database that uses Open Index Protocol — a standard to publish metadata in the distributed ledger of the FLO blockchain.

The problem

How can data coming from heterogeneous sources not just be linked but be verified and re-used? Researchers usually annotate, aggregate, and compress data either for repository deposition or final publication. How can we ensure data integrity throughout the research data life cycle? How can we create an audit trail of data modifications from the time of collection, annotation, analysis through publication? Of course, this can be done without Blockchain. We need to think carefully about how the decentralization and immutability aspect helps us.

Below is a visual schema to show some of the data actors, processes related to natural history museum specimens. Can we imagine a Blockchain solution that works with both centralized and decentralized data sources and provides an open, transparent data ecosystem? A tiny step towards that answer and possibly raising more questions.

In this view, each actor working with a particular process and data repository/source. We create blocks based on the workflow and link them together. In the simplest form, the blocks will just include the hashes instead of the data.

Like my previous examples, my entities and actors have digital representations and persistent identifiers. In this instance, I am using a physical specimen that was deposited at a museum and a tissue sample extracted for DNA sequencing.

See this Github page for the code, explanation, and the full example. Each transaction/event is a “Message” (borrowing the term from “easychain” ). Then I form various “blocks” with these messages and chain them to create a “Blockchain.”

Create my first block

Add an event/message:

Add another event:

Now, our block looks like this. As you can see, there is an element called “prev_hash” that links these records together.

Seal and validate

I seal the block, which computes the message’s hash combining the payload hash and prev_hash. Now, any tampering of the message can be detected by recalculating hashes.

Now, create another block that deals with the DNA sequence workflow.

Now my block looks like this (with three items in it but no timestamp)

I seal and validate the block like before (it now has a final hash and a timestamp).

Chain the blocks

Let’s chain them together.

Using the python data structure, we can see the data stored in the blocks:

Here each block hash only needs to incorporate the last message hash, which includes all prior hashes (similar to Merkle tree).

Tamper data

Now, I want to modify the data (I use pickle to read and write the python objects, details are in the Github page).

I create a “tampered” pickle object, modify the sequence data. Then create a new chain based on this modified data.

If I try to validate chain3, it fails:

Because blocks1 and block2 are linked via hash so block2 depends on block1. Any tampering creates a new hash payload and thus breaking the link. In short, chain3 is not a clean data source; using the validate function, I can check the status of chain1 and chain2 to ensure validity. Of course, a real-life Blockchain solution will be much more complicated, but you get the picture.

In this example, I showed a simple demonstration of how Blockchain can link together different data sources (the items in the block could be anything we want: files, images, datasets, DOIs) and create a trail using the hashed ledger concept.

In a later post, I will delve more into the implications. I am thinking along these lines how can centralized and decentralized work together? what is the role of DOIP (Digital Object Interface Prootoc) here? Is immutability a desirable feature? More later.