Specimen data, digital object, and hashed ledger

In this post, I want to continue the thought process regarding event-driven architecture (see my previous post on event-driven digital object architecture). This time add Blockchain to the mix.

https://dilbert.com/strip/2018-06-06

Below is an example using easychain (a set of python scripts) that does a brilliant job in explaining the fundamental concepts of a blockchain. I am not implementing any particular Blockchain technology, also not exploring all the aspects (such as distributed consensus and mining), just the idea of hashed ledger to show the possibility of a decentralized, tamper-proof, transparent system.

In this example, I make use of the natural history museum data landscape where the physical specimens are the source of primary data for various research questions. A similar infrastructure could be implemented without Blockchain, but one question that is worth exploring: what value would a Blockchain implementation provide?

The examples here are also in the context of Open Science Research, FAIR, and research data management. For background reading, please see this recent article (19 November 2019) that “identified the requirements for an open scientific ecosystem and compared them with the properties of BT [Blockchain Technology] to verify whether they fit together.” Also, read this thought-provoking piece by Balázs Bodó and Alexandra Giannopoulou (University of Amsterdam — Institute for Information Law). The authors argue that decentralization should be seen as a techno-social system with various social, economic, political, and legal forces behind it.

For research data infrastructure related concepts, see LifeBlock, proposed by LifeWatch, ERIC, targeting Biodiversity and ecosystem communities. Also, check out the Electron Tomography Database that uses Open Index Protocol — a standard to publish metadata in the distributed ledger of the FLO blockchain.

The problem

Below is a visual schema to show some of the data actors, processes related to natural history museum specimens. Can we imagine a Blockchain solution that works with both centralized and decentralized data sources and provides an open, transparent data ecosystem? A tiny step towards that answer and possibly raising more questions.

In this view, each actor working with a particular process and data repository/source. We create blocks based on the workflow and link them together. In the simplest form, the blocks will just include the hashes instead of the data.

Like my previous examples, my entities and actors have digital representations and persistent identifiers. In this instance, I am using a physical specimen that was deposited at a museum and a tissue sample extracted for DNA sequencing.

- test/cae42177c14a8fcdeb14: Physical specimen in a museum (tied to a scientific name) 
- test/db44501292f3e4c35f8e: A digital specimen that creates a digital representation of a physical item and provides the various other links
- test/b0ac8fc9596372bc3c97: is the researcher/scientists
- test/8d485a744ebad89f6349: is the digitization manager
- EF215838 is the database record id from the DNA sequence database

See this Github page for the code, explanation, and the full example. Each transaction/event is a “Message” (borrowing the term from “easychain” ). Then I form various “blocks” with these messages and chain them to create a “Blockchain.”

Create my first block

>>> from blockchain import Message, Block, Blockchain
>>> import pickle
>>> B1 = Block()

Add an event/message:

>>> B1.add_message(Message("test/b0ac8fc9596372bc3c97 deposited test/cae42177c14a8fcdeb14 in test/a49a51ac540d68915229"))

Add another event:

B1.add_message(Message("test/8d485a744ebad89f6349 digitized test/cae42177c14a8fcdeb14"))

Now, our block looks like this. As you can see, there is an element called “prev_hash” that links these records together.

>>> B1.__dict__
{'timestamp': None, 'messages': [Message<hash: f35736c42b0cbde8ca320148802755029b5958d9148f1022d431c0b6abd49f13, prev_hash: None, sender: None, receiver: None, data: test/b0ac8fc9596372bc3c97>, Message<hash: f44bd62f8afdb868679278784dc5a509d1d64c19dab5963ae1eab7c5aadca47a, prev_hash: f35736c42b0cbde8ca320148802755029b5958d9148f1022d431c0b6abd49f13, sender: None, receiver: None, data: test/8d485a744ebad89f6349>], 'hash': None, 'prev_hash': None}

Seal and validate

>>> B1.seal()
>>> B1.validate()
>>> B1
Block<hash: 4cf253428a5d8f9ef20c64015fb88bfbb3fcd60b11ff045ed9c7d906f714d537, prev_hash: None, messages: 2, time: 1575925690.53>

Now, create another block that deals with the DNA sequence workflow.

>>> B2 = Block()
>>> B2
Block<hash: None, prev_hash: None, messages: 0, time: None>
>>> B2.add_message(Message("test/8d485a744ebad89f6349 digitized test/cae42177c14a8fcdeb14"))
>>> B2.add_message(Message("DNA accession EF215838 is sourced from test/cae42177c14a8fcdeb14"))
>>> B2.add_message(Message("EF215838 sequence is CAGATGGGCCGAAAGGCCCA"))

Now my block looks like this (with three items in it but no timestamp)

>>> B2
Block<hash: None, prev_hash: None, messages: 3, time: None>

I seal and validate the block like before (it now has a final hash and a timestamp).

>>> B2.seal()
>>> B2.validate()
>>> B2
Block<hash: 67ec631826f503dbfe702c9250ec8f73669729a073876d361e684b070192513e, prev_hash: None, messages: 3, time: 1575926036.86>

Chain the blocks

>>> chain = Blockchain()
>>> chain.add_block(B1)
>>> chain.add_block(B2)

Using the python data structure, we can see the data stored in the blocks:

>>> chain.blocks[1].messages[1].data
'DNA accession EF215838 is sourced from test/cae42177c14a8fcdeb14'
>>> chain.blocks[1].messages[2].data
'EF215838 sequence is CAGATGGGCCGAAAGGCCCA'

Here each block hash only needs to incorporate the last message hash, which includes all prior hashes (similar to Merkle tree).

Tamper data

I create a “tampered” pickle object, modify the sequence data. Then create a new chain based on this modified data.

>>> pickle.dump(chain2, open('chain.p', 'wb'))
>>> tampered = pickle.load(open('chain.p', 'rb'))
>>> tampered.blocks[1].messages[2].data = "EF215838 sequence is ACGATGGGCCGAAAGGCCCA"
>>> pickle.dump(tampered, open('chain.p', 'wb'))
>>> chain3 = pickle.load(open('chain.p', 'rb'))

If I try to validate chain3, it fails:

>>> chain3.validate() 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "blockchain.py", line 116, in validate
raise InvalidBlockchain("Invalid blockchain at block {} caused by: {}".format(i, str(ex)))
blockchain.InvalidBlockchain: Invalid blockchain at block 1 caused by: Invalid block: Message #2 failed validation: Invalid payload hash in message: Message<hash: e28c7fcc52fca2a375c32246adc5b38b4ae5d54b32d3861c096c2d9a2b80383c, prev_hash: fd0cf09adf871458f9eb38bad7023306e539c7e9daecd48605282529f8268ee3, sender: None, receiver: None, data: EF215838 sequence is ACGA>HERE 4c65e09280b1e68199be03569c2df75d95d3f1534fb73b9e496d8e9f422b379f. In block: Block<hash: 60d76c7d06cc770f37dbf9cbaef6433bb4b5ef8b6e7d933810fb570fdb930378, prev_hash: 344182a9b9565a6d0b6633232619cd9834a23e1e19252ace87b53aebef6ecd5a, messages: 3, time: 1575926137.01>

Because blocks1 and block2 are linked via hash so block2 depends on block1. Any tampering creates a new hash payload and thus breaking the link. In short, chain3 is not a clean data source; using the validate function, I can check the status of chain1 and chain2 to ensure validity. Of course, a real-life Blockchain solution will be much more complicated, but you get the picture.

In this example, I showed a simple demonstration of how Blockchain can link together different data sources (the items in the block could be anything we want: files, images, datasets, DOIs) and create a trail using the hashed ledger concept.

In a later post, I will delve more into the implications. I am thinking along these lines how can centralized and decentralized work together? what is the role of DOIP (Digital Object Interface Prootoc) here? Is immutability a desirable feature? More later.

Data Architect@Distributed System of Scientific Collections (https://dissco.eu). PhD in Sociology. Bachelor's in Math and CS from the University of Illinois.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store