Specimen data, digital object, and hashed ledger
In this post, I want to continue the thought process regarding event-driven architecture (see my previous post on event-driven digital object architecture). This time add Blockchain to the mix.
Below is an example using easychain (a set of python scripts) that does a brilliant job in explaining the fundamental concepts of a blockchain. I am not implementing any particular Blockchain technology, also not exploring all the aspects (such as distributed consensus and mining), just the idea of hashed ledger to show the possibility of a decentralized, tamper-proof, transparent system.
In this example, I make use of the natural history museum data landscape where the physical specimens are the source of primary data for various research questions. A similar infrastructure could be implemented without Blockchain, but one question that is worth exploring: what value would a Blockchain implementation provide?
The examples here are also in the context of Open Science Research, FAIR, and research data management. For background reading, please see this recent article (19 November 2019) that “identified the requirements for an open scientific ecosystem and compared them with the properties of BT [Blockchain Technology] to verify whether they fit together.” Also, read this thought-provoking piece by Balázs Bodó and Alexandra Giannopoulou (University of Amsterdam — Institute for Information Law). The authors argue that decentralization should be seen as a techno-social system with various social, economic, political, and legal forces behind it.
For research data infrastructure related concepts, see LifeBlock, proposed by LifeWatch, ERIC, targeting Biodiversity and ecosystem communities. Also, check out the Electron Tomography Database that uses Open Index Protocol — a standard to publish metadata in the distributed ledger of the FLO blockchain.
The problem
How can data coming from heterogeneous sources not just be linked but be verified and re-used? Researchers usually annotate, aggregate, and compress data either for repository deposition or final publication. How can we ensure data integrity throughout the research data life cycle? How can we create an audit trail of data modifications from the time of collection, annotation, analysis through publication? Of course, this can be done without Blockchain. We need to think carefully about how the decentralization and immutability aspect helps us.
Below is a visual schema to show some of the data actors, processes related to natural history museum specimens. Can we imagine a Blockchain solution that works with both centralized and decentralized data sources and provides an open, transparent data ecosystem? A tiny step towards that answer and possibly raising more questions.
Like my previous examples, my entities and actors have digital representations and persistent identifiers. In this instance, I am using a physical specimen that was deposited at a museum and a tissue sample extracted for DNA sequencing.
- test/cae42177c14a8fcdeb14: Physical specimen in a museum (tied to a scientific name)
- test/db44501292f3e4c35f8e: A digital specimen that creates a digital representation of a physical item and provides the various other links
- test/b0ac8fc9596372bc3c97: is the researcher/scientists
- test/8d485a744ebad89f6349: is the digitization manager
- EF215838 is the database record id from the DNA sequence database
See this Github page for the code, explanation, and the full example. Each transaction/event is a “Message” (borrowing the term from “easychain” ). Then I form various “blocks” with these messages and chain them to create a “Blockchain.”
Create my first block
>>> from blockchain import Message, Block, Blockchain
>>> import pickle
>>> B1 = Block()
Add an event/message:
>>> B1.add_message(Message("test/b0ac8fc9596372bc3c97 deposited test/cae42177c14a8fcdeb14 in test/a49a51ac540d68915229"))
Add another event:
B1.add_message(Message("test/8d485a744ebad89f6349 digitized test/cae42177c14a8fcdeb14"))
Now, our block looks like this. As you can see, there is an element called “prev_hash” that links these records together.
>>> B1.__dict__
{'timestamp': None, 'messages': [Message<hash: f35736c42b0cbde8ca320148802755029b5958d9148f1022d431c0b6abd49f13, prev_hash: None, sender: None, receiver: None, data: test/b0ac8fc9596372bc3c97>, Message<hash: f44bd62f8afdb868679278784dc5a509d1d64c19dab5963ae1eab7c5aadca47a, prev_hash: f35736c42b0cbde8ca320148802755029b5958d9148f1022d431c0b6abd49f13, sender: None, receiver: None, data: test/8d485a744ebad89f6349>], 'hash': None, 'prev_hash': None}
Seal and validate
I seal the block, which computes the message’s hash combining the payload hash and prev_hash. Now, any tampering of the message can be detected by recalculating hashes.
>>> B1.seal()
>>> B1.validate()
>>> B1
Block<hash: 4cf253428a5d8f9ef20c64015fb88bfbb3fcd60b11ff045ed9c7d906f714d537, prev_hash: None, messages: 2, time: 1575925690.53>
Now, create another block that deals with the DNA sequence workflow.
>>> B2 = Block()
>>> B2
Block<hash: None, prev_hash: None, messages: 0, time: None>
>>> B2.add_message(Message("test/8d485a744ebad89f6349 digitized test/cae42177c14a8fcdeb14"))
>>> B2.add_message(Message("DNA accession EF215838 is sourced from test/cae42177c14a8fcdeb14"))
>>> B2.add_message(Message("EF215838 sequence is CAGATGGGCCGAAAGGCCCA"))
Now my block looks like this (with three items in it but no timestamp)
>>> B2
Block<hash: None, prev_hash: None, messages: 3, time: None>
I seal and validate the block like before (it now has a final hash and a timestamp).
>>> B2.seal()
>>> B2.validate()
>>> B2
Block<hash: 67ec631826f503dbfe702c9250ec8f73669729a073876d361e684b070192513e, prev_hash: None, messages: 3, time: 1575926036.86>
Chain the blocks
Let’s chain them together.
>>> chain = Blockchain()
>>> chain.add_block(B1)
>>> chain.add_block(B2)
Using the python data structure, we can see the data stored in the blocks:
>>> chain.blocks[1].messages[1].data
'DNA accession EF215838 is sourced from test/cae42177c14a8fcdeb14'
>>> chain.blocks[1].messages[2].data
'EF215838 sequence is CAGATGGGCCGAAAGGCCCA'
Here each block hash only needs to incorporate the last message hash, which includes all prior hashes (similar to Merkle tree).
Tamper data
Now, I want to modify the data (I use pickle to read and write the python objects, details are in the Github page).
I create a “tampered” pickle object, modify the sequence data. Then create a new chain based on this modified data.
>>> pickle.dump(chain2, open('chain.p', 'wb'))
>>> tampered = pickle.load(open('chain.p', 'rb'))
>>> tampered.blocks[1].messages[2].data = "EF215838 sequence is ACGATGGGCCGAAAGGCCCA"
>>> pickle.dump(tampered, open('chain.p', 'wb'))
>>> chain3 = pickle.load(open('chain.p', 'rb'))
If I try to validate chain3, it fails:
>>> chain3.validate()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "blockchain.py", line 116, in validate
raise InvalidBlockchain("Invalid blockchain at block {} caused by: {}".format(i, str(ex)))
blockchain.InvalidBlockchain: Invalid blockchain at block 1 caused by: Invalid block: Message #2 failed validation: Invalid payload hash in message: Message<hash: e28c7fcc52fca2a375c32246adc5b38b4ae5d54b32d3861c096c2d9a2b80383c, prev_hash: fd0cf09adf871458f9eb38bad7023306e539c7e9daecd48605282529f8268ee3, sender: None, receiver: None, data: EF215838 sequence is ACGA>HERE 4c65e09280b1e68199be03569c2df75d95d3f1534fb73b9e496d8e9f422b379f. In block: Block<hash: 60d76c7d06cc770f37dbf9cbaef6433bb4b5ef8b6e7d933810fb570fdb930378, prev_hash: 344182a9b9565a6d0b6633232619cd9834a23e1e19252ace87b53aebef6ecd5a, messages: 3, time: 1575926137.01>
Because blocks1 and block2 are linked via hash so block2 depends on block1. Any tampering creates a new hash payload and thus breaking the link. In short, chain3 is not a clean data source; using the validate function, I can check the status of chain1 and chain2 to ensure validity. Of course, a real-life Blockchain solution will be much more complicated, but you get the picture.
In this example, I showed a simple demonstration of how Blockchain can link together different data sources (the items in the block could be anything we want: files, images, datasets, DOIs) and create a trail using the hashed ledger concept.
In a later post, I will delve more into the implications. I am thinking along these lines how can centralized and decentralized work together? what is the role of DOIP (Digital Object Interface Prootoc) here? Is immutability a desirable feature? More later.