Specimen data, digital object, and hashed ledger


The problem

How can data coming from heterogeneous sources not just be linked but be verified and re-used? Researchers usually annotate, aggregate, and compress data either for repository deposition or final publication. How can we ensure data integrity throughout the research data life cycle? How can we create an audit trail of data modifications from the time of collection, annotation, analysis through publication? Of course, this can be done without Blockchain. We need to think carefully about how the decentralization and immutability aspect helps us.

In this view, each actor working with a particular process and data repository/source. We create blocks based on the workflow and link them together. In the simplest form, the blocks will just include the hashes instead of the data.
- test/cae42177c14a8fcdeb14: Physical specimen in a museum (tied to a scientific name) 
- test/db44501292f3e4c35f8e: A digital specimen that creates a digital representation of a physical item and provides the various other links
- test/b0ac8fc9596372bc3c97: is the researcher/scientists
- test/8d485a744ebad89f6349: is the digitization manager
- EF215838 is the database record id from the DNA sequence database

Create my first block

>>> from blockchain import Message, Block, Blockchain
>>> import pickle
>>> B1 = Block()
>>> B1.add_message(Message("test/b0ac8fc9596372bc3c97 deposited test/cae42177c14a8fcdeb14 in test/a49a51ac540d68915229"))
B1.add_message(Message("test/8d485a744ebad89f6349 digitized test/cae42177c14a8fcdeb14"))
>>> B1.__dict__
{'timestamp': None, 'messages': [Message<hash: f35736c42b0cbde8ca320148802755029b5958d9148f1022d431c0b6abd49f13, prev_hash: None, sender: None, receiver: None, data: test/b0ac8fc9596372bc3c97>, Message<hash: f44bd62f8afdb868679278784dc5a509d1d64c19dab5963ae1eab7c5aadca47a, prev_hash: f35736c42b0cbde8ca320148802755029b5958d9148f1022d431c0b6abd49f13, sender: None, receiver: None, data: test/8d485a744ebad89f6349>], 'hash': None, 'prev_hash': None}

Seal and validate

I seal the block, which computes the message’s hash combining the payload hash and prev_hash. Now, any tampering of the message can be detected by recalculating hashes.

>>> B1.seal()
>>> B1.validate()
>>> B1
Block<hash: 4cf253428a5d8f9ef20c64015fb88bfbb3fcd60b11ff045ed9c7d906f714d537, prev_hash: None, messages: 2, time: 1575925690.53>
>>> B2 = Block()
>>> B2
Block<hash: None, prev_hash: None, messages: 0, time: None>
>>> B2.add_message(Message("test/8d485a744ebad89f6349 digitized test/cae42177c14a8fcdeb14"))
>>> B2.add_message(Message("DNA accession EF215838 is sourced from test/cae42177c14a8fcdeb14"))
>>> B2.add_message(Message("EF215838 sequence is CAGATGGGCCGAAAGGCCCA"))
>>> B2
Block<hash: None, prev_hash: None, messages: 3, time: None>
>>> B2.seal()
>>> B2.validate()
>>> B2
Block<hash: 67ec631826f503dbfe702c9250ec8f73669729a073876d361e684b070192513e, prev_hash: None, messages: 3, time: 1575926036.86>

Chain the blocks

Let’s chain them together.

>>> chain = Blockchain()
>>> chain.add_block(B1)
>>> chain.add_block(B2)
>>> chain.blocks[1].messages[1].data
'DNA accession EF215838 is sourced from test/cae42177c14a8fcdeb14'
>>> chain.blocks[1].messages[2].data

Tamper data

Now, I want to modify the data (I use pickle to read and write the python objects, details are in the Github page).

>>> pickle.dump(chain2, open('chain.p', 'wb'))
>>> tampered = pickle.load(open('chain.p', 'rb'))
>>> tampered.blocks[1].messages[2].data = "EF215838 sequence is ACGATGGGCCGAAAGGCCCA"
>>> pickle.dump(tampered, open('chain.p', 'wb'))
>>> chain3 = pickle.load(open('chain.p', 'rb'))
>>> chain3.validate() 
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "blockchain.py", line 116, in validate
raise InvalidBlockchain("Invalid blockchain at block {} caused by: {}".format(i, str(ex)))
blockchain.InvalidBlockchain: Invalid blockchain at block 1 caused by: Invalid block: Message #2 failed validation: Invalid payload hash in message: Message<hash: e28c7fcc52fca2a375c32246adc5b38b4ae5d54b32d3861c096c2d9a2b80383c, prev_hash: fd0cf09adf871458f9eb38bad7023306e539c7e9daecd48605282529f8268ee3, sender: None, receiver: None, data: EF215838 sequence is ACGA>HERE 4c65e09280b1e68199be03569c2df75d95d3f1534fb73b9e496d8e9f422b379f. In block: Block<hash: 60d76c7d06cc770f37dbf9cbaef6433bb4b5ef8b6e7d933810fb570fdb930378, prev_hash: 344182a9b9565a6d0b6633232619cd9834a23e1e19252ace87b53aebef6ecd5a, messages: 3, time: 1575926137.01>



