In this post I want to talk about how user stories can help us during the conceptual design of a large, distributed data infrastructure. Before thinking about connecting, aggregating and linking various data sources, I find it useful to focus on a particular scenario and think about the related data components.
Currently, I am involved (as the Data Architect) in the DiSSCo (Distributed System of Scientific Collections) project where with the help of the wider research and data community we are trying to understand the natural science collections data landscape. Alex Hardisty described this process as “fishing in deep pools for something you know is there but can’t see”. Even though our user stories are collected from the natural history museum and biodiversity community, I hope this is also useful for a general audience. I will keep some of the gory technical details (such as RDF, Darwin Core) out of the way to illustrate the basic problem. This is a known challenge in the community and various collaborative efforts, including DiSSCo, are addressing it from different perspectives (such as persistent identifiers, search and discovery, linked data, metadata standard etc.).
Recently we gathered user stories for a specific service that is being developed in DiSSCo. This service (European Loans and Visit System) will focus on data curation and access policies across various European institutions. It will also help researchers find collection details and the facilities in the museums. The user story that I am focusing on today is: “As A Researcher I want sampling of specific specimens for DNA so that I can investigate genetic diversity loss.”
A worthy cause indeed. But how can we help facilitate this? Currently, the researcher might approach the museum directly and request the item (assumption here is that the initial search by the researcher is successful or the researcher is already aware of the collection details of the museum. If not then that is another story). Then the museum or the researcher might contact a DNA sequencing lab. I won’t go into the full workflow here but let’s see how the different data components of this user story plays out.
The sample that was requested is part of a museum collection where the specimen is stored along with other specimens. The entities involved here are collection summary and the specimen record. The sample might need to go to a different facility and this needs to be tracked and recorded. Here we will also need institutional data (about the museum, the genbank, what type of facilities, equipments etc). Some of the museums might have specific expertise and equipments for DNA sequencing. We will need a record of those as well to facilitate the request. In the above scenarios we want the specimen and the person requesting the sample be identified and linked. And the researcher does not need to know anything about which institution id links to which collection record. The system and the data infrastructure should deliver the data and the physical sample that was requested (there are indeed various digitisation aspects involved here but often times accessing and tracking the physical sample is very important both for the researcher and the museum).
Based on this particular user story our data landscape might look like this (a very simplified diagram below). I tried to outline some of the items in each entity that are necessary for the linking.
As you can see in the diagram, identifying the items (in this case the specimen) in various different data sources and establishing the relationship between the entities are two crucial aspects of this challenge. This cannot be done with simply creating and combining multiple databases as different institutions, data formats, and standards are involved. Nevertheless, user stories like the one above are helping us to think about specific scenarios and the related data components. I will write more as we continue to think about this challenge and design our new DiSSCo services.