Event-driven digital object architecture for natural science collection

In this post, I want to share some of my initial thoughts about building a service with the event-driven and event first approach. Below I use examples that are related to the biodiversity, natural history collection, and the museum world. I also use the concept of Digital Specimen — a specific form of the more general-purpose Digital Object (DO), which is defined by Digital Object Architecture (DOA). My goal is to see how we can fuse the event-driven model with the object-oriented aspect of Digital Object. Using specific domain-related facts (i.e., the fragmented data landscape), we can create semantics and properties and store them as events so any process and service can be mapped. Disclaimer: These ideas are in a very early brewing stage.

Why Event?

What got me thinking about “events” is the following questions: how do we deal with heterogeneous and distributed data sources and create services around them that are sustainable, scalable, and interoperable? And more importantly, can we think about data atomicity? So that we can focus on the smallest unit that can provide us the building blocks of the system. Can this smallest unit be algorithm, language, and schema-agnostic? Can we re-imagine our vast data landscape in a granular form and think about a sequence of events with a particular workflow? I don’t yet have answers to these questions, but hopefully, the ideas and examples here would help me and others to think about the solutions (or apply to existing solutions).

If you want to read about the advantage and challenges of event-driven architecture, there are plenty of articles out there (in particular from the microservice and evolutionary architecture world), so I am not linking them here. But I will share this description by Neil Avery that succinctly describes the method:

Event-first analog: I walk into a room, generate an “entered room” event and the light turns on. This is a reaction to an event.

Event-command analog: I walk into a room, flip the light switch and the light turns on. This is a command.

What is an Event?

First thing first: An event is “a significant change in state.” Put simply, something that happened in the past: a user viewed a page, a user clicked a button, etc. From a system and data perspective — when some actor acts on an entity in a domain-specific context. An excellent example of this is git — where all commits are stored and lets you figure out what happened. In this model, the idea of entities and events are intertwined concepts. To build the data model, we need to understand the actors that were involved in the event.

What is an Entity?

An entity can be a person or object (digital or physical) that is involved in an event. For example, our entities are scientists, museums, and organisms (entities could be digital content as well, such as pdfs, images, etc.). Events are actions such as “a scientist identifies a species, collects specimen or deposits the specimen in a museum.” It could also be updating a record such as updating the name and address of the museum. We can describe these actions in this form: (very similar to RDF triple): “Scientist A collected specimen X ” and “Scientist A deposited specimen X in Museum Z.”

These events are then stored in an event store database. The event data can be projected based on specific service requirements or other event triggers. I am aware that there are standards such as Darwin Core and various ontologies to describe such “events.” But let’s try to think of “event” inside a system. Even though collecting specimens and depositing items in a museum are real events, here I am interested in the digital trace of those actions. The actual collection date and the digital data creation date might not be the same, so for the sake of simplicity, I am here only using the timestamp related to the digital action. Additional information such as collection date and location can also become the properties of a particular entity.

Let’s create some entities

I create “Digital Object,” as defined by the Digital Object Architecture for this example. I use Cordra to create these objects in JSON, but it could be demonstrated with other tools as well. A digital object (DO) is a “sequence of bits” and “having as an essential element an associated unique persistent identifier.”

Let’s create the entities that we introduced before: Scientist, Species/Organism, Specimen, and Museum. For demonstrating purposes, I am using elementary records.

First, I create an entity representing the scientist “Thirteenth Doctor”:

{
"id": "test/b0ac8fc9596372bc3c97",
"name": "Thirteenth Doctor"
}

The “id” string above is a persistent identifier (PID) (which could be a globally or locally unique string that is persistent and may be resolvable like DOI or handle). The species that we are studying is Homunculus Loxodontus so let’s give the organism a digital identity as well (in this case I include the scientific name, other identification could also go here as properties):

{
"id": "test/cae42177c14a8fcdeb14",
"scientificName": "Homunculus Loxodontus"
}

Again the id string here is a PID. We now need the museum entity. Our museum is the Museum of Broken Relationships:

{
"id": "test/a49a51ac540d68915229",
"InstName": "Museum of Broken Relationships",
"website": "https://en.wikipedia.org/wiki/Museum_of_Broken_Relationships"
}

Let’s create some events

Now, I want to create a few event schemas that can help me capture various actions of The Doctor. These schema could be as generic as create and update records but could also be as specific as collection and deposit. Here is an example of an event that captures the action “The Thirteenth Doctor Collected a specimen of an organism.” The “id” here is the PID of the event. We know this is a collection event and the actor was a scientist. Also, the target here was the specimen (or sample) of the organism. Various other properties could be recorded during this collection event.

{
"id": "test/98499997ff0a30ab444d",
"timestamp": "2019-11-05T11:22:03.524Z",
"event": "collection",
"entityType": "scientist",
"entityID": "test/b0ac8fc9596372bc3c97",
"targetEntityType": "specimen",
"targetEntityId": "test/b0ac8fc9596372bc3c97",
}

The next event is depositing the specimen to the museum. We probably need to introduce another category here called “object” because we want to say “X deposited Y in Z” (this part needs some more work/thinking). Same as before, we have a PID of the event, an event type, the actor, object, and target.

{
"id": "test/c4942d87a9f89d8929c1",
"timestamp": "2019-11-05T13:22:03.524Z",
"event": "deposit",
"entityType": "scientist",
"entityID": "test/b0ac8fc9596372bc3c97",
"objectType": "specimen",
"objectID": "ABC-123-445",
"targetEntityType": "museum",
"targetEntityId": "test/a49a51ac540d68915229",
"depositDate": "11/29/2010"
}

Now the scientist creates the digital specimen, but an automated process, API, another service can execute it also — all of them can be entities.

{
"id": "test/bexo841bce6ef0116d5",
"timestamp": "2019-11-05T14:22:03.524Z",
"event": "creteDS",
"entityType": "scientist",
"entityID: "test/b0ac8fc9596372bc3c97",
"targetEntityType": "digitalspecimen",
"targetEntityID": "test/db44501292f3e4c35f8e"
}

This record above is a digital specimen creation event done by the Thirteenth Doctor who has the PID “test/b0ac8fc9596372bc3c97”. And the actual digital specimen is (output of the above event):

{
"id": "test/db44501292f3e4c35f8e",
"scientificName": "Homunculus Loxodontus",
"physicalSpecimenId": "ABC-123-445",
"depositedIn": "test/a49a51ac540d68915229",
"collectedBy": "test/b0ac8fc9596372bc3c97",
"depositedBy": "test/b0ac8fc9596372bc3c97",
"dscreatedBy: "test/b0ac8fc9596372bc3c97"
}

How did we get here?

What’s the sequence of events which led us to create this digital specimen? We can now create an aggregated log based on all the events and answer that question. We now have audit trails and view the event trail in a tabular format:

Conclusion

So you may ask, couldn’t you create a relational database and log all these? Yes. And that is one of the outputs that can come out of this model. But can the relational CRUD model provide the same flexibility?

We know that entities (such as scientific name, museum address, location of the specimen) and related properties change over time. Focusing on event and state change thus provides a snapshot of those entities. We can answer questions like “Give me all the specimens deposited by X in Z,” but we can also answer questions like “How did all these specimens get here?” or “What was the status of these specimens on Jan 1, 2001?” So not only we can answer who, what, when, where, which, but also how.

Event-based modeling provides us a flexible way to record the state of an entity at a specific moment. One of these flexibilities includes creating domain-specific event types. We might find a way to retain the fragmented landscape and still create valuable integrated service.

Data Architect@Distributed System of Scientific Collections (https://dissco.eu). PhD in Sociology. Bachelor's in Math and CS from the University of Illinois.