FAIR Digital Object Application Case: Composing Machine Learning Training Data
Collecting data from different storages and using it to compose a training data set for Machine Learning (ML) is a time-consuming task. Even if FAIR principles are fulfilled, scientists still need to perform several steps to obtain a ready-to-use training dataset.
Scanning electron microscopy (SEM) images labeled “Biological Research” are one example. Putting them together with other SEM images that are labeled differently, i.e., “Microbiology”, requires relabeling. This is the assignment of related labels to the same category, for being applicable in the same ML workflow. Doing this manually for several hundreds of images and a few different label terms appears to be a doable job for a scientist. If the job gets too tedious, skilled user may even apply small self-written scripts to automate this task. But still, this process involves a lot of preparation, custom work, and data wrangling, and will most probably be repeated for the next ML task.
To automate relabeling, possibly all metadata involved must be readable, interpretable, and operable by software clients. The application of the FAIR Digital Object (FAIR DO) concept makes this possible.
Before starting with FAIR DOs, in a first step the representation of label information had to be harmonized. Labels that are stored in structured metadata documents and based on vocabulary terms are much easier to relate to each other and can be processed by machines. In this application case, the labels of the SEM images were stored in JSON documents and map to vocabulary terms from the UNESCO Thesaurus. This additional metadata cannot always be stored in the same repository as the original SEM images, e.g., because the images were already published, and the publishing repository does not accept subsequent extension of the publication record. Therefore, this metadata was stored in an instance of MetaStore and refers to the original images, which remain unchanged wherever they are located.
After harmonizing and storing the label metadata, everything was set to apply the concept of FAIR DOs for both: the images and the metadata documents containing the label information.
The reason behind creating two FAIR DOs is the fact, that the same image might be used for different purposes. Having a FAIR DO representing the native image increases its reusability compared to a FAIR DO which is further contextualized by including ML-specific metadata.
Each FAIR DO is identified by a globally unique Persistent Identifier (PID) that resolves to a PID record. This PID record is strongly typed and readable by machines, which can interpret the information and perform predefined operations based on their interpretation. All PID Records created within this application case follow the HMC guidance on the Helmholtz Kernel Information Profile and are therefore interoperable with all FAIR DOs following the HMC guidance.
In a final step, a Python3-based Client was implemented to make use of the machine-operability of FAIR DOs. By resolving the PIDs and reading and interpreting the information from the PID record, the client was able to perform an automatic relabeling between different terms from the UNESCO thesaurus. This ultimately resulted in a training dataset where all SEM images from both label terms, i.e., “Biological Research” and Microbiology, were assigned to the same category for the ML framework.
Once the FAIR DOs are implemented, little interaction is required from the scientist. This includes defining the criteria for the training dataset and providing the PIDs of the FAIR DOs to the client, which makes it very easy to compose new ML training data sets. This approach saves the scientist a lot of valuable time, allowing them to focus on other project tasks.