Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ def read_version_from_pyproject():
'sphinx_togglebutton',
'sphinxcontrib.datatemplates',
# Custom extensions, see `_ext` directory.
'plugin_markup',
# 'plugin_markup',
]

language = 'en'
Expand Down
313 changes: 300 additions & 13 deletions docs/source/dev/data_model.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,314 @@
<!--
SPDX-FileCopyrightText: 2022 German Aerospace Center (DLR)
SPDX-FileCopyrightText: 2025 German Aerospace Center (DLR)
SPDX-License-Identifier: CC-BY-SA-4.0
-->

<!--
SPDX-FileContributor: Michael Meinel
SPDX-FileContributor: Stephan Druskat <stephan.druskat@dlr.de>
-->

# HERMES Data Model
# Data model

*hermes* uses an internal data model to store the output of the different stages.
All the data is collected in a directory called `.hermes` located in the root of the project directory.
`hermes`' internal data model acts like a contract between `hermes` and plugins.
It is based on [**JSON-LD (JSON Linked Data)**](https://json-ld.org/), and
the public API simplifies interaction with the data model through Python code.

You should not need to interact with this data directly.
Instead, use {class}`hermes.model.context.HermesContext` and respective subclasses to access the data in a consistent way.
Output of the different `hermes` commands consequently is valid JSON-LD, serialized as JSON, that is cached in
subdirectories of the `.hermes/` directory that is created in the root of the project directory.

The cache is purely for internal purposes, its data should not be interacted with.

## Harvest Data
Depending on whether you develop a plugin for `hermes`, or you develop `hermes` itself, you need to know either [_some_](#json-ld-for-plugin-developers),
or _quite a few_ things about JSON-LD.

The data of the havesters is cached in the sub-directory `.hermes/harvest`.
Each harvester has a separate cache file to allow parallel harvesting.
The cache file is encoded in JSON and stored in `.hermes/harvest/HARVESTER_NAME.json`
where `HARVESTER_NAME` corresponds to the entry point name.
The following sections provide documentation of the data model.
They aim to help you get started with `hermes` plugin and core development,
even if you have no previous experience with JSON-LD.

{class}`hermes.model.context.HermesHarvestContext` encapsulates these harvester caches.
## The data model for plugin developers

If you develop a plugin for `hermes`, you will only need to work with a single Python class and the public API
it provides: {class}`hermes.model.SoftwareMetadata`.

To work with this class, it is necessary that you know _some_ things about JSON-LD.

### JSON-LD for plugin developers

```{attention}
Work in progress.
```


### Working with the `hermes` data model in plugins

> **Goal**
> Understand how plugins access the `hermes` data model and interact with it.
`hermes` aims to hide as much of the data model as possible behind a public API
to avoid that plugin developers have to deal with some of the more complex features of JSON-LD.

#### Model instances in different types of plugin

You can extend `hermes` with plugins for three different commands: `harvest`, `curate`, `deposit`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So "process plugins" and "post-process plugins" won't exist any more? This should (also) be mentioned in some architectural doc rather than in a side note here. As far as I can tell it is not mentioned anywhere else yet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made an issue for that #447 . In the parent issue it is also described.


The commands differ in how they work with instances of the data model.

- `harvest` plugins _create_ a single new model instance and return it.
- `curate` plugins are passed a single existing model instance (the output of `process`),
and return a single model instance.
- `deposit` plugins are passed a single existing model instance (the output of `curate`),
and return a single model instance.

#### How plugins work with the API

```{important}
Plugins access the data model _exclusively_ through the API class {class}`hermes.model.SoftwareMetadata`.
```

The following sections show how this class works.

##### Creating a data model instance

Model instances are primarily created in `harvest` plugins, but may also be created in other plugins to map
existing data into.

To create a new model instance, initialize {class}`hermes.model.SoftwareMetadata`:

```{code-block} python
:caption: Initializing a default data model instance
from hermes.model import SoftwareMetadata
data = SoftwareMetadata()
```

`SoftwareMetadata` objects initialized without arguments provide the default _context_
(see [_JSON-LD for plugin developers_](#json-ld-for-plugin-developers)).
This means that now, you can use terms from the schemas included in the default context to describe software metadata.

Terms from [_CodeMeta_](https://codemeta.github.io/terms/) can be used without a prefix:

```{code-block} python
:caption: Using terms from the default schema
data["readme"] = ...
```

Terms from [_Schema.org_](https://schema.org/) can be used with the prefix `schema`:

```{code-block} python
:caption: Using terms from a non-default schema
data["schema:copyrightNotice"] = ...
```

You can also use other linked data vocabularies. To do this, you need to identify them with a prefix and register them
with the data model by passing it `extra_vocabs` as a `dict` mapping prefixes to URLs where the vocabularies are
provided as JSON-LD:

```{code-block} python
:caption: Injecting additional schemas
from hermes.model import SoftwareMetadata
# Contents served at https://bar.net/schema.jsonld:
# {
# "@context":
# {
# "baz": "https://schema.org/Thing"
# }
# }
data = SoftwareMetadata(extra_vocabs={"foo": "https://bar.net/schema.jsonld"})
data["foo:baz"] = ...
Comment on lines +107 to +121
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a strange example because you're using a type as a predicate. You could use something like https://schema.org/name (/alternateName/description/image/...) instead.

```

##### Adding data

Once you have an instance of {class}`hermes.model.SoftwareMetadata`, you can add data to it,
i.e., metadata that describes software:

```{code-block} python
:caption: Setting data values
data["name"] = "My Research Software" # A simple "Text"-type value
# → Simplified model representation : { "name": [ "My Research Software" ] }
# Cf. "Accessing data" below
data["author"] = {"name": "Foo"} # An object value that uses terms available in the defined context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the foos, bars, and bazes can be quite confusing as it is never obvious what they are referring to. Could we use some more meaningful examples? "author" is a real codemeta field, so why not use a "real" author? Like Josiah Carberry, or Donald E. Knuth, or Margaret Hamilton.

# → Simplified model representation : { "name": [ "My Research Software" ], "author": [ { "name": "Foo" } ] }
# Cf. "Accessing data" below
```

##### Accessing data

You need to be able to access data in the data model instance to add, edit or remove data.
Data can be accessed by using term strings, similar to how values in Python `dict`s are accessed by keys.

```{important}
When you access data from a data model instance,
it will always be returned in a **list**-like object!
```

The reason for providing data in list-like objects is that JSON-LD treats all property values as arrays.
Even if you add "single value" data to a `hermes` data model instance via the API, the underlying JSON-LD model
will treat it as an array, i.e., a list-like object:

```{code-block} python
:caption: Internal data values are arrays
data["name"] = "My Research Software" # → [ "My Research Software" ]
data["author"] = {"name": "Foo"} # → [ { "name": [ "Foo" ] } ]
```

Therefore, you access data in the same way you would access data from a Python `list`:

1. You access single values using indices, e.g., `data["name"][0]`.
2. You can use a list-like API to interact with data objects, e.g.,
`data["name"].append("Bar")`, `data["name"].extend(["Bar", "Baz"])`, `for name in data["name"]: ...`, etc.

##### Interacting with data

The following longer example shows different ways that you can interact with `SoftwareMetadata` objects and the data API.

```{code-block} python
:caption: Building the data model
from hermes.model import SoftwareMetadata
# Create the model object with the default context
data = SoftwareMetadata()
# Let's create author metadata for our software!
# Below each line of code, the value of `data["author"]` is given.
data["author"] = {"name": "Foo"}
# → [{'name': ['Foo']}]
data["author"].append({"name": "Bar"})
# [{'name': ['Foo']}, {'name': ['Bar']}]
data["author"][0]["email"] = "foo@baz.net"
# [{'name': ['Foo'], 'email': ['foo@baz.net']}, {'name': ['Bar']}]
data["author"][1]["email"].append("bar@baz.net")
# [{'name': ['Foo'], 'email': ['foo@baz.net']}, {'name': ['Bar'], 'email': ['bar@baz.net']}]
data["author"][1]["email"].extend(["bar@spam.org", "bar@eggs.com"])
# [
# {'name': ['Foo'], 'email': ['foo@baz.net']},
# {'name': ['Bar'], 'email': ['bar@baz.net', 'bar@spam.org', 'bar@eggs.com']}
# ]
```

The example continues to show how to iterate through data.

```{code-block} python
:caption: for-loop, containment check
for i, author in enumerate(data["author"]):
if author["name"][0] in ["Foo", "Bar"]:
print(f"Author {i + 1} has expected name.")
Comment on lines +202 to +204
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use enumerate(..., start=1) instead of i + 1

else:
raise ValueError("Unexpected author name found!", author["name"][0])
# Mock output:
# $> Author 1 has expected name.
# $> Author 2 has expected name.
```

```{code-block} python
:caption: Value check
for email in data["author"][0]["email"]:
if email.endswith(".edu"):
print("Author has an email address at an educational institution.")
else:
print("Cannot confirm affiliation with educational institution for author.")
# Mock output
# $> Cannot confirm affiliation with educational institution for author.
```

```{code-block} python
:caption: Value check and list comprehension
if ["bar" in email for email in data["author"][1]["email"]]:
Copy link
Contributor

@SKernchen SKernchen Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ["bar" in email for email in data["author"][1]["email"]]:
if [data["author"][1]["name"][0].lower() in email for email in data["author"][1]["email"]]:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general solution could be:

author = data["author"][0]
if all(any(name in email for name in author["name"]) for email in author["email"]):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if ["bar" in email for email in data["author"][1]["email"]]:
if all(["bar" in email for email in data["author"][1]["email"]]):

I think this is what you meant

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is true. This line is also discussed in this comment.

print("Author has only emails with their name in it.")
# Mock output
# $> Author has only emails with their name in it.
```

The example continues to show how to assert data values.

As mentioned in the [introduction to the data model](#data-model),
`hermes` uses a JSON-LD-like internal data model.
The API class {class}`hermes.model.SoftwareMetadata` hides many
of the more complex aspects of JSON-LD and makes it easy to work
with the data model.

Assertions, however, operate on the internal model objects.
Therefore, they may not work as you would expect from plain
Python data:

```{code-block} python
:caption: Naive containment assertion that raises
:emphasize-lines: 5,13
try:
assert (
{'name': ['Foo'], 'email': ['foo@baz.net']}
in
data["author"]
)
print("The author was found!")
except AssertionError:
print("The author could not be found.")
raise
# Mock output
# $> The author could not be found.
# $> AssertionError:
# assert
# {'email': ['foo@baz.net'], 'name': ['Foo']}
# in
# _LDList(
# {'@list': [
# {
# 'http://schema.org/name': [{'@value': 'Foo'}],
# 'http://schema.org/email': [{'@value': 'foo@baz.net'}]
# },
# {
# 'http://schema.org/name': [{'@value': 'Bar'}],
# 'http://schema.org/email': [
# {'@list': [
# {'@value': 'bar@baz.net'}, {'@value': 'bar@spam.org'}, {'@value': 'bar@eggs.com'}
# ]}
# ]
# }]
# }
# )
```

The mock output in the example above shows the inequality of the expected and the actual value.
The actual value is an internal data type wrapping the more complex JSON-LD data.

The complex data structure of JSON-LD is internally constructed in the `hermes` data
model, and to make it possible to work with only the data that is important - the actual terms
and their values - the internal data model types provide a function `.to_python()`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's all Python though 👀 Would .to_dict() be a better name?

Copy link
Collaborator

@notactuallyfinn notactuallyfinn Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think .to_native() or .to_native_python() would be even better, since ld_dict is also a dict and this function also exists for ld_list.
But all suggestions do not mention that the structure is simplified in the process (similar, but not synonymous with compacting the expanded JSON-LD value).

This function can be used in assertions to assert full data integrity:

```{code-block} python
:caption: Containment assertion with `to_python()`
:emphasize-lines: 5,13
try:
assert (
{'name': ['Foo'], 'email': ['foo@baz.net']}
in
data["author"].to_python()
)
print("The author was found!")
except AssertionError:
print("The author could not be found.")
raise
# Mock output
# $> The author was found!
```

---

## See Also

- API reference: {class}`hermes.model.SoftwareMetadata`
Loading
Loading