Skip to content

Conversation

@Aidajafarbigloo
Copy link

Added the URL to the hermes harvest command. Now, the command hermes harvest harvest the metadata from the local repository, and hermes harvest --url <URL> allows harvesting metadata from the provided URL, with support for GitHub and GitLab repositories.

(e.g., hermes harvest --url https://github.com/NFDI4Energy/SMECS)

@Aidajafarbigloo Aidajafarbigloo marked this pull request as draft October 29, 2024 08:49
@Aidajafarbigloo
Copy link
Author

@sferenz
Could you please take a look at this pull request and share your feedback?

@Aidajafarbigloo
Copy link
Author

Harvesting metadata from the provided URL (GitHub/GitLab). Command: hermes harvest --path <URL>

@Aidajafarbigloo
Copy link
Author

@sferenz
Could you please take a look at this pull request and share your feedback?

Copy link
Member

@sferenz sferenz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice code! Please have a look at the comments :)

"""Load settings from the configuration file (passed in from command line)."""

toml_data = toml.load(args.path / args.config)
toml_data = toml.load("." / args.config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this still work if a regular path is given to HERMES?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, specifying a directory path containing CFF or CodeMeta files is also acceptable. For example, the following command works:

hermes harvest --path C:\path\to\your\directory



class HarvestSettings(BaseModel):
class _HarvestSettings(BaseModel):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you rename this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn’t intend to change the class name. There was an issue with incorrectly pulling the original code for base.py from an updated version of it. This occurred due to a recent update in the settings classes, where all were made private in the develop branch of HERMES (commit a6c1a5e).

return None


def _download_to_tempfile(url: str, filename: str) -> pathlib.Path:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you delete the tempfiles later?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current code the temp files for CFF and CodeMeta are stored separately in C:\Temp on the local machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will these files be deleted after the extraction process?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The files won't be deleted after harvesting, however, I can modify the code to delete temp files after extraction. Do you agree with this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the temp files should be deleted at the end of the process.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to remove the temp files after the harvesting process. Could you please have a look on the changes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@Aidajafarbigloo
Copy link
Author

Thanks for the nice code! Please have a look at the comments :)

@sferenz Thank you for the comments.

Add functionality to remove temp files generated during remote harvesting.
Remove temp files after harvesting CFF metadata
Remove temp files after harvesting CodeMeta metadata
@Aidajafarbigloo Aidajafarbigloo marked this pull request as ready for review April 14, 2025 13:55
@sferenz
Copy link
Member

sferenz commented Apr 14, 2025

@sdruskat This pull request is ready to merge, can you please assign us a reviewer?

@zyzzyxdonta zyzzyxdonta self-requested a review April 25, 2025 08:12
Copy link
Contributor

@zyzzyxdonta zyzzyxdonta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work!

I had a first look and I would like to suggest a slightly different approach. I think it would be beneficial to have the --url argument that you had (as indicated by your PR description). This would allow us to do the following:

  1. Create a temporary directory
  2. download the remote repository given by --url to this directory
  3. overwrite args.path with the temporary directory path
  4. run the normal harvesting step
  5. delete the temporary directory

In this case there is no need to change anything in any of the plugins (I think). Only the base harvest command needs to worry about downloading and then deleting the files.

What do you think?

return None


def remove_temp_file(file_path: pathlib.Path, temp_dir: pathlib.Path = pathlib.Path("C:/Temp")):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C:/Temp is windows-specific. You could use tempfile.TemporaryDirectory and place the files in there. Then, instead of deleting the files one by one, you can use .cleanup() on the TemporaryDirectory object.

return corrected_url.replace("https:/", "https://")


def fetch_metadata_from_repo(repo_url: str, filename: str) -> t.Optional[pathlib.Path]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method makes multiple HTTP requests. I think it would be nice to use the hermes user agent, just to let the services know who we are. You can do something like:

from hermes.utils import hermes_user_agent

session = requests.Session()
session.headers.update({"User-Agent": hermes_user_agent})

then use the session to make the requests:

session.get(api_url)

@Aidajafarbigloo
Copy link
Author

Thanks for your work!

I had a first look and I would like to suggest a slightly different approach. I think it would be beneficial to have the --url argument that you had (as indicated by your PR description). This would allow us to do the following:

  1. Create a temporary directory
  2. download the remote repository given by --url to this directory
  3. overwrite args.path with the temporary directory path
  4. run the normal harvesting step
  5. delete the temporary directory

In this case there is no need to change anything in any of the plugins (I think). Only the base harvest command needs to worry about downloading and then deleting the files.

What do you think?

Thanks! I’ve implemented this approach and am testing it with a few different repositories.
Quick note: Cloning large repositories takes too long, it would be good to replace full clones with a shallow clone to check only for CITATION.cff or codemeta.json later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants