-
Notifications
You must be signed in to change notification settings - Fork 8
Feature/276 harvesting metadata from a provided repository url #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Feature/276 harvesting metadata from a provided repository url #278
Conversation
|
@sferenz |
|
Harvesting metadata from the provided URL (GitHub/GitLab). Command: |
|
@sferenz |
sferenz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice code! Please have a look at the comments :)
src/hermes/commands/base.py
Outdated
| """Load settings from the configuration file (passed in from command line).""" | ||
|
|
||
| toml_data = toml.load(args.path / args.config) | ||
| toml_data = toml.load("." / args.config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this still work if a regular path is given to HERMES?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, specifying a directory path containing CFF or CodeMeta files is also acceptable. For example, the following command works:
hermes harvest --path C:\path\to\your\directory
|
|
||
|
|
||
| class HarvestSettings(BaseModel): | ||
| class _HarvestSettings(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you rename this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn’t intend to change the class name. There was an issue with incorrectly pulling the original code for base.py from an updated version of it. This occurred due to a recent update in the settings classes, where all were made private in the develop branch of HERMES (commit a6c1a5e).
| return None | ||
|
|
||
|
|
||
| def _download_to_tempfile(url: str, filename: str) -> pathlib.Path: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you delete the tempfiles later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current code the temp files for CFF and CodeMeta are stored separately in C:\Temp on the local machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will these files be deleted after the extraction process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The files won't be deleted after harvesting, however, I can modify the code to delete temp files after extraction. Do you agree with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think the temp files should be deleted at the end of the process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the code to remove the temp files after the harvesting process. Could you please have a look on the changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
@sferenz Thank you for the comments. |
Add functionality to remove temp files generated during remote harvesting.
Remove temp files after harvesting CFF metadata
Remove temp files after harvesting CodeMeta metadata
…rovided-repository-URL' to incorporate the recent updates
To support repository URL as a path
|
@sdruskat This pull request is ready to merge, can you please assign us a reviewer? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work!
I had a first look and I would like to suggest a slightly different approach. I think it would be beneficial to have the --url argument that you had (as indicated by your PR description). This would allow us to do the following:
- Create a temporary directory
- download the remote repository given by
--urlto this directory - overwrite
args.pathwith the temporary directory path - run the normal harvesting step
- delete the temporary directory
In this case there is no need to change anything in any of the plugins (I think). Only the base harvest command needs to worry about downloading and then deleting the files.
What do you think?
| return None | ||
|
|
||
|
|
||
| def remove_temp_file(file_path: pathlib.Path, temp_dir: pathlib.Path = pathlib.Path("C:/Temp")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
C:/Temp is windows-specific. You could use tempfile.TemporaryDirectory and place the files in there. Then, instead of deleting the files one by one, you can use .cleanup() on the TemporaryDirectory object.
| return corrected_url.replace("https:/", "https://") | ||
|
|
||
|
|
||
| def fetch_metadata_from_repo(repo_url: str, filename: str) -> t.Optional[pathlib.Path]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method makes multiple HTTP requests. I think it would be nice to use the hermes user agent, just to let the services know who we are. You can do something like:
from hermes.utils import hermes_user_agent
session = requests.Session()
session.headers.update({"User-Agent": hermes_user_agent})then use the session to make the requests:
session.get(api_url)
Thanks! I’ve implemented this approach and am testing it with a few different repositories. |
Added the URL to the
hermes harvestcommand. Now, the commandhermes harvestharvest the metadata from the local repository, andhermes harvest --url <URL>allows harvesting metadata from the provided URL, with support for GitHub and GitLab repositories.(e.g.,
hermes harvest --url https://github.com/NFDI4Energy/SMECS)