Merged
Conversation
The valid variables file "pyearthtools.data.download.arco.variables.ERA5.valid" is only used in this context (i.e. validate ARCOERA5.__init__ inputs) and can be replaced by a list variable. Using this, I regrouped it in the same folder as the valid ERA5 valid pressure levels - which were alread a variable and not a .valid file - and removed a level of folder nesting which wasn't needed anymore.
Global variables introduce an hidden side channel that might not be desirable. I understand here that is was use to save loading time (it took 1m44s to open the full dataset for me). This could be replaced by using functools.cache, that properly reload if the kwargs have changed, or by fixing the loading time by dropping the variables not needed.
ERA_NAME_CHANGE contains all variables names, long (keys) and short (values), so it can readily be used to validated the input of ARCOERA5, no need to duplicate that list. In addition, the reverse mapping can be precomputed and use as is, instead of introducing a function (with a nested function) for this.
This can massively accelerate loading when user wants few variables. On my machine, it takes few seconds to load 1 variable, instead of 2 minutes to load the whole dataset (before filtering).
I renamed the model to a more explicict arcoera5 as ERA5 is the only provided dataset. Consequently, the global variables used for validation and name mapping were also renamed as they are only needed for ERA5, so this prefix can be dropped. I made the variables private as I don't think this implementation aspect should be exposed to the user, when they load this single module.
Pull Request Test Coverage Report for Build 15316217872Details
💛 - Coveralls |
The slow tests will load all variables, which will takes more time and much more memory (~18GB).
Collaborator
Author
|
@tennlee I have added tests and the docs to the API docs. The PR is ready :-). |
tennlee
reviewed
May 27, 2025
Collaborator
|
I have taken a brief look, the direction seems fine. I will spend some more time with it over the next few days and may have some additional feedback based on that. |
tennlee
reviewed
Jun 1, 2025
tennlee
reviewed
Jun 1, 2025
tennlee
reviewed
Jun 1, 2025
tennlee
reviewed
Jun 1, 2025
Collaborator
Author
|
@tennlee I think I have addressed the only action point following our discussion, i.e. renaming the pytest mark. Regarding the failing pre-commit, I don't manage to trigger it locally and it is related to a file I have not touched, |
Collaborator
|
Looks good. I'll investigate the pre-commit hook separately. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements a refactoring of the
arcomodule, with the following changes:open_zarrcallkwargschangedfunctools.cacheorjoblib.Memory) if really needed, knowing that the full dataset takes about 18GB of memory once openedMissing elements to complete to get the PR ready: