Skip to content

Comments

Refactor arco#117

Merged
tennlee merged 22 commits intoACCESS-Community-Hub:developfrom
jennan:refactor_arco
Jun 3, 2025
Merged

Refactor arco#117
tennlee merged 22 commits intoACCESS-Community-Hub:developfrom
jennan:refactor_arco

Conversation

@jennan
Copy link
Collaborator

@jennan jennan commented May 26, 2025

This PR implements a refactoring of the arco module, with the following changes:

  • remove dead code (commented out or unused helper functions) and simplification remove the need for helper function (i.e. pre-generate the inverse mapping from short variable names to long variable names)
  • move all code and validation variables in one module, as the simplifications make the code more compact and still readable in one module with less than 500 lines
  • accelerate loading of few variables by dropping all other variables in open_zarr call
    • loading the full dataset takes (all levels) takes 1min30s
    • loading 1 variables (all levels) takes 2s
  • drop of the global variable singleton approach
    • tend to avoid global variables as they create side-channels not visible for a user and I fell this is not needed if the goal was to make loading faster, in particular in a case of loading few variables
    • the original implementation wasn't re-opening the dataset if the kwargs changed
    • this could be replace by caching (either functools.cache or joblib.Memory) if really needed, knowing that the full dataset takes about 18GB of memory once opened

Missing elements to complete to get the PR ready:

  • add tests
  • add documentation at the module level

jennan added 11 commits May 22, 2025 21:40
The valid variables file "pyearthtools.data.download.arco.variables.ERA5.valid"
is only used in this context (i.e. validate ARCOERA5.__init__ inputs) and can
be replaced by a list variable. Using this, I regrouped it in the same folder
as the valid ERA5 valid pressure levels - which were alread a variable and not
a .valid file - and removed a level of folder nesting which wasn't needed
anymore.
Global variables introduce an hidden side channel that might not be desirable.
I understand here that is was use to save loading time (it took 1m44s to open
the full dataset for me). This could be replaced by using functools.cache, that
properly reload if the kwargs have changed, or by fixing the loading time by
dropping the variables not needed.
ERA_NAME_CHANGE contains all variables names, long (keys) and short (values),
so it can readily be used to validated the input of ARCOERA5, no need to
duplicate that list.

In addition, the reverse mapping can be precomputed and use as is, instead of
introducing a function (with a nested function) for this.
This can massively accelerate loading when user wants few variables. On my
machine, it takes few seconds to load 1 variable, instead of 2 minutes to load
the whole dataset (before filtering).
I renamed the model to a more explicict arcoera5 as ERA5 is the only provided
dataset. Consequently, the global variables used for validation and name
mapping were also renamed as they are only needed for ERA5, so this prefix can
be dropped. I made the variables private as I don't think this implementation
aspect should be exposed to the user, when they load this single module.
@jennan jennan requested review from millerjoel and tennlee May 26, 2025 02:23
@jennan jennan self-assigned this May 26, 2025
@coveralls
Copy link

coveralls commented May 26, 2025

Pull Request Test Coverage Report for Build 15316217872

Details

  • 93 of 98 (94.9%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.3%) to 61.117%

Changes Missing Coverage Covered Lines Changed/Added Lines %
packages/data/tests/download/test_arcoera5.py 51 53 96.23%
packages/data/src/pyearthtools/data/download/arcoera5.py 39 42 92.86%
Totals Coverage Status
Change from base Build 15313836174: 0.3%
Covered Lines: 9937
Relevant Lines: 15805

💛 - Coveralls

@jennan
Copy link
Collaborator Author

jennan commented May 27, 2025

@tennlee I have added tests and the docs to the API docs. The PR is ready :-).

@jennan jennan changed the title [draft] Refactor arco Refactor arco May 27, 2025
@tennlee
Copy link
Collaborator

tennlee commented May 27, 2025

I have taken a brief look, the direction seems fine. I will spend some more time with it over the next few days and may have some additional feedback based on that.

@jennan
Copy link
Collaborator Author

jennan commented Jun 3, 2025

@tennlee I think I have addressed the only action point following our discussion, i.e. renaming the pytest mark. Regarding the failing pre-commit, I don't manage to trigger it locally and it is related to a file I have not touched, packages/tutorial/src/pyearthtools/tutorial/ERA5DataClass.py so let me know if this should still be fixed as part of this PR.

@tennlee
Copy link
Collaborator

tennlee commented Jun 3, 2025

Looks good. I'll investigate the pre-commit hook separately.

@tennlee tennlee merged commit 9bad9ba into ACCESS-Community-Hub:develop Jun 3, 2025
6 of 7 checks passed
@jennan jennan deleted the refactor_arco branch June 4, 2025 02:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants