Refactor arco by jennan · Pull Request #117 · ACCESS-Community-Hub/PyEarthTools

jennan · 2025-05-26T02:23:13Z

This PR implements a refactoring of the arco module, with the following changes:

remove dead code (commented out or unused helper functions) and simplification remove the need for helper function (i.e. pre-generate the inverse mapping from short variable names to long variable names)
move all code and validation variables in one module, as the simplifications make the code more compact and still readable in one module with less than 500 lines
accelerate loading of few variables by dropping all other variables in open_zarr call
- loading the full dataset takes (all levels) takes 1min30s
- loading 1 variables (all levels) takes 2s
drop of the global variable singleton approach
- tend to avoid global variables as they create side-channels not visible for a user and I fell this is not needed if the goal was to make loading faster, in particular in a case of loading few variables
- the original implementation wasn't re-opening the dataset if the kwargs changed
- this could be replace by caching (either functools.cache or joblib.Memory) if really needed, knowing that the full dataset takes about 18GB of memory once opened

Missing elements to complete to get the PR ready:

add tests
add documentation at the module level

The valid variables file "pyearthtools.data.download.arco.variables.ERA5.valid" is only used in this context (i.e. validate ARCOERA5.__init__ inputs) and can be replaced by a list variable. Using this, I regrouped it in the same folder as the valid ERA5 valid pressure levels - which were alread a variable and not a .valid file - and removed a level of folder nesting which wasn't needed anymore.

Global variables introduce an hidden side channel that might not be desirable. I understand here that is was use to save loading time (it took 1m44s to open the full dataset for me). This could be replaced by using functools.cache, that properly reload if the kwargs have changed, or by fixing the loading time by dropping the variables not needed.

ERA_NAME_CHANGE contains all variables names, long (keys) and short (values), so it can readily be used to validated the input of ARCOERA5, no need to duplicate that list. In addition, the reverse mapping can be precomputed and use as is, instead of introducing a function (with a nested function) for this.

This can massively accelerate loading when user wants few variables. On my machine, it takes few seconds to load 1 variable, instead of 2 minutes to load the whole dataset (before filtering).

I renamed the model to a more explicict arcoera5 as ERA5 is the only provided dataset. Consequently, the global variables used for validation and name mapping were also renamed as they are only needed for ERA5, so this prefix can be dropped. I made the variables private as I don't think this implementation aspect should be exposed to the user, when they load this single module.

coveralls · 2025-05-26T02:29:29Z

Pull Request Test Coverage Report for Build 15316217872

Details

93 of 98 (94.9%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.3%) to 61.117%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
packages/data/tests/download/test_arcoera5.py	51	53	96.23%
packages/data/src/pyearthtools/data/download/arcoera5.py	39	42	92.86%

Totals
Change from base Build 15313836174:	0.3%
Covered Lines:	9937
Relevant Lines:	15805

💛 - Coveralls

The slow tests will load all variables, which will takes more time and much more memory (~18GB).

jennan · 2025-05-27T05:07:57Z

@tennlee I have added tests and the docs to the API docs. The PR is ready :-).

.github/workflows/python-app.yml

tennlee · 2025-05-27T06:48:59Z

I have taken a brief look, the direction seems fine. I will spend some more time with it over the next few days and may have some additional feedback based on that.

packages/data/src/pyearthtools/data/download/arcoera5.py

jennan · 2025-06-03T04:59:56Z

@tennlee I think I have addressed the only action point following our discussion, i.e. renaming the pytest mark. Regarding the failing pre-commit, I don't manage to trigger it locally and it is related to a file I have not touched, packages/tutorial/src/pyearthtools/tutorial/ERA5DataClass.py so let me know if this should still be fixed as part of this PR.

tennlee · 2025-06-03T05:32:29Z

Looks good. I'll investigate the pre-commit hook separately.

jennan added 11 commits May 22, 2025 21:40

remove dead code and unused imports

a4bad51

remove unused function

e7d952f

drop unused variables instead of filtering after loading everything

38ba4fa

This can massively accelerate loading when user wants few variables. On my machine, it takes few seconds to load 1 variable, instead of 2 minutes to load the whole dataset (before filtering).

reformat with ruff

7d1083d

use set to lower complexity of repeatedly searching in an list

da92f92

small reformating

e72d137

load all variables by default, expose variables and levels in public API

e12023a

jennan requested review from millerjoel and tennlee May 26, 2025 02:23

jennan self-assigned this May 26, 2025

Merge branch 'ACCESS-Community-Hub:develop' into refactor_arco

33fac3f

jennan added 7 commits May 26, 2025 02:55

fix formatting error making pre-commit fail

e938eba

add tests for ARCOERA5 class, mark heavy ones as slow

43337a3

The slow tests will load all variables, which will takes more time and much more memory (~18GB).

add test to check variable renaming

aa4d87c

fix formatting

f15a031

exclude slow tests from CI workflow

dbb1f05

add ARCOERA5 class to the documentation

d6058a9

add arcoera5 constants to the API documentation

95635ff

jennan changed the title ~~[draft] Refactor arco~~ Refactor arco May 27, 2025

tennlee reviewed May 27, 2025

View reviewed changes

.github/workflows/python-app.yml Outdated Show resolved Hide resolved

jennan added 2 commits May 27, 2025 07:57

remove space in flag to skip slow tests

b5e3a19

Merge branch 'ACCESS-Community-Hub:develop' into refactor_arco

d1eeba9

tennlee reviewed Jun 1, 2025

View reviewed changes

packages/data/src/pyearthtools/data/download/arcoera5.py Show resolved Hide resolved

tennlee reviewed Jun 1, 2025

View reviewed changes

packages/data/src/pyearthtools/data/download/arcoera5.py Show resolved Hide resolved

tennlee reviewed Jun 1, 2025

View reviewed changes

packages/data/src/pyearthtools/data/download/arcoera5.py Show resolved Hide resolved

tennlee reviewed Jun 1, 2025

View reviewed changes

packages/data/src/pyearthtools/data/download/arcoera5.py Show resolved Hide resolved

replace "slow" pytest mark with more explicit "noci" mark

3f6f334

tennlee merged commit 9bad9ba into ACCESS-Community-Hub:develop Jun 3, 2025
6 of 7 checks passed

jennan deleted the refactor_arco branch June 4, 2025 02:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Refactor arco#117

Refactor arco#117
tennlee merged 22 commits intoACCESS-Community-Hub:developfrom
jennan:refactor_arco

jennan commented May 26, 2025

Uh oh!

coveralls commented May 26, 2025 •

edited

Loading

Uh oh!

jennan commented May 27, 2025

Uh oh!

Uh oh!

tennlee commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jennan commented Jun 3, 2025

Uh oh!

tennlee commented Jun 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

jennan commented May 26, 2025

Uh oh!

coveralls commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 15316217872

Details

💛 - Coveralls

Uh oh!

jennan commented May 27, 2025

Uh oh!

Uh oh!

tennlee commented May 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jennan commented Jun 3, 2025

Uh oh!

tennlee commented Jun 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented May 26, 2025 •

edited

Loading