[DRAFT] Feature/hadisd dataset integration#127
[DRAFT] Feature/hadisd dataset integration#127tennlee merged 73 commits intoACCESS-Community-Hub:developfrom
Conversation
|
This is looking good @millerjoel . Some general thoughts from running through the notebooks.
those are just my random thoughts, don't feel you have to act on all or any of them. |
|
Thanks @stevehadd! I think I've addressed most of your comments, but will leave some things for another PR. The reason for not having some of the pipeline steps as part of the core library was to demonstrate a way that custom steps can be easily added to a pipeline. In the future I will likely make a separate PR to have some of these steps exist as part of PET. |
|
I think we can merge it as-is but there are some improvements I'd suggest making.
It's going through the conversion now, I'll add further comments post-conversion once that's done. |
| } | ||
| def __init__( | ||
| self, | ||
| station: str | list[str] | None = None, # Allow single station, multiple stations, or None |
There was a problem hiding this comment.
It should be possible to override the base directory for users who have put the data in a different location outside of their home directory
| self, | ||
| station: str | list[str] | None = None, # Allow single station, multiple stations, or None | ||
| variables: list[str] | str | None = None, | ||
| *, |
There was a problem hiding this comment.
The optional arguments should all be after the "*" in the constructor (probably)
| # Returns: | ||
| # list[str]: A list of all station IDs. | ||
| # """ | ||
| # root_directory = Path(root_directory) |
There was a problem hiding this comment.
Please strip out the commented code if it's not a part of the PR
| raise DataNotFoundError(f"Root directory does not exist: {root_directory}") | ||
| station_ids = [] | ||
| for file in cached_iterdir(root_directory): | ||
| if file.suffix == ".nc": |
There was a problem hiding this comment.
This seems to be getting the station ids from the netcdf files, not the zarr files. Does this mean people need to store both the netcdf AND the zarr files?
| # return station_ids | ||
|
|
||
|
|
||
| def get_all_station_ids(self, root_directory: Path | str = None) -> list[str]: |
There was a problem hiding this comment.
Here is says root_directory defaults to HADISD_HOME/netcdf, but in the 'filesystem' method it is called with a first argument of just HADISD_HOME, leading to inconsistencies.
|
|
||
| # Retrieve all station IDs from the dataset directory if "all" is present | ||
| if "all" in station_ids: | ||
| station_ids = self.get_all_station_ids(HADISD_HOME) |
There was a problem hiding this comment.
This is where the class calls get_all_station_ids without the '/netcdf' part, and it's looking through the netcdf files rather than the zarr files to determine available station numbers
| (50000, 79999, "WMO_050000-079999"), | ||
| (80000, 99999, "WMO_080000-099999"), | ||
| (100000, 149999, "WMO_100000-149999"), | ||
| (150000, 199999, "WMO_150000-199999"), |
There was a problem hiding this comment.
When unpacking the zarr files, why not just put everything in a single directory? Do the directories help particularly?
There was a problem hiding this comment.
Stephen made the same comment actually. There is no reason for this and your suggestion is sensible I think. I think it was purely done to break up the download into smaller chunks.
|
Hi everyone. As discussed, I will shortly be merging this code and this will close the pull request. However, the various comments still need to be addressed. Joel has agreed to continue to work through the feedback, but the code is also useful enough to offer value to users as-is. |
Added a HadISD data accessor along with some custom pipeline steps and a notebook to demonstrate the use of the accessor