Skip to content

Comments

[DRAFT] Feature/hadisd dataset integration#127

Merged
tennlee merged 73 commits intoACCESS-Community-Hub:developfrom
millerjoel:feature/hadisd-dataset-integration
Jul 15, 2025
Merged

[DRAFT] Feature/hadisd dataset integration#127
tennlee merged 73 commits intoACCESS-Community-Hub:developfrom
millerjoel:feature/hadisd-dataset-integration

Conversation

@millerjoel
Copy link
Collaborator

Added a HadISD data accessor along with some custom pipeline steps and a notebook to demonstrate the use of the accessor

@stevehadd
Copy link
Collaborator

This is looking good @millerjoel . Some general thoughts from running through the notebooks.

  • Some of the cells in the notebooks have quite a lot in each cell. I usually find it more readable for each cell to be focused on one thing. For example I'd suggest not having import statements in the same cell as the code that then executes functions from the imported modules. Also generally helpful from a developer point of view, as you often want to rerun something, and haviung smaller cells makes it easier to rerun only the bit you are interested in.
  • Although the data source divides the data into 4 files, thats not a fundamental division, just a convenient one so the individual files for download are not too big. I'd suggest that once you provcessed into zarr, there's not reason to maintain that division, so maybe consider having a single "HadISD" zarr at the top level of the dataset (i.e. ~/HadISD/zarr/) which better maps to to it conceptually being a single dataset, rather than 4 separate zarr archives. That way you should also avoid the need for the load_combined_dataset function, as you will just point at the single zarr archive directory, and pyearthtools/xarray will take care of the rest
  • I think you've told me this already, but what was the reasoning for defining the functions in the notebook rather than in the core pyearthtools python libraries? It seems like these would be useful functions to be able to reuse, and it would be easier to do so if they were in the python library. @tennlee perhaps you could comment architecturally? I'm not saying one way or the other is right, just trying to understand the current approach was chosen to inform my own future contributions.
  • I'm sure its already on your todo list, but make sure you delete all the commented out code before you merge this PR.
  • in the xgboost notebook, (Tutorial with CNN training #3) where you specify X_train, y_train etc., so you could rather combine this with the previous cell, so rather have X_train, y_train), (X_test, y_test) = data_prep_pipe["1969-01-01T00"]
  • Before merging you should also rerun the notebooks in the order of the cells, so that the committed version has example output, so that people can look at the output on github, so they know what to expect when they rerun the notebook.
  • maybe something for a follow up, but as part of the purpose of demonstrating the dataset, it might be nice to have a more general data exploration notebook, as one would usually do some data exploration before training an ML model, so as to show people what the dataset contains. In this case we were interested from a data QC perspective, but the idea is this dataset is more broadly applicable, so it would be good to show that in a notebook.

those are just my random thoughts, don't feel you have to act on all or any of them.

@millerjoel
Copy link
Collaborator Author

Thanks @stevehadd! I think I've addressed most of your comments, but will leave some things for another PR. The reason for not having some of the pipeline steps as part of the core library was to demonstrate a way that custom steps can be easily added to a pipeline. In the future I will likely make a separate PR to have some of these steps exist as part of PET.

@tennlee
Copy link
Collaborator

tennlee commented Jul 11, 2025

I think we can merge it as-is but there are some improvements I'd suggest making.

  • Firstly, in the downloading, it should be possible to do this without manually selecting each time-range one by one, and just select all data.
  • Secondly, I think there is a confusion in the downloader about the zip filename and the tar filename. I think the incremental downloader is checking the tar filename when it should be checking the zip filename
  • Thirdly, in the converter, it currently raises a lot of warning messages, which are distracting given they are raised for every small netcdf file - I would suggest capturing them and dropping all except the first one.
  • Consider adding a progress bar to the converter, or a comment on how long it's expected to take
  • I tried this on the entire data set. There are something like 6500 individual station files as far as I can tell, and I think it's doing something like 100-200 per minute, so it's going to take 30-60 minutes I guess. It's also maxing out around 112% CPU, so it's not very multithreaded. I think the notebook needs to explain to people what's involved, and maybe the progress bar is warranted. Alternatively, put a stronger recommendation to start with a single file for experimentation to avoid people trying to do too much at once.
  • I did my initial download with wget, which resulted in some slight differences. In the end I just unzipped and untarred the files at the command-line as a result. That's okay because I know I wasn't following the notebook, but I thought I'd mention it. I also ended up with some corrupted gz files, and I'm not sure which. I wonder how much space the whole dataset takes up as a parquet file, which might be much more efficient than gz in this case. I might try a test later, but it could be that making a .pq distribution file of the dataset and hosting it would lower the barrier significantly for users doing the initial setup.

It's going through the conversion now, I'll add further comments post-conversion once that's done.

}
def __init__(
self,
station: str | list[str] | None = None, # Allow single station, multiple stations, or None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be possible to override the base directory for users who have put the data in a different location outside of their home directory

self,
station: str | list[str] | None = None, # Allow single station, multiple stations, or None
variables: list[str] | str | None = None,
*,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optional arguments should all be after the "*" in the constructor (probably)

# Returns:
# list[str]: A list of all station IDs.
# """
# root_directory = Path(root_directory)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please strip out the commented code if it's not a part of the PR

raise DataNotFoundError(f"Root directory does not exist: {root_directory}")
station_ids = []
for file in cached_iterdir(root_directory):
if file.suffix == ".nc":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be getting the station ids from the netcdf files, not the zarr files. Does this mean people need to store both the netcdf AND the zarr files?

# return station_ids


def get_all_station_ids(self, root_directory: Path | str = None) -> list[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is says root_directory defaults to HADISD_HOME/netcdf, but in the 'filesystem' method it is called with a first argument of just HADISD_HOME, leading to inconsistencies.


# Retrieve all station IDs from the dataset directory if "all" is present
if "all" in station_ids:
station_ids = self.get_all_station_ids(HADISD_HOME)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the class calls get_all_station_ids without the '/netcdf' part, and it's looking through the netcdf files rather than the zarr files to determine available station numbers

(50000, 79999, "WMO_050000-079999"),
(80000, 99999, "WMO_080000-099999"),
(100000, 149999, "WMO_100000-149999"),
(150000, 199999, "WMO_150000-199999"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When unpacking the zarr files, why not just put everything in a single directory? Do the directories help particularly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stephen made the same comment actually. There is no reason for this and your suggestion is sensible I think. I think it was purely done to break up the download into smaller chunks.

@tennlee
Copy link
Collaborator

tennlee commented Jul 15, 2025

Hi everyone. As discussed, I will shortly be merging this code and this will close the pull request. However, the various comments still need to be addressed. Joel has agreed to continue to work through the feedback, but the code is also useful enough to offer value to users as-is.

@tennlee tennlee marked this pull request as ready for review July 15, 2025 08:41
@tennlee tennlee merged commit 4e1433f into ACCESS-Community-Hub:develop Jul 15, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants