[DRAFT] Feature/hadisd dataset integration by millerjoel · Pull Request #127 · ACCESS-Community-Hub/PyEarthTools

millerjoel · 2025-06-02T23:13:19Z

Added a HadISD data accessor along with some custom pipeline steps and a notebook to demonstrate the use of the accessor

…tested)

…within PET

…et-integration

… one dataset

…et-integration

stevehadd · 2025-07-09T16:49:20Z

This is looking good @millerjoel . Some general thoughts from running through the notebooks.

Some of the cells in the notebooks have quite a lot in each cell. I usually find it more readable for each cell to be focused on one thing. For example I'd suggest not having import statements in the same cell as the code that then executes functions from the imported modules. Also generally helpful from a developer point of view, as you often want to rerun something, and haviung smaller cells makes it easier to rerun only the bit you are interested in.
Although the data source divides the data into 4 files, thats not a fundamental division, just a convenient one so the individual files for download are not too big. I'd suggest that once you provcessed into zarr, there's not reason to maintain that division, so maybe consider having a single "HadISD" zarr at the top level of the dataset (i.e. ~/HadISD/zarr/) which better maps to to it conceptually being a single dataset, rather than 4 separate zarr archives. That way you should also avoid the need for the load_combined_dataset function, as you will just point at the single zarr archive directory, and pyearthtools/xarray will take care of the rest
I think you've told me this already, but what was the reasoning for defining the functions in the notebook rather than in the core pyearthtools python libraries? It seems like these would be useful functions to be able to reuse, and it would be easier to do so if they were in the python library. @tennlee perhaps you could comment architecturally? I'm not saying one way or the other is right, just trying to understand the current approach was chosen to inform my own future contributions.
I'm sure its already on your todo list, but make sure you delete all the commented out code before you merge this PR.
in the xgboost notebook, (Tutorial with CNN training #3) where you specify X_train, y_train etc., so you could rather combine this with the previous cell, so rather have X_train, y_train), (X_test, y_test) = data_prep_pipe["1969-01-01T00"]
Before merging you should also rerun the notebooks in the order of the cells, so that the committed version has example output, so that people can look at the output on github, so they know what to expect when they rerun the notebook.
maybe something for a follow up, but as part of the purpose of demonstrating the dataset, it might be nice to have a more general data exploration notebook, as one would usually do some data exploration before training an ML model, so as to show people what the dataset contains. In this case we were interested from a data QC perspective, but the idea is this dataset is more broadly applicable, so it would be good to show that in a notebook.

those are just my random thoughts, don't feel you have to act on all or any of them.

millerjoel · 2025-07-10T12:35:27Z

Thanks @stevehadd! I think I've addressed most of your comments, but will leave some things for another PR. The reason for not having some of the pipeline steps as part of the core library was to demonstrate a way that custom steps can be easily added to a pipeline. In the future I will likely make a separate PR to have some of these steps exist as part of PET.

tennlee · 2025-07-11T01:35:11Z

I think we can merge it as-is but there are some improvements I'd suggest making.

Firstly, in the downloading, it should be possible to do this without manually selecting each time-range one by one, and just select all data.
Secondly, I think there is a confusion in the downloader about the zip filename and the tar filename. I think the incremental downloader is checking the tar filename when it should be checking the zip filename
Thirdly, in the converter, it currently raises a lot of warning messages, which are distracting given they are raised for every small netcdf file - I would suggest capturing them and dropping all except the first one.
Consider adding a progress bar to the converter, or a comment on how long it's expected to take
I tried this on the entire data set. There are something like 6500 individual station files as far as I can tell, and I think it's doing something like 100-200 per minute, so it's going to take 30-60 minutes I guess. It's also maxing out around 112% CPU, so it's not very multithreaded. I think the notebook needs to explain to people what's involved, and maybe the progress bar is warranted. Alternatively, put a stronger recommendation to start with a single file for experimentation to avoid people trying to do too much at once.
I did my initial download with wget, which resulted in some slight differences. In the end I just unzipped and untarred the files at the command-line as a result. That's okay because I know I wasn't following the notebook, but I thought I'd mention it. I also ended up with some corrupted gz files, and I'm not sure which. I wonder how much space the whole dataset takes up as a parquet file, which might be much more efficient than gz in this case. I might try a test later, but it could be that making a .pq distribution file of the dataset and hosting it would lower the barrier significantly for users doing the initial setup.

It's going through the conversion now, I'll add further comments post-conversion once that's done.

tennlee · 2025-07-11T02:39:14Z