Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
33c5f79
First implementation of HadISD integration using ERA5 as template (un…
millerjoel Apr 11, 2025
eee0352
First implementation of HadISD integration using ERA5 as template (un…
millerjoel Apr 11, 2025
da7bdf0
Updata HadISDIndex class: Can now successfully load Hadisd data from …
millerjoel Apr 15, 2025
45b07ae
Create notebook for documenting HadISD dataset integration
millerjoel Apr 15, 2025
3484dd6
Update HadISDIndex class so variables can be selected using transforms
millerjoel Apr 15, 2025
ae7b063
Add small explainer comments
millerjoel Apr 15, 2025
7e4a6c1
Add transform for selecting variables when loading datasets
millerjoel Apr 16, 2025
e86805c
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel Apr 16, 2025
38d60f6
Update class to support selecting variables
millerjoel Apr 16, 2025
ab92010
Add SetMissingToNaN transform class
millerjoel Apr 16, 2025
d703c3a
Update HadISDIndex Class to make use of new SetMissingToNaN transform
millerjoel Apr 16, 2025
1a6e8c6
Update HadISD note book to reflect addition of new transforms
millerjoel Apr 16, 2025
30cb721
Add new test for SetMissingToNaN transform
millerjoel Apr 16, 2025
2f56997
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel Apr 16, 2025
47c0206
Create add_flagged_obs class in values.py module
millerjoel Apr 24, 2025
dd7beba
Update Hadisd notebook
millerjoel Apr 24, 2025
1360631
Add ReIndexTime transform to coordinates.py module
millerjoel Apr 24, 2025
32e3128
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel Apr 24, 2025
ab870fe
Minor Hadisd changes
millerjoel Apr 30, 2025
21fdc07
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel Apr 30, 2025
a4b9311
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel May 12, 2025
c8dc4f1
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel May 14, 2025
d31620d
Update init to reflect correct archive
millerjoel May 19, 2025
cb9aed3
Update load and filesystem methods to work with zarr
millerjoel May 19, 2025
2876235
Update notebook to support loading multiple stations and merging into…
millerjoel May 19, 2025
2479761
Add support for natively loading multifle zarr directories
millerjoel May 19, 2025
4d7d6ba
Update to test with netcdf
millerjoel May 22, 2025
77154b5
single station testing
millerjoel May 22, 2025
ac74e27
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel May 30, 2025
7c9eaea
Save development changes
millerjoel May 30, 2025
cd001bf
Save development
millerjoel May 30, 2025
b4883ce
Add rough FeatureTargetSplit operation to notebook. Add to operations…
millerjoel Jun 1, 2025
58f39da
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel Jun 2, 2025
45dca09
Refine class and make more readable
millerjoel Jun 2, 2025
327ee61
change path
millerjoel Jun 2, 2025
84baa50
revert changes
millerjoel Jun 2, 2025
1832345
More development and testing with xgboost and pipeline operations
millerjoel Jun 2, 2025
e588071
Add notebook for converting netcdf files to zarr
millerjoel Jun 2, 2025
e29eac9
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel Jun 2, 2025
5d85d35
Updated analysis of qc flags
millerjoel Jun 3, 2025
2c5add1
Add XGBoost training plus data manipulation steps
millerjoel Jun 4, 2025
add7d85
Move HadISD to Zarr to notebooks
millerjoel Jun 5, 2025
5384f38
HadISD QC exploration notebook
millerjoel Jun 6, 2025
9b0ba58
remove notebook outputs
millerjoel Jun 23, 2025
d9d9a69
Add pipeline branches, simplify notebook
millerjoel Jun 23, 2025
bcfd651
Add HadISD config notebook for custom pipelinesteps and dictionaries
millerjoel Jun 23, 2025
e55bf8e
Remove TODOs
millerjoel Jun 23, 2025
b6a36c5
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel Jun 23, 2025
eb61f3d
Remove TODOs
millerjoel Jun 23, 2025
bfef46e
add train test split pipeline step
millerjoel Jul 2, 2025
e24c020
Include train test split in pipeline
millerjoel Jul 2, 2025
e51a883
Feed output of new pipeline (train/test) to xgboost
millerjoel Jul 2, 2025
ddfa691
Update get all stations method for more universal suport
millerjoel Jul 3, 2025
9e06185
Update get all stations method for more universal suport
millerjoel Jul 3, 2025
8040256
Define user agnostic paths
millerjoel Jul 3, 2025
3e61faf
Add download notebook and rename other HadISD notebooks
millerjoel Jul 3, 2025
cad4397
Clear notebook outputs
millerjoel Jul 3, 2025
ef75783
Update notebooks
millerjoel Jul 6, 2025
b7932d1
Merge branch 'ACCESS-Community-Hub:develop' into feature/hadisd-datas…
millerjoel Jul 8, 2025
37db785
Add data config notebook
millerjoel Jul 8, 2025
e4b7a5c
Update notebooks to use config notebook
millerjoel Jul 8, 2025
8ce2411
Update init so hadis works out the box
millerjoel Jul 9, 2025
f67098f
Rename Config notebooks
Jul 9, 2025
61d1c9b
Clear notebook outputs and update Data_config
Jul 9, 2025
5db1376
Remove redundent cells from notebooks
millerjoel Jul 9, 2025
9e0194b
Notebooks ready for merge
millerjoel Jul 10, 2025
c569724
Update notebook 3
millerjoel Jul 10, 2025
0960ee6
Improve xarray/dask handling based on review comments
millerjoel Jul 10, 2025
7f83c49
Use explicit instead of falsey
millerjoel Jul 10, 2025
799433e
Make downloads more robust and idempotent so repeat downloads don't o…
millerjoel Jul 15, 2025
afa87ee
Prevent conversion to zarr for already converted netcdf files
millerjoel Jul 15, 2025
aaae6f4
Clear notebook output
millerjoel Jul 15, 2025
1f732ed
Update packages/tutorial/src/pyearthtools/tutorial/__init__.py
tennlee Jul 15, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 253 additions & 0 deletions notebooks/tutorial/HadISD/1_HadISD_Download.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "7c7150db",
"metadata": {},
"source": [
"# HadISD Data Download Notebook\n",
"\n",
"This notebook will help you download a subset of the HadISD dataset directly from the Met Office website. The data will be stored in a user-specified directory (or a sensible default), and extracted for further processing (e.g., conversion to Zarr).\n",
"\n",
"- **Source:** [HadISD v3.4.0.2023f](https://www.metoffice.gov.uk/hadobs/hadisd/v340_2023f/download.html)\n",
"- **Instructions:**\n",
" 1. Set the download directory (or use the default).\n",
" 2. Download the data using Python's `requests` package.\n",
" 3. Extract the `.tar.gz` archive.\n",
" 4. The extracted files will be ready for use in the next notebook (`HadISD_to_zarr.ipynb`).\n",
"\n",
"> **Note:** Download size is large. Ensure you have sufficient disk space and a stable internet connection."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e4fb7b1d",
"metadata": {},
"outputs": [],
"source": [
"import requests\n",
"from tqdm.auto import tqdm\n",
"import tarfile\n",
"import gzip\n",
"import shutil"
]
},
{
"cell_type": "markdown",
"id": "12a6abca",
"metadata": {},
"source": [
"### Retrieve Path to Download Directory\n",
"The download location will default to a folder named \"HadISD_data\" in your home directory.<br>\n",
"If you want to change this, you can do so in the `Data_config.ipynb` configuration notebook. <br>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2eadaf27",
"metadata": {},
"outputs": [],
"source": [
"%run Data_Config.ipynb\n",
"print(f\"Data will be downloaded to: {download_dir}\") "
]
},
{
"cell_type": "markdown",
"id": "2d833606",
"metadata": {},
"source": [
"### Download HadISD Data\n",
"The following code will download the HadISD data files. Some files take longer to download than others depending on time of day. To download different WMO datasets, you can change `wmo_id_range` in the `Data_Config.ipynb` notebook .\n",
"\n",
"The full list of available data can be found here:\n",
"https://www.metoffice.gov.uk/hadobs/hadisd/v340_2023f/download.html"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "feb8d671",
"metadata": {},
"outputs": [],
"source": [
"# Explain why stations are split into ranges, file size, and how it's not neccesssary to download all stations. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11a188d4",
"metadata": {},
"outputs": [],
"source": [
"print(f\"Downloading HadISD data for WMO range: {wmo_id_range}\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ddbebda",
"metadata": {},
"outputs": [],
"source": [
"wmo_id_range = wmo_id_range # This has been defined in HadISD_data_config.ipynb\n",
"\n",
"wmo_str = f\"WMO_{wmo_id_range}\"\n",
"url = f\"https://www.metoffice.gov.uk/hadobs/hadisd/v340_2023f/data/{wmo_str}.tar.gz\"\n",
"tar_name = f\"{wmo_str}.tar\"\n",
"filename = download_dir / tar_name"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "08ac36fd",
"metadata": {},
"outputs": [],
"source": [
"# Get remote file size using HTTP HEAD\n",
"head = requests.head(url, allow_redirects=True)\n",
"remote_size = int(head.headers.get('content-length', 0))\n",
"\n",
"local_size = filename.stat().st_size if filename.exists() else 0\n",
"\n",
"if filename.exists() and local_size == remote_size:\n",
" print(f\"File already fully downloaded: {filename} ({local_size/1024**2:.2f} MB)\")\n",
"else:\n",
" headers = {}\n",
" mode = 'wb'\n",
" initial_pos = 0\n",
" if filename.exists() and local_size < remote_size:\n",
" headers['Range'] = f'bytes={local_size}-'\n",
" mode = 'ab'\n",
" initial_pos = local_size\n",
" print(f\"Resuming download for {filename.name} at {local_size/1024**2:.2f} MB...\")\n",
" else:\n",
" print(f\"Starting download for {filename.name}...\")\n",
"\n",
" response = requests.get(url, stream=True, headers=headers)\n",
" total = remote_size\n",
"\n",
" with open(filename, mode) as f, tqdm(\n",
" desc=f\"Downloading {filename.name}\",\n",
" total=total,\n",
" initial=initial_pos,\n",
" unit='B', unit_scale=True, unit_divisor=1024\n",
" ) as bar:\n",
" for chunk in response.iter_content(chunk_size=8192):\n",
" if chunk:\n",
" f.write(chunk)\n",
" bar.update(len(chunk))\n",
"\n",
" final_size = filename.stat().st_size\n",
" if final_size == remote_size:\n",
" print(f\"Download complete: {filename} ({final_size/1024**2:.2f} MB)\")\n",
" else:\n",
" print(f\"Warning: Download incomplete. Local size: {final_size}, Remote size: {remote_size}\")\n",
"\n",
"# Possibly also add check to see if netcdf files esist for the downloaded tar file, if so then don't download again"
]
},
{
"cell_type": "markdown",
"id": "4da19a94",
"metadata": {},
"source": [
"### Extract Tar Files and Move to Netcdf Subfolder"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fb79a81c",
"metadata": {},
"outputs": [],
"source": [
"extract_dir = download_dir / tar_name.replace('.tar', '')\n",
"extract_dir.mkdir(exist_ok=True)\n",
"\n",
"extracted_files = list(extract_dir.glob('*'))\n",
"if extracted_files:\n",
" print(f\"Extraction directory '{extract_dir}' already contains {len(extracted_files)} files. Skipping extraction.\")\n",
"elif filename.exists():\n",
" with tarfile.open(filename, \"r:gz\") as tar:\n",
" tar.extractall(path=extract_dir)\n",
" extracted_files = list(extract_dir.glob('*'))\n",
" if extracted_files:\n",
" print(f\"Extraction successful. {len(extracted_files)} files found in {extract_dir}.\")\n",
" # Delete the tar file after extraction\n",
" filename.unlink()\n",
" print(f\"Deleted tar file: {filename}\")\n",
" else:\n",
" print(f\"Warning: No files extracted to {extract_dir}. Tar file will not be deleted.\")\n",
" raise RuntimeError(\"Extraction failed, tar file not deleted.\")\n",
"else:\n",
" print(f\"No tar file found and extraction directory is empty. Nothing to extract.\")\n",
" raise FileNotFoundError(f\"Missing tar file: {filename}\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "53161550",
"metadata": {},
"outputs": [],
"source": [
"# Create subfolder for netcdf\n",
"netcdf_dir = download_dir / \"netcdf\"\n",
"netcdf_dir.mkdir(parents=True, exist_ok=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4e43dcc4",
"metadata": {},
"outputs": [],
"source": [
"# Move extracted .nc files into netcdf_dir after extraction\n",
"num_files = 0\n",
"for gz_path in extract_dir.glob('*.nc.gz'):\n",
" nc_path = gz_path.with_suffix('') # Remove .gz extension\n",
" with gzip.open(gz_path, 'rb') as f_in, open(nc_path, 'wb') as f_out:\n",
" f_out.write(f_in.read())\n",
" gz_path.unlink() # Delete the .gz file after extraction\n",
" shutil.move(str(nc_path), netcdf_dir / nc_path.name)\n",
" num_files += 1\n",
"\n",
"print(f\"{num_files} .nc files have been extracted, cleaned up, and moved to the netcdf directory: {netcdf_dir}\")\n",
"\n",
"# Delete the extraction directory after processing\n",
"try:\n",
" shutil.rmtree(extract_dir)\n",
" print(f\"Deleted extraction directory: {extract_dir}\")\n",
"except Exception as e:\n",
" print(f\"Could not delete extraction directory {extract_dir}: {e}\")"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "pyearthtools",
"language": "python",
"name": "pyearthtools"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.13.5"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading