ACCESS-Community-Hub · tennlee · Nov 5, 2025 · Nov 4, 2025 · Nov 4, 2025 · Nov 5, 2025
diff --git a/notebooks/Gallery.ipynb b/notebooks/Gallery.ipynb
@@ -41,6 +41,26 @@
     "| **Radar Visualisation** | Shows how to visualise radar data as a time-series, in 2D and in 3D | ![Image showing a top down view of radar data](https://pyearthtools.readthedocs.io/en/latest/_images/notebooks_RadarVisualisation_10_1.png) | [Radar Visualisation](./RadarVisualisation.ipynb) | 23 Aug 2025 |\n"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "a6f26875-9a0c-40b2-87ad-39cb1f8037e9",
+   "metadata": {},
+   "source": [
+    "## Working with Station Data (medium requirements)\n",
+    "\n",
+    "Working with station data and integrating it with gridded data is quite complex. This series of tutorials demonstrates how to download one of the key open station databases, re-process it to suit the time-series nature of most PyEarthTools use cases, create a Data Accessor, and then combine the data with gridded data to form the basis of a heterogenous machine learning pipeline. \n",
+    "\n",
+    "These tutorials can be run on some laptops and workstations and do not require a GPU as they do not yet include model training, but may require larger amounts of RAM than devices, and some user modification may be needed to run them on less than 36GB RAM.\n",
+    "\n",
+    "| Title |  Description | Image |   Notebooks | Last Tested |\n",
+    "|-------|--------------|-------|-------------|-------------|\n",
+    "| **One - Introduction** | Introduction to station data | (no image) | [One - Introduction](./scorecard/One-Introduction.ipynb) | 5 Nov 2025 |\n",
+    "| **Two - Data Download** | Perform inital data downloading | (no image) | [Two - DataDownload](./scorecard/Two-DataDownload.ipynb) | 5 Nov 2025 |\n",
+    "| **Three - Small Chunks** | Group the data by decade in small groups | (no image) | [Three - SmallChunks](./scorecard/Three-SmallChunks.ipynb) | 5 Nov 2025 |\n",
+    "| **Four - Make Large Groupings** | Group the data by decade in large groups | (no image) | [Four - MakeLargeGroupings](./scorecard/Four-MakeLargeGroupings.ipynb) | 5 Nov 2025 |\n",
+    "| **Five - Data Accessor** | Integrate the data with PyEarthTools pipelines | (no image) | [Five - DataAccessor](./scorecard/Five-DataAccessor.ipynb) | 5 Nov 2025 |"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "1f72b9c5-1d2b-4212-9009-ab147685ca83",
@@ -50,6 +70,8 @@
     "\n",
     "These notebooks start with the basics and work up towards more complex examples, showing how to work with the classes and functions within the package to achieve objectives.\n",
     "\n",
+    "These tutorials require a high-performance computing environment and work with very large data volumes.\n",
+    "\n",
     "| Title |  Description  | Image |  Notebooks | Last Tested |\n",
     "|-------|---------------|-------|------------|-------------|\n",
     "| **ENSO Prediction** |The El Niño–Southern Oscillation (ENSO) is a major driver of climate variability, influencing regional and global weather patterns. It has been linked to extreme weather events across the globe, including droughts, floods, and shifts in precipitation. Weather centres around the world actively forecast ENSO to anticipate these patterns. |  |  |   |  \n",
@@ -136,7 +158,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.13.7"
+   "version": "3.11.7"
   }
  },
  "nbformat": 4,

diff --git a/notebooks/scorecard/Five-DataAccessor.ipynb b/notebooks/scorecard/Five-DataAccessor.ipynb
diff --git a/notebooks/scorecard/Four-MakeLargeGroupings.ipynb b/notebooks/scorecard/Four-MakeLargeGroupings.ipynb
@@ -0,0 +1,156 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "id": "1b4ae39b-4f5f-4dc7-bee0-a798eba46719",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pathlib import Path\n",
+    "import numpy as np\n",
+    "from datetime import datetime\n",
+    "import warnings\n",
+    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
+    "\n",
+    "import xarray as xr"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 28,
+   "id": "8bf7f65c-875c-47ff-9ea9-8ac81128be26",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# A spot to put the data on disk. We keep both the data as-downloaded and the reprocessed version, so you might need up to 50GB free in order to make this work.\n",
+    "\n",
+    "PROCESSING_DIR = Path('/g/data/kd24/data') / 'hadisd' / 'processing'   # We need to cache some data on disk during reprocessing\n",
+    "DECADAL_DIR = Path('/g/data/kd24/data') / 'hadisd' / 'by_decade'    # This will hold the final form of our data\n",
+    "DECADAL_DIR.mkdir()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 29,
+   "id": "db156524-9c84-4257-b351-e960f8b1adcb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "decades = {\n",
+    "    'early': ('1800', '1930'), # Just in case there is undocumented early data\n",
+    "    '1930': ('1930', '1940'),  # Dataset begins in 1930, start by decade here \n",
+    "    '1940': ('1940', '1950'),\n",
+    "    '1950': ('1950', '1960'), \n",
+    "    '1960': ('1960', '1970'), \n",
+    "    '1970': ('1970', '1980'), \n",
+    "    '1980': ('1980', '1990'), \n",
+    "    '1990': ('1990', '2000'), \n",
+    "    '2000': ('2000', '2010'), \n",
+    "    '2010': ('2010', '2020'), \n",
+    "    '2020': ('2020', '2030')\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 30,
+   "id": "e205d264-92cb-4a29-b3ae-3ef16a7404e1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "files_for_decades = {}\n",
+    "\n",
+    "for ix in decades.keys():\n",
+    "    start_dec, end_dec = decades[ix]\n",
+    "    _files_for_decade = list(PROCESSING_DIR.glob(f'*{start_dec}-{end_dec}*.nc'))\n",
+    "    files_for_decades[ix] = _files_for_decade\n",
+    "\n",
+    "# Uncomment this to see values for debugging\n",
+    "# the1950s = files_for_decades['1950']\n",
+    "# the1950s"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 31,
+   "id": "7cd6885d-635f-4028-bd75-19fed284cca3",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1 file groupings to be used for decade 1990\n",
+      "Loaded group 0\n",
+      "Combined group 0\n",
+      "Wrote group 0\n"
+     ]
+    }
+   ],
+   "source": [
+    "decade_of_interest = '1990'  # In the interests of saving time, we process only one decade here\n",
+    "\n",
+    "files_for_decade = files_for_decades[decade_of_interest]\n",
+    "groupings = [files_for_decade[i:i + 40] for i in range(0, len(files_for_decade), 40)]\n",
+    "print(f\"{len(groupings)} file groupings to be used for decade {decade_of_interest}\")\n",
+    "for i, grouping in enumerate(groupings):\n",
+    "    loaded = [xr.open_dataset(f) for f in grouping]\n",
+    "    print(f\"Loaded group {i}\")\n",
+    "    combined = xr.concat(loaded, dim='report', data_vars='all')\n",
+    "    combined['reporting_stats'] = combined['reporting_stats'].fillna(-999.0)\n",
+    "    print(f\"Combined group {i}\")\n",
+    "    filename = f'all_{decade_of_interest}s_group{str(i)}.nc'\n",
+    "    combined.to_netcdf(DECADAL_DIR / filename)\n",
+    "    print(f\"Wrote group {i}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 32,
+   "id": "c2684bb3-0bf5-4b79-9c62-568bbdf5879d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Completed\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Completed\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e6f340e-b2cf-46cc-8157-60926434f31c",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/notebooks/scorecard/One-Introduction.ipynb b/notebooks/scorecard/One-Introduction.ipynb
@@ -0,0 +1,92 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "329c0283-9192-45e4-80b0-910ce5625120",
+   "metadata": {},
+   "source": [
+    "## Hadley Integrated Surface Database\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "51995928-e6e0-4e4b-a853-7eec52cf53a8",
+   "metadata": {},
+   "source": [
+    "This dataset holds the world's weather station data up until late 2025.\n",
+    "\n",
+    "![Image of weather stations](https://www.metoffice.gov.uk/hadobs/hadisd/v343_2025f/images/hadisd_gridded_station_distribution_v343_2025f.png)\n",
+    "\n",
+    "For futher information please see:\n",
+    "\n",
+    "- Dunn, R. J. H., (2019), HadISD version 3: monthly updates, Hadley Centre Technical Note\n",
+    "- Dunn, R. J. H., et al. (2016), Expanding HadISD: quality-controlled, sub-daily station data from 1931, Geoscientific Instrumentation, Methods and Data Systems, 5, 473-491\n",
+    "- Dunn, R. J. H., et al. (2014), Pairwise homogeneity assessment of HadISD, Climate of the Past, 10, 1501-1522\n",
+    "- Dunn, R. J. H., et al. (2012), HadISD: A Quality Controlled global synoptic report database for selected variables at long-term stations from 1973-2011, Climate of the Past, 8, 1649-1679  Smith, A., et al. (2011): The Integrated Surface Database: Recent Developments and Partnerships. Bulletin of the American Meteorological Society, 92, 704-708\n",
+    "\n",
+    "\n",
+    "For the product manual, see [https://www.metoffice.gov.uk/hadobs/hadisd/hadisd_v340_2023f_product_user_guide.pdf](https://www.metoffice.gov.uk/hadobs/hadisd/hadisd_v340_2023f_product_user_guide.pdf)\n",
+    "\n",
+    "For the website, see [https://www.metoffice.gov.uk/hadobs/hadisd/v343_2025f/index.html](https://www.metoffice.gov.uk/hadobs/hadisd/v343_2025f/index.html)\n",
+    "\n",
+    "It's an amazing scientific archive. The data is held in a collection of .tgz files, based on station ranges. These files contains smaller station sub-ranges, themselves gzipped netcdf files. We need to download the ones we want (potentially all of them), then double-unwrap them, and then put them into a more performant file format for quick access by time index when performing ML training or long historical verification runs.\n",
+    "\n",
+    "Eventually, we want to present these efficiently as a PyEarthTools data accessor which can be quickly indexed by time. An alternative data accessor based on station ID rather than time could be imagined, but we will focus on access by time in this tutorial series.\n",
+    "\n",
+    "Despite being packed into NetCDF files -- which is often used for lat/lon/level/time gridded data -- this data is better visualised as just one massive long list of report entries in a big logbook. Each report is a slightly more complex version of \"time, station_id, lat, lon, elevation, bunch of obs data\".\n",
+    "\n",
+    "Many underlying issues have been sorted out, like stations reporting twice under two ids, changing ids, station upgrades/replacements, plain old errors, sensor quality control and more. Many stations only report for some of the time period, some only once or for a short time, some for a very long time. What we want to do is get this into a good form for time-series use by an ML algorithm. The files on disk are roughly organised by nominal station number, for all time. So if you know what stations you want to work with, you could just pick those files. But let's face it, who wants to take the time to understand the mysterious workings of station numbers - at least at first?\n",
+    "\n",
+    "Singe station time-series modelling is a totally valid use case - e.g. fetching \"station data for Melbourne from 2020 to 2025\". That's fairly straightforward - manually look up the station number of interest, find it in the files, open that files with xarray and then select the time-frame of interest.\n",
+    "\n",
+    "Doing the same thing for a handful of stations is also not too bad. Each station file is only a few megabytes, so opening 5 of them isn't a big deal. However, opening all of them becomes a bigger deal, and trying to merge them all together using simple merge and concat operations will cause a computational failure on most platforms (including HPC platforms). Some data processing is required in order to prepare the data for the time of query we want to use.\n",
+    "\n",
+    "Translating between the 'gridded world' or global and regional modelling and the 'station world' is often done by performing a site-based forecast based on gridded inputs (e.g. siteboost or model output statistics). The translation of station data to a gridded model is done through data assimilation. These two ways of working with the data have significant implications for the data structures which will be used, and for computational efficiency. It would be really nice to have a simple API which could abstract away the messy choices, implement the tricky bits and make it easy to just 'get what we want'.\n",
+    "\n",
+    "From a PyEarthTools perspective based on wanting to develop model architectures which include both gridded and point data at the same time (rather than having a 'translation step'), this means getting the data into a structure where the primary index is date-and-time, and all relevant stations are loaded into that data structure. However, the data still can't be simply gridded, as it more represents a point cloud at each moment in time. A few decisions need to be make still. We will keep things \"simple\" by representing the data for each time step as a list of observation reports from all stations reporting at that time, with a small time delta allowed for stations reporting a few seconds off the base time due to engineering tolerences or other reasons. The \"list to grid\" step will be handled either the model, or in an observation operator step to be developed at a later time.\n",
+    "\n",
+    "This tutorial series contains the code (and explanation) for how to download the data from the Hadley Centre website, unpack it, and then re-process it on disk to have a structure which is well-suited for efficient access in the manner just described.\n",
+    "\n",
+    "The tutorials are structured in a sequence, each with a specific scope. They are:\n",
+    "\n",
+    "1. Downloading the data in the form distributed by the Hadley centre\n",
+    "2. Manual unpack of the data on disk for efficiency reasons (see instructions at the end of StationDownload)\n",
+    "3. Re-processing of the station data to break it up by decade for file size reasons\n",
+    "4. Grouping of individual stations into large station groupings to reduce the number of files on disk\n",
+    "5. Data visualisation of the global station data to demonstrate what it looks like this way\n",
+    "6. (to be done) Integration of this data into PyEarthTools data accessor\n",
+    "7. (to be done) Integration of station data into a PyEarthTools pipeline\n",
+    "8. (to be done) Presentation of gridded data and station data to a neural network for training and prediction"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9e2c2235-5acf-4d42-95e1-96b247d91269",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}