From 35d7675a7a5a87ea2800541a6d23558c4fbfb9f9 Mon Sep 17 00:00:00 2001 From: Tennessee Leeuwenburg Date: Thu, 17 Apr 2025 11:24:33 +1000 Subject: [PATCH 1/6] WIP demonstration of loading climate data --- .../tutorial/Working with Climate Data.ipynb | 955 ++++++++++++++++++ 1 file changed, 955 insertions(+) create mode 100644 notebooks/tutorial/Working with Climate Data.ipynb diff --git a/notebooks/tutorial/Working with Climate Data.ipynb b/notebooks/tutorial/Working with Climate Data.ipynb new file mode 100644 index 00000000..36556cf5 --- /dev/null +++ b/notebooks/tutorial/Working with Climate Data.ipynb @@ -0,0 +1,955 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "bd231875-890b-48f7-86fa-d89867799dfc", + "metadata": {}, + "source": [ + "# Working with Climate Data" + ] + }, + { + "cell_type": "markdown", + "id": "21fd252f-ab46-468b-9679-ffee05ce8572", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "In general terms, \"analysis\" refers to estimates of current conditions, \"reanalysis\" refers to estimates of historical conditions, \"nowcasting\" refers to predictions of conditions which are expected in the next 90 minutes, \"short term\" refers to around 6-48 hours lead time and \"weather\" refers to 6 hours to 10 days lead time. A variety of terms like \"multi-week\", \"sub-seasonal\" and \"seasonal\" refer to the period from weeks to months. Lead times longer than a few months may be referred to as \"seasonal\" and merge into the climate time scales. Time scales for climate modelling are typically decades into the future or longer. These terms are not accepted universally, and some people may refer to any of these time scales as climate modelling. Within this tutorial, climate data is meant to describe predictions at least a year into the future.\n", + "\n", + "A \"reforecast\" is done to generate \"what the forecast would have been\". It is subtly different to a \"reanalysis\", because it includes the lead time component as well as the estimate at a point in time.\n", + "\n", + "Climate data uses non-Gregorian calendars, which also involve the use of date/time libraries which may be unfamiliar to many users. \n", + "\n", + "This tutorial will demonstrate the loading of climate data which includes both reforecasting outputs and predictions of the future, and merging it with reanalysis data for the purposes of validation the reforecast component of the dataset.\n", + "\n", + "This will involve loading a climate prediction run from CMIP5 and ERA5 reanalysis data.\n", + "\n", + "! Note - this tutorial is in a draft state and does not yet properly handle the integration of CMIP5 and ERA5 data correctly.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "c54f1450-09bb-4209-93b2-85f0b409cfb6", + "metadata": {}, + "outputs": [], + "source": [ + "import pyearthtools.data as petdata\n", + "import pyearthtools.pipeline as petpipe\n", + "import site_archive_nci" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "7b5b550e-3926-4c2e-8222-0a73d05b638c", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/g/data/kd24/tjl/src/PyEarthTools/packages/nci_site_archive/src/site_archive_nci/_CMIP5.py:140: UserWarning: \n", + "Under development. This class currently only allows single values - e.g. one institution, one model, one experiment id etc. Expansion to allowing multiple\n", + "values in under way, which will create a more complex DataSet object containing all relevant data in a single in-memory object.\n", + "\n", + " warnings.warn(UNDER_DEV_MSG)\n" + ] + } + ], + "source": [ + "# This builds an accessor that can be indexed by time, filtered according to the specified parameters\n", + "# Multiple institutions, scenarios and models are unsupported but the intention is to support that in future\n", + "# Models should generally be included in a pipeline of operations rather than used directly, but we will\n", + "# explore some of the functionality of this object regardless\n", + "cmip5_model1 = petdata.archive.CMIP5(institutions='BCC', scenarios=['rcp60'], models=['bcc-csm1-1'], interval='mon', variables='tas')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f50daa56-8456-44d1-bfd9-3266478dc634", + "metadata": {}, + "outputs": [], + "source": [ + "# With an exact time, you need to pick a time actually in the dataset, for fuzzy selection see the next cell\n", + "da = cmip5_model1['2010-01-16'] # Query data along primary dimension\n", + "da" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "1c0c36ce-cf18-4cad-8d89-9744003b0400", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "
<xarray.Dataset> Size: 862kB\n",
+       "Dimensions:    (time: 24, bnds: 2, lat: 64, lon: 128)\n",
+       "Coordinates:\n",
+       "  * time       (time) object 192B 2010-01-16 12:00:00 ... 2011-12-16 12:00:00\n",
+       "  * lat        (lat) float64 512B -87.86 -85.1 -82.31 ... 82.31 85.1 87.86\n",
+       "  * lon        (lon) float64 1kB 0.0 2.812 5.625 8.438 ... 351.6 354.4 357.2\n",
+       "    height     float64 8B 2.0\n",
+       "Dimensions without coordinates: bnds\n",
+       "Data variables:\n",
+       "    time_bnds  (time, bnds) object 384B dask.array<chunksize=(24, 2), meta=np.ndarray>\n",
+       "    lat_bnds   (time, lat, bnds) float64 25kB dask.array<chunksize=(24, 64, 2), meta=np.ndarray>\n",
+       "    lon_bnds   (time, lon, bnds) float64 49kB dask.array<chunksize=(24, 128, 2), meta=np.ndarray>\n",
+       "    tas        (time, lat, lon) float32 786kB dask.array<chunksize=(24, 64, 128), meta=np.ndarray>\n",
+       "Attributes: (12/24)\n",
+       "    institution:            Beijing Climate Center(BCC),China Meteorological ...\n",
+       "    institute_id:           BCC\n",
+       "    experiment_id:          rcp60\n",
+       "    source:                 bcc-csm1-1:atmosphere:  BCC_AGCM2.1 (T42L26); lan...\n",
+       "    model_id:               bcc-csm1-1\n",
+       "    forcing:                Nat Ant GHG SD Oz Sl SS Ds BC OC\n",
+       "    ...                     ...\n",
+       "    table_id:               Table Amon (11 April 2011) 1cfdc7322cf2f4a3261482...\n",
+       "    title:                  bcc-csm1-1 model output prepared for CMIP5 RCP6\n",
+       "    parent_experiment:      historical\n",
+       "    modeling_realm:         atmos\n",
+       "    realization:            1\n",
+       "    cmor_version:           2.5.6
" + ], + "text/plain": [ + " Size: 862kB\n", + "Dimensions: (time: 24, bnds: 2, lat: 64, lon: 128)\n", + "Coordinates:\n", + " * time (time) object 192B 2010-01-16 12:00:00 ... 2011-12-16 12:00:00\n", + " * lat (lat) float64 512B -87.86 -85.1 -82.31 ... 82.31 85.1 87.86\n", + " * lon (lon) float64 1kB 0.0 2.812 5.625 8.438 ... 351.6 354.4 357.2\n", + " height float64 8B 2.0\n", + "Dimensions without coordinates: bnds\n", + "Data variables:\n", + " time_bnds (time, bnds) object 384B dask.array\n", + " lat_bnds (time, lat, bnds) float64 25kB dask.array\n", + " lon_bnds (time, lon, bnds) float64 49kB dask.array\n", + " tas (time, lat, lon) float32 786kB dask.array\n", + "Attributes: (12/24)\n", + " institution: Beijing Climate Center(BCC),China Meteorological ...\n", + " institute_id: BCC\n", + " experiment_id: rcp60\n", + " source: bcc-csm1-1:atmosphere: BCC_AGCM2.1 (T42L26); lan...\n", + " model_id: bcc-csm1-1\n", + " forcing: Nat Ant GHG SD Oz Sl SS Ds BC OC\n", + " ... ...\n", + " table_id: Table Amon (11 April 2011) 1cfdc7322cf2f4a3261482...\n", + " title: bcc-csm1-1 model output prepared for CMIP5 RCP6\n", + " parent_experiment: historical\n", + " modeling_realm: atmos\n", + " realization: 1\n", + " cmor_version: 2.5.6" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# A weird bug in cftime handling means re-aggregating the interval doesn't work \n", + "# so this is just operating as a slice of the original data intervals\n", + "series = cmip5_model1.series(start='2010-01-01', end='2012-01-01', interval = (6, 'month'))\n", + "series" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "db420f81-f64e-40a8-89fe-9adc1b26652c", + "metadata": {}, + "outputs": [], + "source": [ + "# pipe = petpipe.Pipeline(\n", + "# (cmip5_model1, petdata.archive.ERA5(['2t'])),\n", + "# petpipe.operations.xarray.Merge()\n", + "# )\n", + "# pipe" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47d54afd-8719-4c29-be0a-9a232a0970e2", + "metadata": {}, + "outputs": [], + "source": [ + "pipe = petpipe.Pipeline(\n", + " (cmip5_model1\n", + " )\n", + "pipe" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} From f4fff22431eb1b6610adb467d628d209f1ef7635 Mon Sep 17 00:00:00 2001 From: Tennessee Leeuwenburg Date: Thu, 17 Apr 2025 16:45:35 +1000 Subject: [PATCH 2/6] Add draft functionality to recode climate calendar data --- .../pipeline/operations/xarray/__init__.py | 2 + .../operations/xarray/_recode_calendar.py | 60 +++++++++++++++++++ 2 files changed, 62 insertions(+) create mode 100644 packages/pipeline/src/pyearthtools/pipeline/operations/xarray/_recode_calendar.py diff --git a/packages/pipeline/src/pyearthtools/pipeline/operations/xarray/__init__.py b/packages/pipeline/src/pyearthtools/pipeline/operations/xarray/__init__.py index 6ba4a0f4..320da58d 100644 --- a/packages/pipeline/src/pyearthtools/pipeline/operations/xarray/__init__.py +++ b/packages/pipeline/src/pyearthtools/pipeline/operations/xarray/__init__.py @@ -37,6 +37,7 @@ from pyearthtools.pipeline.operations.xarray.join import Merge, Concatenate from pyearthtools.pipeline.operations.xarray.sort import Sort from pyearthtools.pipeline.operations.xarray.chunk import Chunk +from pyearthtools.pipeline.operations.xarray._recode_calendar import RecodeCalendar from pyearthtools.pipeline.operations.xarray import ( conversion, @@ -65,4 +66,5 @@ "metadata", "normalisation", "remapping", + "RecodeCalendar", ] diff --git a/packages/pipeline/src/pyearthtools/pipeline/operations/xarray/_recode_calendar.py b/packages/pipeline/src/pyearthtools/pipeline/operations/xarray/_recode_calendar.py new file mode 100644 index 00000000..cdd51f76 --- /dev/null +++ b/packages/pipeline/src/pyearthtools/pipeline/operations/xarray/_recode_calendar.py @@ -0,0 +1,60 @@ +# Copyright Commonwealth of Australia, Bureau of Meteorology 2025. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + + +from typing import TypeVar, Optional + +import xarray as xr + +from pyearthtools.pipeline.operation import Operation + +T = TypeVar("T", xr.Dataset, xr.DataArray) + +class RecodeCalendar(Operation): + """ + Climate datasets often use the cftime module to index into data using non-standard calendars. + This operation will recode the time coordinate of a dataset or data array to a standard timestamp + For now, support only exists for recoding from Noleap to Timestamp + """ + + _override_interface = "Serial" + + def __init__(self): + """ + Record initialisation and store flags for processing + """ + + super().__init__( + split_tuples=True, + operation="apply", + recognised_types=(xr.Dataset, xr.DataArray), + ) + self.record_initialisation() + + def apply_func(self, data: xr.Dataset) -> xr.Dataset: + """Sort an `xarray` object data variables into the given order + + Args: + data (T): + `xarray` object to sort. + + Returns: + (T): + Sorted dataset + """ + + recoded = data.indexes["time"].to_datetimeindex() + data["time"] = recoded + + return data From bb52834d7d36d2877a6928640e20a0a87118b285 Mon Sep 17 00:00:00 2001 From: Tennessee Leeuwenburg Date: Wed, 23 Apr 2025 19:31:47 +1000 Subject: [PATCH 3/6] Initial implementation of two new operators (align dates and GeospatialTimeSeriesHoiner) Addition of tutorial on working with climate data --- .../tutorial/Working with Climate Data.ipynb | 4652 ++++++++++++++++- .../pipeline/operations/xarray/__init__.py | 2 + .../operations/xarray/_align_dates.py | 76 + .../pipeline/operations/xarray/join.py | 78 +- 4 files changed, 4747 insertions(+), 61 deletions(-) create mode 100644 packages/pipeline/src/pyearthtools/pipeline/operations/xarray/_align_dates.py diff --git a/notebooks/tutorial/Working with Climate Data.ipynb b/notebooks/tutorial/Working with Climate Data.ipynb index 36556cf5..b7ac0275 100644 --- a/notebooks/tutorial/Working with Climate Data.ipynb +++ b/notebooks/tutorial/Working with Climate Data.ipynb @@ -15,22 +15,22 @@ "source": [ "## Introduction\n", "\n", - "In general terms, \"analysis\" refers to estimates of current conditions, \"reanalysis\" refers to estimates of historical conditions, \"nowcasting\" refers to predictions of conditions which are expected in the next 90 minutes, \"short term\" refers to around 6-48 hours lead time and \"weather\" refers to 6 hours to 10 days lead time. A variety of terms like \"multi-week\", \"sub-seasonal\" and \"seasonal\" refer to the period from weeks to months. Lead times longer than a few months may be referred to as \"seasonal\" and merge into the climate time scales. Time scales for climate modelling are typically decades into the future or longer. These terms are not accepted universally, and some people may refer to any of these time scales as climate modelling. Within this tutorial, climate data is meant to describe predictions at least a year into the future.\n", + "This tutorial shows, in detail, how to make a PyEarthTools pipeline for loading climate data. It is a precursor step to creating an ML model for bias correction. It goes into a lot of detail in order to document how to process new data sources and what steps are involved. The end result is a re-usable data loading pipeline, which can then be re-used for many projects, based on the steps and assumptions shown here.\n", + "\n", + "In general terms, an \"analysis\" refers to estimates of current conditions, \"reanalysis\" refers to estimates of historical conditions, \"nowcasting\" refers to predictions of conditions which are expected in the next 90 minutes, \"short term\" refers to around 6-48 hours lead time and \"weather\" refers to 6 hours to 10 days lead time. A variety of terms like \"multi-week\", \"sub-seasonal\" and \"seasonal\" refer to the period from weeks to months. Lead times longer than a few months may be referred to as \"seasonal\" and merge into the climate time scales. Time scales for climate modelling are typically decades into the future or longer. These terms are not accepted universally, and some people may refer to any of these time scales as climate modelling. Within this tutorial, climate data is meant to describe predictions at least a year into the future.\n", "\n", "A \"reforecast\" is done to generate \"what the forecast would have been\". It is subtly different to a \"reanalysis\", because it includes the lead time component as well as the estimate at a point in time.\n", "\n", "Climate data uses non-Gregorian calendars, which also involve the use of date/time libraries which may be unfamiliar to many users. \n", "\n", - "This tutorial will demonstrate the loading of climate data which includes both reforecasting outputs and predictions of the future, and merging it with reanalysis data for the purposes of validation the reforecast component of the dataset.\n", - "\n", - "This will involve loading a climate prediction run from CMIP5 and ERA5 reanalysis data.\n", + "This tutorial will demonstrate the loading of climate data which includes both reforecasting outputs and predictions of the future, and merging it with reanalysis data for the purposes of validation of the reforecast component of the dataset.\n", "\n", - "! Note - this tutorial is in a draft state and does not yet properly handle the integration of CMIP5 and ERA5 data correctly.\n" + "This will involve loading a climate prediction run from CMIP5 and ERA5 reanalysis data.\n" ] }, { "cell_type": "code", - "execution_count": 5, + "execution_count": 1, "id": "c54f1450-09bb-4209-93b2-85f0b409cfb6", "metadata": {}, "outputs": [], @@ -45,20 +45,10 @@ "execution_count": 2, "id": "7b5b550e-3926-4c2e-8222-0a73d05b638c", "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/g/data/kd24/tjl/src/PyEarthTools/packages/nci_site_archive/src/site_archive_nci/_CMIP5.py:140: UserWarning: \n", - "Under development. This class currently only allows single values - e.g. one institution, one model, one experiment id etc. Expansion to allowing multiple\n", - "values in under way, which will create a more complex DataSet object containing all relevant data in a single in-memory object.\n", - "\n", - " warnings.warn(UNDER_DEV_MSG)\n" - ] - } - ], + "outputs": [], "source": [ + "%%capture\n", + "\n", "# This builds an accessor that can be indexed by time, filtered according to the specified parameters\n", "# Multiple institutions, scenarios and models are unsupported but the intention is to support that in future\n", "# Models should generally be included in a pipeline of operations rather than used directly, but we will\n", @@ -74,8 +64,12 @@ "outputs": [], "source": [ "# With an exact time, you need to pick a time actually in the dataset, for fuzzy selection see the next cell\n", - "da = cmip5_model1['2010-01-16'] # Query data along primary dimension\n", - "da" + "\n", + "# Under-specifying the datetime will request all source data which matches Jan 2010\n", + "# In this case, the data is monthy, with a pseudo-day-of-month of the 16th of the month\n", + "# Note for later - longitude is indexed from 0 to 360\n", + "ds_cmip_2010 = cmip5_model1['2010-01'] # Query data along primary dimension\n", + "ds_cmip_2010" ] }, { @@ -483,7 +477,7 @@ " parent_experiment: historical\n", " modeling_realm: atmos\n", " realization: 1\n", - " cmor_version: 2.5.6