diff --git a/notebooks/Gallery.ipynb b/notebooks/Gallery.ipynb index 19877578..fb826ed8 100644 --- a/notebooks/Gallery.ipynb +++ b/notebooks/Gallery.ipynb @@ -28,7 +28,8 @@ "- [Accessing ERA5 Data](./tutorial/Accessing_ERA5_Data.ipynb) (working as at 29/3/2025)\n", "- [Introduction to Pipelines](./tutorial/Data_Pipelines.ipynb) (working as at 29/3/2025)\n", "- [End-to-end CNN Training Example](./tutorial/CNN_model_training.ipynb) (working as at 29/3/2025)\n", - "- [Working with Multiple Data Sources](./tutorial/MultipleSources.ipynb) (working as at 29/3/2025)" + "- [Working with Multiple Data Sources](./tutorial/MultipleSources.ipynb) (working as at 29/3/2025)\n", + "- [Working with Climate Data](./tutorial/Working_with_Climate_Data.ipynb) (working as at 23/4/2025)" ] }, { @@ -99,7 +100,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.7" + "version": "3.12.9" } }, "nbformat": 4, diff --git a/notebooks/tutorial/Working_with_Climate_Data.ipynb b/notebooks/tutorial/Working_with_Climate_Data.ipynb new file mode 100644 index 00000000..b7ac0275 --- /dev/null +++ b/notebooks/tutorial/Working_with_Climate_Data.ipynb @@ -0,0 +1,5491 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "bd231875-890b-48f7-86fa-d89867799dfc", + "metadata": {}, + "source": [ + "# Working with Climate Data" + ] + }, + { + "cell_type": "markdown", + "id": "21fd252f-ab46-468b-9679-ffee05ce8572", + "metadata": {}, + "source": [ + "## Introduction\n", + "\n", + "This tutorial shows, in detail, how to make a PyEarthTools pipeline for loading climate data. It is a precursor step to creating an ML model for bias correction. It goes into a lot of detail in order to document how to process new data sources and what steps are involved. The end result is a re-usable data loading pipeline, which can then be re-used for many projects, based on the steps and assumptions shown here.\n", + "\n", + "In general terms, an \"analysis\" refers to estimates of current conditions, \"reanalysis\" refers to estimates of historical conditions, \"nowcasting\" refers to predictions of conditions which are expected in the next 90 minutes, \"short term\" refers to around 6-48 hours lead time and \"weather\" refers to 6 hours to 10 days lead time. A variety of terms like \"multi-week\", \"sub-seasonal\" and \"seasonal\" refer to the period from weeks to months. Lead times longer than a few months may be referred to as \"seasonal\" and merge into the climate time scales. Time scales for climate modelling are typically decades into the future or longer. These terms are not accepted universally, and some people may refer to any of these time scales as climate modelling. Within this tutorial, climate data is meant to describe predictions at least a year into the future.\n", + "\n", + "A \"reforecast\" is done to generate \"what the forecast would have been\". It is subtly different to a \"reanalysis\", because it includes the lead time component as well as the estimate at a point in time.\n", + "\n", + "Climate data uses non-Gregorian calendars, which also involve the use of date/time libraries which may be unfamiliar to many users. \n", + "\n", + "This tutorial will demonstrate the loading of climate data which includes both reforecasting outputs and predictions of the future, and merging it with reanalysis data for the purposes of validation of the reforecast component of the dataset.\n", + "\n", + "This will involve loading a climate prediction run from CMIP5 and ERA5 reanalysis data.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "c54f1450-09bb-4209-93b2-85f0b409cfb6", + "metadata": {}, + "outputs": [], + "source": [ + "import pyearthtools.data as petdata\n", + "import pyearthtools.pipeline as petpipe\n", + "import site_archive_nci" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "7b5b550e-3926-4c2e-8222-0a73d05b638c", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "\n", + "# This builds an accessor that can be indexed by time, filtered according to the specified parameters\n", + "# Multiple institutions, scenarios and models are unsupported but the intention is to support that in future\n", + "# Models should generally be included in a pipeline of operations rather than used directly, but we will\n", + "# explore some of the functionality of this object regardless\n", + "cmip5_model1 = petdata.archive.CMIP5(institutions='BCC', scenarios=['rcp60'], models=['bcc-csm1-1'], interval='mon', variables='tas')" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "f50daa56-8456-44d1-bfd9-3266478dc634", + "metadata": {}, + "outputs": [], + "source": [ + "# With an exact time, you need to pick a time actually in the dataset, for fuzzy selection see the next cell\n", + "\n", + "# Under-specifying the datetime will request all source data which matches Jan 2010\n", + "# In this case, the data is monthy, with a pseudo-day-of-month of the 16th of the month\n", + "# Note for later - longitude is indexed from 0 to 360\n", + "ds_cmip_2010 = cmip5_model1['2010-01'] # Query data along primary dimension\n", + "ds_cmip_2010" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "1c0c36ce-cf18-4cad-8d89-9744003b0400", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
<xarray.Dataset> Size: 862kB\n", + "Dimensions: (time: 24, bnds: 2, lat: 64, lon: 128)\n", + "Coordinates:\n", + " * time (time) object 192B 2010-01-16 12:00:00 ... 2011-12-16 12:00:00\n", + " * lat (lat) float64 512B -87.86 -85.1 -82.31 ... 82.31 85.1 87.86\n", + " * lon (lon) float64 1kB 0.0 2.812 5.625 8.438 ... 351.6 354.4 357.2\n", + " height float64 8B 2.0\n", + "Dimensions without coordinates: bnds\n", + "Data variables:\n", + " time_bnds (time, bnds) object 384B dask.array<chunksize=(24, 2), meta=np.ndarray>\n", + " lat_bnds (time, lat, bnds) float64 25kB dask.array<chunksize=(24, 64, 2), meta=np.ndarray>\n", + " lon_bnds (time, lon, bnds) float64 49kB dask.array<chunksize=(24, 128, 2), meta=np.ndarray>\n", + " tas (time, lat, lon) float32 786kB dask.array<chunksize=(24, 64, 128), meta=np.ndarray>\n", + "Attributes: (12/24)\n", + " institution: Beijing Climate Center(BCC),China Meteorological ...\n", + " institute_id: BCC\n", + " experiment_id: rcp60\n", + " source: bcc-csm1-1:atmosphere: BCC_AGCM2.1 (T42L26); lan...\n", + " model_id: bcc-csm1-1\n", + " forcing: Nat Ant GHG SD Oz Sl SS Ds BC OC\n", + " ... ...\n", + " table_id: Table Amon (11 April 2011) 1cfdc7322cf2f4a3261482...\n", + " title: bcc-csm1-1 model output prepared for CMIP5 RCP6\n", + " parent_experiment: historical\n", + " modeling_realm: atmos\n", + " realization: 1\n", + " cmor_version: 2.5.6
<xarray.Dataset> Size: 6GB\n", + "Dimensions: (longitude: 1440, latitude: 721, time: 744)\n", + "Coordinates:\n", + " * longitude (longitude) float32 6kB -180.0 -179.8 -179.5 ... 179.5 179.8\n", + " * latitude (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0\n", + " * time (time) datetime64[ns] 6kB 2010-01-01 ... 2010-01-31T23:00:00\n", + "Data variables:\n", + " 2t (time, latitude, longitude) float64 6GB dask.array<chunksize=(93, 91, 180), meta=np.ndarray>\n", + "Attributes:\n", + " Conventions: CF-1.6\n", + " history: 2020-09-28 12:42:00 UTC+1000 by era5_replication_tools-1.2....\n", + " license: Licence to use Copernicus Products: https://apps.ecmwf.int/...\n", + " summary: ERA5 is the fifth generation ECMWF atmospheric reanalysis o...\n", + " title: ERA5 single-levels reanalysis 2m_temperature 20100101-20100131
<xarray.Dataset> Size: 8MB\n", + "Dimensions: (longitude: 1440, latitude: 721, time: 1)\n", + "Coordinates:\n", + " * longitude (longitude) float32 6kB -180.0 -179.8 -179.5 ... 179.5 179.8\n", + " * latitude (latitude) float32 3kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0\n", + " * time (time) datetime64[ns] 8B 2010-01-01\n", + "Data variables:\n", + " 2t (time, latitude, longitude) float64 8MB dask.array<chunksize=(1, 721, 1440), meta=np.ndarray>\n", + "Attributes:\n", + " Conventions: CF-1.6\n", + " history: 2020-09-05 13:01:05 UTC+1000 by era5_replication_tools-1.0....\n", + " license: Licence to use Copernicus Products: https://apps.ecmwf.int/...\n", + " summary: ERA5 is the fifth generation ECMWF atmospheric reanalysis o...\n", + " title: ERA5 single-levels monthly-averaged 2m_temperature 20100101...
Pipeline\n",
+ "\tDescription `pyearthtools.pipeline` Data Pipeline\n",
+ "\n",
+ "\n",
+ "\tInitialisation \n",
+ "\t\t exceptions_to_ignore None\n",
+ "\t\t iterator None\n",
+ "\t\t sampler None\n",
+ "\tSteps \n",
+ "\t\t _.CMIP5 {'CMIP5': {'institutions': "'BCC'", 'interval': "'mon'", 'models': "['bcc-csm1-1']", 'scenarios': "['rcp60']", 'variables': "'tas'"}}\n",
+ "\t\t ERA5 {'ERA5': {'level_value': 'None', 'product': "'monthly-averaged'", 'variables': "['2t']"}}Pipeline\n",
+ "\tDescription `pyearthtools.pipeline` Data Pipeline\n",
+ "\n",
+ "\n",
+ "\tInitialisation \n",
+ "\t\t exceptions_to_ignore None\n",
+ "\t\t iterator None\n",
+ "\t\t sampler None\n",
+ "\tSteps \n",
+ "\t\t _.CMIP5 {'CMIP5': {'institutions': "'BCC'", 'interval': "'mon'", 'models': "['bcc-csm1-1']", 'scenarios': "['rcp60']", 'variables': "'tas'"}}\n",
+ "\t\t _recode_calendar.RecodeCalendar {'RecodeCalendar': {}}\n",
+ "\t\t ERA5 {'ERA5': {'level_value': 'None', 'product': "'monthly-averaged'", 'variables': "['2t']"}}\n",
+ "\t\t join.Merge {'Merge': {'merge_kwargs': 'None'}}<xarray.Dataset> Size: 29MB\n", + "Dimensions: (time: 2, latitude: 785, longitude: 1552, bnds: 2)\n", + "Coordinates:\n", + " * time (time) datetime64[ns] 16B 2010-01-01 2010-01-16T12:00:00\n", + " * latitude (latitude) float64 6kB -90.0 -89.75 -89.5 ... 89.5 89.75 90.0\n", + " * longitude (longitude) float64 12kB -180.0 -179.8 -179.5 ... 354.4 357.2\n", + " height float64 8B 2.0\n", + "Dimensions without coordinates: bnds\n", + "Data variables:\n", + " time_bnds (time, bnds) object 32B dask.array<chunksize=(1, 2), meta=np.ndarray>\n", + " lat_bnds (time, latitude, bnds) float64 25kB dask.array<chunksize=(1, 64, 2), meta=np.ndarray>\n", + " lon_bnds (time, longitude, bnds) float64 50kB dask.array<chunksize=(1, 128, 2), meta=np.ndarray>\n", + " tas (time, latitude, longitude) float32 10MB dask.array<chunksize=(1, 64, 128), meta=np.ndarray>\n", + " 2t (time, latitude, longitude) float64 19MB dask.array<chunksize=(1, 721, 1552), meta=np.ndarray>\n", + "Attributes: (12/24)\n", + " institution: Beijing Climate Center(BCC),China Meteorological ...\n", + " institute_id: BCC\n", + " experiment_id: rcp60\n", + " source: bcc-csm1-1:atmosphere: BCC_AGCM2.1 (T42L26); lan...\n", + " model_id: bcc-csm1-1\n", + " forcing: Nat Ant GHG SD Oz Sl SS Ds BC OC\n", + " ... ...\n", + " table_id: Table Amon (11 April 2011) 1cfdc7322cf2f4a3261482...\n", + " title: bcc-csm1-1 model output prepared for CMIP5 RCP6\n", + " parent_experiment: historical\n", + " modeling_realm: atmos\n", + " realization: 1\n", + " cmor_version: 2.5.6
Pipeline\n",
+ "\tDescription `pyearthtools.pipeline` Data Pipeline\n",
+ "\n",
+ "\n",
+ "\tInitialisation \n",
+ "\t\t exceptions_to_ignore None\n",
+ "\t\t iterator None\n",
+ "\t\t sampler None\n",
+ "\tSteps \n",
+ "\t\t _.CMIP5 {'CMIP5': {'institutions': "'BCC'", 'interval': "'mon'", 'models': "['bcc-csm1-1']", 'scenarios': "['rcp60']", 'variables': "'tas'"}}\n",
+ "\t\t _recode_calendar.RecodeCalendar {'RecodeCalendar': {}}\n",
+ "\t\t _align_dates.AlignDates {'AlignDates': {'to': "'01'"}}\n",
+ "\t\t ERA5 {'ERA5': {'level_value': 'None', 'product': "'monthly-averaged'", 'variables': "['2t']"}}\n",
+ "\t\t coordinates.StandardLongitude {'StandardLongitude': {'longitude_name': "'longitude'", 'type': "'0-360'"}}\n",
+ "\t\t join.GeospatialTimeSeriesMerge {'GeospatialTimeSeriesMerge': {'interpolation_method': "'nearest'", 'merge_kwargs': 'None', 'reference_dataset': 'None', 'reference_index': '0', 'time_dimension': "'time'"}}<xarray.Dataset> Size: 103kB\n", + "Dimensions: (time: 1, latitude: 64, bnds: 2, longitude: 128)\n", + "Coordinates:\n", + " height float64 8B 2.0\n", + " * latitude (latitude) float64 512B -87.86 -85.1 -82.31 ... 82.31 85.1 87.86\n", + " * longitude (longitude) float64 1kB 0.0 2.812 5.625 ... 351.6 354.4 357.2\n", + " * time (time) datetime64[ns] 8B 2010-01-01\n", + "Dimensions without coordinates: bnds\n", + "Data variables:\n", + " lat_bnds (time, latitude, bnds) float64 1kB dask.array<chunksize=(1, 64, 2), meta=np.ndarray>\n", + " lon_bnds (time, longitude, bnds) float64 2kB dask.array<chunksize=(1, 128, 2), meta=np.ndarray>\n", + " tas (time, latitude, longitude) float32 33kB dask.array<chunksize=(1, 64, 128), meta=np.ndarray>\n", + " time_bnds (time, bnds) object 16B dask.array<chunksize=(1, 2), meta=np.ndarray>\n", + " 2t (time, latitude, longitude) float64 66kB dask.array<chunksize=(1, 64, 128), meta=np.ndarray>\n", + "Attributes: (12/24)\n", + " institution: Beijing Climate Center(BCC),China Meteorological ...\n", + " institute_id: BCC\n", + " experiment_id: rcp60\n", + " source: bcc-csm1-1:atmosphere: BCC_AGCM2.1 (T42L26); lan...\n", + " model_id: bcc-csm1-1\n", + " forcing: Nat Ant GHG SD Oz Sl SS Ds BC OC\n", + " ... ...\n", + " table_id: Table Amon (11 April 2011) 1cfdc7322cf2f4a3261482...\n", + " title: bcc-csm1-1 model output prepared for CMIP5 RCP6\n", + " parent_experiment: historical\n", + " modeling_realm: atmos\n", + " realization: 1\n", + " cmor_version: 2.5.6