Offline generation of routehandles to have bfb with online generation #615

minghangli-uni · 2025-12-07T21:34:20Z

minghangli-uni
Dec 7, 2025

TL;DR;

Our goal is to fully replicate CMEPS online routehandle generation in an offline workflow using esmf mesh files while remaining bfb. This is to avoid very expensive mediator initialisation at high med core counts. Any guidance, example code on how to do this would be greatly appreciated!

Context

We are using cmeps as the coupler for our ACCESS-OM3 MOM6-CICE6 configuration with datm and drof as data components, and are running into initialisation cost issues - online generation of routehandles, when scaling up the mediator core counts.

It runs well at moderate med counts. However as we increase cpl_ntasks, the online routehandle generation becomes expensive, especially the vector mapping step (mappatch_uv3d) when mapuv_with_cart3d = .true..This becomes a major bottleneck for higher-resolution configurations. Another evidence can be found here for a moderate resolution configuration.

Below is what we've been doing for now and where we're stuck.

Current configuration:

coupler: cmeps v1.1.2, nuopc driver, coupling_mode="cesm"
esmf: 8.7.0
components:
- DATM & DROF
- CICE6
- MOM6
Mesh files
- esmf meshes
  - mesh_atm = ./INPUT/access-om2-100km-nomask-ESMFmesh.nc
  - mesh_ocn = ./INPUT/access-om2-100km-ESMFmesh.nc
  - mesh_ice = ./INPUT/access-om2-100km-ESMFmesh.nc
  - mesh_mask = ./INPUT/access-om2-100km-ESMFmesh.nc
Mediator option snippet

MED_attributes::
     coupling_mode      = cesm
     aoflux_grid        = ogrid
     mapuv_with_cart3d  = .true.

What we already tried offline

1. use `ESMF_RegridWeightGen` directly

We firstly tried generating weight files with ESMF_RegridWeightGen using the same ESMF mesh files, such as for atm->ocn, and wired them into nuopc.runconfig (I made some changes of the source code and included *_smapname / *_fmapname / *_vmapname attributes.)

# ATM -> OCN
ESMF_RegridWeightGen -s $NO_MASK_ESMFMESH -d $ESMF_MESH --method bilinear --src_loc corner --dst_loc corner --ignore_unmapped --ignore_degenerate --netcdf4 --weight a2o_bilnr_$RESOLUTION.nc
ESMF_RegridWeightGen -s $NO_MASK_ESMFMESH -d $ESMF_MESH --method patch --src_loc corner --dst_loc corner --ignore_unmapped --ignore_degenerate --netcdf4 --weight a2o_patch_$RESOLUTION.nc
ESMF_RegridWeightGen -s $NO_MASK_ESMFMESH -d $ESMF_MESH --method conserve --norm_type fracarea --ignore_unmapped --ignore_degenerate --netcdf4 --weight a2o_consf_$RESOLUTION.nc

And this is what inlcudes in nuopc.runconfig

     atm2ocn_vmapname = ./INPUT/a2o_patch_100km.nc
     atm2ocn_smapname = ./INPUT/a2o_bilnr_100km.nc
     atm2ocn_fmapname = ./INPUT/a2o_bilnr_100km.nc

Results:

When use the bilinear and conserve weight files for scalar mappings (e.g. atm2ocn_smapname, atm2ocn_fmapname, and similarly for ICE), we get bfb identical results compared to CMEPS online mapbilnr and mapconsf.
This is not true when using patch weights for the vector field mapping (e.g. atm2ocn_vmapname = ./INPUT/a2o_patch_100km.nc with mapuv_with_cart3d = .false. or .true. for testing), the results are not bfb with the online mappatch / mappatch_uv3d mapping.

2. Offline generator mirroring cmeps logic

I then wrote a small standalone program that tries to re-implement cmeps mappatch_uv3d using only mesh files,

Read mesh_atm, mesh_ocn, mesh_ice via ESMF_MeshCreate
Create uv3d_src and uv3d_dst as

uv3d_src = ESMF_FieldCreate(mesh_src, ESMF_TYPEKIND_R8, name='src3d', &
     ungriddedLbound=(/1/), ungriddedUbound=(/3/),                   &
     gridToFieldMap=(/2/), meshloc=ESMF_MESHLOC_ELEMENT, rc=rc)

uv3d_dst = ESMF_FieldCreate(mesh_dst, ESMF_TYPEKIND_R8, name='dst3d', &
     ungriddedLbound=(/1/), ungriddedUbound=(/3/),                   &
     gridToFieldMap=(/2/), meshloc=ESMF_MESHLOC_ELEMENT, rc=rc)

Then call ESMF_FieldRegridStore with:

call ESMF_FieldRegridStore(uv3d_src, uv3d_dst,              &
     routehandle       = rh,                                &
     srcMaskValues     = (/ ispval_mask /),                 &
     dstMaskValues     = (/ 0 /),                           &
     regridmethod      = ESMF_REGRIDMETHOD_PATCH,           &
     polemethod        = ESMF_POLEMETHOD_ALLAVG,            &
     srcTermProcessing = 0,                                 &
     ignoreDegenerate  = .true.,                            &
     dstStatusField    = dstStatus,                         &
     unmappedaction    = ESMF_UNMAPPEDACTION_IGNORE,        &
     rc                = rc)

Finally, write the RH to disk via ESMF_RouteHandleWrite.

Results:
The test configuration is run in serial mode - ocn and ice core counts are the same and we are using the same ocn and ice esmf meshes, this offline program hence produces two identical files for:

atm->ocn (cmeps.rh_ATM_OCN_mappatch_uv3d.RH)
atm->ice (cmeps.rh_ATM_ICE_mappatch_uv3d.RH)

In contrast, the RH files produced online by CMEPS:

atm2ice_patch_uv3d.RH  ~53M
atm2ocn_patch_uv3d.RH  ~52M

are clearly different, and swapping them in atm2ocn_vmapname / atm2ice_vmapname changes the model results, confirming that each component genuinely has distinct mapping. So it suggests that the online generation is using additional information beyond mesh + scalar srcMaskValues/dstMaskValues. It appears cmeps incorporates additional component-specific masking or med state?

Current workaround

We can run a short case to allow CMEPS to generate all required routehandles online. We then copy the routehandles into INPUT and reference them in nuopc.runconfig for production runs. This is BFB with a fully online workflow.

Request

We are seeking help on,

Is there a recommended approach for reproducing online routehandle generation exactly using only mesh files?
what additional masks, field state, mediator logic does cmeps apply when generating routehandles?
Is there documentation or example code demonstrating a bfb-equivalent offline generation path for routehandles?
Are there plans to support official offline generation for cmeps in future releases?

billsacks · 2025-12-09T23:57:29Z

billsacks
Dec 9, 2025
Maintainer

Thank you for your detailed report. I am going to transfer this to a discussion, which seems more appropriate here. I'll add some comments there.

0 replies

billsacks · 2025-12-10T00:34:02Z

billsacks
Dec 10, 2025
Maintainer

I hoped to be able to give you some help here, but reading back through your description, it looks like you have already dug as far – or farther – than I can get quickly with what I know about the relevant parts of CMEPS. So I'll reach out to others to see if they can help.

I looked at the ESMF release notes to see if there may have been any issues fixed recently that would help with this. I do see that there was a fix for second-order conservative remapping, but my sense is that probably doesn't apply in this case. And, based on your findings - particularly, trying to write your own offline generator - it seems like this issue indeed has some CMEPS-specific behavior tied into it, and doesn't seem to simply be a general ESMF issue. However, I will mention this to the ESMF team in case there are some performance improvements that could be applied to the vector remapping.

I can give you some general answers to some of your questions: One of the big advantages that came with CMEPS over our previous coupling infrastructure (in CESM) was that it did away with the need for offline generation of mapping files. As such, I don't think we (in CESM, anyway) have plans for any kind of official support for offline map generation. However, there have periodically been considerations of adding a capability to write the RouteHandle (mapping) information from a run so that it can be read in future runs rather than being regenerated each time. See #335 for additional thoughts on that feature. So far we haven't seen a use case where this has felt important enough to be worth prioritizing, but if you would find that feature useful, then we would welcome contributions to add it. I'm not sure that we (again, referring to CESM) can justify developing that feature in our near-term development priorities, but (pending some more discussion here) we may be able to at least support you or someone from your team in its development.

0 replies

DeniseWorthen · 2025-12-10T00:57:29Z

DeniseWorthen
Dec 10, 2025
Collaborator

@billsacks We (UFS) do have a branch that contains the "write/read" RH feature already. We had it prepared for one of the operational implementations, where the layout is fixed and we know it won't be changing. However, I think some on the ESMF team are aware of some weird issues that came along w/ trying to use that feature (specifically, measurably slowing the post-ice and post-ocn phases) and so we didn't end up using it. They're trying to figure out why....

However, for other operational implementations (such as the new DATM + 1/12 MOM/CICE6), the initial hope is that mapfiles will be usable. The initialization cost, when you're just making 9-day runs and they have to run w/in a certain operational window make any improvement in the start-up cost worth the effort. But, we're talking fractions of minutes , not hours (ie a few minutes vs <1min) for the overall RH creation step in DataInitialize.

Back to mapfiles though....For our config, ocean and ice are always the same grid (same mesh file) so we won't be generating separate "A->O" and "A->I" mapfiles. The mapfiles contain just the weights (Gerhard/Bob explained in a meeting, which is why they can be layout-independent), so if the mesh is the same, there wouldn't any need for two mapfiles, right?

EDIT---re-reading now, I see that your meshes are the same.

1 reply

billsacks Dec 10, 2025
Maintainer

Thank you for sharing this, @DeniseWorthen !

jedwards4b · 2025-12-10T15:59:49Z

jedwards4b
Dec 10, 2025
Maintainer

I guess I would first try updating ESMF to the latest - 8.9.0 is available - and if you can point to a scaling performance issue in ESMF or in CMEPS, I think we should address that, there is no reason the online regridding should not scale.

0 replies

minghangli-uni · 2025-12-10T22:58:09Z

minghangli-uni
Dec 10, 2025
Author

Thanks all for your comments! For anyone interested, @DeniseWorthen and I also have some related discussion here: ACCESS-NRI/om3-scripts#91.

and if you can point to a scaling performance issue in ESMF or in CMEPS, I think we should address that, there is no reason the online regridding should not scale.

@jedwards4b This thread ACCESS-NRI/access-om3-configs#334 (comment) includes a preliminary scaling study for our global 25 km ACCESS-OM3 configuration: . In the first figure (black solid line), you can see that once the mediator cpu core count exceeds 144, the initialisation time increases sharply.

The two additional plots below show the corresponding scaling and efficiency results as a function of MED core counts:

I guess I would first try updating ESMF to the latest - 8.9.0 is available

Thanks @jedwards4b I'll upgrade to ESMF 8.9.0 and re-run the tests and see how it goes.

Edit: The plots differ slightly because the screenshots come from our more recent scaling study.

0 replies

Offline generation of routehandles to have bfb with online generation #615

Uh oh!

Uh oh!

minghangli-uni Dec 7, 2025

TL;DR;

Context

Current configuration:

What we already tried offline

1. use ESMF_RegridWeightGen directly

2. Offline generator mirroring cmeps logic

Current workaround

Request

Replies: 5 comments · 1 reply

Uh oh!

billsacks Dec 9, 2025 Maintainer

Uh oh!

billsacks Dec 10, 2025 Maintainer

Uh oh!

Uh oh!

DeniseWorthen Dec 10, 2025 Collaborator

Uh oh!

billsacks Dec 10, 2025 Maintainer

Uh oh!

jedwards4b Dec 10, 2025 Maintainer

Uh oh!

Uh oh!

minghangli-uni Dec 10, 2025 Author

minghangli-uni
Dec 7, 2025

1. use `ESMF_RegridWeightGen` directly

Replies: 5 comments 1 reply

billsacks
Dec 9, 2025
Maintainer

billsacks
Dec 10, 2025
Maintainer

DeniseWorthen
Dec 10, 2025
Collaborator

billsacks Dec 10, 2025
Maintainer

jedwards4b
Dec 10, 2025
Maintainer

minghangli-uni
Dec 10, 2025
Author