Latent MOE support and fix KV cache quant export #768

jenchen13 · 2026-01-13T02:52:42Z

What does this PR do?

Type of change: New feature

Overview:
New nemotron models use TransformerLayer.forward() instead of MoELayer.forward() for MOE. This is a breaking change to our quantization implementation for Nano3 which relied on patching MoELayer.forward() to force tokens to be routed to all experts during calibration.

Add patch for TransformerLayer.forward() which forces tokens to be routed to all experts during PTQ calibration
Enable latent MOE modules during megatron import/export
Fix KV cache quantization export: remove old qkv_layer.output_quantizer export & replace with proper k/v_bmm_quantizer logic
Improvements to EP amax sync

? TODO Potentially remove MoELayer quant config if all MOEs in future will use TransformerLayer instead?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

copy-pr-bot · 2026-01-13T02:52:46Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-01-13T02:52:48Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-01-13T03:03:39Z

Codecov Report

❌ Patch coverage is 40.00000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 74.62%. Comparing base (5104513) to head (4471c03).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	40.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #768      +/-   ##
==========================================
- Coverage   74.62%   74.62%   -0.01%     
==========================================
  Files         192      192              
  Lines       18989    18992       +3     
==========================================
+ Hits        14171    14172       +1     
- Misses       4818     4820       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

modelopt/torch/quantization/plugins/megatron.py

realAsma · 2026-01-14T19:41:52Z

modelopt/torch/quantization/model_calib.py

+    # Step 1: Sync amax across local experts in a SequentialMLP
+    for name, module in model.named_modules():
+        if hasattr(module, "sync_moe_local_experts_amax"):
+            module.sync_moe_local_experts_amax()
+
+            # TODO just for testing
+            if "experts" in name and "weight_quantizer" in name:
+                assert child.amax is not None


Can we move this before if distributed sync check? This is not doing anything particular to distributed sync

modelopt/torch/quantization/plugins/megatron.py

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 · 2026-01-20T20:23:22Z

modelopt/torch/quantization/plugins/megatron.py

+# TODO double check if MOE forward will be implemented in MoELayer or TransformerLayer
+# We do not need both layers to be patched
+
+@QuantModuleRegistry.register({megatron_transformer_layer.TransformerLayer: "megatron_transformer_layer_TransformerLayer"})


TODO maybe remove this since MOE forward will not be removed in MLM main?

jenchen13 · 2026-01-20T20:24:39Z

modelopt/torch/export/plugins/mcore_nemotron.py

    "input_layernorm": NameRemapping("backbone.layers.{}.norm."),
    "linear_qkv": QKVSlicing("backbone.layers.{}.mixer."),
    "linear_proj": NameRemapping("backbone.layers.{}.mixer.o_proj."),
+    "core_attention": SelfAttentionScaling("backbone.layers.{}.mixer."),


doublecheck that this is only needed for export

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

support latent moe import and fix local experts sync

9bda954

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 added 2 commits January 12, 2026 21:06

patch TransformerLayer forward

5da17f4

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

fix bug of duplicate forward

4471c03

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 changed the title ~~Latent MOE support and fix MOE amax sync~~ Latent MOE support and patch TransformerLayer forward for MOE Jan 13, 2026

jenchen13 requested review from ChenhanYu and realAsma January 13, 2026 16:03

realAsma reviewed Jan 14, 2026

View reviewed changes

modelopt/torch/quantization/plugins/megatron.py Outdated Show resolved Hide resolved

realAsma reviewed Jan 14, 2026

View reviewed changes

modelopt/torch/quantization/plugins/megatron.py Outdated Show resolved Hide resolved

fix kv bmm export

db1892f

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 commented Jan 20, 2026

View reviewed changes

small fixes

f26bf3c

Signed-off-by: jenchen13 <jennifchen@nvidia.com>

jenchen13 changed the title ~~Latent MOE support and patch TransformerLayer forward for MOE~~ Latent MOE support and fix KV cache quant export Jan 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Latent MOE support and fix KV cache quant export #768

Latent MOE support and fix KV cache quant export #768

Uh oh!

jenchen13 commented Jan 13, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Jan 13, 2026

Uh oh!

coderabbitai bot commented Jan 13, 2026

Review skipped

Uh oh!

codecov bot commented Jan 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

realAsma Jan 14, 2026

Uh oh!

Uh oh!

jenchen13 Jan 20, 2026

Uh oh!

jenchen13 Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Latent MOE support and fix KV cache quant export #768

Are you sure you want to change the base?

Latent MOE support and fix KV cache quant export #768

Uh oh!

Conversation

jenchen13 commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Jan 13, 2026

Uh oh!

coderabbitai bot commented Jan 13, 2026

Review skipped

Uh oh!

codecov bot commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

realAsma Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jenchen13 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

jenchen13 Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jenchen13 commented Jan 13, 2026 •

edited

Loading

codecov bot commented Jan 13, 2026 •

edited

Loading