-
Notifications
You must be signed in to change notification settings - Fork 241
[1/3] Diffusion ckpt export for NVFP4 & FP8 #781
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis PR adds HuggingFace checkpoint export support for quantized diffusion models. It introduces a new CLI option and export configuration field for specifying a checkpoint directory, then extends the unified export module to route and handle diffusion pipeline exports with quantizer management and QKV fusion. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/CLI
participant Config as ExportConfig
participant Manager as ExportManager
participant Export as export_hf_ckpt()
participant Router as Route Logic
participant DiffusionExp as _export_diffusers_checkpoint()
participant Components as Component Handler
User->>Config: Pass --hf-ckpt-dir
Config->>Manager: Create with hf_ckpt_dir set
Manager->>Export: Call export_hf_ckpt(pipe)
Export->>Router: Detect model type
Router->>DiffusionExp: Route DiffusionPipeline
DiffusionExp->>Components: Extract & process components
Components->>Components: Hide quantizers
Components->>Components: Fuse QKV linears
Components->>Components: Save per-component subdirs
DiffusionExp->>DiffusionExp: Save model_index.json
DiffusionExp->>Export: Export complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@modelopt/torch/export/unified_export_hf.py`:
- Around line 1055-1057: The _get_diffusers_components currently raises for
anything not a DiffusionPipeline but _export_diffusers_checkpoint accepts
DiffusionPipeline | ModelMixin; update _get_diffusers_components to also accept
instances of ModelMixin (e.g., a standalone UNet) by detecting isinstance(model,
ModelMixin) and returning a components mapping consistent with what
_export_diffusers_checkpoint expects (for example {'unet': model} or the
appropriate single-component keys used downstream); ensure the DiffusionPipeline
branch behavior is unchanged and that callers handle the returned mapping
uniformly.
- Around line 452-463: The loop over model.named_modules() sets
fsdp_module_to_reshard for each FSDPModule but never resshards the last one
after the loop, leaving it unsharded; after the loop completes add a final check
and call to reshard on fsdp_module_to_reshard (i.e., if fsdp_module_to_reshard
is not None: fsdp_module_to_reshard.reshard()) so the last FSDPModule is
properly resharded; locate symbols model.named_modules, FSDPModule,
fsdp_module_to_reshard, and reshard() to apply the fix.
🧹 Nitpick comments (5)
modelopt/torch/export/unified_export_hf.py (5)
82-83: Move import to top of file with other imports.The
contextmanagerimport should be grouped with other imports at the top of the file (around lines 18-27) rather than inserted mid-file.Suggested fix
Move to the imports section at the top:
from contextlib import contextmanager
954-955: Consider using logging instead of print statements.The function uses
print()for debug output which is inconsistent with the rest of the codebase that useswarnings.warn()or could use a logger. This also affects production output.Suggested approach
Replace
print()calls withwarnings.warn()for warning-level messages, or consider adding an optionalloggerparameter:- print("No quantized linear modules found for QKV fusion.") + warnings.warn("No quantized linear modules found for QKV fusion.") ... - print(f"Warning: Unknown model type '{model_class_name}', skipping QKV fusion.") + warnings.warn(f"Unknown model type '{model_class_name}', skipping QKV fusion.") ... - print(f"Warning: Failed to run dummy forward for QKV fusion: {e}") - print("Skipping QKV fusion. Quantization may still work but amax values won't be unified.") + warnings.warn(f"Failed to run dummy forward for QKV fusion: {e}. Skipping QKV fusion.")Also applies to: 970-970, 979-980, 1014-1015, 1017-1020
1113-1125: Minor: Step numbering is inconsistent - "Step 2" is missing.The comments jump from "Step 1" (line 1113) to "Step 3" (line 1125). Consider renumbering for clarity.
1129-1129: Consider usingwarnings.warn()or a logger instead ofprint()statements.Multiple
print()calls throughout this function for status messages. For consistency with the rest of the codebase and to allow users to control output, consider usingwarnings.warn()or passing in a logger.Also applies to: 1147-1148, 1178-1178, 1190-1190, 1206-1206, 1229-1229
23-23: Unnecessary import of builtinValueError.
ValueErroris a Python builtin and doesn't need to be imported.Suggested fix
-from builtins import ValueError
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
examples/diffusers/quantization/quantize.pymodelopt/torch/export/unified_export_hf.py
🧰 Additional context used
🧬 Code graph analysis (2)
examples/diffusers/quantization/quantize.py (1)
modelopt/torch/export/unified_export_hf.py (1)
export_hf_checkpoint(1232-1303)
modelopt/torch/export/unified_export_hf.py (6)
modelopt/torch/export/layer_utils.py (1)
is_quantlinear(346-348)modelopt/torch/export/quant_utils.py (1)
get_quantization_format(432-533)modelopt/torch/quantization/utils.py (1)
fsdp2_aware_weight_update(689-797)modelopt/torch/quantization/conversion.py (1)
set_quantizer_by_cfg_context(305-327)modelopt/torch/export/convert_hf_config.py (1)
convert_hf_quant_config_format(21-117)modelopt/torch/export/plugins/hf_spec_export.py (2)
spec_opt_only(107-112)export_spec_ckpt_state_dict(115-153)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (13)
modelopt/torch/export/unified_export_hf.py (7)
85-119: LGTM - Context manager for hiding quantizers during save.The implementation correctly backs up and restores quantizer attributes. Using
delattr/setattrwith a try/finally ensures quantizers are restored even if an exception occurs during save.
617-816: LGTM - Comprehensive dummy input generation for diffusion models.Good implementation with defensive coding:
- Uses
isinstancechecks with fallback to string matching when imports fail- Provides sensible defaults via
getattrfor missing config attributes- Returns
Nonefor unsupported models, which the caller handles gracefully
818-869: LGTM - QKV projection identification logic.The pattern matching is comprehensive, covering common diffusers naming conventions (
to_q,to_k,to_v, etc.) and correctly handles nested module paths.
871-909: LGTM - QKV grouping logic.Correctly groups QKV projections by parent attention block and distinguishes between main and added (cross-attention) QKV types.
1059-1072: LGTM - Simple quantization check.Clean implementation using generator expression with
any()for early termination.
1074-1086: LGTM - dtype inference with sensible default.Returns the dtype of the first parameter found, with a reasonable
float16fallback for edge cases (e.g., models with no parameters).
1232-1261: LGTM - Clean routing between diffusers and transformers export.The updated public API correctly routes to the appropriate export function based on model type. The
componentsparameter documentation clearly states it's only for diffusers pipelines.examples/diffusers/quantization/quantize.py (6)
69-69: LGTM - Import follows existing pattern.The import is correctly placed with other modelopt imports.
352-352: LGTM - ExportConfig extension follows existing patterns.The
hf_ckpt_dirfield and its validation mirror the existingonnx_dirhandling.Also applies to: 368-370
870-883: LGTM - Method follows existing ExportManager patterns.Clean implementation that mirrors other export methods like
save_checkpointandexport_onnx.
1016-1020: LGTM - CLI argument follows existing conventions.The
--hf-ckpt-dirargument is consistent with the existing--onnx-dirpattern.
1097-1097: LGTM - Config construction follows existing pattern.Correctly handles the optional
hf_ckpt_dirargument.
1153-1155: LGTM - HF checkpoint export integrated at appropriate point in flow.Placed after ONNX export, following the logical export sequence.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #781 +/- ##
=======================================
Coverage 74.19% 74.19%
=======================================
Files 192 192
Lines 19238 19238
=======================================
+ Hits 14273 14274 +1
+ Misses 4965 4964 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
See #781 This is the MR that only includes the refactoring of the llm export, please ignore the change on quantize.py from the diffusion example. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added `--hf-ckpt-dir` CLI option to save checkpoints in HuggingFace format * Enabled support for exporting Diffusers-based pipelines * Unified export system now handles both transformer and diffusion model architectures <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
| batch_size = 1 | ||
|
|
||
| # Try to import specific model classes for isinstance checks | ||
| try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you simplify the logics and use a helper function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! Defined an internal helper function to handle different input types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you’re suggesting merging the code and using a unified singple input for al ldiffusion models, that isn’t possible.
#781 (comment)
| ]: | ||
| if qkv_only: | ||
| raise NotImplementedError("Diffusion only") | ||
| # Filter to only include QKV projection layers (diffusion models) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QQ: why there is a mode for qkv only?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For diffusion, we focus on QKV fusion for now due to the downstream deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not do fc1 and fc2 fusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is simply not implemented yet; the reason is still unclear. I’ll keep tracking their code, and once it’s added, we’ll update our implementation accordingly.
cjluo-nv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any unittests for this change?
| elif is_sd3: | ||
| # SD3Transformer2DModel: 4D hidden_states (batch, channels, height, width) | ||
| # Requires: hidden_states, encoder_hidden_states, pooled_projections, timestep | ||
| in_channels = getattr(cfg, "in_channels", 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have a helper function for getting all the required info from the model to avoid repeating code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, its challenging for diffusion models because the input shapes and parameter names vary from model to model. Unlike LLMs, there isn’t a unified input format that works across all diffusion models.
That means every time a new model is introduced, we have to manually adjust the inputs and add the required functions.
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Just added some simple unit test, I will add a more formal&e2e test case in the 3rd MR. |
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
| return quantized_state_dict, quant_config | ||
|
|
||
|
|
||
| def _fuse_qkv_linears_diffusion(model: nn.Module) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this function to quant_utils.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this function should remain in the diffusers file, since it only applies to diffusers. However, moving _has_quantized_modules to quant_utils makes sense to me.
| if isinstance(pipe, DiffusionPipeline): | ||
| for component_name, component in all_components.items(): | ||
| # Skip nn.Module components (already handled above) | ||
| if isinstance(component, nn.Module): | ||
| continue | ||
|
|
||
| component_export_dir = export_dir / component_name | ||
| component_export_dir.mkdir(parents=True, exist_ok=True) | ||
|
|
||
| print(f"Exporting component: {component_name} ({type(component).__name__})") | ||
|
|
||
| # Handle different component types | ||
| if hasattr(component, "save_pretrained"): | ||
| # Tokenizers, feature extractors, image processors | ||
| component.save_pretrained(component_export_dir) | ||
| elif hasattr(component, "save_config"): | ||
| # Schedulers | ||
| component.save_config(component_export_dir) | ||
| else: | ||
| warnings.warn( | ||
| f"Component '{component_name}' of type {type(component).__name__} " | ||
| "does not have save_pretrained or save_config method. Skipping." | ||
| ) | ||
| continue | ||
|
|
||
| print(f" Saved to: {component_export_dir}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check if the exported components are identical to the original ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, could you clarify what you meant by “identical to the original ones”?
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
Signed-off-by: Jingyu Xin <jingyux@nvidia.com>
What does this PR do?
Type of change: New feature
Overview:
This PR adds support for exporting quantized diffusers models (DiT, Flux, SD3, UNet, etc.) to HuggingFace checkpoint format, enabling deployment to inference frameworks like SGLang, vLLM, and TensorRT-LLM.
Changes
New file:
diffusers_utils.pyhide_quantizers_from_state_dict()context manager for clean savesRefactored:
unified_export_hf.py_fuse_qkv_linears_diffusion()for QKV amax fusion_export_diffusers_checkpoint()to export full pipelines (models + tokenizers + schedulers etc.)Plans
Usage
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
New Features
--hf-ckpt-dirCLI argument for specifying checkpoint export destination✏️ Tip: You can customize this high-level summary in your review settings.