-
Notifications
You must be signed in to change notification settings - Fork 242
[5525939] Allow user to select target opset in MOQ #809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughThis change introduces opset version control to ONNX quantization, allowing users to specify target ONNX opset versions via CLI Changes
Sequence DiagramsequenceDiagram
participant User as User/CLI
participant Quantizer as quantize()
participant Preprocessor as _preprocess_onnx()
participant OpsetUtil as Opset Utils
participant Converter as convert_to_f16()
participant Model as ONNX Model
User->>Quantizer: quantize(model, opset=X)
Quantizer->>Preprocessor: _preprocess_onnx(model, opset=X)
Preprocessor->>OpsetUtil: get_qdq_precisions(model)
OpsetUtil-->>Preprocessor: precision_set
Preprocessor->>OpsetUtil: get_min_opset_for_precisions(precision_set)
OpsetUtil-->>Preprocessor: mode_min_opset
alt opset provided
Preprocessor->>Preprocessor: validate opset vs mode_min_opset
alt opset < mode_min_opset
Preprocessor->>Preprocessor: warn & upgrade to mode_min_opset
end
else opset not provided
Preprocessor->>Preprocessor: select max(original_opset, mode_min_opset)
end
Preprocessor->>Model: convert to target_opset
Preprocessor-->>Quantizer: preprocessed_model
Quantizer->>Converter: convert_to_f16(model, opset=target_opset)
Converter->>OpsetUtil: get_qdq_precisions(model)
Converter->>OpsetUtil: get_min_opset_for_precisions(precision_set)
Converter->>Converter: compute min_opset with validation
Converter-->>Quantizer: converted_model
Quantizer-->>User: quantized_model
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/onnx/quantization/quantize.py (1)
440-521: Pass the effective target opset downstream.
_preprocess_onnxmay upgrade the model's opset viaonnx.version_converter.convert_version(), butquantize_int8andquantize_fp8receive the raw useropsetinput. This creates a mismatch: the quantizers'opsetparameter may diverge from the actual model opset after preprocessing. Passget_opset_version(onnx_model)(already imported) instead.🔧 Suggested fix
) = _preprocess_onnx( onnx_path, use_external_data_format, output_path, @@ simplify, quantize_mode, opset, ) + target_opset = get_opset_version(onnx_model) @@ - opset=opset, + opset=target_opset, **kwargs, )
🤖 Fix all issues with AI agents
In `@modelopt/onnx/autocast/convert.py`:
- Around line 222-235: The docstring for the conversion routine is out of sync
with the implementation: low_precision_type uses a base_min_opset of 19 for
"fp16" (see base_min_opset and low_precision_type in convert.py) but the
docstring still claims 13; update the docstring to state that the default
minimum opset for fp16 is 19 (and bf16 is 22) and keep the note that Q/DQ nodes
may require increasing the opset (e.g., FP8/INT4/NVFP4) so the documentation
matches the logic in get_qdq_precisions, get_min_opset_for_precisions and the
base_min_opset assignment.
In `@modelopt/onnx/quantization/__main__.py`:
- Around line 289-297: Update the help text for the "--opset" argument added via
argparser.add_argument to include bf16 minimum opset info: mention that
--high_precision_dtype bf16 may require opset 22 (in addition to the existing
note about 19 for fp16, 21 for int4, 23 for nvfp4), and make the same change to
the other duplicate "--opset" help string found elsewhere in this module; ensure
the message clearly states that opset may be automatically increased if required
by operations.
In `@modelopt/onnx/quantization/quantize.py`:
- Around line 124-130: The min-opset lookup uses exact keys so variants like
"int4_awq" fall back to BASE_MIN_OPSET; normalize quantize_mode before querying
QDQ_PRECISION_MIN_OPSET: compute a normalized_mode (e.g., if "int4" in
quantize_mode -> "int4", if "nvfp4" in quantize_mode or "float4" variant ->
"float4_e2m1fn", etc.), then use QDQ_PRECISION_MIN_OPSET.get(normalized_mode,
BASE_MIN_OPSET) when setting mode_min_opset; update references in quantize.py
that use quantize_mode (including the substring checks and the get_opset_version
flow) to use the normalized value so variants resolve to the correct minimum
opset.
In `@modelopt/onnx/utils.py`:
- Around line 702-731: get_qdq_precisions currently only inspects
DequantizeLinear nodes with Constant inputs and misses QuantizeLinear nodes and
non-constant/Variable paths (activations), causing under-reporting of
precisions; update get_qdq_precisions to also iterate QuantizeLinear nodes and
extract precision from their output_dtype attribute where present, and for both
QuantizeLinear and DequantizeLinear handle Variable inputs by resolving the
tensor type via the graph/model value_info or node.output type (e.g., check
graph.value_info / model.graph.value_info / model.graph.input/output types for
the corresponding tensor and use its elem_type/name), while still keeping the
existing Constant-path logic (Constant.values.dtype.name) and preserving
detection of custom nodes like TRT_FP4DynamicQuantize.
🧹 Nitpick comments (1)
tests/unit/onnx/test_quantize_api.py (1)
29-34: Avoid hard‑coded min‑opset duplication in tests.To reduce drift, consider deriving these values from the production constants.
♻️ Suggested refactor
-from modelopt.onnx.utils import get_opset_version +from modelopt.onnx.utils import BASE_MIN_OPSET, QDQ_PRECISION_MIN_OPSET, get_opset_version @@ MIN_OPSET = { - "int8": 19, - "fp8": 19, - "int4": 21, + "int8": BASE_MIN_OPSET, + "fp8": BASE_MIN_OPSET, + "int4": QDQ_PRECISION_MIN_OPSET["int4"], }
| # Opset 19 is the minimum required for fp16 scales in Q/DQ nodes | ||
| # Higher opsets required for specific quantization modes (int4: 21, nvfp4: 23) | ||
| original_opset_version = get_opset_version(onnx_model) | ||
|
|
||
| required_opset_version = 13 | ||
| if opset_version < required_opset_version and opset_version != 1: | ||
| opset_version = required_opset_version | ||
| onnx_model = onnx.version_converter.convert_version(onnx_model, opset_version) | ||
| onnx_path = os.path.join(output_dir, f"{model_name}_opset{opset_version}.onnx") | ||
| # Determine minimum required opset based on quantization mode | ||
| mode_min_opset = QDQ_PRECISION_MIN_OPSET.get(quantize_mode, BASE_MIN_OPSET) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
wc -l modelopt/onnx/quantization/quantize.pyRepository: NVIDIA/Model-Optimizer
Length of output: 108
🏁 Script executed:
cat -n modelopt/onnx/quantization/quantize.py | sed -n '110,150p'Repository: NVIDIA/Model-Optimizer
Length of output: 2526
🏁 Script executed:
rg -n "QDQ_PRECISION_MIN_OPSET|BASE_MIN_OPSET" modelopt/onnx/quantization/ -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 719
🏁 Script executed:
rg -n '"int4"|"nvfp4"|"fp4"' modelopt/onnx/quantization/ -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 798
🏁 Script executed:
rg -n 'quantize_mode.*"int4"|"int4".*in.*quantize_mode' modelopt/onnx/ -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 415
🏁 Script executed:
rg -n "QDQ_PRECISION_MIN_OPSET\s*=" modelopt/onnx/ -A 10Repository: NVIDIA/Model-Optimizer
Length of output: 657
🏁 Script executed:
rg -n "int4_awq|int4_gptq|int4_woq|nvfp4" modelopt/onnx/Repository: NVIDIA/Model-Optimizer
Length of output: 1301
🏁 Script executed:
rg -n "quantize_mode" modelopt/onnx/quantization/__init__.py modelopt/onnx/quantization/__main__.py -B 2 -A 2Repository: NVIDIA/Model-Optimizer
Length of output: 1216
🏁 Script executed:
rg -n "def quantize\(" modelopt/onnx/quantization/quantize.py -A 15Repository: NVIDIA/Model-Optimizer
Length of output: 816
🏁 Script executed:
rg -n "nvfp4_exporter|nvfp4" modelopt/onnx/ -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 4595
🏁 Script executed:
rg -n "int4_awq" modelopt/onnx/quantization/quantize.py -B 3 -A 3Repository: NVIDIA/Model-Optimizer
Length of output: 48
🏁 Script executed:
cat -n modelopt/onnx/quantization/quantize.py | sed -n '500,560p'Repository: NVIDIA/Model-Optimizer
Length of output: 3232
Normalize quantize_mode before min‑opset lookup.
The code at line 523 uses substring matching (elif "int4" in quantize_mode), implying variant strings are supported. However, the min‑opset lookup at line 129 uses exact key matching, so variant values like int4_awq will incorrectly fall back to BASE_MIN_OPSET (19) instead of the required 21. Additionally, the documentation references opset 23 for nvfp4, but that key doesn't exist in QDQ_PRECISION_MIN_OPSET; only float4_e2m1fn is present. This creates a mismatch where int4 variants could proceed with insufficient opset. Normalize quantize_mode before lookup to handle variants consistently.
🔧 Suggested fix
- mode_min_opset = QDQ_PRECISION_MIN_OPSET.get(quantize_mode, BASE_MIN_OPSET)
+ if "int4" in quantize_mode:
+ mode_min_opset = QDQ_PRECISION_MIN_OPSET["int4"]
+ elif quantize_mode in QDQ_PRECISION_MIN_OPSET:
+ mode_min_opset = QDQ_PRECISION_MIN_OPSET[quantize_mode]
+ else:
+ mode_min_opset = BASE_MIN_OPSET🤖 Prompt for AI Agents
In `@modelopt/onnx/quantization/quantize.py` around lines 124 - 130, The min-opset
lookup uses exact keys so variants like "int4_awq" fall back to BASE_MIN_OPSET;
normalize quantize_mode before querying QDQ_PRECISION_MIN_OPSET: compute a
normalized_mode (e.g., if "int4" in quantize_mode -> "int4", if "nvfp4" in
quantize_mode or "float4" variant -> "float4_e2m1fn", etc.), then use
QDQ_PRECISION_MIN_OPSET.get(normalized_mode, BASE_MIN_OPSET) when setting
mode_min_opset; update references in quantize.py that use quantize_mode
(including the substring checks and the get_opset_version flow) to use the
normalized value so variants resolve to the correct minimum opset.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #809 +/- ##
==========================================
+ Coverage 74.13% 74.85% +0.72%
==========================================
Files 192 192
Lines 19263 19314 +51
==========================================
+ Hits 14280 14457 +177
+ Misses 4983 4857 -126 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
setup.py
Outdated
| "onnxruntime~=1.22.0 ; platform_machine == 'aarch64' or platform_system == 'Darwin'", | ||
| "onnxruntime-gpu~=1.22.0 ; platform_machine != 'aarch64' and platform_system != 'Darwin'", | ||
| "onnxruntime~=1.23.0 ; platform_machine == 'aarch64' or platform_system == 'Darwin'", | ||
| "onnxruntime-gpu~=1.23.0 ; platform_machine != 'aarch64' and platform_system != 'Darwin'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's leave windows version unchanged as they saw some regressions. cc @hthadicherla
| "onnxruntime-gpu~=1.23.0 ; platform_machine != 'aarch64' and platform_system != 'Darwin'", | |
| "onnxruntime-gpu~=1.23.0 ; platform_machine != 'aarch64' and platform_system != 'Darwin' and platform_system != 'Windows'", # noqa: E501 | |
| "onnxruntime-gpu==1.22.0; platform_system == 'Windows'", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the regression was initially observed by @ajrasane . He saw that quantizing with latest ort was causing accuracy degradations in some vision models in Linux . When I tested these models later , I found the exact same regressions in windows.
Better to leave it 1.22 in setup.py. In LLM quantization examples, we reinstall the latest ort version, by having ort==1.23 in requirements.txt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I reverted the setup change @kevalmorabia97
- Allow user to select the target opset - Minimum opset will be defined according to quantization mode - Upgrade onnxruntime to 1.23.0 to support nvfp4 - Add tests in tests/unit/onnx/test_quantize_api.py Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
9882089 to
8e80cee
Compare
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
a864ab4 to
8cdbe62
Compare
What does this PR do?
Type of change: new feature
Overview:
Testing
Added unit tests
tests/unit/onnx/test_quantize_api.py::test_opset_below_minimum_upgrades_to_minimum[int8] PASSED [ 11%]
tests/unit/onnx/test_quantize_api.py::test_opset_below_minimum_upgrades_to_minimum[fp8] PASSED [ 22%]
tests/unit/onnx/test_quantize_api.py::test_opset_below_minimum_upgrades_to_minimum[int4] PASSED [ 33%]
tests/unit/onnx/test_quantize_api.py::test_opset_below_original_uses_original[int8] PASSED [ 44%]
tests/unit/onnx/test_quantize_api.py::test_opset_below_original_uses_original[fp8] PASSED [ 55%]
tests/unit/onnx/test_quantize_api.py::test_opset_below_original_uses_original[int4] PASSED [ 66%]
tests/unit/onnx/test_quantize_api.py::test_opset_above_minimum[int8] PASSED [ 77%]
tests/unit/onnx/test_quantize_api.py::test_opset_above_minimum[fp8] PASSED [ 88%]
tests/unit/onnx/test_quantize_api.py::test_opset_above_minimum[int4] PASSED [100%]
Before your PR is "Ready for review"
Additional Information
Requested as a WAR for a Windows-onnxruntime issue in 5525939, but regardless, it's a useful feature to have
Summary by CodeRabbit
New Features
--opsetCLI option enabling users to specify target ONNX opset version when quantizing models.Tests
✏️ Tip: You can customize this high-level summary in your review settings.