From 6105f560c30f96c3aca6a95d2e2602eeee8d914b Mon Sep 17 00:00:00 2001 From: unknown Date: Mon, 12 Jan 2026 13:52:43 +0530 Subject: [PATCH 1/8] [Benchmark] Update md file for Perplexity and Kl divergence benchmark info Signed-off-by: unknown --- examples/windows/Benchmark.md | 51 +++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/examples/windows/Benchmark.md b/examples/windows/Benchmark.md index 0105a7fad..a2324c72a 100644 --- a/examples/windows/Benchmark.md +++ b/examples/windows/Benchmark.md @@ -24,6 +24,8 @@ Memory savings and inference speedup are compared to the ONNX FP16 baseline. ### 1.2 Accuracy Comparison +#### 1.2.1 MMLU Scores + For accuracy evaluation, the [Massive Multitask Language Understanding (MMLU)](https://arxiv.org/abs/2009.03300) benchmark has been utilized. Please refer to the [detailed instructions](./accuracy_benchmark/README.md) for running the MMLU accuracy benchmark. The table below shows the MMLU 5-shot score for some models. @@ -39,3 +41,52 @@ The table below shows the MMLU 5-shot score for some models. | [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) | 61.76 | 60.73 | | [Llama3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | 60.8 | 57.71 | | [Gemma-2b-it](https://huggingface.co/google/gemma-2b-it) | 37.01 | 37.2 | + +#### 1.2.2 Perplexity (PPL) + +Perplexity measures how well a probability model predicts a sample. Lower perplexity values indicate better model quality. The following table shows perplexity values at input sequence length 1024 with chunk size of 512. + +**Learn more about Perplexity:** [Perplexity - Wikipedia](https://en.wikipedia.org/wiki/Perplexity) | [Hugging Face - Perplexity of Fixed-Length Models](https://huggingface.co/docs/transformers/en/perplexity) + +- **FP16-MB**: Baseline FP16 genai model (Model Builder) +- **Mixed AWQ-MO**: Mixed precision AWQ quantization using ModelOpt +- **Mixed RTN-MO**: Mixed precision RTN quantization using ModelOpt +- **Pure INT4 AWQ-MO**: Pure INT4 AWQ quantization using ModelOpt +- **Pure INT4 RTN-MO**: Pure INT4 RTN quantization using ModelOpt +- **Pure INT8 RTN-MO**: Pure INT8 RTN quantization using ModelOpt +- **Pure INT8 AWQ-MO**: Pure INT8 AWQ quantization using ModelOpt +- **Configuration**: Windows OS, GPU RTX 5090, nvidia-modelopt v0.39.0, onnxruntime-genai-cuda 0.9.2, onnxruntime-gpu 1.23.0, torch 2.8.0+cu128, transformers 4.49.0 + +| Model | FP16-MB | Mixed AWQ-MO | Mixed RTN-MO | Pure INT4 AWQ-MO | Pure INT4 RTN-MO | Pure INT8 RTN-MO | Pure INT8 AWQ-MO | +|:------|:--------|:-------------|:-------------|:-----------------|:-----------------|:-----------------|:-----------------| +| DeepSeek R1 Distill Qwen 1.5B | 39.447 | 41.699 | 44.332 | 44.213 | 46.304 | 39.802 | 39.713 | +| Llama 3.2 1B Instruct | 12.631 | 13.852 | 14.176 | 14.549 | 16.900 | 12.664 | 12.637 | +| Phi-3.5 Mini Instruct | 6.046 | 6.500 | 6.599 | 6.711 | 7.070 | - | - | +| Phi-4 Mini Instruct | 9.039 | 9.673 | 9.712 | 10.015 | 10.911 | - | - | +| Qwen 2.5 1.5B Instruct | 9.216 | 10.084 | 10.338 | 10.495 | 10.933 | 9.227 | 9.232 | + +For detailed instructions on evaluating perplexity, please refer to the [Perplexity Evaluation Guide](./accuracy_benchmark/perplexity_metrics/README.md). + +#### 1.2.3 KL-divergence + +KL-divergence (Kullback-Leibler divergence) quantifies the distributional difference between the quantized model and the baseline model. Lower KL-divergence values indicate that the quantized model's output distribution is closer to the original model. + +**Learn more about KL-divergence:** [KL Divergence - Wikipedia](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) | [Understanding KL Divergence](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained) + +- **Baseline model**: Hugging Face FP16 model +- **Quantized models**: Generated using ModelOpt fake quantization +- **Configuration**: Windows OS, GPU RTX 5090, nvidia-modelopt v0.39.0, onnxruntime-genai-cuda 0.9.2, onnxruntime-gpu 1.23.0, torch 2.8.0+cu128, transformers 4.49.0 + +| Model | Quantization Method | Block-size | KL-divergence | Notes | +|:------|:--------------------|:-----------|:--------------|:------| +| Qwen2.5-1.5B-Instruct | Base FP16 (Baseline) | - | 0.000 | Reference baseline | +| Qwen2.5-1.5B-Instruct | fake int4+int8 Blockwise-max-mixed | 128 (blockwise) | 0.336 | Blockwise quantization | +| Qwen2.5-1.5B-Instruct | fake int4+int8 max-mixed | 128, -1 (per-channel) | 0.337 | Per-channel quantization | +| Llama-3.2-3B-Instruct | Base FP16 (Baseline) | - | 0.000 | Reference baseline | +| Llama-3.2-3B-Instruct | fake int4+int8 Blockwise-awq-lite-mixed | 128 (blockwise) | 0.228 | Best: Lowest divergence | +| Llama-3.2-3B-Instruct | fake int4+int8 per-channel-awq-lite-mixed | 128, -1 (per-channel) | 0.230 | AWQ-lite per-channel | +| Llama-3.2-3B-Instruct | fake int4+int8 Blockwise-max-mixed | 128 (blockwise) | 0.238 | Max-mixed blockwise | +| Llama-3.2-3B-Instruct | fake int4+int8 per-channel-max-mixed | 128, -1 (per-channel) | 0.238 | Max-mixed per-channel | +| Llama-3.2-3B-Instruct | fake int4-Blockwise-max | 128 (blockwise) | 0.334 | INT4 only (no INT8 activation) | + +For detailed instructions on computing KL-divergence, please refer to the [KL-divergence Evaluation Guide](./accuracy_benchmark/kl_divergence_metrics/README.md). \ No newline at end of file From d4f1db562b5250033cfa7b27791ac552e3301c5d Mon Sep 17 00:00:00 2001 From: unknown Date: Mon, 12 Jan 2026 13:54:52 +0530 Subject: [PATCH 2/8] [Benchmark] Update md file for Perplexity and Kl divergence benchmark info Signed-off-by: unknown --- examples/windows/Benchmark.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/windows/Benchmark.md b/examples/windows/Benchmark.md index a2324c72a..a6201151d 100644 --- a/examples/windows/Benchmark.md +++ b/examples/windows/Benchmark.md @@ -89,4 +89,4 @@ KL-divergence (Kullback-Leibler divergence) quantifies the distributional differ | Llama-3.2-3B-Instruct | fake int4+int8 per-channel-max-mixed | 128, -1 (per-channel) | 0.238 | Max-mixed per-channel | | Llama-3.2-3B-Instruct | fake int4-Blockwise-max | 128 (blockwise) | 0.334 | INT4 only (no INT8 activation) | -For detailed instructions on computing KL-divergence, please refer to the [KL-divergence Evaluation Guide](./accuracy_benchmark/kl_divergence_metrics/README.md). \ No newline at end of file +For detailed instructions on computing KL-divergence, please refer to the [KL-divergence Evaluation Guide](./accuracy_benchmark/kl_divergence_metrics/README.md). From 4c9c5c90c87d8d0984212c6ae12cc41832224ac0 Mon Sep 17 00:00:00 2001 From: unknown Date: Thu, 22 Jan 2026 18:10:09 +0530 Subject: [PATCH 3/8] handle review comments Signed-off-by: unknown --- examples/windows/Benchmark.md | 31 ++++++++++++++++++------------- 1 file changed, 18 insertions(+), 13 deletions(-) diff --git a/examples/windows/Benchmark.md b/examples/windows/Benchmark.md index a6201151d..cf49416ce 100644 --- a/examples/windows/Benchmark.md +++ b/examples/windows/Benchmark.md @@ -24,7 +24,7 @@ Memory savings and inference speedup are compared to the ONNX FP16 baseline. ### 1.2 Accuracy Comparison -#### 1.2.1 MMLU Scores +#### 1.2.1 MMLU For accuracy evaluation, the [Massive Multitask Language Understanding (MMLU)](https://arxiv.org/abs/2009.03300) benchmark has been utilized. Please refer to the [detailed instructions](./accuracy_benchmark/README.md) for running the MMLU accuracy benchmark. @@ -73,20 +73,25 @@ KL-divergence (Kullback-Leibler divergence) quantifies the distributional differ **Learn more about KL-divergence:** [KL Divergence - Wikipedia](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) | [Understanding KL Divergence](https://www.countbayesie.com/blog/2017/5/9/kullback-leibler-divergence-explained) +**Supported backends:** PyTorch and Onnxruntim-cuda, onnxruntime-trt-rtx-ep are both supported for evaluation. + - **Baseline model**: Hugging Face FP16 model -- **Quantized models**: Generated using ModelOpt fake quantization +- **Quantized models**: Models where quantization is simulated (a.k.a. fake quantization), typically using the PyTorch-CUDA backend for evaluation. Fake quantization means quantized weights and dequantized simultaneously to simulate quantization. The inference backend column in the table below indicates whether the reported results are from PyTorch simulation or ONNX-runtime-based inference. - **Configuration**: Windows OS, GPU RTX 5090, nvidia-modelopt v0.39.0, onnxruntime-genai-cuda 0.9.2, onnxruntime-gpu 1.23.0, torch 2.8.0+cu128, transformers 4.49.0 -| Model | Quantization Method | Block-size | KL-divergence | Notes | -|:------|:--------------------|:-----------|:--------------|:------| -| Qwen2.5-1.5B-Instruct | Base FP16 (Baseline) | - | 0.000 | Reference baseline | -| Qwen2.5-1.5B-Instruct | fake int4+int8 Blockwise-max-mixed | 128 (blockwise) | 0.336 | Blockwise quantization | -| Qwen2.5-1.5B-Instruct | fake int4+int8 max-mixed | 128, -1 (per-channel) | 0.337 | Per-channel quantization | -| Llama-3.2-3B-Instruct | Base FP16 (Baseline) | - | 0.000 | Reference baseline | -| Llama-3.2-3B-Instruct | fake int4+int8 Blockwise-awq-lite-mixed | 128 (blockwise) | 0.228 | Best: Lowest divergence | -| Llama-3.2-3B-Instruct | fake int4+int8 per-channel-awq-lite-mixed | 128, -1 (per-channel) | 0.230 | AWQ-lite per-channel | -| Llama-3.2-3B-Instruct | fake int4+int8 Blockwise-max-mixed | 128 (blockwise) | 0.238 | Max-mixed blockwise | -| Llama-3.2-3B-Instruct | fake int4+int8 per-channel-max-mixed | 128, -1 (per-channel) | 0.238 | Max-mixed per-channel | -| Llama-3.2-3B-Instruct | fake int4-Blockwise-max | 128 (blockwise) | 0.334 | INT4 only (no INT8 activation) | +| Model | Quantization Method | Quantization Granularity | KL-divergence | Inference Backend | +|:-----------------------|:-------------------------------------------------|:--------------------------------------------------------------------|:--------------|:------------------------------| +| Qwen2.5-1.5B-Instruct | Base FP16 (Baseline) | - | 0.000 | PyTorch (FP16) | +| Qwen2.5-1.5B-Instruct | int4+int8 Blockwise-max_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.336 | PyTorch (fake quantization) | +| Qwen2.5-1.5B-Instruct | int4+int8 max_algo-mixed_quant (simulated, per-channel) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.337 | PyTorch (fake quantization) | +| Llama-3.2-3B-Instruct | Base FP16 (Baseline) | - | 0.000 | PyTorch (FP16) | +| Llama-3.2-3B-Instruct | int4+int8 Blockwise-awq-lite_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.228 | PyTorch (fake quantization) | +| Llama-3.2-3B-Instruct | int4+int8 per-channel-awq-lite_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.230 | PyTorch (fake quantization) | +| Llama-3.2-3B-Instruct | int4+int8 Blockwise-max_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.238 | PyTorch (fake quantization) | +| Llama-3.2-3B-Instruct | int4+int8 per-channel-max_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.238 | PyTorch (fake quantization) | +| Llama-3.2-3B-Instruct | int4 Blockwise-max_algo only (simulated) | INT4: per-block (block-size=128) | 0.334 | PyTorch (fake quantization) | + + +*All KL-divergence results above are obtained via PyTorch fake quantization simulation unless otherwise noted. Inference with ONNX-runtime can also be evaluated .* For detailed instructions on computing KL-divergence, please refer to the [KL-divergence Evaluation Guide](./accuracy_benchmark/kl_divergence_metrics/README.md). From d1e98c08ab308eef7820b0193f5801ac4c887374 Mon Sep 17 00:00:00 2001 From: unknown Date: Thu, 22 Jan 2026 18:12:36 +0530 Subject: [PATCH 4/8] handle review comments Signed-off-by: unknown --- examples/windows/Benchmark.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/examples/windows/Benchmark.md b/examples/windows/Benchmark.md index cf49416ce..1a0ad2bae 100644 --- a/examples/windows/Benchmark.md +++ b/examples/windows/Benchmark.md @@ -24,7 +24,7 @@ Memory savings and inference speedup are compared to the ONNX FP16 baseline. ### 1.2 Accuracy Comparison -#### 1.2.1 MMLU +#### 1.2.1 MMLU For accuracy evaluation, the [Massive Multitask Language Understanding (MMLU)](https://arxiv.org/abs/2009.03300) benchmark has been utilized. Please refer to the [detailed instructions](./accuracy_benchmark/README.md) for running the MMLU accuracy benchmark. @@ -91,7 +91,6 @@ KL-divergence (Kullback-Leibler divergence) quantifies the distributional differ | Llama-3.2-3B-Instruct | int4+int8 per-channel-max_algo-mixed_quant (simulated) | INT4: per-block (block-size=128), INT8: per-channel (row-wise) | 0.238 | PyTorch (fake quantization) | | Llama-3.2-3B-Instruct | int4 Blockwise-max_algo only (simulated) | INT4: per-block (block-size=128) | 0.334 | PyTorch (fake quantization) | - *All KL-divergence results above are obtained via PyTorch fake quantization simulation unless otherwise noted. Inference with ONNX-runtime can also be evaluated .* For detailed instructions on computing KL-divergence, please refer to the [KL-divergence Evaluation Guide](./accuracy_benchmark/kl_divergence_metrics/README.md). From 40d033d55c9223f6547bdc34fa1591ae40d98cea Mon Sep 17 00:00:00 2001 From: unknown Date: Fri, 23 Jan 2026 17:03:10 +0530 Subject: [PATCH 5/8] Update description Signed-off-by: unknown --- examples/windows/Benchmark.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/examples/windows/Benchmark.md b/examples/windows/Benchmark.md index 1a0ad2bae..6714a61e8 100644 --- a/examples/windows/Benchmark.md +++ b/examples/windows/Benchmark.md @@ -49,12 +49,12 @@ Perplexity measures how well a probability model predicts a sample. Lower perple **Learn more about Perplexity:** [Perplexity - Wikipedia](https://en.wikipedia.org/wiki/Perplexity) | [Hugging Face - Perplexity of Fixed-Length Models](https://huggingface.co/docs/transformers/en/perplexity) - **FP16-MB**: Baseline FP16 genai model (Model Builder) -- **Mixed AWQ-MO**: Mixed precision AWQ quantization using ModelOpt -- **Mixed RTN-MO**: Mixed precision RTN quantization using ModelOpt -- **Pure INT4 AWQ-MO**: Pure INT4 AWQ quantization using ModelOpt -- **Pure INT4 RTN-MO**: Pure INT4 RTN quantization using ModelOpt -- **Pure INT8 RTN-MO**: Pure INT8 RTN quantization using ModelOpt -- **Pure INT8 AWQ-MO**: Pure INT8 AWQ quantization using ModelOpt +- **Mixed AWQ-MO**: Important linear layers in INT8, rest in INT4 (AWQ), with ModelOpt. +- **Mixed RTN-MO**: Important linear layers in INT8, rest in INT4 (RTN), with ModelOpt. +- **Pure INT4 AWQ-MO**: All linear layers INT4 (AWQ) with ModelOpt. +- **Pure INT4 RTN-MO**: All linear layers INT4 (RTN) with ModelOpt. +- **Pure INT8 RTN-MO**: All linear layers INT8 (RTN) with ModelOpt. +- **Pure INT8 AWQ-MO**: All linear layers INT8 (AWQ) with ModelOpt. - **Configuration**: Windows OS, GPU RTX 5090, nvidia-modelopt v0.39.0, onnxruntime-genai-cuda 0.9.2, onnxruntime-gpu 1.23.0, torch 2.8.0+cu128, transformers 4.49.0 | Model | FP16-MB | Mixed AWQ-MO | Mixed RTN-MO | Pure INT4 AWQ-MO | Pure INT4 RTN-MO | Pure INT8 RTN-MO | Pure INT8 AWQ-MO | From f729663ca7bc1fa3eeb18501a9ae9352be1ad17f Mon Sep 17 00:00:00 2001 From: unknown Date: Fri, 23 Jan 2026 18:05:51 +0530 Subject: [PATCH 6/8] Add information for mixed precision quantization Signed-off-by: unknown --- examples/windows/onnx_ptq/genai_llm/README.md | 75 +++++++++++++++++++ 1 file changed, 75 insertions(+) diff --git a/examples/windows/onnx_ptq/genai_llm/README.md b/examples/windows/onnx_ptq/genai_llm/README.md index b833d44dc..fb5bc4a5b 100644 --- a/examples/windows/onnx_ptq/genai_llm/README.md +++ b/examples/windows/onnx_ptq/genai_llm/README.md @@ -35,6 +35,81 @@ python quantize.py --model_name=meta-llama/Meta-Llama-3-8B \ --calib_size=32 --algo=awq_lite --dataset=cnn ``` +### Mixed Precision Quantization (INT4 + INT8) + +ModelOpt-Windows supports **mixed precision quantization**, where different layers in the model can be quantized to different bit-widths. This approach combines INT4 quantization for most layers (for maximum compression and speed) with INT8 quantization for important or sensitive layers (to preserve accuracy). + +#### Why Use Mixed Precision? + +Mixed precision quantization provides an optimal balance between: +- **Model Size**: Primarily INT4 keeps the model small +- **Inference Speed**: INT4 layers run faster and smaller +- **Accuracy Preservation**: Critical layers in INT8 maintain model quality + +Based on benchmark results, mixed precision quantization shows significant advantages: + +| Model | Metric | INT4 RTN | Mixed RTN (INT4+INT8) | Improvement | +|:------|:-------|:-------------|:---------------------|:-----------| +| DeepSeek R1 1.5B | MMLU | 32.40% | 33.90% | +1.5% | +| | Perplexity | 46.304 | 44.332 | -2.0 (lower is better) | +| Llama 3.2 1B | MMLU | 39.90% | 44.70% | +4.8% | +| | Perplexity | 16.900 | 14.176 | -2.7 (lower is better) | +| Qwen 2.5 1.5B | MMLU | 56.70% | 57.50% | +0.8% | +| | Perplexity | 10.933 | 10.338 | -0.6 (lower is better) | + +As shown above, mixed precision significantly improves accuracy with minimal disk size increase (~85-109 MB). + +#### How Mixed Precision Works + +The quantization strategy selects which layers to quantize to INT8 vs INT4: + +1. **INT8 Layers** (Higher Precision): Important layers that significantly impact model quality. Quantized per-channel + +2. **INT4 Layers** (Maximum Compression): All other layers. Qunatized blockwise. + +This strategy preserves accuracy for the most sensitive layers while maintaining aggressive compression elsewhere. + +#### Using Mixed Precision Quantization + +**Method 1: Use the default mixed precision strategy** + +```bash +python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ + --onnx_path="E:\models\llama3.2-1b-fp16\model.onnx" \ + --output_path="E:\models\llama3.2-1b-int4-int8-mixed\model.onnx" \ + --algo=awq_lite \ + --calib_size=32 \ + --enable_mixed_quant +``` + +The `--enable_mixed_quant` flag automatically applies the default strategy. + +**Method 2: Specify custom layers for INT8** + +```bash +python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ + --onnx_path="E:\models\llama3.2-1b-fp16\model.onnx" \ + --output_path="E:\models\llama3.2-1b-int4-int8-custom\model.onnx" \ + --algo=awq_lite \ + --calib_size=32 \ + --layers_8bit="layers.0,layers.1,layers.15,layers.16" +``` + +The `--layers_8bit` option allows you to manually specify which layers to quantize to INT8. You can use: +- Layer indices: `layers.0,layers.5,layers.10` +- Layer paths: `model/layers.0/attn/qkv_proj` +- Partial names: `qkv_proj,down_proj` + + +#### Technical Details + +- **Block Size**: INT4 layers use block-wise quantization (default block-size=128), INT8 uses per-channel quantization +- **Quantization Axis**: INT4 (per-block), INT8 (per-channel row-wise) +- **Compatibility**: Works with both `awq_lite` and `rtn_dq` algorithms +- **Automatic Detection**: The `--layers_8bit` option automatically enables mixed quantization + +For more benchmark results and detailed accuracy metrics, refer to the [Benchmark Guide](../../Benchmark.md). + #### Command Line Arguments The table below lists key command-line arguments of the ONNX PTQ example script. From 6f8311b0497dd808c9b873068a18b61a36ffa903 Mon Sep 17 00:00:00 2001 From: unknown Date: Fri, 23 Jan 2026 18:08:01 +0530 Subject: [PATCH 7/8] Add information for mixed precision quantization Signed-off-by: unknown --- examples/windows/onnx_ptq/genai_llm/README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/examples/windows/onnx_ptq/genai_llm/README.md b/examples/windows/onnx_ptq/genai_llm/README.md index fb5bc4a5b..cba8d6e87 100644 --- a/examples/windows/onnx_ptq/genai_llm/README.md +++ b/examples/windows/onnx_ptq/genai_llm/README.md @@ -42,6 +42,7 @@ ModelOpt-Windows supports **mixed precision quantization**, where different laye #### Why Use Mixed Precision? Mixed precision quantization provides an optimal balance between: + - **Model Size**: Primarily INT4 keeps the model small - **Inference Speed**: INT4 layers run faster and smaller - **Accuracy Preservation**: Critical layers in INT8 maintain model quality @@ -71,7 +72,7 @@ This strategy preserves accuracy for the most sensitive layers while maintaining #### Using Mixed Precision Quantization -**Method 1: Use the default mixed precision strategy** +##### Method 1: Use the default mixed precision strategy ```bash python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ @@ -84,7 +85,7 @@ python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ The `--enable_mixed_quant` flag automatically applies the default strategy. -**Method 2: Specify custom layers for INT8** +##### Method 2: Specify custom layers for INT8 ```bash python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ @@ -96,11 +97,11 @@ python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ ``` The `--layers_8bit` option allows you to manually specify which layers to quantize to INT8. You can use: + - Layer indices: `layers.0,layers.5,layers.10` - Layer paths: `model/layers.0/attn/qkv_proj` - Partial names: `qkv_proj,down_proj` - #### Technical Details - **Block Size**: INT4 layers use block-wise quantization (default block-size=128), INT8 uses per-channel quantization From 08f342b613c9a3cf05289d598122cd947b05666d Mon Sep 17 00:00:00 2001 From: unknown Date: Sun, 25 Jan 2026 11:08:31 +0530 Subject: [PATCH 8/8] Handle review comments Signed-off-by: unknown --- examples/windows/onnx_ptq/genai_llm/README.md | 108 +++++++++--------- 1 file changed, 54 insertions(+), 54 deletions(-) diff --git a/examples/windows/onnx_ptq/genai_llm/README.md b/examples/windows/onnx_ptq/genai_llm/README.md index cba8d6e87..1a46a687e 100644 --- a/examples/windows/onnx_ptq/genai_llm/README.md +++ b/examples/windows/onnx_ptq/genai_llm/README.md @@ -35,11 +35,58 @@ python quantize.py --model_name=meta-llama/Meta-Llama-3-8B \ --calib_size=32 --algo=awq_lite --dataset=cnn ``` -### Mixed Precision Quantization (INT4 + INT8) +#### Command Line Arguments + +The table below lists key command-line arguments of the ONNX PTQ example script. + +| **Argument** | **Supported Values** | **Description** | +|---------------------------|------------------------------------------------------|-------------------------------------------------------------| +| `--calib_size` | 32 (default), 64, 128 | Specifies the calibration size. | +| `--dataset` | cnn (default), pilevel | Choose calibration dataset: cnn_dailymail or pile-val. | +| `--algo` | awq_lite (default), awq_clip, rtn, rtn_dq | Select the quantization algorithm. | +| `--onnx_path` | input .onnx file path | Path to the input ONNX model. | +| `--output_path` | output .onnx file path | Path to save the quantized ONNX model. | +| `--use_zero_point` | Default: zero-point is disabled | Use this option to enable zero-point based quantization. | +| `--block-size` | 32, 64, 128 (default) | Block size for AWQ. | +| `--awqlite_alpha_step` | 0.1 (default) | Step-size for AWQ scale search, user-defined | +| `--awqlite_run_per_subgraph` | Default: run_per_subgraph is disabled | Use this option to run AWQ scale search at the subgraph level | +| `--awqlite_disable_fuse_nodes` | Default: fuse_nodes enabled | Use this option to disable fusion of input scales in parent nodes. | +| `--awqclip_alpha_step` | 0.05 (default) | Step-size for AWQ weight clipping, user-defined | +| `--awqclip_alpha_min` | 0.5 (default) | Minimum AWQ weight-clipping threshold, user-defined | +| `--awqclip_bsz_col` | 1024 (default) | Chunk size in columns during weight clipping, user-defined | +| `--calibration_eps` | dml, cuda, cpu, NvTensorRtRtx (default: [dml,cpu]) | List of execution-providers to use for session run during calibration | +| `--no_position_ids` | Default: position_ids input enabled | Use this option to disable position_ids input in calibration data| +| `--enable_mixed_quant` | Default: disabled mixed quant | Use this option to enable mixed precsion quantization| +| `--layers_8bit` | Default: None | Use this option to Overrides default mixed quant strategy| +| `--gather_quantize_axis` | Default: None | Use this option to enable INT4 quantization of Gather nodes - choose 0 or 1| +| `--gather_block_size` | Default: 32 | Block-size for Gather node's INT4 quantization (when its enabled using gather_quantize_axis option)| + +Run the following command to view all available parameters in the script: + +```bash +python quantize.py --help +``` + +Note: + +1. For the `algo` argument, we have following options to choose form: awq_lite, awq_clip, rtn, rtn_dq. + - The 'awq_lite' option does core AWQ scale search and INT4 quantization. + - The 'awq_clip' option primarily does weight clipping and INT4 quantization. + - The 'rtn' option does INT4 RTN quantization with Q->DQ nodes for weights. + - The 'rtn_dq' option does INT4 RTN quantization with only DQ nodes for weights. +1. RTN algorithm doesn't use calibration-data. +1. If needed for the input base model, use `--no_position_ids` command-line option to disable + generating position_ids calibration input. The GenAI built LLM models produced with DML EP has + position_ids input but ones produced with CUDA EP, NvTensorRtRtx EP don't have position_ids input. + Use `--help` or command-line options table above to inspect default values. + +Please refer to `quantize.py` for further details on command-line parameters. + +#### Mixed Precision Quantization (INT4 + INT8) ModelOpt-Windows supports **mixed precision quantization**, where different layers in the model can be quantized to different bit-widths. This approach combines INT4 quantization for most layers (for maximum compression and speed) with INT8 quantization for important or sensitive layers (to preserve accuracy). -#### Why Use Mixed Precision? +##### Why Use Mixed Precision? Mixed precision quantization provides an optimal balance between: @@ -60,7 +107,7 @@ Based on benchmark results, mixed precision quantization shows significant advan As shown above, mixed precision significantly improves accuracy with minimal disk size increase (~85-109 MB). -#### How Mixed Precision Works +##### How Mixed Precision Works The quantization strategy selects which layers to quantize to INT8 vs INT4: @@ -70,9 +117,9 @@ The quantization strategy selects which layers to quantize to INT8 vs INT4: This strategy preserves accuracy for the most sensitive layers while maintaining aggressive compression elsewhere. -#### Using Mixed Precision Quantization +##### Using Mixed Precision Quantization -##### Method 1: Use the default mixed precision strategy +###### Method 1: Use the default mixed precision strategy ```bash python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ @@ -85,7 +132,7 @@ python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ The `--enable_mixed_quant` flag automatically applies the default strategy. -##### Method 2: Specify custom layers for INT8 +###### Method 2: Specify custom layers for INT8 ```bash python quantize.py --model_name=meta-llama/Meta-Llama-3.2-1B \ @@ -102,7 +149,7 @@ The `--layers_8bit` option allows you to manually specify which layers to quanti - Layer paths: `model/layers.0/attn/qkv_proj` - Partial names: `qkv_proj,down_proj` -#### Technical Details +##### Technical Details - **Block Size**: INT4 layers use block-wise quantization (default block-size=128), INT8 uses per-channel quantization - **Quantization Axis**: INT4 (per-block), INT8 (per-channel row-wise) @@ -111,53 +158,6 @@ The `--layers_8bit` option allows you to manually specify which layers to quanti For more benchmark results and detailed accuracy metrics, refer to the [Benchmark Guide](../../Benchmark.md). -#### Command Line Arguments - -The table below lists key command-line arguments of the ONNX PTQ example script. - -| **Argument** | **Supported Values** | **Description** | -|---------------------------|------------------------------------------------------|-------------------------------------------------------------| -| `--calib_size` | 32 (default), 64, 128 | Specifies the calibration size. | -| `--dataset` | cnn (default), pilevel | Choose calibration dataset: cnn_dailymail or pile-val. | -| `--algo` | awq_lite (default), awq_clip, rtn, rtn_dq | Select the quantization algorithm. | -| `--onnx_path` | input .onnx file path | Path to the input ONNX model. | -| `--output_path` | output .onnx file path | Path to save the quantized ONNX model. | -| `--use_zero_point` | Default: zero-point is disabled | Use this option to enable zero-point based quantization. | -| `--block-size` | 32, 64, 128 (default) | Block size for AWQ. | -| `--awqlite_alpha_step` | 0.1 (default) | Step-size for AWQ scale search, user-defined | -| `--awqlite_run_per_subgraph` | Default: run_per_subgraph is disabled | Use this option to run AWQ scale search at the subgraph level | -| `--awqlite_disable_fuse_nodes` | Default: fuse_nodes enabled | Use this option to disable fusion of input scales in parent nodes. | -| `--awqclip_alpha_step` | 0.05 (default) | Step-size for AWQ weight clipping, user-defined | -| `--awqclip_alpha_min` | 0.5 (default) | Minimum AWQ weight-clipping threshold, user-defined | -| `--awqclip_bsz_col` | 1024 (default) | Chunk size in columns during weight clipping, user-defined | -| `--calibration_eps` | dml, cuda, cpu, NvTensorRtRtx (default: [dml,cpu]) | List of execution-providers to use for session run during calibration | -| `--no_position_ids` | Default: position_ids input enabled | Use this option to disable position_ids input in calibration data| -| `--enable_mixed_quant` | Default: disabled mixed quant | Use this option to enable mixed precsion quantization| -| `--layers_8bit` | Default: None | Use this option to Overrides default mixed quant strategy| -| `--gather_quantize_axis` | Default: None | Use this option to enable INT4 quantization of Gather nodes - choose 0 or 1| -| `--gather_block_size` | Default: 32 | Block-size for Gather node's INT4 quantization (when its enabled using gather_quantize_axis option)| - -Run the following command to view all available parameters in the script: - -```bash -python quantize.py --help -``` - -Note: - -1. For the `algo` argument, we have following options to choose form: awq_lite, awq_clip, rtn, rtn_dq. - - The 'awq_lite' option does core AWQ scale search and INT4 quantization. - - The 'awq_clip' option primarily does weight clipping and INT4 quantization. - - The 'rtn' option does INT4 RTN quantization with Q->DQ nodes for weights. - - The 'rtn_dq' option does INT4 RTN quantization with only DQ nodes for weights. -1. RTN algorithm doesn't use calibration-data. -1. If needed for the input base model, use `--no_position_ids` command-line option to disable - generating position_ids calibration input. The GenAI built LLM models produced with DML EP has - position_ids input but ones produced with CUDA EP, NvTensorRtRtx EP don't have position_ids input. - Use `--help` or command-line options table above to inspect default values. - -Please refer to `quantize.py` for further details on command-line parameters. - ### Evaluate the Quantized Model To evaluate the quantized model, please refer to the [accuracy benchmarking](../../accuracy_benchmark/README.md) and [onnxruntime-genai performance benchmarking](https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python).