Skip to content

Conversation

@Edwardf0t1
Copy link
Contributor

@Edwardf0t1 Edwardf0t1 commented Jan 9, 2026

What does this PR do?

Type of change: New feature

Overview:

The primary goal of this PR is to allow the model optimizer to use image-text pair data during the calibration phase of quantization, which is likely help improve accuracy of quantized VLMs like Nemotron VL on visual understanding tasks particularly, compared to text-only calibration data.

  • New Feature: Adds support for VLM calibration specifically using image-text data.
  • Dataset Integration: Introduces support for sampling from the Nemotron-VLM-Dataset-v2.
  • Refactoring: Created a separate utility for VLM datasets to keep the main Hugging Face PTQ script (hf_ptq.py) clean.
  • Simplified logic for handling multimodal inputs.
  • Addressed specific issues encountered when calibrating the Nemotron-Nano-VL-12B-V2 model with image data.
  • Documentation: Updated the README to include instructions and examples for VLM calibration.

This PR complements #347 and we will consolidate llm_ptq and vlm_ptq examples in follow-up PRs.

Usage

python3 hf_ptq.py   --pyt_ckpt_path /home/scratch.omniml_data_2/models/Nemotron-Nano-VL-12B-V2   --qformat nvfp4   --export_path /home/omniml_data_3/zhiyuc/checkpoints/Nemotron-Nano-VL-12B-V2-NVFP4-doccalib   --trust_remote_code   --kv_cache_qformat none --calib_with_images   --vlm_dataset nemotron_vlm_dataset_v2   --vlm_subsets sparsetables,plotqa_cot   --calib_size 512

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes
  • Did you update Changelog?: Not yet

Additional Information

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 9, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@codecov
Copy link

codecov bot commented Jan 9, 2026

Codecov Report

❌ Patch coverage is 9.84615% with 293 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.17%. Comparing base (307fe71) to head (2611b0e).
⚠️ Report is 21 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/utils/vlm_dataset_utils.py 8.37% 175 Missing ⚠️
modelopt/torch/utils/nemotron_vlm_dataset_utils.py 11.94% 118 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #755      +/-   ##
==========================================
- Coverage   74.66%   73.17%   -1.50%     
==========================================
  Files         192      193       +1     
  Lines       18975    19352     +377     
==========================================
- Hits        14167    14160       -7     
- Misses       4808     5192     +384     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…for Nemotron-VLM-Dataset-v2

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…for Nemotron-VLM-Dataset-v2

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1 Edwardf0t1 self-assigned this Jan 14, 2026
@Edwardf0t1 Edwardf0t1 marked this pull request as ready for review January 14, 2026 01:16
@Edwardf0t1 Edwardf0t1 requested review from a team as code owners January 14, 2026 01:16
@shengliangxu
Copy link
Contributor

shengliangxu commented Jan 14, 2026

So, we only support image quantization for just nemotron-vl? If yes, why?

# limitations under the License.

"""Utility functions for getting samples and forward loop function for different vlm datasets."""
"""Utility functions for getting samples and dataloader for different VLM calibration datasets.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ajrasane could you review this change?

@cjluo-nv
Copy link
Collaborator

@Edwardf0t1 do you have experiments evaluating the accuracy impact of using the new dataset?

@Edwardf0t1
Copy link
Contributor Author

So, we only support image quantization for just nemotron-vl? If yes, why?

At this time, only Nemotron VL has been tested. We can extend the logic to support other VLMs later. Note that different VLMs may have different forward functions—e.g., the way the vision encoder interacts with the language decoder can vary across models.

Do you have a preferred VL model you’d like us to support next? For instance, Qwen3-VL?

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@Edwardf0t1
Copy link
Contributor Author

@Edwardf0t1 do you have experiments evaluating the accuracy impact of using the new dataset?

Tested on two benchmarks DocVQA and InfoVQA for Nemotron Nano VL v2 with vLLM backend:

  • BF16 Baseline: 94.2184, 79.1404
  • NVFP4 text-only calibration: 93.9472, 77.7221
  • NVFP4 image-text calibration: 94.0854, 77.9598

Image-text calibration is only marginally better in these cases, but the calibration flow in this PR should be ready. The follow-up experiments can be

  1. Choose different subsets in Nemotron-VLM-Dataset-v2 or another image-text dataset for calibration
  2. Check more evaluation metrics.
  3. Run benchmarks on other VLMs such as Nemotron Parse, Qwen3-VL.


[PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.

#### VLM calibration with image-text pairs (e.g., Nemotron VL)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--qformat nvfp4 \
--export_path <quantized_ckpt_path> \
--trust_remote_code \
--calib_with_images \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: Can user choose which vlm dataset to use or we just provide one option

calib_dataloader = None
first_text_speech_dataset = None
if model_type == "mllama":
if getattr(args, "calib_with_images", False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use getattr here?

):
"""Auto search quantization of multiple formats."""

if getattr(args, "calib_with_images", False):
Copy link
Collaborator

@cjluo-nv cjluo-nv Jan 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here. And why not just use assert?

)
elif is_nemotron_vl_model and getattr(args, "calib_with_images", False):
# For Nemotron VL image calibration, we need an AutoProcessor to build multimodal inputs.
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this try except?

tokenizer.padding_side = "left"

# Quantize only the language model, but keep the full_model for calibration forward.
language_model_lineage = get_language_model_from_vl(full_model)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please avoid duplicated codes with below

# Those kwargs must be consumed by the *full* VLM model, not the extracted language_model.
if getattr(args, "calib_with_images", False) and is_nemotron_vl_model:

def calibrate_full_model(_model):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make these helper functions and move it output hf_ptq

# prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# inputs = processor(text=[prompt], images=[pil_image], ...)

def _collate_fn(examples: list[dict[str, Any]]) -> dict[str, torch.Tensor] | dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to introduce these while the original one does not?

Copy link
Contributor

@jingyu-ml jingyu-ml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I only reviewed the dataset processing part, which behaves as expected, loading the dataset on demand rather than downloading the entire dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants