Enable int4 quantization for parakeet #17061

larryliu0820 · 2026-01-30T07:32:32Z

Summary

Add int4/int8 quantization support for Parakeet TDT model export using torchao
Add storage_offset support in CUDA AOTI shims to enable quantized weight tensor views
Extract quantization utilities to a separate module for reusability

Changes

Quantization Support for Parakeet

Added support for quantizing encoder and decoder components with multiple configurations:

Linear layers: 4w, 8w, 8da4w, 8da8w quantization configs
Embedding layers: 4w, 8w quantization configs
Packing formats: tile_packed_to_4d for optimized inference on CUDA

New CLI Arguments

Argument	Description
`--qlinear_encoder`	Quantization config for encoder linear layers
`--qlinear_encoder_group_size`	Group size for encoder quantization (default: 32)
`--qlinear_encoder_packing_format`	Packing format for encoder
`--qlinear`	Quantization config for decoder linear layers
`--qlinear_group_size`	Group size for decoder quantization (default: 32)
`--qlinear_packing_format`	Packing format for decoder
`--qembedding`	Quantization config for embedding layer
`--qembedding_group_size`	Group size for embedding quantization

CUDA Backend: Storage Offset Support

Modified aoti_torch__reinterpret_tensor in backends/cuda/runtime/shims/memory.cpp to support non-zero storage offsets, which is required for int4 quantized weight tensors:

Removed the validate_storage_offset check that rejected non-zero offsets
Added logic to compute the adjusted data pointer: base_ptr + storage_offset * element_size
Updated memory tracking to use base_data_ptr for reference counting
Added tracking for offset data_ptr as NOT_OWN to enable proper tensor deletion

This enables the CUDA backend to handle tensor views created by torchao's int4 quantization, which uses _convert_weight_to_int4pack and _weight_int4pack_mm operations that produce tensors with non-zero storage offsets.

Code Organization

Extracted quantize() function to examples/models/parakeet/quantize.py
Model moved to CUDA after preprocessor export when --backend cuda is specified
Example inputs created on correct device to match model parameters

Example Usage

python examples/models/parakeet/export_parakeet_tdt.py
--backend cuda
--dtype bf16
--qlinear_encoder 4w
--qlinear_encoder_packing_format tile_packed_to_4d
--qlinear 4w
--qlinear_packing_format tile_packed_to_4d
--output-dir ./parakeet_int4

Test Plan

[x] Export with CUDA backend and int4 quantization completes successfully
[x] Model runs through encoder with storage_offset tensors
[x] Verify full transcription accuracy matches eager mode
[x] Verify model size reduction with quantization

pytorch-bot · 2026-01-30T07:32:36Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17061

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

- Add int4/int8 quantization support for Parakeet TDT model export using torchao - Add storage_offset support in CUDA AOTI shims to enable quantized weight tensor views - Extract quantization utilities to a separate module for reusability Added support for quantizing encoder and decoder components with multiple configurations: - **Linear layers**: `4w`, `8w`, `8da4w`, `8da8w` quantization configs - **Embedding layers**: `4w`, `8w` quantization configs - **Packing formats**: `tile_packed_to_4d` for optimized inference on CUDA | Argument | Description | |----------|-------------| | `--qlinear_encoder` | Quantization config for encoder linear layers | | `--qlinear_encoder_group_size` | Group size for encoder quantization (default: 32) | | `--qlinear_encoder_packing_format` | Packing format for encoder | | `--qlinear` | Quantization config for decoder linear layers | | `--qlinear_group_size` | Group size for decoder quantization (default: 32) | | `--qlinear_packing_format` | Packing format for decoder | | `--qembedding` | Quantization config for embedding layer | | `--qembedding_group_size` | Group size for embedding quantization | Modified `aoti_torch__reinterpret_tensor` in `backends/cuda/runtime/shims/memory.cpp` to support non-zero storage offsets, which is required for int4 quantized weight tensors: - **Removed** the `validate_storage_offset` check that rejected non-zero offsets - **Added** logic to compute the adjusted data pointer: `base_ptr + storage_offset * element_size` - **Updated** memory tracking to use `base_data_ptr` for reference counting - **Added** tracking for offset `data_ptr` as `NOT_OWN` to enable proper tensor deletion This enables the CUDA backend to handle tensor views created by torchao's int4 quantization, which uses `_convert_weight_to_int4pack` and `_weight_int4pack_mm` operations that produce tensors with non-zero storage offsets. - Extracted `quantize()` function to `examples/models/parakeet/quantize.py` - Model moved to CUDA after preprocessor export when `--backend cuda` is specified - Example inputs created on correct device to match model parameters python examples/models/parakeet/export_parakeet_tdt.py \ --backend cuda \ --dtype bf16 \ --qlinear_encoder 4w \ --qlinear_encoder_packing_format tile_packed_to_4d \ --qlinear 4w \ --qlinear_packing_format tile_packed_to_4d \ --output-dir ./parakeet_int4 Test Plan [x] Export with CUDA backend and int4 quantization completes successfully [x] Model runs through encoder with storage_offset tensors [x] Verify full transcription accuracy matches eager mode [x] Verify model size reduction with quantization

larryliu0820 requested a review from lucylq as a code owner January 30, 2026 07:32

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 30, 2026

larryliu0820 added the release notes: desktop for desktop/laptop workstream label Jan 30, 2026

larryliu0820 temporarily deployed to upload-benchmark-results January 30, 2026 08:47 — with GitHub Actions Inactive

larryliu0820 force-pushed the parakeet_int4 branch from e3ca374 to c9a7aee Compare January 30, 2026 17:26

mergennachin requested review from Copilot, manuelcandales and mergennachin January 30, 2026 17:29

Copilot started reviewing on behalf of mergennachin January 30, 2026 17:32 View session

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable int4 quantization for parakeet #17061

Enable int4 quantization for parakeet #17061

larryliu0820 commented Jan 30, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Jan 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Enable int4 quantization for parakeet #17061

Are you sure you want to change the base?

Enable int4 quantization for parakeet #17061

Conversation

larryliu0820 commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Quantization Support for Parakeet

New CLI Arguments

CUDA Backend: Storage Offset Support

Code Organization

Example Usage

Test Plan

Uh oh!

pytorch-bot bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17061

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

larryliu0820 commented Jan 30, 2026 •

edited

Loading

pytorch-bot bot commented Jan 30, 2026 •

edited

Loading