Skip to content

Conversation

@larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Jan 30, 2026

Summary

  • Add int4/int8 quantization support for Parakeet TDT model export using torchao
  • Add storage_offset support in CUDA AOTI shims to enable quantized weight tensor views
  • Extract quantization utilities to a separate module for reusability

Changes

Quantization Support for Parakeet

Added support for quantizing encoder and decoder components with multiple configurations:

  • Linear layers: 4w, 8w, 8da4w, 8da8w quantization configs
  • Embedding layers: 4w, 8w quantization configs
  • Packing formats: tile_packed_to_4d for optimized inference on CUDA

New CLI Arguments

Argument Description
--qlinear_encoder Quantization config for encoder linear layers
--qlinear_encoder_group_size Group size for encoder quantization (default: 32)
--qlinear_encoder_packing_format Packing format for encoder
--qlinear Quantization config for decoder linear layers
--qlinear_group_size Group size for decoder quantization (default: 32)
--qlinear_packing_format Packing format for decoder
--qembedding Quantization config for embedding layer
--qembedding_group_size Group size for embedding quantization

CUDA Backend: Storage Offset Support

Modified aoti_torch__reinterpret_tensor in backends/cuda/runtime/shims/memory.cpp to support non-zero storage offsets, which is required for int4 quantized weight tensors:

  • Removed the validate_storage_offset check that rejected non-zero offsets
  • Added logic to compute the adjusted data pointer: base_ptr + storage_offset * element_size
  • Updated memory tracking to use base_data_ptr for reference counting
  • Added tracking for offset data_ptr as NOT_OWN to enable proper tensor deletion

This enables the CUDA backend to handle tensor views created by torchao's int4 quantization, which uses _convert_weight_to_int4pack and _weight_int4pack_mm operations that produce tensors with non-zero storage offsets.

Code Organization

  • Extracted quantize() function to examples/models/parakeet/quantize.py
  • Model moved to CUDA after preprocessor export when --backend cuda is specified
  • Example inputs created on correct device to match model parameters

Example Usage

python examples/models/parakeet/export_parakeet_tdt.py
--backend cuda
--dtype bf16
--qlinear_encoder 4w
--qlinear_encoder_packing_format tile_packed_to_4d
--qlinear 4w
--qlinear_packing_format tile_packed_to_4d
--output-dir ./parakeet_int4

Test Plan

[x] Export with CUDA backend and int4 quantization completes successfully
[x] Model runs through encoder with storage_offset tensors
[x] Verify full transcription accuracy matches eager mode
[x] Verify model size reduction with quantization

@larryliu0820 larryliu0820 requested a review from lucylq as a code owner January 30, 2026 07:32
@pytorch-bot
Copy link

pytorch-bot bot commented Jan 30, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17061

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 30, 2026
@larryliu0820 larryliu0820 added the release notes: desktop for desktop/laptop workstream label Jan 30, 2026
@larryliu0820 larryliu0820 temporarily deployed to upload-benchmark-results January 30, 2026 08:47 — with GitHub Actions Inactive
- Add int4/int8 quantization support for Parakeet TDT model export using torchao
- Add storage_offset support in CUDA AOTI shims to enable quantized weight tensor views
- Extract quantization utilities to a separate module for reusability

Added support for quantizing encoder and decoder components with multiple configurations:
- **Linear layers**: `4w`, `8w`, `8da4w`, `8da8w` quantization configs
- **Embedding layers**: `4w`, `8w` quantization configs
- **Packing formats**: `tile_packed_to_4d` for optimized inference on CUDA

| Argument | Description |
|----------|-------------|
| `--qlinear_encoder` | Quantization config for encoder linear layers |
| `--qlinear_encoder_group_size` | Group size for encoder quantization (default: 32) |
| `--qlinear_encoder_packing_format` | Packing format for encoder |
| `--qlinear` | Quantization config for decoder linear layers |
| `--qlinear_group_size` | Group size for decoder quantization (default: 32) |
| `--qlinear_packing_format` | Packing format for decoder |
| `--qembedding` | Quantization config for embedding layer |
| `--qembedding_group_size` | Group size for embedding quantization |

Modified `aoti_torch__reinterpret_tensor` in `backends/cuda/runtime/shims/memory.cpp` to support non-zero storage offsets, which is required for int4 quantized weight tensors:

- **Removed** the `validate_storage_offset` check that rejected non-zero offsets
- **Added** logic to compute the adjusted data pointer: `base_ptr + storage_offset * element_size`
- **Updated** memory tracking to use `base_data_ptr` for reference counting
- **Added** tracking for offset `data_ptr` as `NOT_OWN` to enable proper tensor deletion

This enables the CUDA backend to handle tensor views created by torchao's int4 quantization, which uses `_convert_weight_to_int4pack` and `_weight_int4pack_mm` operations that produce tensors with non-zero storage offsets.

- Extracted `quantize()` function to `examples/models/parakeet/quantize.py`
- Model moved to CUDA after preprocessor export when `--backend cuda` is specified
- Example inputs created on correct device to match model parameters

python examples/models/parakeet/export_parakeet_tdt.py \
    --backend cuda \
    --dtype bf16 \
    --qlinear_encoder 4w \
    --qlinear_encoder_packing_format tile_packed_to_4d \
    --qlinear 4w \
    --qlinear_packing_format tile_packed_to_4d \
    --output-dir ./parakeet_int4

Test Plan
[x] Export with CUDA backend and int4 quantization completes successfully
[x] Model runs through encoder with storage_offset tensors
[x] Verify full transcription accuracy matches eager mode
[x] Verify model size reduction with quantization
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. release notes: desktop for desktop/laptop workstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants