Enable int4 quantization for parakeet #17061
Open
+317
−28
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Changes
Quantization Support for Parakeet
Added support for quantizing encoder and decoder components with multiple configurations:
4w,8w,8da4w,8da8wquantization configs4w,8wquantization configstile_packed_to_4dfor optimized inference on CUDANew CLI Arguments
--qlinear_encoder--qlinear_encoder_group_size--qlinear_encoder_packing_format--qlinear--qlinear_group_size--qlinear_packing_format--qembedding--qembedding_group_sizeCUDA Backend: Storage Offset Support
Modified
aoti_torch__reinterpret_tensorinbackends/cuda/runtime/shims/memory.cppto support non-zero storage offsets, which is required for int4 quantized weight tensors:validate_storage_offsetcheck that rejected non-zero offsetsbase_ptr + storage_offset * element_sizebase_data_ptrfor reference countingdata_ptrasNOT_OWNto enable proper tensor deletionThis enables the CUDA backend to handle tensor views created by torchao's int4 quantization, which uses
_convert_weight_to_int4packand_weight_int4pack_mmoperations that produce tensors with non-zero storage offsets.Code Organization
quantize()function toexamples/models/parakeet/quantize.py--backend cudais specifiedExample Usage
python examples/models/parakeet/export_parakeet_tdt.py
--backend cuda
--dtype bf16
--qlinear_encoder 4w
--qlinear_encoder_packing_format tile_packed_to_4d
--qlinear 4w
--qlinear_packing_format tile_packed_to_4d
--output-dir ./parakeet_int4
Test Plan
[x] Export with CUDA backend and int4 quantization completes successfully
[x] Model runs through encoder with storage_offset tensors
[x] Verify full transcription accuracy matches eager mode
[x] Verify model size reduction with quantization