PyTorch Custom Operator Integration by matthewdouglas · Pull Request #1544 · bitsandbytes-foundation/bitsandbytes

matthewdouglas · 2025-02-27T18:49:42Z

Overview

This PR introduces the initial scaffolding to integrate PyTorch Custom Operators as the primary mechanism for dispatching to device-specific operator implementation.

As outlined in the related RFC #1545, the intent is that this will supersede the previous backend registration interface that was developed on the multi-backend-refactor branch. The baseline CUDA operators are established in this PR, and the implementation for additional backends is to be ported over to this new interface.

Why Custom Ops?

Registering operators with torch.library allows us to take advantage of the existing device dispatch mechanisms in PyTorch.
We can treat calls to functionality in our CUDA kernels, or other low-level backend implementations, as opaque for improved torch.compile support.
We can provide naive implementations of operators with only PyTorch code as a fallback option.
This helps to simplify the development for additional backends, while taking an idiomatic modern PyTorch approach.

Operator Definitions

We broadly categorize operator functionality into three feature groups, though there can be some overlap.

LLM.int8()

Inference requirements

int8_vectorwise_quant(A: Tensor, threshold: float = 0.0) -> (Tensor, Tensor, Tensor?)
- Implements the LLM.int8() quantization algorithm with the specified threshold.
- Returns an int8 quantized tensor, a float32 tensor containing the scaling stats, and an optional int32 tensor containing a list of column indices with outliers present.
int8_scaled_mm(A: Tensor, B: Tensor, row_stats: Tensor, col_stats: Tensor, bias: Tensor?, dtype=torch.float16)
- By default, this is a composition of the below two operators. The choice can be made to implement one fused operator or two separately.
  - int8_linear_matmul(A: Tensor, B: Tensor) -> Tensor
    - Performs an 8-bit integer matrix multiplication between two int8 matrices.
    - Returns an int32 matrix: A @ B.T
  - int8_mm_dequant(A: Tensor, row_stats: Tensor, col_stats: Tensor, dtype=torch.float16, bias: Tensor?) -> Tensor
    - Dequantizes the result of a quantized 8-bit matrix multiplication with an optional fused bias.
    - The result is returned in the specified dtype, which is always torch.float16 for the current CUDA implementation.

Optional

int8_vectorwise_dequant(A: Tensor, stats: Tensor)
- Dequantizes an int8 tensor that was quantized with int8_vectorwise_quant.
- A default implementation in PyTorch is provided, which should work with any backend.
- This is a utility utilized by Transformers, Diffusers, PEFT, and others.
int8_double_quant(A: Tensor, threshold: float = 0.0)
- Quantizes the input tensor using the LLM.int8() algorithm across both dimensions.
- This is only useful for full int8 training (e.g. not LoRA), and as such, we only recommend implementing int8_vectorwise_quant.

NF4/FP4

Minimal requirements

dequantize_4bit(A: Tensor, absmax: Tensor, blocksize: int, quant_type: Literal["nf4" | "fp4"], shape: int[], dtype) -> Tensor
- Dequantizes a packed 4bit tensor into the specified floating point dtype.
- Note: Unlike bitsandbytes.functional.dequantize_4bit, this operator does not dequantize the absmax tensor. If utilized, dequantize_blockwise must be performed first.
quantize_4bit(A: Tensor, blocksize: int, quant_type: Literal["nf4" | "fp4"], quant_storage=torch.uint8) -> (Tensor, Tensor)
- Quantizes a floating point tensor into a packed 4bit tensor.
- Returns a tensor with the quantized data packed into into bytes, backed by the storage type specified. The float32 absmax scaling factors are additionally returned.
- Note: Unlike bitsandbytes.functional.quantize_4bit, this operator does not quantize the absmax tensor. If utilized, quantize_blockwise must be performed first.

Double quantization (aka `compressed_statistics` or `nested`)

dequantize_blockwise(A: Tensor, absmax: Tensor, code: Tensor, blocksize: int, dtype) -> Tensor
- Dequantizes an 8bit tensor that was quantized with quantize_blockwise
- The dequantized tensor with the specified dtype.
quantize_blockwise(A: Tensor, code: Tensor, blocksize: int) -> (Tensor, Tensor)
- Quantizes into an 8bit blocked data type defined by code.
- The blocksize will typically be 256 for usage with NF4/FP4 and optimizers.
- Returns the quantized tensor in uint8 format, along with float32 absmax.

Optional

gemv_4bit
- Fast path for bsz=1 inference with 4bit quantization. This operator is subject to some future revision.

Optimizers

Optimizer functionality will be implemented to support the custom operators in a future update.

github-actions · 2025-02-27T18:53:16Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

bitsandbytes/_ops.py

…/bitsandbytes into customop-refactoring

… nf4 work

Titus-von-Koeller

The custom_ops as well as clean up parts are looking really excellent. Thanks for this impactful work!

As we said, let's do separate PRs for

torch.compile
- about fixing avoidable issues with graph breaks
- fix issues around torch.compile of quantize functions
deprecate:
- TestSpMMFunctional tests and mark related functions with @deprecated
- SwitchBackLinear and communicate on issues / repo of open_clip
fix AttributeError: module 'bitsandbytes.functional' has no attribute 'double_quant'

matthewdouglas added 14 commits January 28, 2025 09:20

Sketch out first custom op registration

6268912

Add note

04e1bc6

Merge branch 'main' into customop-refactoring

d5df4c6

Initial int8 op registration

04482ff

Cleanup some deprecated functions.

2813571

Int8 ops updates; tests

4ad1d9e

Implement 4bit quant/dequant ops

e9c79cf

Fix nested quant

9d0f459

cleanup

f360a08

Test improvements

45ead33

Clean up and improve tests

6aeea81

Add higher level custom op for int8 matmul + dequant + bias

cbd1670

Add gemv 4bit custom op

db07f4e

Cleanup

23eba7a

matthewdouglas added High Priority (first issues that will be worked on) Cross Platform labels Feb 27, 2025

matthewdouglas mentioned this pull request Feb 27, 2025

[RFC] PyTorch Custom Operators & Multi-Backend Support #1545

Closed

matthewdouglas requested a review from Titus-von-Koeller February 27, 2025 19:33

zou3519 reviewed Mar 3, 2025

View reviewed changes

bitsandbytes/_ops.py Show resolved Hide resolved

Titus-von-Koeller mentioned this pull request Mar 5, 2025

[spike] evaluate + prototype interaction of unified memory abstraction with custom_ops #1556

Closed

3 tasks

matthewdouglas added this to the v0.46.0 milestone Mar 5, 2025

matthewdouglas marked this pull request as ready for review March 5, 2025 15:28

Implement out kwarg overloads for custom ops

2d5b2cc

matthewdouglas mentioned this pull request Mar 7, 2025

Fix CPU dequantization to use nested dequantized scaling constant #1549

Merged

matthewdouglas and others added 5 commits March 7, 2025 18:37

Update PyTorch minimum to 2.1

6172770

Deprecation updates

242c602

Deprecation updates

25368bc

merge main

32345e4

Cleanup; rename int8_linear_dequant -> int8_scaled_mm

2b85100

matthewdouglas added 15 commits March 13, 2025 11:13

Merge branch 'customop-refactoring' of https://github.com/TimDettmers…

aacd408

…/bitsandbytes into customop-refactoring

Bump min pytorch to 2.2

a61c0fa

cleanup

fd74c06

Test reorganization

587120a

Remove deprecated supports_igemmlt

975c356

More cleanup

da40911

Merge branch 'main' into customop-refactoring

0b04376

Cleanup obsolete C++/CUDA code

11e2e92

Cleanup

b599401

Create 'default' backend for fallback op implementations; initial CPU…

c703d8d

… nf4 work

Stub out for multi-platform

431819d

Fix serialization tests for torch>=2.6.0

fa188f6

Add example for torch.compile e2e inference

2015127

Test update

0a11fae

Merge branch 'main' into customop-refactoring

dcc2c16

Titus-von-Koeller approved these changes Mar 25, 2025

View reviewed changes

matthewdouglas merged commit e82f72b into main Mar 25, 2025
66 checks passed

matthewdouglas mentioned this pull request Mar 26, 2025

[AMD ROCm] _validate_bnb_multi_backend_availability() incorrectly tries to alter a frozenset. #1573

Closed

DevKimbob mentioned this pull request Mar 30, 2025

Fix: Return tuple in get_cuda_version_tuple #1580

Merged

matthewdouglas mentioned this pull request Apr 8, 2025

Support for Apple silicon #252

Closed

matthewdouglas mentioned this pull request Apr 22, 2025

Stop building for CUDA toolkit < 11.8 #1605

Merged

vivekgoe mentioned this pull request May 21, 2025

supports HPU double quant #1630

Merged

matthewdouglas mentioned this pull request May 22, 2025

Wrong result 8bit blockwise quantization over float16 #1540

Closed

matthewdouglas linked an issue May 22, 2025 that may be closed by this pull request

Wrong result 8bit blockwise quantization over float16 #1540

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PyTorch Custom Operator Integration#1544

PyTorch Custom Operator Integration#1544
matthewdouglas merged 35 commits intomainfrom
customop-refactoring

matthewdouglas commented Feb 27, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Feb 27, 2025

Uh oh!

Uh oh!

Titus-von-Koeller left a comment •

edited by matthewdouglas

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

matthewdouglas commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Why Custom Ops?

Operator Definitions

LLM.int8()

Inference requirements

Optional

NF4/FP4

Minimal requirements

Double quantization (aka compressed_statistics or nested)

Optional

Optimizers

Uh oh!

github-actions bot commented Feb 27, 2025

Uh oh!

Uh oh!

Titus-von-Koeller left a comment • edited by matthewdouglas Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

matthewdouglas commented Feb 27, 2025 •

edited

Loading

Double quantization (aka `compressed_statistics` or `nested`)

Titus-von-Koeller left a comment •

edited by matthewdouglas

Loading