Handle non-contiguous tensors in quantize/dequantize ops by TimDettmers · Pull Request #1859 · bitsandbytes-foundation/bitsandbytes

TimDettmers · 2026-02-14T06:09:47Z

Summary

Auto-convert non-contiguous input tensors to contiguous in the CUDA backend for quantize_blockwise, dequantize_blockwise, quantize_4bit, and dequantize_4bit
The CUDA kernels use raw pointers (get_ptr()) which assume contiguous memory layout; non-contiguous inputs (e.g. strided slices) produced silently incorrect results
Added regression tests verifying non-contiguous tensors produce identical results to their contiguous equivalents

Test plan

New TestNonContiguousInputs class in tests/test_ops.py with 4 test methods (54 CUDA parametrizations)
Tests create non-contiguous tensors via strided slicing and compare quantize/dequantize output against contiguous baseline
Includes end-to-end roundtrip test (quantize → dequantize) for 4bit
All existing tests pass (371 passed, 73 skipped, 60 xfailed; 6 pre-existing OOM failures in test_4bit_quant_large unrelated to this change)

🤖 Generated with Claude Code

…#1342, #1690) Add A.contiguous() calls at the top of quantize_blockwise, quantize_4bit, and their dequantize counterparts in the CUDA backend. The CUDA kernels use raw pointers and assume contiguous memory layout, so non-contiguous inputs (e.g. tensor slices with strides) produced silently incorrect results. Add regression tests verifying non-contiguous tensors produce identical results to their contiguous equivalents for all four ops. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

github-actions · 2026-02-14T06:13:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

TimDettmers

PR Review: #1859 — Handle non-contiguous tensors in quantize/dequantize ops

Bug fix: adds A = A.contiguous() at the top of the four core CUDA backend quantization functions (quantize_blockwise, _dequantize_blockwise_impl, quantize_4bit, _dequantize_4bit_impl). Non-contiguous tensors passed to these functions were silently producing incorrect results because the underlying CUDA kernels use get_ptr() which assumes contiguous memory layout. Includes a thorough regression test class (TestNonContiguousInputs) with 4 test methods covering all four ops.

Classification: [bug-fix] [test]

No blocking issues.

The fix is correct and well-placed. Placing .contiguous() at the CUDA backend layer (rather than in the public functional.py wrappers) is the right design — it keeps the fix close to the assumption it enforces (get_ptr() needs contiguous memory), and it naturally covers both direct op calls via torch.ops.bitsandbytes.* and calls through the higher-level functional.py API. The .contiguous() call is a no-op on already-contiguous tensors (returns self with no copy), so the common case pays zero cost.

Root cause analysis: The CUDA kernels receive raw data pointers via get_ptr() and iterate over them assuming contiguous C-order layout. When a tensor has non-unit strides (from slicing, transposing, etc.), the pointer still points to the start of the storage, but the kernel reads elements sequentially — skipping the stride logic — producing silently corrupted output. The fix materializes a contiguous copy before pointer extraction. This matches the root cause described in both #1342 and #1690.

Scope check — other get_ptr() call sites in cuda/ops.py: The file has additional get_ptr() calls in int8_linear_matmul, int8_mm_dequant, int8_vectorwise_quant, gemv_4bit, and the optimizer update functions. These are not covered by this PR. However, the scope is appropriate: the PR targets the four ops reported in the linked issues, and the other ops have different calling conventions (int8 ops receive already-quantized int8 tensors which are always freshly allocated and contiguous; gemv_4bit receives quantized weight tensors and activation vectors that come from nn.Linear.forward which produces contiguous outputs; optimizer tensors are parameter/gradient buffers managed by PyTorch which are contiguous). A follow-up to audit the remaining call sites would be a reasonable hardening measure but is not blocking.

Non-blocking suggestions:

Consider whether a thin _ensure_contiguous() helper or a comment near get_ptr() documenting the contiguity requirement would help prevent regressions as new ops are added.

Downstream Impact

Risk level: NONE (beneficial)

This fix only adds automatic contiguity enforcement. It does not change any function signatures, return types, class attributes, or serialization formats. Downstream projects that were already passing contiguous tensors see no change. Downstream projects that were inadvertently passing non-contiguous tensors (which would have produced silently wrong results) now get correct results.

Transformers: not affected (beneficial)
PEFT: not affected (beneficial)
Accelerate: not affected (beneficial)
TGI: not affected
vLLM: not affected

Performance Impact

Hot path affected: yes (quantize/dequantize are in the forward path for 4-bit inference)

Changes:

Four new .contiguous() calls at the top of CUDA backend quantize/dequantize functions
.contiguous() on an already-contiguous tensor is a no-op (returns self, no allocation or copy)
For non-contiguous inputs, a copy is made — but without this fix those inputs produced wrong results, so correctness trumps the copy cost

Expected impact: negligible for the common case; correctness fix for the uncommon case.

Recommendation: no concern

Cross-PR Conflicts

PR #1858 (Add k-bit blockwise quantization): overlaps on bitsandbytes/backends/cuda/ops.py. The changes are in different functions (this PR modifies existing quantize/dequantize functions; #1858 adds new kbit functions). Merge conflicts are unlikely; no semantic conflict.
PRs #1860, #1861, #1863, #1864, #1865, #1866 overlap on tests/test_linear4bit.py. This PR's change to that file is a one-line formatting fix (joining a multi-line assert onto one line) — trivially resolvable.
Security: Clear
Downstream impact: None (beneficial — prevents silent corruption)
Tests: Adequate — 4 test methods covering all 4 affected ops, with 3 dtypes, multiple blocksizes, fp4/nf4 for 4-bit ops, and an end-to-end roundtrip test
CI: All pass (lint, CPU builds/tests across platforms, CUDA builds/tests on L40S and T4 with multiple CUDA versions, Windows CUDA, ROCm builds)
Performance: Negligible (no-op on contiguous tensors)
Serialization: Not affected
torch.compile: Not affected (no op registration changes)

matthewdouglas

LGTM!

style: Fix ruff format violation in test_linear4bit.py

0018e4a

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

This was referenced Feb 16, 2026

Fix Params4bit attribute access for FSDP state_dict traversal #1866

Merged

fix: Replace hard-coded precision thresholds with std-based bounds #1864

Open

TimDettmers commented Feb 16, 2026

View reviewed changes

matthewdouglas approved these changes Feb 16, 2026

View reviewed changes

matthewdouglas merged commit 505a00a into main Feb 16, 2026
138 checks passed

matthewdouglas added this to the v0.49.2 milestone Feb 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Handle non-contiguous tensors in quantize/dequantize ops#1859

Handle non-contiguous tensors in quantize/dequantize ops#1859
matthewdouglas merged 2 commits intomainfrom
fix/issue-1342

TimDettmers commented Feb 14, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 14, 2026

Uh oh!

TimDettmers left a comment

Uh oh!

matthewdouglas left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Comments

Conversation

TimDettmers commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions bot commented Feb 14, 2026

Uh oh!

TimDettmers left a comment

Choose a reason for hiding this comment

PR Review: #1859 — Handle non-contiguous tensors in quantize/dequantize ops

Downstream Impact

Performance Impact

Cross-PR Conflicts

Uh oh!

matthewdouglas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TimDettmers commented Feb 14, 2026 •

edited

Loading