[Discussion] Unify CUDA and HIP kernel sources via compat.cuh portability layer#1877
[Discussion] Unify CUDA and HIP kernel sources via compat.cuh portability layer#1877Abdennacer-Badaoui wants to merge 4 commits intobitsandbytes-foundation:mainfrom
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
TimDettmers
left a comment
There was a problem hiding this comment.
PR Review: [Discussion] Unify CUDA and HIP kernel sources via compat.cuh portability layer
Classification: Refactoring / RFC (discussion-only, not intended to merge as-is)
Author: @Abdennacer-Badaoui (known contributor — authored the blocksize-32/64 kernels in #1854/#1856 that this unifies)
Risk level: Low (all files are additions in csrc/examples/, no existing code is modified)
Summary
This PR proposes a design for merging the duplicated CUDA and HIP kernel source files into a unified codebase using two new portability headers: compat.cuh (host-safe) and compat_device.cuh (device-only). The current codebase maintains near-identical copies of 5 pairs of files (~6500 LOC of duplication). The proposed approach would eliminate 5 files and ~3300 lines of duplication while introducing 2 new portability headers.
The 8 example files demonstrate the full approach. This is a well-structured RFC that shows rather than tells.
CI Status
- Lint: FAIL (expected — clang-format likely hasn't been run on the new files)
- build-wheels: FAIL (unrelated — dependency on lint)
- All CUDA/HIP/CPU build & test jobs: PASS (these don't compile
csrc/examples/)
The lint failure is expected for an RFC and is not a concern at this stage.
Design Assessment
The two-header split (compat.cuh for host-safe code, compat_device.cuh for device-only CUB/MMA) is a clean design. The rationale is solid: .cpp files compiled by gcc/g++ cannot parse CUDA device headers, so the split is necessary.
Strengths:
-
Namespace aliasing for CUB/hipCUB (
namespace bnb_cub = cub/hipcub) eliminates ~90% of the mechanicalcub::vshipcub::differences with a single line. Elegant. -
Compile-time
BNB_WARP_SIZEincommon_unified.cuhcorrectly handles the GFX9 (CDNA) 64-wide warps vs RDNA/CUDA 32-wide warps. The#ifdef __GFX9__guard is correct for current ROCm architectures. -
kQuantizeBlockwiseSmallsuccessfully unifieskQuantizeBlockwise32(CUDA) andkQuantizeBlockwise64(HIP) by parameterizing onBNB_WARP_SIZE. The kernel logic is structurally identical to both originals — I verified the codebook values, reduction ops, quantization packing, and store patterns match. -
#if BNB_HIPguards are used sparingly and only where genuinely needed:atomicMax(CUDA CAS loop vs HIP native)Contextclass (cuBLAS vs rocBLAS handle creation)gemmex/strided_gemmex(different BLAS APIs)igemmlt(hipBLAS requires explicit heuristic algo selection)blocksize==64dispatch path inops_unified.cu(only HIP with 64-wide warps needs the small-block kernel for blocksize=64)
-
CMakeLists change is minimal and correct: single
GPU_FILESlist replaces separateCUDA_FILES/HIP_FILES, withset_source_files_properties(${GPU_FILES} PROPERTIES LANGUAGE HIP)for HIP builds. The<<<>>>launch syntax works natively on HIP, so nohipLaunchKernelGGLwrappers are needed.
Technical concerns (for discussion):
-
BNB_WARP_SIZEand blocksize=64 dispatch: Inops_unified.culines 50-61, theblocksize==64path has a#if BNB_HIPguard to dispatch tokQuantizeBlockwiseSmallfor 4-bit types on HIP. However,BNB_WARP_SIZEis a device-side macro (__GFX9__is only defined in device code), while this dispatch decision is made in host code. How will the host-side code know whether to use the warp-64 path? The current approach uses#if BNB_HIPas a proxy, which is correct if the library is compiled separately for each target architecture, but could be wrong if a single HIP binary targets both CDNA (warp64) and RDNA (warp32) architectures simultaneously. This probably needs a runtime check or separate compilation for each arch, or a comment explaining the assumption. -
kQuantizeBlockwiseSmallname: The kernel is called "Small" but on HIP with warp=64, it handles blocksize=64 (not small at all). ConsiderkQuantizeBlockwiseWarpor similar to reflect that it processes warp-sized blocks. Minor naming nit. -
compat.cuhincludesrocblas/rocblas.handhipblas/hipblas.hunconditionally on HIP: These are heavyweight headers. Ifcompat.cuhis meant to be "host-safe and lightweight," consider whether these BLAS includes belong here or in a separate BLAS compat header. Currently theContextclass inops_unified.cuhneeds them, but other files includingcompat.cuhmay not. -
BNB_BLASLT_PTR_MODE_ALPHA_VECasymmetry: On CUDA this maps toCUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO, on HIP toHIPBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST. TheBETA_ZEROvsBETA_HOSTdifference is notable — is this an intentional difference in how the two backends handle beta, or should it beBETA_ZEROon both? This discrepancy exists in the current code, so it's not introduced by this PR, but the unification is a good opportunity to document why. -
Missing
bnb_blasLtPrefCreate/bnb_blasLtPrefSetAttr/bnb_blasLtAlgoGetHeuristicmacros for CUDA: These are defined for HIP incompat.cuhbut not for CUDA, because CUDA doesn't need the heuristic path. However, they're used inside a#if BNB_HIPblock inops_unified.cu, so there's no build failure — but it means the compat header is incomplete if someone tried to use these macros on CUDA. Add a comment or#ifdefguard noting these are HIP-only. -
CUDA_CHECK_RETURNbackward compat macro: Good thatcompat.cuhdefines#define CUDA_CHECK_RETURN(value) BNB_CHECK_RETURN(value)for migration purposes. This should be documented as deprecated and removed after the full migration.
Security Review
- No network access, command execution, or dynamic code execution introduced
- No new dependencies added
- No changes to
pyproject.toml, CI workflows, or agent configuration files - No invisible Unicode characters detected in any file
- Codebook values (FP4 and NF4 lookup tables) are byte-identical to the existing
kernels.cu - CMakeLists changes are limited to file list unification — no new
execute_process,FetchContent, or custom commands - Build flags unchanged
No security concerns.
Numerical Correctness
All quantization/dequantization kernel code is mechanically equivalent to the existing CUDA and HIP kernels. Specifically verified:
fp4_dequantization_lutandnf4_dequantization_lutvalues are identicaldQuantizeFP4,dQuantizeNF4,dDequantizeFP4Tree,dDequantizeNF4logic is identicalatomicMaxCAS loop is correctly guarded with#if !BNB_HIPkQuantizeBlockwisetemplate usesbnb_cub::andBNB_MAX_OPas 1:1 replacementskQuantizeBlockwiseSmalllogic matches bothkQuantizeBlockwise32(CUDA) andkQuantizeBlockwise64(HIP)igemmltpreserves the HIP heuristic path and CUDA direct path
No numerical correctness concerns.
Downstream Impact
None. This PR adds files to csrc/examples/ — it does not modify any compiled source, public API, or serialization format. No downstream impact.
Cross-PR Conflicts
PR #1858 (k-bit blockwise quantization kernels) adds new CUDA kernels. If this RFC proceeds to full migration, the new kernels from #1858 would need to be written using the compat.cuh abstractions rather than raw CUDA APIs. Worth noting for sequencing.
Verdict: APPROVE (as RFC)
This is a well-designed RFC. The portability layer approach is sound, the #if BNB_HIP guards are minimal and limited to genuinely divergent code, and the unified kernel code is a faithful merge of the existing CUDA and HIP sources. The concerns listed above are discussion points for the design, not blockers.
For the full migration, I'd recommend:
- Resolve the warp-size host/device detection question (concern #1 above)
- Add compilation tests that verify the unified files build correctly for both CUDA and HIP
- Run the full test suite on both CUDA and ROCm hardware to verify numerical equivalence
- Sequence this after or coordinate with #1858 to avoid rework
|
@Abdennacer-Badaoui Thanks! This is essentially what I was expecting we could do. I think this is a good way forward. Most of the review comments above make sense as well! |
RFC — Not intended to be merged as-is
This PR proposes a design for merging the duplicated CUDA and HIP kernel sources into a single codebase. The
csrc/examples/directory contains the full unified files demonstrating the approach. This is meant for discussion and feedback before we proceed with a full migration.Problem
We maintain near-identical copies of every GPU kernel:
kernels.cukernels.hipkernels.cuhkernels_hip.cuhops.cuops.hipops.cuhops_hip.cuhcommon.cuhcommon_hip.cuhThe HIP files were originally auto-generated by hipify and manually patched. Every bug fix or new feature must be applied to both copies, and they inevitably drift apart.
Proposed design
Introduce two portability headers:
compat.cuh— Host-safe types and macros (safe to include from.cppfiles)compat_device.cuh— Device-only layer: CUB/hipCUB, reduction ops, MMA (include from.cufiles only)These resolve all mechanical CUDA/HIP differences via macros, type aliases, and namespace aliases:
bnb_cub::→cub::on CUDA,hipcub::on HIPbnb_bfloat16→__nv_bfloat16on CUDA,hip_bfloat16on HIPbnb_stream_t→cudaStream_t/hipStream_tBNB_MAX_OP→cub::Max()/hipcub::Max()BNB_CHECK_RETURN()→ unified error checkingbnb_blasLt*,bnb_sparse*→ cuBLAS/hipBLAS and cuSPARSE/hipSPARSEKernel code uses these abstractions and compiles unmodified on both platforms. The
<<<grid, block>>>launch syntax works natively on HIP, so nohipLaunchKernelGGLwrappers are needed.For HIP builds, CMake simply sets
LANGUAGE HIPon the.cufiles.#if BNB_HIPguards are only needed for genuinely divergent code (~10% of changes):atomicMax(CUDA needs CAS loop, HIP has native)Contextclass (cuBLAS vs rocBLAS handle creation)igemmlt(hipBLAS requires explicit heuristic algo selection)BNB_WARP_SIZEcompile-time constants)The split into two headers is necessary because
.cppfiles (likepythonInterface.cpp) are compiled by the host compiler (gcc/g++), which cannot parse CUB/device headers. Only.cufilescompiled by nvcc/hipcc include
compat_device.cuh.Example files in
csrc/examples/compat.cuhcompat_device.cuhcommon_unified.cuhcommon.cuh+common_hip.cuhkernels_unified.cukernels.cu+kernels.hipops_unified.cuhops_unified.cu#if BNB_HIPfor divergent APIs)pythonInterface_unified.cppCMakeLists_unified.txtGPU_FILESlist)End state after full migration
common_hip.cuh,kernels.hip,kernels_hip.cuh,ops.hip,ops_hip.cuhcommon.cuh,kernels.cu,kernels.cuh,ops.cu,ops.cuh(now unified)compat.cuh,compat_device.cuh