Skip to content

Conversation

@Flakebi
Copy link
Contributor

@Flakebi Flakebi commented Dec 14, 2025

Add intrinsics for the amdgpu architecture.

I’m not sure how to add/run CI (ci/run.sh fails for me e.g. for
nvptx because core cannot be found), but I checked that it compiles
without warnings with

$ CARGO_BUILD_RUSTFLAGS=-Ctarget-cpu=gfx900 cargo check --target amdgcn-amd-amdhsa -Zbuild-std=core

Tracking issue: rust-lang/rust#149988

@rustbot
Copy link
Collaborator

rustbot commented Dec 14, 2025

r? @sayantn

rustbot has assigned @sayantn.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

@sayantn
Copy link
Contributor

sayantn commented Dec 14, 2025

I think it can be "tested" the same way we do core::arch::nvptx (this is also a similar NO_RUN, NO_STD target). For those targets, we just test in CI that it builds. You can just look at how we handle nvptx64-nvidia-cuda in .github/workflows/main.yml. Also you would need to add this target to ci/dox.sh to check that the documentation builds

@Flakebi Flakebi force-pushed the amdgpu-intrinsics branch 5 times, most recently from d6043df to e102811 Compare December 14, 2025 20:28
@Flakebi
Copy link
Contributor Author

Flakebi commented Dec 14, 2025

Thanks, I tried to replicate what’s there for nvptx + adding -Ctarget-cpu (as amdgpu doesn’t compile without, it has no generic cpu) + build-std=core.
I still can’t build ci/run-docker.sh or ci/run.sh locally (ci/dox.sh seems to work if I remove s390x), but I think I got it working now.
Sorry for the many force-pushes!

The diff to my first push from when opening this PR is here: https://github.com/rust-lang/stdarch/compare/b3f5bdae0efbdd5f7297d0225623bd31c7fe895b..e1028110e77561574bfb7ea349154d46b5ea7b86

Copy link
Contributor

@sayantn sayantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I have put some comments

@sayantn
Copy link
Contributor

sayantn commented Dec 16, 2025

Also I have noticed that Clang provides a different set of intrinsics than these (amdgpuintrin.h and gpuintrin.h). Is this divergence intentional? We normally stick to following the C conventions in stdarch

@Flakebi
Copy link
Contributor Author

Flakebi commented Dec 16, 2025

Thanks for the review!

Interesting, I thought core_arch is more akin to the __builtin_* family of functions in clang.
The clang __gpu intrinsics are definitely useful, though they are (to no surprise) missing low-level, hardware-specific intrinsics (e.g. wave_id, sethalt and buffer/image intrinsics, which I can’t add yet because they need simd type support in rustc).

I do plan to write a common gpu module for amdgpu and nvptx as well.
Maybe this one should follow gpuintrin.h?
We could e.g. have

mod amdgpu {
    mod gpu {
        use super::*;
        pub fn block_id_x() -> u32 {
            workitem_id_x()
        }
    }
}
// Same for nvptx

mod gpu {
    #[cfg(target_arch = "amdgpu")]
    pub use amdgpu::gpu::*;
    #[cfg(target_arch = "nvptx64")]
    pub use nvptx::gpu::*;

    // + more intrinsics as in gpuintrin.h
}

@sayantn
Copy link
Contributor

sayantn commented Dec 16, 2025

The analogue to __builtin functions in C are #[rust_intrinsic] functions and linked LLVM intrinsics. They are generally not exposed to users (in stable) as they are tightly linked to the compiler implementation.

If there are some interesting platform-specific intrinsics, we can add them, but generally we follow GCC and Clang in regards to intrinsics.

Yea, a common gpu module can be added following gpuintrin.h.

@Flakebi
Copy link
Contributor Author

Flakebi commented Dec 18, 2025

Thanks for the review, that’s a nice case for const generics! I fixed all your inline comments.

Also thanks for linking the clang intrinsics, I tend to forget about those.
As far as I am aware, the moste used APIs for GPUs are CUDA/HIP/SYCL on the compute side and DirectX/Vulkan (through HLSL/GLSL) on the graphics side. Although clang is a building block for these, the clang intrinsics are most often not used directly. Therefore I’m rather hestitant to follow clang for GPUs.

As common GPU intrinsics is a larger topic, I started a Discourse post here (please add your thoughts!): https://internals.rust-lang.org/t/naming-gpu-things-in-the-rust-compiler-and-standard-library/23833
Though, I think progress on this doesn’t need to block amdgpu intrinsics (note that I don’t have a plan to stabilize amdgpu intrinsics, stabilizing generic GPU intrinsics seems more desirable).

My planned intrinsic “stack” is roughly as follows, from high-level to low-level:

  1. Rust exposes (experimental) generic gpu intrinsics, e.g. at core::gpu, naming tbd, see the Discourse post
  2. These are implemented for amdgpu and nvptx by calling target-specific intrinsics from core::arch::amdgpu/nvptx
  3. core::arch::amdgpu/nvptx intrinsics expose (not on stable) their respective LLVM intrinsics, naming can follow the LLVM intrinsic names

We could also use clang names for core::arch::amdgpu/nvptx, then we would end up with three set of names:

  1. Generic core::gpu intrinsics: Rust-specific names
  2. core::arch::amdgpu/nvptx: clang names
  3. LLVM intrinsics (used in core::arch): LLVM names

I would prefer following LLVM’s intrinsic names in core::arch::amdgpu/nvptx instead of clang’s names.
What do you think?

Note for my future self: I got run.sh to work with NORUN=1 NOSTD=1 TARGET=amdgcn-amd-amdhsa CARGO_UNSTABLE_BUILD_STD=core ci/run.sh.

Copy link
Contributor

@sayantn sayantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left some more comments

@sayantn
Copy link
Contributor

sayantn commented Dec 18, 2025

I don't have any strong preferences, if you feel that the lower-level LLVM intrinsics are more useful that just having the clang intrinsics, go ahead with that -- we can change it, remove it, or add the clang intrinsics too if required (as long as they are unstable)

@Flakebi
Copy link
Contributor Author

Flakebi commented Dec 19, 2025

Thanks for taking such a thorough look!
I fixed the (s.)wave.barrier and permlane64.i32 and answered your questions inline. I think the “works only on some target-cpus” is the most difficult one. I’m thinking, as long as it’s unstable, this state is fine. But that’s because I have no good idea to fix this, so if you or someone else has an idea, please let me know.

Copy link
Contributor

@sayantn sayantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered some more problems with the intrinsics, wave.reduce.f{add,sub,max,min} and llvm.s.barrier.leave are not in the LLVM release branch yet, they were added like 1 month ago. So they will fail to compile once someone actually monomorphises the functions. We should add them when the changes actually propagates to the rust fork (~3 months probably), otherwise it will cause link failures

Also, you probably forgot to add intrinsics for the add and fadd ones.

The other changes LGTM

Add intrinsics for the amdgpu architecture.
@Flakebi
Copy link
Contributor Author

Flakebi commented Dec 19, 2025

I fixed your comments and added a tests mod with instructions to check that generating the instructions works.
I changed the default target-cpu to gfx12, that seems like the one that supports most intrinsics (it doesn’t support s_memrealtime, s_get_waveid_in_workgroup, permlane16_swap and permlane32_swap).

Copy link
Contributor

@sayantn sayantn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one nit

Edit: sorry, my mistake. No other problems. Thanks ❤️

@sayantn sayantn added this pull request to the merge queue Dec 19, 2025
Merged via the queue into rust-lang:main with commit 6111906 Dec 19, 2025
75 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants