Help to understand fattn-mma-f16 #18243

zhang-hui-yulo · 2025-12-21T04:42:49Z

zhang-hui-yulo
Dec 21, 2025

I'm going through fattn-mma-f16 and try to add RDNA support, but I'm not very clear about the coding logic in fattn-mma-f16, fattn-wmma-f16 is similar to the original paper but fattn-mma-f16 looks different, so I just open this thread ask some questions to figure out the code logic, thank you.

Q1: Looks like that fattn-mma-f16 uses trans(V) * online_softmax(K*Q) not QKV in the original paper, may I have the reason?
Q2: I've seen a lot of "cols_per_warp == 8" for Turing MMA, could you help to explain the root cause? Especially movmatrix is used in the path as RDNA doesn't have movmatrix.
Q3: Look like that the Volta path doesn't use ldmatrix_trans, could you help to explain how to deal with V as AFAIK V always shall be transposed? Of course, RDNA doesn't have ldmatrix_trans, this might be helpful for the perf, thank you.

Best Regards
Hui

JohannesGaessler · 2025-12-21T09:26:50Z

JohannesGaessler
Dec 21, 2025
Collaborator

To avoid confusion, I assume the following data layouts in global memory: Q is column-major, K is row-major, V is column-major, VKQ is column-major.

I did not base the implementation on the paper, I read the paper for the general idea of rescaling the softmax and then did the implementation as I saw fit and to match llama.cpp's needs. The transposition of V only has to do with llama.cpp using a column-major data layout for V in global memory. But the Turing mma instructions require a row-major layout for V so the data needs to be transposed. With the NVIDIA WMMA interface this is presumably done internally so it's enough to specify wmma::col_major for the fragment. V is only being transposed in terms of the memory layout, from a logical standpoint there is no change.
The NVIDIA mma instructions are asymmetric in A and B. It is preferable to have a column-major Q, KQ, and VKQ matrix. However, then I would have to use the wide tile with 16 columns for Q, KQ, VKQ rather than for K, V. So it's better to use the thin tile with 8 columns for K, KQ, VKQ and the wide tile with 16 rows for K, V. But then KQ, VKQ end up being in row-major layout and you need to treat them differently.
The purpose of ldmatrix_trans is to bring the V data into the row-major layout required by Turing. However, Volta can handle both column-major and row-major layouts for the A, B matrices so this is not necessary.

0 replies

zhang-hui-yulo · 2025-12-21T10:01:52Z

zhang-hui-yulo
Dec 21, 2025
Author

Got it, any suggestion for the implement on RDNA? Looks like that the only way to get it work on RDNA is 16 rows tile on Turing, then use the normal way to load the transposed K, as RDNA mma's layout is 16x16x16.

1 reply

JohannesGaessler Dec 21, 2025
Collaborator

Just use the 16x16x16 layout and pad unused Q columns, it should be optimal for very large batch sizes. After that, check for which tensor shapes the mma kernel is faster than the tile kernel and write the kernel selection logic accordingly.

zhang-hui-yulo · 2025-12-21T11:47:47Z

zhang-hui-yulo
Dec 21, 2025
Author

Thank you for the info, good to know that I don't need to worry about "cols_per_warp == 8", I will have a try first but obviously RDNA needs to care about more, at least padding for K needs to be adjusted as there is no ldmatrix_trans.

I will keep this thread open as it might be more questions in the future, anyway, thank you for the support.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Help to understand fattn-mma-f16 #18243

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Help to understand fattn-mma-f16 #18243

Uh oh!

zhang-hui-yulo Dec 21, 2025

Replies: 3 comments · 1 reply

Uh oh!

JohannesGaessler Dec 21, 2025 Collaborator

Uh oh!

zhang-hui-yulo Dec 21, 2025 Author

Uh oh!

JohannesGaessler Dec 21, 2025 Collaborator

Uh oh!

zhang-hui-yulo Dec 21, 2025 Author

zhang-hui-yulo
Dec 21, 2025

Replies: 3 comments 1 reply

JohannesGaessler
Dec 21, 2025
Collaborator

zhang-hui-yulo
Dec 21, 2025
Author

JohannesGaessler Dec 21, 2025
Collaborator

zhang-hui-yulo
Dec 21, 2025
Author