Help to understand fattn-mma-f16 #18243
Replies: 3 comments 1 reply
-
|
To avoid confusion, I assume the following data layouts in global memory: Q is column-major, K is row-major, V is column-major, VKQ is column-major.
|
Beta Was this translation helpful? Give feedback.
-
|
Got it, any suggestion for the implement on RDNA? Looks like that the only way to get it work on RDNA is 16 rows tile on Turing, then use the normal way to load the transposed K, as RDNA mma's layout is 16x16x16. |
Beta Was this translation helpful? Give feedback.
-
|
Thank you for the info, good to know that I don't need to worry about "cols_per_warp == 8", I will have a try first but obviously RDNA needs to care about more, at least padding for K needs to be adjusted as there is no ldmatrix_trans. I will keep this thread open as it might be more questions in the future, anyway, thank you for the support. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello @JohannesGaessler
I'm going through fattn-mma-f16 and try to add RDNA support, but I'm not very clear about the coding logic in fattn-mma-f16, fattn-wmma-f16 is similar to the original paper but fattn-mma-f16 looks different, so I just open this thread ask some questions to figure out the code logic, thank you.
Q1: Looks like that fattn-mma-f16 uses trans(V) * online_softmax(K*Q) not QKV in the original paper, may I have the reason?
Q2: I've seen a lot of "cols_per_warp == 8" for Turing MMA, could you help to explain the root cause? Especially movmatrix is used in the path as RDNA doesn't have movmatrix.
Q3: Look like that the Volta path doesn't use ldmatrix_trans, could you help to explain how to deal with V as AFAIK V always shall be transposed? Of course, RDNA doesn't have ldmatrix_trans, this might be helpful for the perf, thank you.
Best Regards
Hui
Beta Was this translation helpful? Give feedback.
All reactions