-
Notifications
You must be signed in to change notification settings - Fork 14.2k
vulkan: Warptile tuning for Intel Xe2/Xe3 #18178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 4 commits
0f82d0a
c908711
3441283
6b7f1e8
0cfa616
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2851,6 +2851,15 @@ static void ggml_vk_load_shaders(vk_device& device) { | |
| m_align = 64; | ||
| s_align = 32; | ||
|
|
||
| if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) { | ||
| if (device->coopmat_support && device->architecture == INTEL_XE2) { | ||
| // Xe2/Xe3 with coopmat enabled - warptile performance tuning | ||
| m_warptile = { 512, 128, 128, 16, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 }; | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if this should actually be the large tile size? Also, a quick google search suggests Xe2 has 64KB of register file per core, which with 512 invocations is only 32 registers each which seems very low. But I've never worked on this hardware so I'm just speculating.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hi @jeffbolznv Sure I can look at re-enabling large warptile size for Intel here and then moving the warptile config from m_ to l_. I'll also check perf again after the change. Are you doing (64 * 1024) / 512 invocations is 128 bytes per invocation and the assumption is 4 byte width register? (to get 32 registers per invocation?)
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that's the calculation I did.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks Jeff, for Xe architecture each register in GRF was 32 Byte wide. But I need to look into the register situation a bit deeper |
||
| m_warptile_mmq = { 512, 128, 128, 32, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 }; | ||
| m_mmq_wg_denoms = m_wg_denoms = { 128, 128, 1 }; | ||
| } | ||
| } | ||
|
|
||
| for (uint32_t i = 0; i < GGML_TYPE_COUNT; ++i) { | ||
| ggml_type t = (ggml_type)i; | ||
| // Disable medium and large matrix multiplication if not enough shared memory is available | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah it's good to see more tunes show up 😃. Please move your tune to line 2845 so that they're all placed together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @netrunnereve thanks! I can do that but the wg_denoms change also needs to overwrite the default and be 'paired' with this tuning config to pass unit tests. Do you want me to make two separate 'if' statements with the same conditions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would just move this section above line 2841.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made the change