Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 15 additions & 7 deletions ggml/src/ggml-vulkan/ggml-vulkan.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -2838,18 +2838,26 @@ static void ggml_vk_load_shaders(vk_device& device) {
m_warptile_mmqid_int_k = { 128, 64, 64, 32, mul_mat_subgroup_size_16, 32, 1, 2, 2, 1, mul_mat_subgroup_size_16 };
s_warptile_mmqid_int_k = { mul_mat_subgroup_size_32, 32, 32, 32, 32, 32, 1, 2, 1, 1, mul_mat_subgroup_size_16 };

l_mmq_wg_denoms = l_wg_denoms = { 128, 128, 1 };
m_mmq_wg_denoms = m_wg_denoms = { 64, 64, 1 };
s_mmq_wg_denoms = s_wg_denoms = { 32, 32, 1 };
l_align = 128;
m_align = 64;
s_align = 32;

// chip specific tuning
if ((device->architecture == AMD_GCN) && (device->driver_id != vk::DriverId::eAmdProprietary)) {
m_warptile_mmq = m_warptile_mmq_int = { 256, 64, 64, 32, 16, 16, 2, 2, 2, 1, 16 };
m_warptile_mmqid = m_warptile_mmqid_int = { 256, 64, 64, 32, 16, 16, 2, 2, 2, 1, 16 };
}

l_mmq_wg_denoms = l_wg_denoms = {128, 128, 1 };
m_mmq_wg_denoms = m_wg_denoms = { 64, 64, 1 };
s_mmq_wg_denoms = s_wg_denoms = { 32, 32, 1 };
l_align = 128;
m_align = 64;
s_align = 32;
else if ((device->vendor_id == VK_VENDOR_ID_INTEL) && (device->driver_id == vk::DriverId::eIntelProprietaryWindows)) {
if (device->coopmat_support && device->architecture == INTEL_XE2) {
// Xe2/Xe3 with coopmat enabled - warptile performance tuning
m_warptile = { 512, 128, 128, 16, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should actually be the large tile size?

Also, a quick google search suggests Xe2 has 64KB of register file per core, which with 512 invocations is only 32 registers each which seems very low. But I've never worked on this hardware so I'm just speculating.

Copy link
Author

@virajwad virajwad Dec 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jeffbolznv Sure I can look at re-enabling large warptile size for Intel here and then moving the warptile config from m_ to l_. I'll also check perf again after the change.

Are you doing (64 * 1024) / 512 invocations is 128 bytes per invocation and the assumption is 4 byte width register? (to get 32 registers per invocation?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's the calculation I did.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jeff, for Xe architecture each register in GRF was 32 Byte wide. But I need to look into the register situation a bit deeper

m_warptile_mmq = { 512, 128, 128, 32, subgroup_size_8, 32, 2, tm_m, tn_m, tk_m, subgroup_size_8 };
m_mmq_wg_denoms = m_wg_denoms = { 128, 128, 1 };
}
}

for (uint32_t i = 0; i < GGML_TYPE_COUNT; ++i) {
ggml_type t = (ggml_type)i;
Expand Down