Skip to content

Conversation

@degasus
Copy link
Contributor

@degasus degasus commented Jul 21, 2025

  • Use _maskz_permutexvar instead of permute + and on avx512bw, too
  • Use xsimd::make_batch_constant and so _mm512_set instead of _mm512_load(constexpr std::array)

The first one is just a tiny optimization already done in the AVX512VBMI implementation.

The second patch makes it easier for the compiler to move this constants into the .text section. GCC is not affected, MSVC sadly is affected a lot. It used to generated 8 times mov [rbp + N], const with a single vmovdqu32 zmm, [rdp] afterwards - poor MSVC.... And worse for the store_forwarding unit....

degasus added 2 commits July 21, 2025 20:18
This patch picks the instructions from avx512vbmi for the fast path.
Masking is faster than an additional AND instruction.
Instead of loading from an aligned array on the stack.
So we yield the `set` instead of `load` intrinsic, which makes it easier for the compiler to constant fold this parts.

Sadly, MSVC needs this....
@serge-sans-paille serge-sans-paille merged commit 45cbf4b into xtensor-stack:master Jul 22, 2025
63 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants