Skip to content

Conversation

@degasus
Copy link
Contributor

@degasus degasus commented Jul 22, 2025

With a fast path for N = 4*i and a split version else (inspired from avx512bw).

As the vpermd actually has the lower latency on some intel cpus compared to vpermw / vpermb, also use it instead of the avx512bw and avx512vbmi implementations if trivially possible.

Also reverse the order of the slow path for lower latency. It used to be:

SLR -> PERM ->
               OR -> PERM
SLL         ->

Now the latency is reduced to:

SLR -> PERM ->
               OR
SLL -> PERM ->

And it should now generate even better code on avx512bw with N=63 with only one PERM (as already done for N=1).

For N=16,32,48, it prefers vshufi32x4 over vpermd for lower latency on Zen4 and decreased register usage.

@degasus
Copy link
Contributor Author

degasus commented Jul 22, 2025

FYI, the reason why I'm interested in those slide_left is because my code requires a cumsum:

batch cumsum(batch x) {
  for (size_t i = 0; i < std::bit_width(batch::size - 1); i++) {
    x += xsimd::slide_left<(sizeof(batch::value_type) << i)>(x); 
  }
  return x;
}

Is this a method you're interested in as part of XSIMD's public API?

With a fast path for N = 4*i and a split version else (inspired from avx512bw).

As the vpermd actually has the lower latency on some intel cpus compared to vpermw / vpermb, also use it instead of the avx512bw and avx512vbmi implementations if trivially possible.

Also reverse the order of the slow path for lower latency. It used to be:
```
SLR -> PERM ->
               OR -> PERM
SLL         ->
```

Now the latency is reduced to:
```
SLR -> PERM ->
               OR
SLL -> PERM ->
```

And it should now generate even better code on avx512bw with N=63 with only one PERM (as already done for N=1).

For N=16,32,48, it prefers vshufi32x4 over vpermd for lower latency on Zen4 and decreased register usage.
@serge-sans-paille
Copy link
Contributor

serge-sans-paille commented Jul 24, 2025

cumsum... isn't that https://xsimd.readthedocs.io/en/latest/api/reducer_index.html#_CPPv4I00E10reduce_add1TRK5batchI1T1AE ?
EDIT: it's not :-/ yeah of course add it as a generic operation!

@serge-sans-paille serge-sans-paille merged commit 4b8842c into xtensor-stack:master Jul 24, 2025
63 checks passed
@degasus degasus deleted the opt_shift branch August 19, 2025 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants