Skip to content

Conversation

@serge-sans-paille
Copy link
Contributor

It's generally slower than the sse2 version due to a latency of 5 (!) for hadd_pd.

Related to #1107

It's generally slower than the sse2 version due to a latency of 5 (!) for
hadd_pd.

Related to #1107
@serge-sans-paille serge-sans-paille force-pushed the feature/faster-reduce_add branch from 6d0c663 to 634b18f Compare April 17, 2025 12:25
Forwarding to sse is actually faster. Related to #1107
@serge-sans-paille serge-sans-paille force-pushed the feature/faster-reduce_add branch from a208210 to b0a4665 Compare April 18, 2025 10:43
@serge-sans-paille
Copy link
Contributor Author

@DiamonDinoia : why did you guarded the code here: 174c475#diff-c4f5d7f47f45c737cc1723af31069217834daf7da679a0b4ff255a4a6ae73c83R1413 this "intrinsic" seems to be standard - and it passes our validation without the guard.

@DiamonDinoia
Copy link
Contributor

I see, I thought it was an icc only intrinsic from the way it was used in Agner's vcl. Altough, I have been using fine on any compiler.

@serge-sans-paille
Copy link
Contributor Author

Great. I'm going to merge this patchset then. Would you mind doing the same investigation for single precision float?

@serge-sans-paille serge-sans-paille merged commit bb5dd63 into master Apr 18, 2025
120 checks passed
@DiamonDinoia
Copy link
Contributor

DiamonDinoia commented Apr 18, 2025

Thanks for merging this!

Sure, I will have a look next week when I have a moment.

Would you mind considering the API I was suggesting for reducing interleaved complex?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants