-
Notifications
You must be signed in to change notification settings - Fork 293
Description
Performance Issue: Inefficient get(0) Implementation
Summary
The current implementation of get() for batch<T, A> always stores the entire batch into an aligned buffer and returns buffer[I], even for I == 0. This introduces unnecessary overhead when only the first element is needed, which is common in reduction operations.
template <class A, size_t I, class T>
XSIMD_INLINE T get(batch<T, A> const& self, ::xsimd::index<I>, requires_arch<common>) noexcept
{
alignas(A::alignment()) T buffer[batch<T, A>::size];
self.store_aligned(&buffer[0]);
return buffer[I];
}Problem
Accessing the first element (get(0)) via full store_aligned is much more expensive than necessary. In reduce function, self.get(0) in the end which adds unneccassary cost. If we are loading the batch in a buffer, the performance benefit of using reduce function disappears as we can just load everything to buffer and then implement in a scalar fashion. The entire purpose of reduction operations are to avoid directly copying to data to a buffer.---
Proposed Solution
Introduce a first() helper for efficiently accessing the first lane of a batch:
template <class T, class A>
XSIMD_INLINE T first( batch<T, A> const& self) noexcept
{
// Example: platform-specific optimized intrinsic
return self.get_first(); // or use appropriate intrinsic depending on A
}This could avoid the store_aligned() and instead use more efficient intrinsics like:
_mm_cvtsd_f64() (SSE2)
_mm256_castps256_ps128() + _mm_cvtss_f32() (AVX)
_mm512_cvtss_f32() (AVX512)
This would dramatically improve performance for reductions and any other first-element access patterns.
This would eliminate the cost of storing the entire batch just to access the first element.