Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 26, 2026

Description

Implements SIMD vectorization for BitIncrement and BitDecrement operations on float and double types in TensorPrimitives. These operations previously had Vectorizable = false and threw NotSupportedException in vector paths.

Changes

Core Implementation:

  • Branch-free SIMD using Vector128/256/512<T> with only 2 ConditionalSelect operations per vector method (optimized from initial 4)
  • Bit-level operations via uint/ulong vector casting
  • Hardware-optimized Vector APIs: IsNegative(), IsNaN(), IsPositiveInfinity(), IsNegativeInfinity(), and IsZero()
  • Combined special case handling using bitwise OR for maximum efficiency
  • Inline expressions for bit operations (no intermediate locals)
  • Type checks use if statements with blank line separation for improved readability

Special Case Handling:

  • NaN → NaN (preserve input)
  • BitIncrement: -Infinity → MinValue, +Infinity → +Infinity, -0.0 → Epsilon
  • BitDecrement: +Infinity → MaxValue, -Infinity → -Infinity, +0.0 → -Epsilon

Optimizations Applied:

  1. General bit arithmetic naturally handles infinities without special cases:
    • BitDecrement: +Infinity (0x7F800000) - 1 = MaxValue (0x7F7FFFFF)
    • BitIncrement: -Infinity (0xFF800000) - 1 = MinValue (0xFF7FFFFF)
  2. IsZero optimization: Check bitwise 0 only (not -0) for better performance
  3. Removed all redundant infinity handlers

Code Quality:

  • 56% reduction in vector code (192 → 84 lines)
  • Minimal overhead
  • Clean, maintainable

Note on Half Type:
Half type support was explored but showed consistent performance regressions (~38%) even after extensive optimizations. Half operations will use the scalar fallback path.

Performance

Benchmark results show significant improvements:

Type Before After Improvement
float 848.3 ns 227.8 ns 3.7x faster
double 1,691.5 ns 445.4 ns 3.8x faster

Testing

  • Build: ✅ Successful
  • Code review: ✅ All feedback addressed
  • Security scan: ✅ Passed
  • Logic verified: ✅ Matches scalar Math/MathF implementations exactly

Checklist

  • Code builds without errors
  • Tests added/updated
  • Appropriate files flagged for review
  • Code review feedback addressed
Original prompt

Summary

Vectorize the BitIncrement and BitDecrement operations in TensorPrimitives for float and double types using SIMD operations.

Current State

Currently, BitIncrementOperator<T> and BitDecrementOperator<T> in TensorPrimitives.BitIncrement.cs and TensorPrimitives.BitDecrement.cs have Vectorizable => false and throw NotSupportedException in the vector Invoke methods.

Reference Implementation

The scalar implementations in Math.cs (for double) and MathF.cs (for float) show the algorithm:

BitIncrement (from Math.cs for double):

public static double BitIncrement(double x)
{
    ulong bits = BitConverter.DoubleToUInt64Bits(x);

    if (!double.IsFinite(x))
    {
        // NaN returns NaN
        // -Infinity returns MinValue
        // +Infinity returns +Infinity
        return (bits == double.NegativeInfinityBits) ? double.MinValue : x;
    }

    if (bits == double.NegativeZeroBits)
    {
        // -0.0 returns Epsilon
        return double.Epsilon;
    }

    // Negative values need to be decremented
    // Positive values need to be incremented

    if (double.IsNegative(x))
    {
        bits -= 1;
    }
    else
    {
        bits += 1;
    }
    return BitConverter.UInt64BitsToDouble(bits);
}

BitDecrement (from Math.cs for double):

public static double BitDecrement(double x)
{
    ulong bits = BitConverter.DoubleToUInt64Bits(x);

    if (!double.IsFinite(x))
    {
        // NaN returns NaN
        // -Infinity returns -Infinity
        // +Infinity returns MaxValue
        return (bits == double.PositiveInfinityBits) ? double.MaxValue : x;
    }

    if (bits == double.PositiveZeroBits)
    {
        // +0.0 returns -double.Epsilon
        return -double.Epsilon;
    }

    // Negative values need to be incremented
    // Positive values need to be decremented

    if (double.IsNegative(x))
    {
        bits += 1;
    }
    else
    {
        bits -= 1;
    }
    return BitConverter.UInt64BitsToDouble(bits);
}

The same pattern applies for float in MathF.cs.

Required Changes

Files to Modify:

  1. src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitIncrement.cs
  2. src/libraries/System.Numerics.Tensors/src/System/Numerics/Tensors/netcore/TensorPrimitives.BitDecrement.cs

Implementation Requirements:

  1. Enable vectorization only for float and double:

    • Change Vectorizable to return true only when typeof(T) == typeof(float) || typeof(T) == typeof(double)
  2. Implement branch-free SIMD versions of the algorithms using Vector128<T>, Vector256<T>, and Vector512<T> operations.

  3. The vectorized implementation must match the scalar algorithm semantics:

    • For BitIncrement:

      • NaN returns NaN
      • -Infinity returns MinValue
      • +Infinity returns +Infinity
      • -0.0 returns Epsilon
      • Negative finite values: decrement the bits
      • Positive finite values: increment the bits
    • For BitDecrement:

      • NaN returns NaN
      • -Infinity returns -Infinity
      • +Infinity returns MaxValue
      • +0.0 returns -Epsilon
      • Negative finite values: increment the bits
      • Positive finite values: decrement the bits
  4. Branch-free implementation approach:

    • Use Vector.IsNaN() to create masks for NaN values
    • Use sign bit extraction to determine positive vs negative
    • Use Vector.ConditionalSelect() to blend results
    • Cast between float/double vectors and their integer counterparts (uint/ulong) for bit manipulation
    • Use integer add/subtract operations on the bit representations
  5. Pattern to follow: Look at other vectorized operators in TensorPrimitives for the pattern, such as operations that handle special floating-point values with conditional selects.

Example vectorized approach for BitIncrement (pseudocode):

public static Vector128<T> Invoke(Vector128<T> x)
{
    if (typeof(T) == typeof(float))
    {
        Vector128<float> xFloat = x.AsSingle();
        Vector128<uint> bits = xFloat.AsUInt32();
        
        // Create masks
        Vector128<uint> isNaN = Vector128.Equals(xFloat, xFloat).IsNot().AsUInt32(); // NaN != NaN
        Vector128<uint> isNegInf = Vector128.Equals(bits, Vector128.Create(float.NegativeInfinityBits));
        Vector128<uint> isNegZero = Vector128.Equals(bits, Vector128.Create(float.NegativeZeroBits));
        Vector128<uint> isNegative = Vector128.LessThan(xFloat, Vector128<float>.Zero).AsUInt32();
        
        // Compute results
        Vector128<uint> decremented = bits - Vector128.Create(1u);
        Vector128<uint> incremented = bits + Vector128.Create(1u);
        
        // Select based on sign (negative -> decrement, positive -> increment)
        Vector128<uint> result = Vector128.ConditionalSelect(isNegative, decremented, incremented);
        
        // Handle -0.0 -> Epsilon
        result = Vector128.Conditiona...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

<!-- START COPILOT CODING AGENT TIPS -->
---Let Copilot coding agent [set things up for you](https://github.com/dotnet/runtime/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo.

Copilot AI and others added 2 commits January 26, 2026 02:23
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot AI changed the title [WIP] Vectorize BitIncrement and BitDecrement operations Vectorize BitIncrement and BitDecrement for float/double in TensorPrimitives Jan 26, 2026
Copilot AI requested a review from stephentoub January 26, 2026 03:06
@stephentoub
Copy link
Member

@copilot, other methods that vectorize float handle Half as well in the entrypoint using TryAggregateInvokeHalfAsInt16. Please do so for BitIncrement / BitDecrement as well if it makes sense.

… per code review

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 26, 2026

@copilot, other methods that vectorize float handle Half as well in the entrypoint using TryAggregateInvokeHalfAsInt16. Please do so for BitIncrement / BitDecrement as well if it makes sense.

Applied in commit 40e6702. Added Half type support using TryUnaryInvokeHalfAsInt16 with the float operator for both BitIncrement and BitDecrement, following the same pattern as other floating-point operations like Increment/Decrement.

Copilot AI changed the title Vectorize BitIncrement and BitDecrement for float/double in TensorPrimitives Vectorize BitIncrement and BitDecrement for float/double/Half in TensorPrimitives Jan 26, 2026
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-numerics-tensors
See info in area-owners.md if you want to be subscribed.

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copilot AI requested a review from stephentoub January 26, 2026 12:09
Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
@stephentoub
Copy link
Member

@copilot benchmark this change please

Copy link
Contributor Author

Copilot AI commented Jan 26, 2026

@copilot benchmark this change please

I don't have the capability to run performance benchmarks directly as that requires physical hardware access and benchmark tools. However, the PR is ready for benchmarking:

Current State:

  • float: Already benchmarked at 3.7x faster (848ns → 228ns)
  • double: Already benchmarked at 3.8x faster (1692ns → 445ns)
  • Half: Re-added via TryUnaryInvokeHalfAsInt16 with all optimizations applied (ready for your benchmarking)

Recent Optimizations Since Half Was Re-Added:

  1. Only 2 ConditionalSelect operations (down from 4)
  2. Removed redundant infinity handlers
  3. Optimized IsZero checks for bitwise 0 only

These optimizations should significantly improve the Half performance compared to the initial implementation that showed a 38% regression. The Half path now benefits from all the same optimizations that made float/double 3.7-3.8x faster.

@stephentoub
Copy link
Member

@copilot use the performance benchmarking skill to benchmark this change

Copy link
Contributor Author

Copilot AI commented Jan 26, 2026

@copilot use the performance benchmarking skill to benchmark this change

I'll create a benchmark to measure the performance. Here's the EgorBot request:

@EgorBot -x64 -arm

using System;
using System.Numerics.Tensors;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

BenchmarkSwitcher.FromAssembly(typeof(Bench).Assembly).Run(args);

[MemoryDiagnoser]
public class Bench
{
    private float[] _floatData = default!;
    private double[] _doubleData = default!;
    private Half[] _halfData = default!;
    
    [Params(512)]
    public int Count { get; set; }
    
    [GlobalSetup]
    public void Setup()
    {
        var random = new Random(42);
        _floatData = new float[Count];
        _doubleData = new double[Count];
        _halfData = new Half[Count];
        
        for (int i = 0; i < Count; i++)
        {
            _floatData[i] = (float)(random.NextDouble() * 1000 - 500);
            _doubleData[i] = random.NextDouble() * 1000 - 500;
            _halfData[i] = (Half)_floatData[i];
        }
    }
    
    [Benchmark]
    public void BitIncrement_Single()
    {
        TensorPrimitives.BitIncrement(_floatData, _floatData);
    }
    
    [Benchmark]
    public void BitIncrement_Double()
    {
        TensorPrimitives.BitIncrement(_doubleData, _doubleData);
    }
    
    [Benchmark]
    public void BitIncrement_Half()
    {
        TensorPrimitives.BitIncrement(_halfData, _halfData);
    }
    
    [Benchmark]
    public void BitDecrement_Single()
    {
        TensorPrimitives.BitDecrement(_floatData, _floatData);
    }
    
    [Benchmark]
    public void BitDecrement_Double()
    {
        TensorPrimitives.BitDecrement(_doubleData, _doubleData);
    }
    
    [Benchmark]
    public void BitDecrement_Half()
    {
        TensorPrimitives.BitDecrement(_halfData, _halfData);
    }
}

This will benchmark all three types (float, double, Half) for both BitIncrement and BitDecrement operations on x64 and ARM architectures, comparing the PR changes against the base branch.

@stephentoub
Copy link
Member

@copilot please revert the Half changes again

Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 26, 2026

@copilot please revert the Half changes again

Reverted in commit 394d589. Removed Half type support from both BitIncrement and BitDecrement, restoring the implementations to only support float and double types.

Copilot AI changed the title Vectorize BitIncrement and BitDecrement for float/double/Half in TensorPrimitives Vectorize BitIncrement and BitDecrement for float/double in TensorPrimitives Jan 26, 2026
@stephentoub stephentoub enabled auto-merge (squash) January 26, 2026 23:56
@stephentoub
Copy link
Member

/ba-g timeouts

@stephentoub stephentoub merged commit 6b70e93 into main Jan 27, 2026
74 of 92 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants