feat[fastlanes]: add optimized 1024-bit transpose implementations #6135
Performance Regression: -29.9%
⚠️ Unknown Walltime execution environment detected
Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.
For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.
⚡ 3 improved benchmarks
❌ 7 regressed benchmarks
✅ 1252 untouched benchmarks
🆕 16 new benchmarks
⏩ 1290 skipped benchmarks1
⚠️ Please fix the performance issues or acknowledge them on CodSpeed.
Performance Changes
| Mode | Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|---|
| ⚡ | WallTime | u8_FoR[10M] |
71.7 µs | 5.6 µs | ×13 |
| ❌ | Simulation | canonical_into_non_nullable[(10000, 100, 0.0)] |
1.9 ms | 2.7 ms | -29.9% |
| ❌ | Simulation | canonical_into_non_nullable[(10000, 100, 0.1)] |
3.7 ms | 4.5 ms | -18.26% |
| ❌ | Simulation | canonical_into_non_nullable[(10000, 100, 0.01)] |
2.1 ms | 3 ms | -27.53% |
| ⚡ | Simulation | canonical_into_nullable[(10000, 10, 0.0)] |
528.5 µs | 445.6 µs | +18.61% |
| ⚡ | Simulation | canonical_into_nullable[(10000, 100, 0.0)] |
4.9 ms | 4.1 ms | +19.6% |
| ❌ | Simulation | into_canonical_non_nullable[(10000, 100, 0.0)] |
1.9 ms | 2.7 ms | -29.38% |
| ❌ | Simulation | into_canonical_non_nullable[(10000, 100, 0.01)] |
2.2 ms | 3 ms | -26.6% |
| ❌ | Simulation | into_canonical_non_nullable[(10000, 100, 0.1)] |
3.8 ms | 4.6 ms | -17.54% |
| ❌ | Simulation | into_canonical_nullable[(10000, 100, 0.0)] |
4.4 ms | 5.2 ms | -15.61% |
| 🆕 | Simulation | transpose_baseline_throughput |
N/A | 2.5 ms | N/A |
| 🆕 | Simulation | transpose_baseline |
N/A | 10.9 µs | N/A |
| 🆕 | Simulation | transpose_best_throughput |
N/A | 92.8 µs | N/A |
| 🆕 | Simulation | transpose_best |
N/A | 2 µs | N/A |
| 🆕 | Simulation | transpose_scalar |
N/A | 3.4 µs | N/A |
| 🆕 | Simulation | untranspose_best |
N/A | 2.8 µs | N/A |
| 🆕 | Simulation | transpose_scalar_throughput |
N/A | 661 µs | N/A |
| 🆕 | Simulation | transpose_scalar_fast |
N/A | 1.7 µs | N/A |
| 🆕 | Simulation | untranspose_baseline |
N/A | 10.9 µs | N/A |
| 🆕 | Simulation | transpose_scalar_fast_throughput |
N/A | 64.2 µs | N/A |
| ... | ... | ... | ... | ... | ... |
ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.
Comparing claude/bitpacking-transpose-optimization-tM1U4 (2cbd439) with develop (13f120f)
Footnotes
-
1290 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩