feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Performance Regression: -29.9%

⚠️

Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 3 improved benchmarks
❌ 7 regressed benchmarks
✅ 1252 untouched benchmarks
🆕 16 new benchmarks
⏩ 1290 skipped benchmarks¹

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	WallTime	`u8_FoR[10M]`	71.7 µs	5.6 µs	×13
❌	Simulation	`canonical_into_non_nullable[(10000, 100, 0.0)]`	1.9 ms	2.7 ms	-29.9%
❌	Simulation	`canonical_into_non_nullable[(10000, 100, 0.1)]`	3.7 ms	4.5 ms	-18.26%
❌	Simulation	`canonical_into_non_nullable[(10000, 100, 0.01)]`	2.1 ms	3 ms	-27.53%
⚡	Simulation	`canonical_into_nullable[(10000, 10, 0.0)]`	528.5 µs	445.6 µs	+18.61%
⚡	Simulation	`canonical_into_nullable[(10000, 100, 0.0)]`	4.9 ms	4.1 ms	+19.6%
❌	Simulation	`into_canonical_non_nullable[(10000, 100, 0.0)]`	1.9 ms	2.7 ms	-29.38%
❌	Simulation	`into_canonical_non_nullable[(10000, 100, 0.01)]`	2.2 ms	3 ms	-26.6%
❌	Simulation	`into_canonical_non_nullable[(10000, 100, 0.1)]`	3.8 ms	4.6 ms	-17.54%
❌	Simulation	`into_canonical_nullable[(10000, 100, 0.0)]`	4.4 ms	5.2 ms	-15.61%
🆕	Simulation	`transpose_baseline_throughput`	N/A	2.5 ms	N/A
🆕	Simulation	`transpose_baseline`	N/A	10.9 µs	N/A
🆕	Simulation	`transpose_best_throughput`	N/A	92.8 µs	N/A
🆕	Simulation	`transpose_best`	N/A	2 µs	N/A
🆕	Simulation	`transpose_scalar`	N/A	3.4 µs	N/A
🆕	Simulation	`untranspose_best`	N/A	2.8 µs	N/A
🆕	Simulation	`transpose_scalar_throughput`	N/A	661 µs	N/A
🆕	Simulation	`transpose_scalar_fast`	N/A	1.7 µs	N/A
🆕	Simulation	`untranspose_baseline`	N/A	10.9 µs	N/A
🆕	Simulation	`transpose_scalar_fast_throughput`	N/A	64.2 µs	N/A
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing claude/bitpacking-transpose-optimization-tM1U4 (2cbd439) with develop (13f120f)}

1290 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Uh oh!

feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Uh oh!

Performance Regression: -29.9%

Performance Changes

Re-running checks...

feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Are you sure you want to change the base?

feat[fastlanes]: add 4-block VBMI transpose for 7% additional speedup

Uh oh!

feat[fastlanes]: add optimized 1024-bit transpose implementations #6135

Uh oh!

Performance Regression: -29.9%

Performance Changes

Footnotes

Re-running checks...