perf(python): optimize int64 list/tuple serialization with C++ batch processing #3216

pandalee99 · 2026-01-27T03:32:34Z

Summary

Add C++ batch processing functions for int64 sequence serialization/deserialization to improve performance.

Changes:

Add Fory_PyInt64SequenceWriteToBuffer and Fory_PyInt64SequenceReadFromBuffer in C++ layer
Optimize _write_int and _read_int in CollectionSerializer for list/tuple types
Add unit tests and benchmark script

Performance

Benchmark results on 1M integers:

Operation	Before (loop)	After (batch)	Speedup
Serialize	22.80 ms	6.44 ms	3.5x

============================================================
Int64 Batch Serialization Benchmark
============================================================
Size         Original        Optimized       Speedup
------------------------------------------------------------
10,000       0.41            0.10            3.97x
100,000      2.28            0.68            3.38x
1,000,000    22.80           6.44            3.54x

Run benchmark:

import time

import pyfory
from pyfory.buffer import Buffer


def benchmark_original_write(data, iterations=20):
    """Benchmark original loop-based write_varint64."""
    buf = Buffer.allocate(len(data) * 9 + 1024)

    # Warmup
    for _ in range(3):
        buf.writer_index = 0
        for v in data:
            buf.write_varint64(v)

    # Benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        buf.writer_index = 0
        for v in data:
            buf.write_varint64(v)
    end = time.perf_counter()

    return (end - start) / iterations * 1000


def benchmark_optimized_serialize(data, iterations=20):
    """Benchmark optimized batch serialization via Fory."""
    fory = pyfory.Fory(ref=True)

    # Warmup
    for _ in range(3):
        fory.serialize(data)

    # Benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        fory.serialize(data)
    end = time.perf_counter()

    return (end - start) / iterations * 1000


def benchmark_deserialize(data, iterations=20):
    """Benchmark deserialization."""
    fory = pyfory.Fory(ref=True)
    serialized = fory.serialize(data)

    # Warmup
    for _ in range(3):
        fory.deserialize(serialized)

    # Benchmark
    start = time.perf_counter()
    for _ in range(iterations):
        fory.deserialize(serialized)
    end = time.perf_counter()

    return (end - start) / iterations * 1000, len(serialized)


def main():
    print("=" * 60)
    print("Int64 Batch Serialization Benchmark")
    print("=" * 60)
    print(f"Cython enabled: {pyfory.ENABLE_FORY_CYTHON_SERIALIZATION}")
    print()

    sizes = [10_000, 100_000, 1_000_000]

    print(f"{'Size':<12} {'Original':<15} {'Optimized':<15} {'Speedup':<10} {'Deser':<12}")
    print("-" * 60)

    for size in sizes:
        data = list(range(size))

        original_time = benchmark_original_write(data)
        optimized_time = benchmark_optimized_serialize(data)
        deser_time, serialized_size = benchmark_deserialize(data)

        speedup = original_time / optimized_time

        print(
            f"{size:<12,} {original_time:<15.2f} {optimized_time:<15.2f} "
            f"{speedup:<10.2f}x {deser_time:<12.2f}"
        )

    print()
    print("Notes:")
    print("- Original: loop with individual write_varint64 calls (write only)")
    print("- Optimized: full serialize including header overhead")
    print("- Times are in milliseconds")


if __name__ == "__main__":
    main()

Test plan

All existing tests pass (356 passed)
New unit tests for edge cases (large integers, negative values, empty list)
Encoding format compatible with original varint64+zigzag

…processing Signed-off-by: lipan02 <pandali.kk@qq.com>

Signed-off-by: lipan02 <pandali.kk@qq.com>

cpp/fory/python/pyfory.cc

chaokunyang · 2026-01-27T04:24:16Z

cpp/fory/python/pyfory.cc

+
+// Write varint64 with ZigZag encoding inline
+// Returns number of bytes written
+static inline uint32_t WriteVarint64ZigZag(uint8_t *arr, int64_t value) {


I'm refactoring fory cython Buffer to forward numeric read/write into c++ Buffer, you can use c++ Buffer.write_varint64 directly after I finsihed the refactor later

WriteVarint64ZigZag should be removed now, python use same C++ Buffer for write/read ints

let me take a look.

## Why? - Python Cython buffer implementation duplicated C++ buffer logic and error handling. - Centralizing buffer reads/writes in the C++ buffer reduces duplication and keeps behavior aligned. ## What does this PR do? - Refactors `python/pyfory/buffer.pyx` to delegate reads/writes, varint/tagged encoding, and bounds checks to `fory::Buffer`. - Adds C++ `Buffer` int24 helpers and exposes missing buffer APIs/errors to Cython via `libutil.pxd`. - Introduces Python error mapping helpers and updates row/collection code to pass C++ buffer pointers correctly. ## Related issues Closes #3218 #3216 #1017 ## Does this PR introduce any user-facing change? - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark

Signed-off-by: lipan02 <pandali.kk@qq.com>

python/pyfory/collection.pxi

Signed-off-by: lipan02 <pandali.kk@qq.com>

pandalee99 added 3 commits January 26, 2026 20:14

perf(python): optimize int64 list/tuple serialization with C++ batch …

167c0b9

…processing Signed-off-by: lipan02 <pandali.kk@qq.com>

perf(python): optimize int64 list/tuple serialization with C++ batch …

c152666

…processing Signed-off-by: lipan02 <pandali.kk@qq.com>

clean codestyle

55c0062

Signed-off-by: lipan02 <pandali.kk@qq.com>

pandalee99 requested review from PragmaTwice and chaokunyang as code owners January 27, 2026 03:32

Merge branch 'main' into feat/opt_int64_list

2db282b