diff --git a/images/2025-allocatealot-machine-code.png b/images/2025-allocatealot-machine-code.png new file mode 100644 index 000000000..e20756a4f Binary files /dev/null and b/images/2025-allocatealot-machine-code.png differ diff --git a/posts/2025/06/how-fast-can-the-rpython-gc-allocate.md b/posts/2025/06/how-fast-can-the-rpython-gc-allocate.md new file mode 100644 index 000000000..5dc5c505b --- /dev/null +++ b/posts/2025/06/how-fast-can-the-rpython-gc-allocate.md @@ -0,0 +1,532 @@ + + +While working on a paper about [allocation profiling in +VMProf](https://pypy.org/posts/2025/02/pypy-gc-sampling.html) I got curious +about how quickly the RPython GC can allocate an object. I wrote a small +RPython benchmark program to get an idea of the order of magnitude. + +The basic idea is to just allocate an instance in a tight loop: + +```python +class A(object): + pass + +def run(loops): + # preliminary idea, see below + for i in range(loops): + a = A() + a.i = i +``` + +The RPython type inference will find out that instances of `A` have a single +`i` field, which is an integer. In addition to that field, every RPython object +needs one word of GC meta-information. Therefore one instance of `A` needs 16 +bytes on a 64-bit architecture. + +However, measuring like this is not good enough, because the RPython static +optimizer would remove the allocation since the object isn't used. But we can +confuse the escape analysis sufficiently by always keeping two instances alive +at the same time: + +```python +class A(object): + pass + +def run(loops): + a = prev = None + for i in range(loops): + prev = a + a = A() + a.i = i + print(prev, a) # print the instances at the end +``` + +(I confirmed that the allocation isn't being removed by looking at the C code +that the RPython compiler generates from this.) + +This is doing a little bit more work than needed, because of the `a.i = i` +instance attribute write. We can also (optionally) leave the field +uninitialized. + +```python +def run(initialize_field, loops): + t1 = time.time() + if initialize_field: + a = prev = None + for i in range(loops): + prev = a + a = A() + a.i = i + print(prev, a) # make sure always two objects are alive + else: + a = prev = None + for i in range(loops): + prev = a + a = A() + print(prev, a) + t2 = time.time() + print(t2 - t1, 's') + object_size_in_words = 2 # GC header, one integer field + mem = loops * 8 * object_size_in_words / 1024.0 / 1024.0 / 1024.0 + print(mem, 'GB') + print(mem / (t2 - t1), 'GB/s') +``` + +Then we need to add some RPython scaffolding: + +```python +def main(argv): + loops = int(argv[1]) + with_init = bool(int(argv[2])) + if with_init: + print("with initialization") + else: + print("without initialization") + run(with_init, loops) + return 0 + +def target(*args): + return main +``` + +To build a binary: + +```console +pypy rpython/bin/rpython targetallocatealot.py +``` + +Which will turn the RPython code into C code and use a C compiler to turn that +into a binary, containing both our code above as well as the RPython garbage +collector. + +Then we can run it (all results again from my AMD Ryzen 7 PRO 7840U, running +Ubuntu Linux 24.04.2): + +```console +$ ./targetallocatealot-c 1000000000 0 +without initialization + +0.433825 s +14.901161 GB +34.348322 GB/s +$ ./targetallocatealot-c 1000000000 1 +with initialization + +0.501856 s +14.901161 GB +29.692100 GB/s +``` + +Let's compare it with the Boehm GC: + +```console +$ pypy rpython/bin/rpython --gc=boehm --output=targetallocatealot-c-boehm targetallocatealot.py +... +$ ./targetallocatealot-c-boehm 1000000000 0 +without initialization + +9.722585 s +14.901161 GB +1.532634 GB/s +$ ./targetallocatealot-c-boehm 1000000000 1 +with initialization + +9.684149 s +14.901161 GB +1.538717 GB/s +``` + +This is not a fair comparison, because the Boehm GC uses conservative stack +scanning, therefore it cannot move objects, which requires much more +complicated allocation. + +## Let's look at `perf stats` + +We can use `perf` to get some statistics about the executions: + +```console +$ perf stat -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./targetallocatealot-c 10000000000 0 +without initialization + +4.301442 s +149.011612 GB +34.642245 GB/s + + Performance counter stats for './targetallocatealot-c 10000000000 0': + + 7,244,117,828 cache-references + 23,446,661 cache-misses # 0.32% of all cache refs + 21,074,240,395 cycles + 110,116,790,943 instructions # 5.23 insn per cycle + 20,024,347,488 branches + 1,287 faults + 24 migrations + + 4.303071693 seconds time elapsed + + 4.297557000 seconds user + 0.003998000 seconds sys + +$ perf stat -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./targetallocatealot-c 10000000000 1 +with initialization + +5.016772 s +149.011612 GB +29.702688 GB/s + + Performance counter stats for './targetallocatealot-c 10000000000 1': + + 7,571,461,470 cache-references + 241,915,266 cache-misses # 3.20% of all cache refs + 24,503,497,532 cycles + 130,126,387,460 instructions # 5.31 insn per cycle + 20,026,280,693 branches + 1,285 faults + 21 migrations + + 5.019444749 seconds time elapsed + + 5.012924000 seconds user + 0.005999000 seconds sys +``` + +This is pretty cool, we can run this loop with >5 instructions per cycle. Every +allocation takes `110116790943 / 10000000000 ≈ 11` instructions and +`21074240395 / 10000000000 ≈ 2.1` cycles, including the loop around it. + + +## How often does the GC run? + +The RPython GC queries the L2 cache size to determine the size of the nursery. +We can find out what it is by turning on PYPYLOG, selecting the proper logging +categories, and printing to `stdout` via `:-`: + +```console +$ PYPYLOG=gc-set-nursery-size,gc-hardware:- ./targetallocatealot-c 1 1 +[f3e6970465723] {gc-set-nursery-size +nursery size: 270336 +[f3e69704758f3] gc-set-nursery-size} +[f3e697047b9a1] {gc-hardware +L2cache = 1048576 +[f3e69705ced19] gc-hardware} +[f3e69705d11b5] {gc-hardware +memtotal = 32274210816.000000 +[f3e69705f4948] gc-hardware} +[f3e6970615f78] {gc-set-nursery-size +nursery size: 4194304 +[f3e697061ecc0] gc-set-nursery-size} +with initialization +NULL +0.000008 s +0.000000 GB +0.001894 GB/s +``` + +So the nursery is 4 MiB. This means that when we allocate 14.9 GiB the GC needs to perform `10000000000 * 16 / 4194304 ≈ 38146` minor collections. Let's confirm that: + +```console +$ PYPYLOG=gc-minor:out ./targetallocatealot-c 10000000000 1 +with initialization +w +5.315511 s +149.011612 GB +28.033356 GB/s +$ head out +[f3ee482f4cd97] {gc-minor +[f3ee482f53874] {gc-minor-walkroots +[f3ee482f54117] gc-minor-walkroots} +minor collect, total memory used: 0 +number of pinned objects: 0 +total size of surviving objects: 0 +time taken: 0.000029 +[f3ee482f67b7e] gc-minor} +[f3ee4838097c5] {gc-minor +[f3ee48380c945] {gc-minor-walkroots +$ grep "{gc-minor-walkroots" out | wc -l +38147 +``` + +Each minor collection is very quick, because a minor collection is +O(surviving objects), and in this program only one object survive each time +(the other instance is in the process of being allocated). +Also, the GC root shadow stack is only one entry, so walking that is super +quick as well. The time the minor collections take is logged to the out file: + +```console +$ grep "time taken" out | tail +time taken: 0.000002 +time taken: 0.000002 +time taken: 0.000002 +time taken: 0.000002 +time taken: 0.000002 +time taken: 0.000002 +time taken: 0.000002 +time taken: 0.000003 +time taken: 0.000002 +time taken: 0.000002 +$ grep "time taken" out | grep -o "0.*" | numsum +0.0988160000000011 +``` + +(This number is super approximate due to float formatting rounding.) + +that means that `0.0988160000000011 / 5.315511 ≈ 2%` of the time is spent in the GC. + + +## What does the generated machine code look like? + +The allocation fast path of the RPython GC is a simple bump pointer, in Python +pseudo-code it would look roughly like this: + +```python +result = gc.nursery_free +# Move nursery_free pointer forward by totalsize +gc.nursery_free = result + totalsize +# Check if this allocation would exceed the nursery +if gc.nursery_free > gc.nursery_top: + # If it does => collect the nursery and al + result = collect_and_reserve(totalsize) +result.hdr = +``` + +So we can disassemble the compiled binary `targetallocatealot-c` and try to +find the equivalent logic in machine code. I'm super bad at reading machine +code, but I tried to annotate what I think is the core loop (the version +without initializing the `i` field) below: + +```c-objdump + ... + cb68: mov %rbx,%rdi + cb6b: mov %rdx,%rbx + + # initialize object header of object allocated in previous iteration + cb6e: movq $0x4c8,(%rbx) + + # loop termination check + cb75: cmp %rbp,%r12 + cb78: je ccb8 + + # load nursery_free + cb7e: mov 0x33c13(%rip),%rdx + + # increment loop counter + cb85: add $0x1,%rbp + + # add 16 (size of object) to nursery_free + cb89: lea 0x10(%rdx),%rax + + # compare nursery_top with new nursery_free + cb8d: cmp %rax,0x33c24(%rip) + + # store new nursery_free + cb94: mov %rax,0x33bfd(%rip) + + # if new nursery_free exceeds nursery_top, fall through to slow path, if not, start at top + cb9b: jae cb68 + + # slow path from here on: + # save live object from last iteration to GC shadow stack + cb9d: mov %rbx,-0x8(%rcx) + cba1: mov %r13,%rdi + cba4: mov $0x10,%esi + # do minor collection + cba9: call 20800 + ... +``` + +## Running the benchmark as regular Python code + +So far we ran this code as *RPython*, i.e. type inference is performed and the +program is translated to a C binary. We can also run it on top of PyPy, as a +regular Python3 program. However, an instance of a user-defined class in regular +Python when run on PyPy is actually a much larger object, due to [dynamic +typing](https://pypy.org/posts/2010/11/efficiently-implementing-python-objects-3838329944323946932.html). +It's at least 7 words, which is 56 bytes. + +However, we can simply use `int` objects instead. Integers are allocated on the +heap and consist of two words, one for the GC and one with the +machine-word-sized integer value, if the integer fits into a signed 64-bit +representation (otherwise a less compact different representation is used, +which can represent arbitrarily large integers). + +Therefore, we can simply use this kind of code: + +``` +import sys, time + + +def run(loops): + t1 = time.time() + a = prev = None + for i in range(loops): + prev = a + a = i + print(prev, a) # make sure always two objects are alive + t2 = time.time() + object_size_in_words = 2 # GC header, one integer field + mem = loops * 28 / 1024.0 / 1024.0 / 1024.0 + print(mem, 'GB') + print(mem / (t2 - t1), 'GB/s') + +def main(argv): + loops = int(argv[1]) + run(loops) + return 0 + +if __name__ == '__main__': + sys.exit(main(sys.argv)) +``` + +In this case we can't really leave the value uninitialized though. + +We can run this both with and without the JIT: + +```console +$ pypy3 allocatealot.py 1000000000 +999999998 999999999 +14.901161193847656 GB +17.857494904899553 GB/s +$ pypy3 --jit off allocatealot.py 1000000000 +999999998 999999999 +14.901161193847656 GB +0.8275382375297171 GB/s +``` + +This is obviously much less efficient than the C code, the PyPy JIT generates +much less efficient machine code than GCC. Still, "only" twice as slow is kind +of cool anyway. + +(Running it with CPython doesn't really make sense for this measurements, since +CPython ints are bigger – `sys.getsizeof(5)` reports 28 bytes.) + +## The machine code that the JIT generates + +Unfortunately it's a bit of a journey to show the machine code that PyPy's JIT generates for this. First we need to run with all jit logging categories: + +```console +$ PYPYLOG=jit:out pypy3 allocatealot.py 1000000000 +``` + +Then we can read the log file to find the trace IR for the loop under the logging category `jit-log-opt`: + +``` ++532: label(p0, p1, p6, p9, p11, i34, p13, p19, p21, p23, p25, p29, p31, i44, i35, descr=TargetToken(137358545605472)) +debug_merge_point(0, 0, 'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-9~#24 FOR_ITER') + +# are we at the end of the loop ++552: i45 = int_lt(i44, i35) ++555: guard_true(i45, descr=) [p0, p6, p9, p11, p13, p19, p21, p23, p25, p29, p31, p1, i44, i35, i34] ++561: i47 = int_add(i44, 1) +debug_merge_point(0, 0, 'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-9~#26 STORE_FAST') +debug_merge_point(0, 0, 'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-10~#28 LOAD_FAST') +debug_merge_point(0, 0, 'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-10~#30 STORE_FAST') +debug_merge_point(0, 0, 'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-11~#32 LOAD_FAST') +debug_merge_point(0, 0, 'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-11~#34 STORE_FAST') +debug_merge_point(0, 0, 'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-11~#36 JUMP_ABSOLUTE') + +# update iterator object ++565: setfield_gc(p25, i47, descr=) ++569: guard_not_invalidated(descr=) [p0, p6, p9, p11, p19, p21, p23, p25, p29, p31, p1, i44, i34] + +# check for signals ++569: i49 = getfield_raw_i(137358624889824, descr=) ++582: i51 = int_lt(i49, 0) ++586: guard_false(i51, descr=) [p0, p6, p9, p11, p19, p21, p23, p25, p29, p31, p1, i44, i34] +debug_merge_point(0, 0, 'run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-9~#24 FOR_ITER') + +# allocate the integer (allocation sunk to the end of the trace) ++592: p52 = new_with_vtable(descr=) ++630: setfield_gc(p52, i34, descr=) ++634: jump(p0, p1, p6, p9, p11, i44, p52, p19, p21, p23, p25, p29, p31, i47, i35, descr=TargetToken(137358545605472)) +``` + +To find the machine code address of the trace, we need to search for this line: + +``` +Loop 1 (run;/home/cfbolz/projects/gitpypy/allocatealot.py:6-9~#24 FOR_ITER) \ + has address 0x7ced473ffa0b to 0x7ced473ffbb0 (bootstrap 0x7ced473ff980) +``` + +Then we can use a script in the PyPy repo to disassemble the generated machine code: + +```console +$ pypy rpython/jit/backend/tool/viewcode.py out +``` + +This will dump all the machine code to stdout, and open a [pygame-based +graphviz cfg](https://pypy.org/posts/2021/04/ways-pypy-graphviz.html). In there +we can search for the address and see this: + +![Graphviz based visualization of the machine code the JIT generates](/images/2025-allocatealot-machine-code.png) + +Here's an annotated version with what I think this code does: + +```objdump + +# increment the profile counter +7ced473ffb40: 48 ff 04 25 20 9e 33 incq 0x38339e20 +7ced473ffb47: 38 + +# check whether the loop is done +7ced473ffb48: 4c 39 fe cmp %r15,%rsi +7ced473ffb4b: 0f 8d 76 01 00 00 jge 0x7ced473ffcc7 + +# increment iteration variable +7ced473ffb51: 4c 8d 66 01 lea 0x1(%rsi),%r12 + +# update iterator object +7ced473ffb55: 4d 89 61 08 mov %r12,0x8(%r9) + +# check for ctrl-c/thread switch +7ced473ffb59: 49 bb e0 1b 0b 4c ed movabs $0x7ced4c0b1be0,%r11 +7ced473ffb60: 7c 00 00 +7ced473ffb63: 49 8b 0b mov (%r11),%rcx +7ced473ffb66: 48 83 f9 00 cmp $0x0,%rcx +7ced473ffb6a: 0f 8c 8f 01 00 00 jl 0x7ced473ffcff + +# load nursery_free pointer +7ced473ffb70: 49 8b 8b d8 30 f6 fe mov -0x109cf28(%r11),%rcx + +# add size (16) +7ced473ffb77: 48 8d 51 10 lea 0x10(%rcx),%rdx + +# compare against nursery top +7ced473ffb7b: 49 3b 93 f8 30 f6 fe cmp -0x109cf08(%r11),%rdx + +# jump to slow path if nursery is full +7ced473ffb82: 0f 87 41 00 00 00 ja 0x7ced473ffbc9 + +# store new value of nursery free +7ced473ffb88: 49 89 93 d8 30 f6 fe mov %rdx,-0x109cf28(%r11) + +# initialize GC header +7ced473ffb8f: 48 c7 01 30 11 00 00 movq $0x1130,(%rcx) + +# initialize integer field +7ced473ffb96: 48 89 41 08 mov %rax,0x8(%rcx) +7ced473ffb9a: 48 89 f0 mov %rsi,%rax +7ced473ffb9d: 48 89 8d 60 01 00 00 mov %rcx,0x160(%rbp) +7ced473ffba4: 4c 89 e6 mov %r12,%rsi +7ced473ffba7: e9 94 ff ff ff jmp 0x7ced473ffb40 +7ced473ffbac: 0f 1f 40 00 nopl 0x0(%rax) +``` + +## Conclusion + +The careful design of the RPython GC's allocation fast path gives pretty good +allocation rates. This technique isn't really new, it's a pretty typical way to +design a GC. Apart from that, my main conclusion would be that computers are +fast or something? Indeed, when we ran the same code on my colleague's +two-year-old AMD, we got quite a bit worse results, so a lot of the speed seems +to be due to the hard work of CPU architects. +