11<!--
2- .. title: Low Overhead Allocation Sampling in a Garbage Collected Virtual Machine
2+ .. title: Low Overhead Allocation Sampling with VMProf in PyPy's GC
33.. slug: pypy-gc-sampling
4- .. date: 2025-01-26 14:38:00 UTC
4+ .. date: 2025-02-28 14:38:00 UTC
55.. tags: gc, profiling, vmprof
66.. category:
77.. link:
1414
1515There are many time-based statistical profilers around (like VMProf or py-spy
1616just to name a few). They allow the user to pick a trade-off between profiling
17- precision and runtime overhead. On the other hand there are memory profilers,
18- which profile every allocation, resulting in precise profiling, but larger
19- overhead (e.g. memray). In this post we describe our experimental approach to
20- low overhead statistical memory profiling tightly integrated into VMProf and
21- the PyPy Garbage Collector. The main technical insight is that the check,
22- whether an allocation should be sampled, can be folded into the bump pointer
23- allocator checks, PyPy’s GC uses to find out if it should start a minor
24- collection. This makes the fast path with and without memory sampling the same.
25- A tool like this can be handy for discovering functions, allocating lots of
26- memory, leading to longer GC pauses caused by collections.
17+ precision and runtime overhead.
18+
19+ On the other hand there are memory profilers
20+ such as [ memray] ( https://github.com/bloomberg/memray ) . They can be handy for
21+ finding leaks or for discovering functions that allocate a lot of memory.
22+ Memory profilers typlically save every single allocation a program does. This
23+ results in precise profiling, but larger overhead.
24+
25+ In this post we describe our experimental approach to low overhead statistical
26+ memory profiling. Instead of saving every single allocation a program does, it
27+ only saves every nth allocated byte. We have tightly integrated VMProf and the
28+ PyPy Garbage Collector to achieve this. The main technical insight is that the
29+ check whether an allocation should be sampled can be made free. This is done by
30+ folding it into the bump pointer allocator check that the PyPy’s GC uses to
31+ find out if it should start a minor collection. In this way the fast path with
32+ and without memory sampling are exactly the same.
2733
2834## Background
2935
@@ -33,12 +39,11 @@ both of them first.
3339### VMProf
3440
3541[ VMProf] ( https://github.com/vmprof/vmprof-python ) is a statistical time-based
36- profiler for PyPy. This means, VMProf samples the stack a certain
37- user-configured number of times per second. By adjusting this number, the
42+ profiler for PyPy. VMProf samples the stack of currently running Python
43+ functions a certain user-configured number of times per second. By adjusting
44+ this number, the
3845overhead of profiling can be modified to pick the correct trade-off between
39- overhead and precision of the profile. Additionally, VMProf has the ability to
40- record JIT traces, giving an insight what the JIT does to the executed
41- functions of the program. In the resulting profile, functions with huge runtime
46+ overhead and precision of the profile. In the resulting profile, functions with huge runtime
4247stand out the most, functions with shorter runtime less so. If you want to get
4348a little more introduction to VMProf and how to use it with PyPy, you may look
4449at [ this blog
@@ -51,24 +56,23 @@ two spaces for allocated objects, the nursery and the old-space. Freshly
5156allocated objects will be allocated into the nursery. When the nursery is full
5257at some point, it will be collected and all objects that survive will be
5358tenured i.e. moved into the old-space. The old-space is much larger than the
54- nursery and is collected less frequent and incrementally (not completely
55- collected in one go, but step-by-step).
56-
57- This is still a quite simplified explanation of PyPy’s GC, but we don't want to
58- go over every detail right now. Instead, we will take a look at the fast path
59- (allocations in the nursery) and how the nursery is collected.
59+ nursery and is collected less frequently and
60+ [ incrementally] ( /posts/2024/03/fixing-bug-incremental-gc.html ) (not completely
61+ collected in one go, but step-by-step). The old space collection is not
62+ relevant for the rest of the post though. We will now take a look at nursery
63+ allocations and how the nursery is collected.
6064
6165### Bump Pointer Allocation in the Nursery
6266
63- The nursery (a small continuous memory area) utilizes pointers, to keep track
64- from where on the nursery is free and where it ends, called ` nursery_free ` and
67+ The nursery (a small continuous memory area) utilizes two pointers to keep track
68+ from where on the nursery is free and where it ends. They are called ` nursery_free ` and
6569` nursery_top ` . When memory is allocated, the GC checks if there is enough space
6670in the nursery left. If there is enough space, the ` nursery_free ` pointer will
67- be returned as the start address for the new allocated memory, and
71+ be returned as the start address for the newly allocated memory, and
6872` nursery_free ` will be moved forward by the amount of allocated memory.
6973
7074
71- <img src =" ../../.. /images/2025_02_allocation_sampling_images/nursery_allocation.svg" >
75+ <img src =" /images/2025_02_allocation_sampling_images/nursery_allocation.svg " >
7276
7377
7478``` Python
@@ -83,6 +87,11 @@ def allocate(totalsize):
8387 result = collect_and_reserve(totalsize)
8488 # result is a pointer into the nursery, obj will be allocated there
8589 return result
90+
91+ def collect_and_reserve (size_of_allocation ):
92+ # do a minor collection and return the start of the nursery afterwards
93+ minor_collection()
94+ return gc.nursery_free
8695```
8796
8897Understanding this is crucial for our allocation sampling approach, so let us
@@ -91,7 +100,7 @@ go through this step-by-step.
91100We already saw an example on how an allocation into a non-full nursery will
92101look like. But what happens, if the nursery is (too) full?
93102
94- <img src =" ../../.. /images/2025_02_allocation_sampling_images/nursery_full.svg" >
103+ <img src=/images/2025_02_allocation_sampling_images/nursery_full.svg">
95104
96105
97106As soon as an object doesn't fit into the nursery anymore, it will be
@@ -100,13 +109,15 @@ old-space, so that the nursery is free afterwards, and the requested allocation
100109can be made.
101110
102111
103- <img src =" ../../../images/2025_02_allocation_sampling_images/nursery_collected.svg " >
104-
112+ <img src =" /images/2025_02_allocation_sampling_images/nursery_collected.svg " >
105113
106- Note that this is still a bit of a simplification.
114+ ( Note that this is still a bit of a simplification.)
107115
108116## Sampling Approach
109117
118+ The last section described how the nursery allocation works normally. Now we'll
119+ talk how we integrate the new allocation sampling approach into it.
120+
110121To decide whether the GC should trigger a sample, the sampling logic is
111122integrated into the bump pointer allocation logic. Usually, when there is not
112123enough space in the nursery left to fulfill an allocation request, the nursery
@@ -116,20 +127,21 @@ is calculated by `sample_point = nursery_free + sample_n_bytes` where
116127` sample_n_bytes ` is the number of bytes allocated before a sample is made (i.e.
117128our sampling rate).
118129
119- Image we'd have a nursery of 2MB and want to sample every 512KB allocated, then
130+ Imagine we'd have a nursery of 2MB and want to sample every 512KB allocated, then
120131you could imagine our nursery looking like that:
121132
122- <img src =" ../../.. /images/2025_02_allocation_sampling_images/nursery_sampling.svg" >
133+ <img src =" /images/2025_02_allocation_sampling_images/nursery_sampling.svg " >
123134
124- And now here comes our secret trick. We use the sample point as ` nursery_top ` ,
135+ We use the sample point as ` nursery_top ` ,
125136so that allocating a chunk of 512KB would exceed the nursery top and start a
126137nursery collection. But of course we don't want to do a minor collection just
127138then, so before starting a collection, we need to check if the nursery is
128139actually full or if that is just an exceeded sample point. The latter will then
129- trigger a sample via VMProf's C-interface.
140+ trigger a VMprof stack sample. Afterwards we don't actually do a minor
141+ collection, but change ` nursery_top ` and immediately return to the caller.
130142
131- Now I got to tell you, that the last picture is only a half-truth. There do not
132- exist more than one sample point at any time. After we created the sampling
143+ The last picture is a conceptual simplification. Only one sampling point exists
144+ at any given time. After we created the sampling
133145point, it will be used as nursery top, if exceeded at some point, we will just
134146add ` sample_n_bytes ` to that sampling point, i.e. move it forward.
135147
@@ -141,58 +153,47 @@ def collect_and_reserve(size_of_allocation):
141153 if gc.nursery_top == gc.sample_point:
142154 # One allocation could exceed multiple sample points
143155 # Sample, move sample_point forward
144- while gc.sample_point < gc.nursery_free + size_of_allocation:
145- vmprof.sample_now()
146- gc.sample_point += sample_n_bytes
147-
156+ vmprof.sample_now()
157+ gc.sample_point += sample_n_bytes
158+
148159 # Set sample point as new nursery_top if it fits into the nursery
149160 if sample_point <= gc.real_nursery_top:
150- gc.nursery_top = sample_point
161+ gc.nursery_top = sample_point
151162 # Or use the real nursery top if it does not fit
152163 else :
153- gc.nursery_top = gc.real_nursery_top
164+ gc.nursery_top = gc.real_nursery_top
154165
155166 # Is there enough memory left inside the nursery
156167 if gc.nursery_free + size_of_allocation <= gc.nursery_top:
157- # Yes => move nursery_free forward
158- gc.nursery_free += size_of_allocation
159- return gc.nursery_free
160-
168+ # Yes => move nursery_free forward
169+ gc.nursery_free += size_of_allocation
170+ return gc.nursery_free
171+
161172 # We did not exceed a sampling point and must do a minor collection, or
162- # we exceeded a sample point but we needed to do a minor collection anways
173+ # we exceeded a sample point but we needed to do a minor collection anway
163174 minor_collection()
164- return gc.nursery_free
175+ return gc.nursery_free
165176```
166177
167178## Why is the Overhead ‘low’
168179
169- The tight integration of sampling into the nursery's bump-pointer logic does
170- add only slight overhead.
171-
172- Every time an allocation exceeds the ` sample_point ` , ` collect_and_reserve ` is
173- called to sample over the size of the allocation (as described in previous
174- section). The resulting overhead is directly controlled by the sampling rate,
175- as the amount of samples is ` size_of_allocation / sample_n_bytes ` .
180+ The most important property of our approach is that the bump-pointer fast path
181+ is not changed at all. If sampling is turned off, the slow path in
182+ ` collect_and_reserve ` has three extra instructions for the if at the beginning,
183+ but are only a very small amount of overhead, compared to doing a minor
184+ collection.
176185
186+ When sampling is on, the extra logic in ` collect_and_reserve ` gets executed.
187+ Every time an allocation exceeds the ` sample_point ` , ` collect_and_reserve ` will
188+ sample the Python functions currently executing. The resulting overhead is
189+ directly controlled by ` sample_n_bytes ` .
177190After sampling, the ` sample_point ` and ` nursery_top ` must be set accordingly.
178191This will be done once after sampling in ` collect_and_reserve ` .
179-
180- Setting a pointer like ` sampling_point ` , ` nursery_top ` , etc. due to sampling,
181- are additions, subtractions, loads and stores on the lowest level. Those are
182- not expensive on their own, but rather when excecuted very often.
183-
184192At some point a nursery collection will free the nursery and set the new
185- ` sample_point ` afterwards.
186-
187- The total number of times sampling overhead is introduced is:
188- - Some pointer arithmetic and a call to VMProf's c-interface every time the
189- ` sample_point ` is exceeded
190- - Some pointer arithmetic after a nursery collection
191- - Some pointer arithmetic when enabling or disabling sampling (but that is of
192- course constant overhead)
193+ ` sample_point ` afterwards.
193194
194- That means that the overhead mostly depends on the sampling rate and the amount
195- of memory allocated by the user program, as the combination of those two
195+ That means that the overhead mostly depends on the sampling rate and the rate
196+ at which the user program allocates memory , as the combination of those two
196197factors determines the amount of samples.
197198
198199Since the sampling rate can be adjusted from as low as 64 Byte to a theoretical
@@ -324,7 +325,8 @@ Modified VMProf with allocation sampling support
324325
325326## Future Work
326327
327- There are multiple points we’d like to address in the future.
328+ We have a bunch of ideas of features that could be added to VMProf in the
329+ future.
328330
329331One very important thing when it comes to profiling is the overhead. One idea
330332on how to decrease the overhead per sample is, to not walk the entire stack
@@ -344,7 +346,7 @@ pypylog as we do for vmprof, but this could introduce more overhead. A better
344346way of associating them, could be to record the ` TSC ` with each sample so we’d
345347get an approximate alignment of logged events and samples.
346348
347- Another Idea would be extracting information about allocations from the GC,
349+ Another idea would be extracting information about allocations from the GC,
348350e.g. type of object to be allocated, object size or even some statistics about
349351lifetime (if possible). For example PyPy could also log jitted functions, so we
350352can see what function got jitted at what time. Together with the sampled
0 commit comments