Skip to content

Commit 5f00234

Browse files
committed
evaluation
1 parent aa0f5f7 commit 5f00234

File tree

4 files changed

+19
-9
lines changed

4 files changed

+19
-9
lines changed
Binary file not shown.
99.2 KB
Loading
59.5 KB
Loading

posts/2025/02/allocation_sampling.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -92,7 +92,7 @@ As soon as an object doesn't fit into the nursery anymore, it will be collected.
9292

9393
The last section described how the nursery allocation works normally. Now we'll talk how we integrate the new allocation sampling approach into it.
9494

95-
To decide whether the GC should trigger a sample, the sampling logic is integrated into the bump pointer allocation logic. Usually, when there is not enough space in the nursery left to fulfill an allocation request, the nursery will be collected and the allocation will be done afterwards. We re-use that mechanism for sampling, by introducing a new pointer called `sample_point` that is calculated by `sample_point = nursery_free + sample_n_bytes` where `sample_n_bytes` is the number of bytes allocated before a sample is made (i.e. our sampling rate).
95+
To decide whether the GC should trigger a sample, the sampling logic is integrated into the bump pointer allocation logic. Usually, when there is not enough space in the nursery left to fulfill an allocation request, the nursery will be collected and the allocation will be done afterwards. We reuse that mechanism for sampling, by introducing a new pointer called `sample_point` that is calculated by `sample_point = nursery_free + sample_n_bytes` where `sample_n_bytes` is the number of bytes allocated before a sample is made (i.e. our sampling rate).
9696

9797
Imagine we'd have a nursery of 2MB and want to sample every 512KB allocated, then you could imagine our nursery looking like that:
9898

@@ -181,39 +181,49 @@ For testing and benchmarking, we usually started with a sampling rate of 128Kb a
181181

182182
Now let us take a look at the allocation sampling overhead, by profiling some benchmarks.
183183

184-
The y-axis shows the profiling overhead, while the x-axis tells the sampling rate. The overhead is computed as `runtime_with_sampling /
185-
runtime_without_sampling`.
184+
The x-axis shows the sampling rate, while the y-axis shows the overhead, which is computed as `runtime_with_sampling / runtime_without_sampling`.
186185

187-
All benchmarks were executed five times on a PyPy with JIT and native profiling enabled.
186+
All benchmarks were executed five times on a PyPy with JIT and native profiling enabled, so that every dot in the plot is one run of a benchmark.
188187

189-
<img src="/images/2025_02_allocation_sampling_images/images/2025_02_allocation_sampling_images/allocation_sampling_overhead.png">
188+
<img src="/images/2025_02_allocation_sampling_images/as_overhead.png">
189+
190+
As you probably expected, the Overhead drops with higher allocation sampling rates.
191+
Reaching from as high as ~390% for 32kb allocation sampling to as low as < 10% for 32mb.
192+
193+
Let me give one concrete example: One run of the microbenchmark at 32kb sampling took 15.596 seconds and triggered 822050 samples.
194+
That makes a ridiculous amount of `822050 / 15.596 = ~52709` samples per second.
195+
196+
There is probably no need for that amount of samples per second, so that for 'real' application profiling a much higher sampling rate would be sufficient.
190197

191-
...
192198

193199
Let us compare that to time sampling.
194200

195-
Again we ran those benchmarks with a series of time sampling rates. That is 1000, 5000, 10000, 15000, 20000, 25000 and 30000 samples per second.
201+
This time we ran those benchmarks with 100, 1000 and 2000 samples per second.
196202

197-
[IMG time based sampling]
203+
<img src="/images/2025_02_allocation_sampling_images/ts_overhead.png">
198204

199-
The overhead varies with the sampling rate. Both with allocation and time sampling you can reach any overhead you want and any level of profiling precision you want. The best approach probably is to just try out a sampling rate and choose what gives you the right tradeoff between precision and overhead (and disk usage).
205+
The overhead varies with the sampling rate. Both with allocation and time sampling, you can reach any amount of overhead and any level of profiling precision you want. The best approach probably is to just try out a sampling rate and choose what gives you the right tradeoff between precision and overhead (and disk usage).
200206

201207
The benchmarks used are:
202208

203209
microbenchmark
210+
204211
- https://github.com/Cskorpion/microbenchmark
205212
- `pypy microbench.py 65536`
206213

207214
gcbench
215+
208216
- https://github.com/pypy/pypy/blob/main/rpython/translator/goal/gcbench.py
209217
- print statements removed
210218
- `pypy gcbench.py 1`
211219

212220
pypy translate step
221+
213222
- first step of the pypy translation (annotation step)
214223
- `pypy path/to/rpython --opt=0 --cc=gcc --dont-write-c-files --gc=incminimark --annotate path/to/pypy/goal/targetpypystandalone.py`
215224

216225
interpreter pystone
226+
217227
- pystone benchmark on top of an interpreted pypy on top of a translated pypy
218228
- `pypy path/to/pypy/bin/pyinteractive.py -c "import test.pystone; test.pystone.main(1)"`
219229

0 commit comments

Comments
 (0)