-
Notifications
You must be signed in to change notification settings - Fork 61
Description
While discussing #96 I was able to identify a bottleneck
there was needlessly large time wasted in passing the arguments and retrieving the return
basically I made a simple 32-bit hash function cdb_djp_hash.c
in pure python the time is proportional to string length
in wasm it was almost constant time ~40ms regardless of string length 13 bytes to 1300 bytes (100x)
which means that most of those 40ms are in passing the arguments not in the actual loop (if it was the loop it would be proportional to string length or a large fraction of that).
This was confirmed using
pr=Profile()
pr.enable()
for i in range(10000):
cdb_djp_hash_wasm(large_a)
pr.disable()
#pr.print_stats()
#pr.print_stats('cumulative')
pr.print_stats('calls')the first note that the actual WASM call is fast (169ms out of 800ms)
which means that we might able to squeeze ~650ms of 800ms making it 4x faster
and I was able to identify bottleneck
- things that take too much cumulative time
- things that are called more than 10k (number of iterations)
examples of such things
- too many calls to
isinstancewhich is 170k (in a 10k iteration benchmark), that is 17 times per call - converting the argument
_value.py:129(_convert)too ~200ms out of 800ms - we got dynamically sized lists operations like
{method 'append' of 'list' objects}
Are we sure that passing arguments / converting can be made faster?
Yes
I can confirm _value.py:129(_convert) that is too slow
I've moved the conversion outside the loop and confirmed it was the bottleneck
i32_start = Val.i32(start)
i32_size = Val.i32(size)
for i in range(10000):
np_mem[start:stop]=input
hash_wasm = ctypes.c_uint32(_cdb_djp_hash_wasm(i32_start, i32_size)).valuealso I can confirm that this time is just waste because passing parameter should do the following
- convert little endian (which should not take time see below)
- set memory just after the stack pointer (which is now fast using Investigate adding the Python Buffer Protocol on the
memory#135 Setting a slice of memory without having to enumerate #81 FIXES #81: 400x faster slice assignment #134 ) - move the stack pointer (which should not take time)
to proof that we can convert in almost no time
in_struct=Struct('<LL')
in_pack=in_struct.pack
for i in range(10000):
b=pack(start, size)