optimize time wasted in passing argument and fetching return

While discussing #96 I was able to identify a bottleneck
there was needlessly large time wasted in passing the arguments and retrieving the return

basically I made a simple 32-bit hash function [cdb_djp_hash.c](https://github.com/muayyad-alsadi/wasm-demos/blob/main/cdb_djp_hash/cdb_djp_hash.c)

in pure python the time is proportional to string length
in wasm it was almost constant time ~40ms regardless of string length 13 bytes to 1300 bytes (100x)
which means that most of those 40ms are in passing the arguments not in the actual loop (if it was the loop it would be proportional to string length or a large fraction of that).

This was confirmed using 

```python
pr=Profile()
pr.enable()
for i in range(10000):
        cdb_djp_hash_wasm(large_a)
pr.disable()
#pr.print_stats()
#pr.print_stats('cumulative')
pr.print_stats('calls')
```

the first note that the actual WASM call is fast (169ms out of 800ms)
which means that we might able to squeeze ~650ms of 800ms making it 4x faster

and I was able to identify bottleneck

- things that take too much cumulative time
- things that are called more than 10k (number of iterations)

examples of such things

- too many calls to `isinstance` which is 170k (in a 10k iteration benchmark), that is 17 times per call
- converting the argument `_value.py:129(_convert)` too ~200ms out of 800ms
- we got dynamically sized lists operations like `{method 'append' of 'list' objects}` 

Are we sure that passing arguments / converting can be made faster?
Yes

I can confirm `_value.py:129(_convert)`  that is too slow

I've moved the conversion outside the loop and confirmed it was the bottleneck

```python
    i32_start = Val.i32(start)
    i32_size = Val.i32(size)
    for i in range(10000):
        np_mem[start:stop]=input
        hash_wasm = ctypes.c_uint32(_cdb_djp_hash_wasm(i32_start, i32_size)).value
```

also I can confirm that this time is just waste because passing parameter should do the following

- convert little endian (which should not take time see below)
- set memory just after the stack pointer (which is now fast using #135 #81 #134 )
- move the stack pointer (which should not take time)

to proof that we can convert in almost no time

```
in_struct=Struct('<LL')
in_pack=in_struct.pack
for i in range(10000):
    b=pack(start, size)
```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize time wasted in passing argument and fetching return #137

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

optimize time wasted in passing argument and fetching return #137

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions