Performance comparison with Python version - slower inference speed #536
-
|
Hi there, First of all, thank you for this awesome project that allows me to run large language models in my favorite language! ❤️ I've noticed that the inference speed is slightly slower compared to the Python version when using the same model. I'm wondering what might be causing this difference. Here are the logs I observed: Python version (llama-cpp-python): 789ms Node.js version - CPU mode: 1125 ms I observed that the graph splits value is much smaller in the Node.js version. Is this the reason for the performance difference? If so, is there any way to configure or adjust it? Configuration details: Or is this an unavoidable performance gap when using JavaScript/Node.js? In addition, is there any optimization when topk= Infinity? I'd appreciate any insights or suggestions on how to improve the performance in Node.js! 🙏 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
The version of How did you measure the speed in I've run the command below on my machine (M1 Max Mac) and compared it with the performance of # with GPU
npx --yes node-llama-cpp chat hf:unsloth/Qwen3-0.6B-GGUF:IQ4_XS --prompt 'Hi there!' --contextSize 4096 --flashAttention --printTimings --timing
# without GPU
npx --yes node-llama-cpp chat hf:unsloth/Qwen3-0.6B-GGUF:IQ4_XS --prompt 'Hi there!' --contextSize 4096 --flashAttention --printTimings --timing --gpu false
Compared with: llama-cli -m ~/.node-llama-cpp/models/hf_unsloth_Qwen3-0.6B.IQ4_XS.gguf -c 4096 -fa on -b 512 -ub 512 -ngl 999
llama-cli -m ~/.node-llama-cpp/models/hf_unsloth_Qwen3-0.6B.IQ4_XS.gguf -c 4096 -fa on -b 512 -ub 512 -ngl 0 |
Beta Was this translation helpful? Give feedback.
The version of
llama.cppin the python package is from 4 months ago, while the version innode-llama-cppis from a few days ago.I've seen a few fixes implemented in
llama.cppin the last few months that improve stability and correctness that might lead to the difference you're seeing here. It could help if you could compare the performance with older versions ofnode-llama-cpp, specifically3.12.3since it uses a version that came shortly after the one used in the python package.How did you measure the speed in
node-llama-cpp? Have you excluded the time it takes to load the model and the context?Also, the first token might take some time to generate since things are still loading durin…