Replies: 1 comment 7 replies
-
|
@aadrians1 Are you sure that the GPU isn't used in LM Studio? I've run the command below on my machine (M1 Max Mac) and compared it with the performance of # with GPU
npx --yes node-llama-cpp chat hf:LiquidAI/LFM2-1.2B-GGUF:Q8_0 --prompt 'Hi there!' --contextSize 4096 --flashAttention --printTimings --timing
# without GPU
npx --yes node-llama-cpp chat hf:LiquidAI/LFM2-1.2B-GGUF:Q8_0 --prompt 'Hi there!' --contextSize 4096 --flashAttention --printTimings --timing --gpu false
Compared with: llama-cli -m ~/.node-llama-cpp/models/hf_LiquidAI_LFM2-1.2B.Q8_0.gguf -c 4096 -fa on -b 512 -ub 512 -ngl 999
llama-cli -m ~/.node-llama-cpp/models/hf_LiquidAI_LFM2-1.2B.Q8_0.gguf -c 4096 -fa on -b 512 -ub 512 -ngl 0 |
Beta Was this translation helpful? Give feedback.








Uh oh!
There was an error while loading. Please reload this page.
-
I understand this will never run as fast as something like LM Studio, but something seems wrong. When running on the CPU (i5-13600), it takes almost two minutes just to load the prompt while the CPU is at 100%. When using the GPU, loading is much faster, memory is allocated correctly, and prompts run faster as well. When running on the integrated GPU, performance is limited by GPU, but when running on the CPU, a different bottleneck appears to be limiting performance. For example in LM Studio I can get up to 100 tokens/sec, but here, lucky if maximum 10.
There are some snippets:
Beta Was this translation helpful? Give feedback.
All reactions