Very slow speeds and memory allocations using only CPU. #535

aadrians1 · 2025-12-14T23:57:59Z

aadrians1
Dec 14, 2025

I understand this will never run as fast as something like LM Studio, but something seems wrong. When running on the CPU (i5-13600), it takes almost two minutes just to load the prompt while the CPU is at 100%. When using the GPU, loading is much faster, memory is allocated correctly, and prompts run faster as well. When running on the integrated GPU, performance is limited by GPU, but when running on the CPU, a different bottleneck appears to be limiting performance. For example in LM Studio I can get up to 100 tokens/sec, but here, lucky if maximum 10.

There are some snippets:

public static LLAMA_DEFAULT_MODEL = 'LiquidAI/LFM2-1.2B-GGUF/LFM2-1.2B-Q8_0.gguf';
public static DEFAULT_CONTEXT_SIZE = 4096;
private options: { llama: LlamaOptions, model: LlamaModelOptions };

//GLOSSARY_OPTIONS: { model: { modelPath: null, gpuLayers: 0 }, llama: { gpu: false } } as { llama: LlamaOptions, model: LlamaModelOptions }
constructor(options?: { llama?: LlamaOptions, model?: LlamaModelOptions }, modelPath?: string, modelName?: string, contextSize: number = LlamaGlossary.DEFAULT_CONTEXT_SIZE) {
    super();
    if (!options) { options = {} };
    if (!options.llama) { options.llama = {} }
    if (!options.model) { options.model = { modelPath } }
    if (options.llama.gpu === undefined) { options.llama.gpu = 'auto'; }
    if (options.llama.maxThreads === undefined) { options.llama.maxThreads = 0; }
    if (options.model.gpuLayers === undefined) { options.model.gpuLayers = 0; }

    this.modelName = modelName;
    this.contextSize = contextSize;
    this.options = { model: options?.model || { modelPath }, llama: options?.llama || {} };
}

public async init(): Promise<LlamaGlossary> {
    if (this.model) { return; }
    this.options.model.onLoadProgress = (progress) => this.emit('progress', progress);
    if (!this.options.model.modelPath) { this.options.model.modelPath = await LlamaGlossary.getModel(this.modelName ? this.modelName : LlamaGlossary.LLAMA_DEFAULT_MODEL); }
    this.llama = await getLlama(this.options.llama);
    this.model = await this.llama.loadModel(this.options.model);
    return this;
}

public async promt(prompt: string): Promise<string> {
    if (!this.model) { await this.init(); }
    var flashAttention = this.model.flashAttentionSupported; //Disable Warning: [node-llama-cpp] llama_kv_cache: the V embeddings have different sizes across layers and FA is not enabled - padding V cache to 512
    var context = await this.model.createContext({ contextSize: this.contextSize, flashAttention });
    const session = new LlamaChatSession({ contextSequence: context.getSequence() });
    var result = await session.prompt(prompt, { onResponseChunk: (chunk) => this.emit('response_chunk', chunk) });
    await session.dispose();
    await context.dispose();
    return result;
}

giladgd · 2025-12-15T14:36:33Z

giladgd
Dec 15, 2025
Maintainer

@aadrians1 Are you sure that the GPU isn't used in LM Studio?
Also, have you used the exact same model? Different quantizations can affect the performance on various hardware configurations.
How did you measure the speed in node-llama-cpp? Have you excluded the time it takes to load the model and the context?
Have you made sure to unload the model from LM Studio before testing your node project? Also, the first token might take some time to generate since things are still loading during the first evaluation pass.

I've run the command below on my machine (M1 Max Mac) and compared it with the performance of llama.cpp's CLI (it logs the same metrics when you exit it), and the generation speed was the same on my machine; 159 t/s with GPU and 50 t/s without GPU.

# with GPU
npx --yes node-llama-cpp chat hf:LiquidAI/LFM2-1.2B-GGUF:Q8_0 --prompt 'Hi there!' --contextSize 4096 --flashAttention --printTimings --timing

# without GPU
npx --yes node-llama-cpp chat hf:LiquidAI/LFM2-1.2B-GGUF:Q8_0 --prompt 'Hi there!' --contextSize 4096 --flashAttention --printTimings --timing --gpu false

Try prompting again in the same chat session and track the generation speed

Compared with:

llama-cli -m ~/.node-llama-cpp/models/hf_LiquidAI_LFM2-1.2B.Q8_0.gguf -c 4096 -fa on -b 512 -ub 512 -ngl 999

llama-cli -m ~/.node-llama-cpp/models/hf_LiquidAI_LFM2-1.2B.Q8_0.gguf -c 4096 -fa on -b 512 -ub 512 -ngl 0

7 replies

aadrians1 Dec 15, 2025
Author

Using --gpu vulkan performance did go up a little bit, and it actually did use CPU this time around 90% of cpu. Also tried changing KV cache setting, but it doesnt't really affect any results. Also interessting fact, but when using 14 threads in LM Studio it only uses around 30%-40% of CPU. As of building no gpu version, I am currently unable to do that.

giladgd Dec 15, 2025
Maintainer

Can you please show me your Hardware and Runtime tabs in the LM Studio settings?
If you can also share the logs from llama.cpp that are under Developer logs in the Developer panel after loading the model and prompting it, it'd also greatly help me understand what's going on.

Can you also try setting a smaller value for the threads in the node-llama-cpp CLI while using the --gpu vulkan flag? Something like --threads 7 and see whether it affects the performance.
Also try disabling flash attention and see whether it makes a difference.

aadrians1 Dec 18, 2025
Author

2025-12-18 16:20:07 [DEBUG]
 [LM Studio] GPU Configuration:
  Strategy: evenly
  Priority: []
  Disabled GPUs: [0]
  Limit weight offload to dedicated GPU Memory: OFF
  Offload KV Cache to GPU: OFF
[LM Studio] Live GPU memory info:
No live GPU info available
2025-12-18 16:20:07 [DEBUG]
 [LM Studio] Model load size estimate with raw num offload layers 'max' and context length '4096':
  Model: 1.28 GB
  Context: 67.43 MB
  Total: 1.35 GB
2025-12-18 16:20:07 [DEBUG]
 [LM Studio] Strict GPU VRAM cap is OFF: GPU offload layers will not be checked for adjustment
2025-12-18 16:20:07 [DEBUG]
 No enabled gpus, returning default tensor split
[LM Studio] Resolved GPU config options:
  Num Offload Layers: max
  Num CPU Expert Layers: 0
  Main GPU: 0
  Tensor Split: [0]
  Disabled GPUs: [0]
2025-12-18 16:20:07 [DEBUG]
 CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
2025-12-18 16:20:07 [DEBUG]
 llama_model_loader: loaded meta data with 34 key-value pairs and 148 tensors from C:\Users\User\.lmstudio\models\LiquidAI\LFM2-1.2B-GGUF\LFM2-1.2B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = lfm2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = LFM2 1.2B
llama_model_loader: - kv   3:                           general.basename str              = LFM2
llama_model_loader: - kv   4:                         general.size_label str              = 1.2B
llama_model_loader: - kv   5:                            general.license str              = other
llama_model_loader: - kv   6:                       general.license.name str              = lfm1.0
llama_model_loader: - kv   7:                       general.license.link str              = LICENSE
llama_model_loader: - kv   8:                               general.tags arr[str,4]       = ["liquid", "lfm2", "edge", "text-gene...
llama_model_loader: - kv   9:                          general.languages arr[str,8]       = ["en", "ar", "zh", "fr", "de", "ja", ...
llama_model_loader: - kv  10:                           lfm2.block_count u32              = 16
llama_model_loader: - kv  11:                        lfm2.context_length u32              = 128000
llama_model_loader: - kv  12:                      lfm2.embedding_length u32              = 2048
llama_model_loader: - kv  13:                   lfm2.feed_forward_length u32              = 8192
llama_model_loader: - kv  14:                  lfm2.attention.head_count u32              = 32
llama_model_loader: - kv  15:               lfm2.attention.head_count_kv arr[i32,16]      = [0, 0, 8, 0, 0, 8, 0, 0, 8, 0, 8, 0, ...
llama_model_loader: - kv  16:                        lfm2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  17:                            lfm2.vocab_size u32              = 65536
llama_model_loader: - kv  18:                     lfm2.shortconv.l_cache u32              = 3
llama_model_loader: - kv  19:      lfm2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = lfm2
2025-12-18 16:20:07 [DEBUG]
 llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,65536]   = ["<|pad|>", "<|startoftext|>", "<|end...
2025-12-18 16:20:07 [DEBUG]
 llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,65536]   = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
2025-12-18 16:20:07 [DEBUG]
 llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,63683]   = ["Ċ Ċ", "Ċ ĊĊ", "ĊĊ Ċ", "Ċ �...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 7
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  28:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  29:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  30:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{bos_token}}{% for message in messag...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 7
llama_model_loader: - type  f32:   55 tensors
llama_model_loader: - type q8_0:   93 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 1.16 GiB (8.50 BPW)
2025-12-18 16:20:07 [DEBUG]
 load: printing all EOG tokens:
load:   - 2 ('<|endoftext|>')
load:   - 7 ('<|im_end|>')
load: special tokens cache size = 507
2025-12-18 16:20:07 [DEBUG]
 load: token to piece cache size = 0.3756 MB
print_info: arch             = lfm2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 128000
print_info: n_embd           = 2048
print_info: n_embd_inp       = 2048
print_info: n_layer          = 16
print_info: n_head           = 32
print_info: n_head_kv        = [0, 0, 8, 0, 0, 8, 0, 0, 8, 0, 8, 0, 8, 0, 8, 0]
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = [0, 0, 4, 0, 0, 4, 0, 0, 4, 0, 4, 0, 4, 0, 4, 0]
print_info: n_embd_k_gqa     = [0, 0, 512, 0, 0, 512, 0, 0, 512, 0, 512, 0, 512, 0, 512, 0]
print_info: n_embd_v_gqa     = [0, 0, 512, 0, 0, 512, 0, 0, 512, 0, 512, 0, 512, 0, 512, 0]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 128000
print_info: rope_yarn_log_mul= 0.0000
print_info: rope_finetuned   = unknown
print_info: model type       = 1.2B
print_info: model params     = 1.17 B
print_info: general.name     = LFM2 1.2B
print_info: vocab type       = BPE
print_info: n_vocab          = 65536
print_info: n_merges         = 63683
print_info: BOS token        = 1 '<|startoftext|>'
print_info: EOS token        = 7 '<|im_end|>'
print_info: EOT token        = 7 '<|im_end|>'
print_info: PAD token        = 0 '<|pad|>'
print_info: LF token         = 708 'Ċ'
print_info: EOG token        = 2 '<|endoftext|>'
print_info: EOG token        = 7 '<|im_end|>'
print_info: max token length = 30
2025-12-18 16:20:07 [DEBUG]
 load_tensors: loading model tensors, this can take a while... (mmap = true)
2025-12-18 16:20:07 [DEBUG]
 load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU
load_tensors:   CPU_Mapped model buffer size =  1186.25 MiB
2025-12-18 16:20:08 [DEBUG]
 common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
2025-12-18 16:20:08 [DEBUG]
 llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 4096
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (128000) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.25 MiB
llama_kv_cache:        CPU KV buffer size =    48.00 MiB
2025-12-18 16:20:08 [DEBUG]
 llama_kv_cache: size =   48.00 MiB (  4096 cells,   6 layers,  1/1 seqs), K (f16):   24.00 MiB, V (f16):   24.00 MiB
2025-12-18 16:20:08 [DEBUG]
 llama_memory_recurrent:        CPU RS buffer size =     0.16 MiB
llama_memory_recurrent: size =    0.16 MiB (     1 cells,  16 layers,  1 seqs), R (f32):    0.16 MiB, S (f32):    0.00 MiB
2025-12-18 16:20:08 [DEBUG]
 llama_context:        CPU compute buffer size =   132.00 MiB
llama_context: graph nodes  = 549
llama_context: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
2025-12-18 16:20:08 [DEBUG]
 GgmlThreadpools: llama threadpool init = n_threads = 40
2025-12-18 16:20:16 [DEBUG]
 Sampling params:	repeat_last_n = 64, repeat_penalty = 1.050, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = -1
	top_k = 40, top_p = 0.950, min_p = 0.150, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.000
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
Sampling: 
logits -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
Generate: n_ctx = 4096, n_batch = 8192, n_predict = -1, n_keep = 2352
2025-12-18 16:20:16 [DEBUG]
 Total prompt tokens: 2352
Prompt tokens to decode: 2352
2025-12-18 16:20:16 [DEBUG]
 BeginProcessingPrompt
2025-12-18 16:20:27 [DEBUG]
 FinishedProcessingPrompt. Progress: 100
2025-12-18 16:20:46 [DEBUG]
 Target model llama_perf stats:
common_perf_print:    sampling time =     224.62 ms
common_perf_print:    samplers time =     181.72 ms /  3222 tokens
common_perf_print:        load time =     708.13 ms
common_perf_print: prompt eval time =   11498.82 ms /  2352 tokens (    4.89 ms per token,   204.54 tokens per second)
common_perf_print:        eval time =   18594.13 ms /   869 runs   (   21.40 ms per token,    46.74 tokens per second)
common_perf_print:       total time =   30342.11 ms /  3221 tokens
common_perf_print: unaccounted time =      24.54 ms /   0.1 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 1366 =  1186 +      48 +     132                |

giladgd Dec 18, 2025
Maintainer

From the logs it appears that LM Studio did in fact use the GPU. You can see in the logs:

load_tensors: offloading 16 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 17/17 layers to GPU

The CLI command of node-llama-cpp to replicate the exact config of what's actually happening in LM Studio according to these logs is:

npx --yes node-llama-cpp chat hf:LiquidAI/LFM2-1.2B-GGUF:Q8_0 --prompt 'Hi there!' --contextSize 4096 --flashAttention --printTimings --timing --gpu vulkan --threads 40

Note that this command loads all the layers that can fit the GPU by default

I doubt that manually setting --threads 40 will make a meaningful difference over the default resolution algorithm (when not manually configuring the number of threads), but let me know if that's not the case.

aadrians1 Dec 18, 2025
Author

Hmm, interesting, because in Task Manager, GPU shared memory and GPU utilization are not being used in LM Studio. I think those logs are not representing reality. And even when I use the provided CLI, the performance is still far off.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Very slow speeds and memory allocations using only CPU. #535

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 7 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Uh oh!

Very slow speeds and memory allocations using only CPU. #535

Uh oh!

aadrians1 Dec 14, 2025

Replies: 1 comment · 7 replies

Uh oh!

giladgd Dec 15, 2025 Maintainer

Uh oh!

aadrians1 Dec 15, 2025 Author

Uh oh!

giladgd Dec 15, 2025 Maintainer

Uh oh!

aadrians1 Dec 18, 2025 Author

Uh oh!

giladgd Dec 18, 2025 Maintainer

Uh oh!

Uh oh!

aadrians1 Dec 18, 2025 Author

aadrians1
Dec 14, 2025

Replies: 1 comment 7 replies

giladgd
Dec 15, 2025
Maintainer

aadrians1 Dec 15, 2025
Author

giladgd Dec 15, 2025
Maintainer

aadrians1 Dec 18, 2025
Author

giladgd Dec 18, 2025
Maintainer

aadrians1 Dec 18, 2025
Author